How to audit staging area?
Imagine that it is your first day in new company. Your manager is giving you an ambitious plan of restructuring the current Data Warehouse design.
He said that it is probably not a task for this week, but in 3 months he would like to have some proposal from you. He points some people that you might talk to.
You would like to make this process efficient. Imagine a set of questions that you could as to help you solve this puzzle.
How to create a DWH inventory? How to start? What to look for? What are the red flags?
Time is already ticking.
Let's have a good checklist for Staging layer at the beginning.
How to audit staging area - General questions
- Do you have a staging?
- What kind of staging do you have (permanent, volatile)?
- If it is permanent, do you have a backup and restore policy?
- What is the staging source? File? DB? Cloud?
- How it is processed into staging? ETL / SQL?
- What kind of load do you have to staging? Full load? Incremental load?
- What is the next layer? What are the consumers? DV? ODS? DWH? DM?
- Is there a landing zone while loading data to staging?
How to audit staging area - Verify load routine
- Do you have an audit table?
- How is staging load monitored?
- What is happening in case of successful load?
- What is happening in case of a failure?
- Is staging common? Can you truncate and reload data from staging at any time?
- What is the loading window? Start at fixed time?
- When do we know when to start the load into staging? Is it fixed time? Or we are waiting for an event?
- Can processing of the next layer start after staging load is finished?
- Are you loading data from multiple sources into staging?
How to audit staging area - Performance
- Are you using CDC mechanism? Is there any change detection mechanism? If yes, where is it implemented? On source side or on processing side?
- What is happening when load into staging takes 5 times more than yesterday?
- How long does the load to staging take?
- Do we have SLA to refresh the staging?
- Is there a load that you would like to optimize?
How to audit staging area - Data quality and security questions
- What happened when staging source is not available? Is error reported? To whom is it sent?
- Who have access to it? Is it a kitchen for DWH processing? Or are there consumers that are waiting for the staging data?
- How does the communication with upper layer looks like?
- Do anomalies (like duplicates) are reported when loading data into staging?
- Is loaded staging data compared to the source?
- What will happen when precision will change in the source side?
How to audit staging area - Change management questions
- Is staging structure/load logic changing frequently?
- How are change request handled?
- Do they need involvement and communication with other teams?
- What are the staging change constraints?
- Is there any business logic implemented when loading data into staging?
- Where do you keep your staging data structure? Do you have a modeling tool like PowerDesigner?
- What will happen when a column is added on the source level? Will the load have failed? Or it will run, but improper data might be loaded?
How to audit staging area - summary
Those are just questions that you can ask. They will help you explore current design. On the high level, look for the patterns that may cause some performance and data quality issues.
Look for what is causing development, change management and support process a pain.
Would you like to hear some answers? Or preferred answers? Some good practices? Is there a point that you are specially interested in? Are you interested in what are the red flags?
What are you staging are challenges?
Let me know in the comments, I will try to address your questions.