Sunday, May 6, 2012

Data Staging

The best image I could find?
In Extract, Transform, & Load ETL processes, you might save the data you are extracting to a temporary location and in an unadulterated state. This is called staging the data. Staging is an interesting word. It's a synonym for scaffolding, or a temporary support for something else, such as a career in cooking, a cancer treatment, or a rocket's trajectory. In this case, data staging is a preparation for data that has the format, structure, and purity you want. A data mart is a stage for a data warehouse.

Staging could seem like a bad idea for a number of reasons. First, you're duplicating data. Duplication is almost always a bad word when it comes to data, but staged data should not be used for anything besides staging and perhaps reporting. Still, it requires space, which can be a precious thing even when space is so cheap.

Second, staged data must be written to disk and then read from before doing the final transformations. You're effectively doubling your disk I/O. If the amount of data is significant, this could be a lot of reads and writes.

Written in 2010, a product of experience
These are real trade-offs, but there are a number of reasons to stage, some of which did not occur to me before reading Microsoft SQL Server 2008 Integration Services: Problem-Design-Solution, an excellent book on SSIS best practices by MVP's Erik Veerman, Jessica Moss, Brian Knight, and Jay Hackney.

  • Data Lineage - When you run into data purity issues in production, the first question your production support team will ask is, "Where did this data come from?" A staging area provides means of tracking the import of data and may allow you to run subsequent transformations in a temporary transaction so that you can identify duplicates, null values, or other data purity issues. You will need to keep data around for enough time for such problems to be identified and addressed.

  • Restartability - Jobs fail. The last thing you want to have happen at 3 AM during a conversion process is a job failure (yes, this has happened to me). A staging area should help you restart the job halfway through, after the necessary corrections have been made. This assumes you run your transformations in a transaction, which you are doing, right?

  • Source Alternative - A staging area can act as a reporting environment when your source system is highly transactional or under significant strain. Just make sure you're not doing any writes to stage.

  • Archive - Stage can also be an archive system, which may be useful if your destination system becomes corrupt and needs to be rebuilt. For instance, a data warehouse updated by incremental loads could be rebuilt by running through all transactions. I haven't run across this in practice, though.

  • Performance - Finally, an ETL process might perform best with staged data. Though SSIS can buffer data flows, databases can often sort and aggregate data faster and with less resource load.

As always, there are no universal solutions to data architecture. Don't stage data just for the sake of staging data. You'll have to weigh the pro's and con's according to your business, technology, QA, and production support needs.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...