the Garden of Forking Paths: May 2012

Sunday, May 13, 2012

Object Relations

I recently talked about the architectural problem of mapping objects to databases. A related problem is the behavior of such mappings. Martin Fowler explains, "That behavioral problem is how to get the various objects to load and save themselves to the database." Though the architectural part is not simple (Jeff Atwood calls it the Vietnam of computer science), behavior can be a much more intractable problem.

It's magic

Think about it. If you load a bunch of stuff from the database into memory and then do some work--especially unpredictable work generated by user commands--you have to figure out what to write back to the database. Some things will have changed, and others will not. Some things might have been updated since you last wrote to the database.

The main thing you need is a unit of work, which keeps track of all changes, writes them back to the database, and checks to make sure there are no conflicts when updating. You could, of course, have a 'dirty' flag for each object, and check each one at the end of your unit of work in order what to determine what to write back. But it's often more elegant to create a unit of work object and take care of objects that are new, dirty, or removed. The problem remains of determining the order of write backs.

If you have a lot of overlap between different units of work in a single session, you may want an identity map, which ensures that objects are only loaded once. This way, you can't accidentally modify two different instances of the same information in memory. An identity map could be located in a session object or as a static object in units of work.

Finally, since you may not want to load all objects into memory at the same time, you might consider implementing lazy loading. The idea is that you don't load data until you actually need it. This may defeat the purpose of O/R mapping, if it ends up leading to a lot of separate loads, but it is a natural extension of the identity map.

Now, much hinges on the actual map itself. How do your objects relate to each other and to the database? If your objects simply mirror your database schema, you could use an identity field which stores the primary key for each object / row. If you do this, you'll probably need a foreign key association between objects / tables.

Things become more complicated if you're making use of inheritance. You could have one table to manage an inheritance hierarchy. This is called single table inheritance. This may lead to lots of NULL values in the database table and difficulty in naming things in a single namespace. Another option is to use class table inheritance, in which you have one table for each class. This simplifies the relationship between objects and tables, but it leads to complex joins. You'll also have to be careful about foreign key relationships to other tables, as a primary key cannot be referred to by other tables. A variation on this pattern is what Fowler calls concrete table inheritance, in which you have one table for each level in an inheritance hierarchy. This should help with the foreign key relationships.

An example of class table inheritance

All this would be a lot to implement by hand, which is why Microsoft created the Entity Framework, first introduced in .NET 3.5. Java coders may try Querydsl. I haven't had a ton of experience dealing with OR behavioral issues, but I can understand the appeal of the EF. You can simply build your objects, build the relations, and then export a database from Visual Studio. I worry about the performance of these tools, but they seem flexible enough to allow for customization. Whether or not this customization is worth the development effort required is another question.

Sunday, May 6, 2012

Data Staging

The best image I could find?

In Extract, Transform, & Load ETL processes, you might save the data you are extracting to a temporary location and in an unadulterated state. This is called staging the data. Staging is an interesting word. It's a synonym for scaffolding, or a temporary support for something else, such as a career in cooking, a cancer treatment, or a rocket's trajectory. In this case, data staging is a preparation for data that has the format, structure, and purity you want. A data mart is a stage for a data warehouse.

Staging could seem like a bad idea for a number of reasons. First, you're duplicating data. Duplication is almost always a bad word when it comes to data, but staged data should not be used for anything besides staging and perhaps reporting. Still, it requires space, which can be a precious thing even when space is so cheap.

Second, staged data must be written to disk and then read from before doing the final transformations. You're effectively doubling your disk I/O. If the amount of data is significant, this could be a lot of reads and writes.

Written in 2010, a product of experience

These are real trade-offs, but there are a number of reasons to stage, some of which did not occur to me before reading Microsoft SQL Server 2008 Integration Services: Problem-Design-Solution, an excellent book on SSIS best practices by MVP's Erik Veerman, Jessica Moss, Brian Knight, and Jay Hackney.

Data Lineage - When you run into data purity issues in production, the first question your production support team will ask is, "Where did this data come from?" A staging area provides means of tracking the import of data and may allow you to run subsequent transformations in a temporary transaction so that you can identify duplicates, null values, or other data purity issues. You will need to keep data around for enough time for such problems to be identified and addressed.

Restartability - Jobs fail. The last thing you want to have happen at 3 AM during a conversion process is a job failure (yes, this has happened to me). A staging area should help you restart the job halfway through, after the necessary corrections have been made. This assumes you run your transformations in a transaction, which you are doing, right?

Source Alternative - A staging area can act as a reporting environment when your source system is highly transactional or under significant strain. Just make sure you're not doing any writes to stage.

Archive - Stage can also be an archive system, which may be useful if your destination system becomes corrupt and needs to be rebuilt. For instance, a data warehouse updated by incremental loads could be rebuilt by running through all transactions. I haven't run across this in practice, though.

Performance - Finally, an ETL process might perform best with staged data. Though SSIS can buffer data flows, databases can often sort and aggregate data faster and with less resource load.

As always, there are no universal solutions to data architecture. Don't stage data just for the sake of staging data. You'll have to weigh the pro's and con's according to your business, technology, QA, and production support needs.