the Garden of Forking Paths: Database Refactoring

I'm taking ownership of a database, so I've been reading about database development best practices. Database development is unique from other kinds of development for a number of reasons. Most crucially, you can never start from scratch with a database, because it's a living, breathing thing--you can't simply recompile and redeploy. Second, you have to design your database so that it can be set up with representative data on developers' workstations, in a development environment, and a testing environment. Third, you have to make sure that any scripts can be deployed on any environment (it usually helps to have some way of knowing what 'version' a database is on). And you may have to do all this while being unsure of all the applications that access a database, or the manner in which they do it.

These challenges led me to read Scott Ambler and Pramod Sadalage's book on database refactoring (2006). "Refactoring" is a term popularized by Martin Fowler's eponymous book (1999). He defines it as a "disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior." It is important to have a concept like refactoring, because development is usually focused on adding functionality only. When things change quickly, who has time to refactor?

There is some anecdotal evidence that spending time refactoring will decrease the amount of time coding later, but there's no rule for how much time you should spend refactoring. However, if you can't do any of the things mentioned above (limit access to your database, know what version it is, deploy to multiple environments), you must refactor. The hoops you'll have to jump through just to do normal development are far costlier than the effort needed to refactor. Ambler and Pradalage also suggest that it might be time to refactor if you notice any of the following "database smells": multi-purpose columns, multi-purpose tables, redundant data, tables with many columns, tables with many rows, "smart" columns, and fear of change.

Defining refactoring in the way Fowler does is also important, because it simplifies testing by separating the development of added functionality from development that should preserve existing functionality. Ambler and Pradalage embrace a Test-Driven Development approach, where you test, deploy, and test again. If you make testing central to your development process, you can be sure that you're not breaking anything during refactoring.

Though refactoring seems necessary only in (hopefully) rare circumstances, the steps that constitute it should bleed into normal development. For example, it's always a good idea to develop in small, easily testable pieces, rather than making many large changes all at once. An iterative process also allows for feedback from users. In short, refactoring should help you to embrace AGILE principles in database development, and it should make further development faster and easier. Even if you can't make your database development completely AGILE, these principles should help you evaluate bottlenecks in your process and suggest means of addressing them.

Links:
-An article that contains the main sections of the book

4 comments:

Scout7October 28, 2011 at 3:21 PM
I think the big key is how the applications access the database and the data. I have always shied away from writing SQL code in an application for this reason (along with others), while on the db side I try to force applications to use specific accounts and go through stored procedures as much as possible. Doing so makes changing your database at least somewhat easier.

The real issue, though, is that database design, architecture, and programming requires a different mindset than other programming languages. I think that the db stuff tends to be more abstract, and requires thinking about things in a non-standard fashion.
Zach VanderVeenOctober 29, 2011 at 9:41 AM
It's funny how weird databases are--and all programmers think they know SQL! Do you have any other suggestions for how to explain the difference between database and object oriented programming? It seems to me like a big one, which I mention above, is that you can't start from scratch with a database by recompiling. Set logic, like you mention, is a big one too!
Scout7October 31, 2011 at 1:57 PM
Well, that's always the trick, isn't it? Especially if you're talking to OO programmers.

For me, I have to do both, which sometimes makes it hard to do one or the other well. The biggest issue I run into is NOT making tables that resemble objects, and vice versa. For this aspect, I think the key is to recognize that the database is really a storage medium for organizing and managing related information. Therefore, any tables you create should be done with that intent in mind. Now, the real blurring comes in with the indexes; good indexing strategies will cover the most common usage scenarios without negatively impacting overall performance. So now you have to look at HOW the data is accessed and used, not just how to store it.

This thought process is different than what OO does. The focus is on objects, and then describing those objects, then describing what they do. Whereas in the database, you want to focus on how the object descriptions inter-relate with each other. I may not be doing the best job with this, but I think you get the gist of it.
Zach VanderVeenOctober 31, 2011 at 9:32 PM
I like this rule--don't think of databases in terms of objects. One of Bill Karwin's anti-patters has to do with trying to do polymorphism in SQL!

Sunday, October 16, 2011

Database Refactoring

4 comments: