Sunday, November 20, 2011

Database Version Control Goals

I'm currently version controlling a database and developing version control processes which will be used by a new team at my company. Everyone knows you should version control your database, but what exactly this means and how to go about doing it are not obvious. As I've mentioned before, version controlling databases is much different from version controlling application code, because you can never start from scratch with a database.

To meet the minimum requirements of a version control system (VCS), you must be able to:
  1. Maintain a directory structure of all database objects
  2. Maintain change/upgrade scripts
  3. Isolate development from production code
The most basic version of Subversion will allow you accomplish these tasks, with two caveats. First, you have to either build your directory structure from the ground up, or you have to form some kind of process that will script out your database. I've tried the latter with Powershell, SSMS, Visual Studio database projects, and custom .Net code using SMO. In my opinion, the least bad option is .Net, since the code is not too difficult to write and you can customize it to your needs. You can add in functionality to script out only objects changed after a certain date. However SMO is incredibly slow. It took me 2 hours to create a baseline for my database.

The challenge shared by all of these home-brewed approaches for scripting out database objects is that you have to be consistent. For instance, if sometimes you use SSMS and sometimes Powershell, the object scripts may be slightly different and create false positives in your VCS.

If you stick with these minimum goals, you are also hampered by the annoying and uncertain process of having to maintain change scripts on top of your object scripts. I'm sure you always test your change scripts thoroughly on test environments that exactly mirror production, but you can never be 100% sure that your codebase actually matches your production database. Finally, if you want to create sandbox environments, you'll have to create another process for turning your object scripts into database generation scripts. This is more important if you work at a software company and need to be able to reproduce defects in old, supported code than if you work in an in-house database system. In either case, it's still a good idea to be able to create sandboxes for developers so that people don't break each other's code from working in the same database.

For these reasons, you also want to be able to:
  1. Generate change scripts automatically
  2. Create database environments with test data at any version
  3. Know what version a database is
Now we're talking about incurring some monetary costs, but most likely ones that will pay for themselves. The two main options for achieving these advanced goals are Red Gate tools and Visual Studio Ultimate / Database Edition.

Visual Studio database projects are a cool idea, since you can populate object scripts and compare them against your database, as long as you have something better than VS Premium. You can get Anhk SVN or Visual SVN to version control directly in VS. VS has an 'offline' development model, meaning that you do your code in VS and then later create change scripts by 'compiling' your database. I've had issues with circular dependencies among databases, making this an unfeasible process. You get what you pay for with Red Gate tools, and you'll really want both Source Control and SQL Compare. Source Control just helps you deal with the basic goals in an elegant way, but, when paired with SQL Compare, it will allow you to create scripts to turn any database at any version and turn it into one of any other version. You can also get tools like Data Compare to sync your test and production environments, and you can use the Data Generator to create non-sensitive test data. These features improve the development process, but are not necessarily within the scope of version control.

There are a few options for knowing what version a database is on. SQL Compare will allow you to compare the database with a version in a repository. You could also create a table that stores the database version, which is useful but will never provide 100% certainty about all the objects. Finally, you can create extended events for database objects but Phil Factor doesn't recommend it in the eBook, since there are not many options for documentation. Database version control is slowly making it's way out of the Dark Ages and into the modern, Agile world that other developers have long taken for granted. Once you are able to handle all of the advanced goals, you can do even more things, such as automate the build process and run automated test scripts, a process called Continuous Integration. I must admit that the developments like those Red Gate has made take some of the mystery and the fun away from good ole database development. But I'll probably get over it.

-Grant Fritchey's article on Simple Talk, which is chapter 3 in the Red Gate eBook
-Michael Baylon's own list of goals and solutions
-Another good article


  1. My experience with SQLCompare is based off of someone else running it and generating CREATE scripts for moving objects through dev, QA, and production environments.

    Personally, I did not like it. I'm sure the tool itself probably has far more options than what we were using, but we would inevitably run into issues with the deployment scripts. Things got even worse if you wanted to populate new tables (or update the data in existing tables).

    DB versioning is, in my mind, the holy grail of version control. It's difficult for all the reasons you've mentioned, and there's two sides to it (the objects, and the data itself). Ultimately, good database version control comes down to two things: a well thought-out architecture, and good documentation on the design and implementation. The first is important because it allows for the ability to limit the impact of changes. The second is important because it gives an historical record of the database purpose/goals, the decisions that went into it, and a record of what objects are supposed to do what.

  2. Interesting to hear that about Red Gate. If the automatic script generation doesn't work, I don't know why anyone would buy it!

    We're probably going to stick with the basics for a while. Perhaps we should focus on documentation instead--but I've never worked for a place that had good documentation.

    I'd be interested to hear your thoughts on improving documentation. Hunt and Thomas a really into avoiding duplication, including in documentation. If documentation is separate from the code, no one's going to update it. But code documentation always seems too granular to me--what I want is the big picture.

  3. I may have been a little too vague in my comment with regard to Red Gate's script generation. It works, and works pretty well. But the real issue that I've run into is having to do things slightly differently when moving to another environment. Try as we might, there is always different settings for things, and the data itself can cause certain issues, between production and dev environments.

    Documentation is something that I've gotten more involved with over the last few months. I agree that in-code documentation is too granular except for the developers (always good to have an understanding of what the purpose of a procedure of function is).

    For db documentation, what I have found to be really useful as a starting point is the standard relational model (like you can produce in Visio). Personally, I like having one page that shows all of the tables and their relations, and then other pages that cover specific sections of the database.

    Beyond that, you could use a combination of techniques, specifically a data dictionary, or using the Extended Properties. (I work all in SQL Server, so if you're using a different db, these links won't apply).

    But, I think the absolute most important thing is to have a document that outlines the decisions made in the development of the database and its architecture. It is invaluable to be able to go back a year from now and have a definitive answer why the foo table uses a surrogate key, while the bar table does not. Knowing why you did something is sometimes more important than knowing what you did.

  4. A lot of great suggestions--thanks! I've never heard of a data dictionary before. I'm hoping that this winter I can get some documentation done. This will be my first serious attempt. I can understand why so many people/firms just decide not to do it. Of course, you can always figure out what something does by looking through the code, but that's rarely the most efficient thing to do!


Related Posts Plugin for WordPress, Blogger...