Sunday, May 13, 2012

Object Relations

I recently talked about the architectural problem of mapping objects to databases. A related problem is the behavior of such mappings. Martin Fowler explains, "That behavioral problem is how to get the various objects to load and save themselves to the database." Though the architectural part is not simple (Jeff Atwood calls it the Vietnam of computer science), behavior can be a much more intractable problem.

It's magic
Think about it. If you load a bunch of stuff from the database into memory and then do some work--especially unpredictable work generated by user commands--you have to figure out what to write back to the database. Some things will have changed, and others will not. Some things might have been updated since you last wrote to the database.

The main thing you need is a unit of work, which keeps track of all changes, writes them back to the database, and checks to make sure there are no conflicts when updating. You could, of course, have a 'dirty' flag for each object, and check each one at the end of your unit of work in order what to determine what to write back. But it's often more elegant to create a unit of work object and take care of objects that are new, dirty, or removed. The problem remains of determining the order of write backs.

If you have a lot of overlap between different units of work in a single session, you may want an identity map, which ensures that objects are only loaded once. This way, you can't accidentally modify two different instances of the same information in memory. An identity map could be located in a session object or as a static object in units of work.

Finally, since you may not want to load all objects into memory at the same time, you might consider implementing lazy loading. The idea is that you don't load data until you actually need it. This may defeat the purpose of O/R mapping, if it ends up leading to a lot of separate loads, but it is a natural extension of the identity map.

Now, much hinges on the actual map itself. How do your objects relate to each other and to the database? If your objects simply mirror your database schema, you could use an identity field which stores the primary key for each object / row. If you do this, you'll probably need a foreign key association between objects / tables.

Things become more complicated if you're making use of inheritance. You could have one table to manage an inheritance hierarchy. This is called single table inheritance. This may lead to lots of NULL values in the database table and difficulty in naming things in a single namespace. Another option is to use class table inheritance, in which you have one table for each class. This simplifies the relationship between objects and tables, but it leads to complex joins. You'll also have to be careful about foreign key relationships to other tables, as a primary key cannot be referred to by other tables. A variation on this pattern is what Fowler calls concrete table inheritance, in which you have one table for each level in an inheritance hierarchy. This should help with the foreign key relationships.

An example of class table inheritance

All this would be a lot to implement by hand, which is why Microsoft created the Entity Framework, first introduced in .NET 3.5. Java coders may try Querydsl. I haven't had a ton of experience dealing with OR behavioral issues, but I can understand the appeal of the EF. You can simply build your objects, build the relations, and then export a database from Visual Studio. I worry about the performance of these tools, but they seem flexible enough to allow for customization. Whether or not this customization is worth the development effort required is another question.

Sunday, May 6, 2012

Data Staging

The best image I could find?
In Extract, Transform, & Load ETL processes, you might save the data you are extracting to a temporary location and in an unadulterated state. This is called staging the data. Staging is an interesting word. It's a synonym for scaffolding, or a temporary support for something else, such as a career in cooking, a cancer treatment, or a rocket's trajectory. In this case, data staging is a preparation for data that has the format, structure, and purity you want. A data mart is a stage for a data warehouse.

Staging could seem like a bad idea for a number of reasons. First, you're duplicating data. Duplication is almost always a bad word when it comes to data, but staged data should not be used for anything besides staging and perhaps reporting. Still, it requires space, which can be a precious thing even when space is so cheap.

Second, staged data must be written to disk and then read from before doing the final transformations. You're effectively doubling your disk I/O. If the amount of data is significant, this could be a lot of reads and writes.

Written in 2010, a product of experience
These are real trade-offs, but there are a number of reasons to stage, some of which did not occur to me before reading Microsoft SQL Server 2008 Integration Services: Problem-Design-Solution, an excellent book on SSIS best practices by MVP's Erik Veerman, Jessica Moss, Brian Knight, and Jay Hackney.

  • Data Lineage - When you run into data purity issues in production, the first question your production support team will ask is, "Where did this data come from?" A staging area provides means of tracking the import of data and may allow you to run subsequent transformations in a temporary transaction so that you can identify duplicates, null values, or other data purity issues. You will need to keep data around for enough time for such problems to be identified and addressed.

  • Restartability - Jobs fail. The last thing you want to have happen at 3 AM during a conversion process is a job failure (yes, this has happened to me). A staging area should help you restart the job halfway through, after the necessary corrections have been made. This assumes you run your transformations in a transaction, which you are doing, right?

  • Source Alternative - A staging area can act as a reporting environment when your source system is highly transactional or under significant strain. Just make sure you're not doing any writes to stage.

  • Archive - Stage can also be an archive system, which may be useful if your destination system becomes corrupt and needs to be rebuilt. For instance, a data warehouse updated by incremental loads could be rebuilt by running through all transactions. I haven't run across this in practice, though.

  • Performance - Finally, an ETL process might perform best with staged data. Though SSIS can buffer data flows, databases can often sort and aggregate data faster and with less resource load.

As always, there are no universal solutions to data architecture. Don't stage data just for the sake of staging data. You'll have to weigh the pro's and con's according to your business, technology, QA, and production support needs.

Sunday, April 29, 2012

Where's the Logic Go?

Typically, the business logic resides in the middle, or domain, layer
Every application has at least two components: the design of technology platforms, called the application logic, and the processes that need to happen, called the business logic. In theory, the business logic is dependent of the application logic, since a business has rules, workflows, and transactions that have nothing to do with any programming languages or database systems. In practice, however, application logic can put constraints on business 'illogic.'

One of the key design choices in developing any application is deciding where the business logic should go. Database developers think it should go in the database, since keeping the code at the database level is often most performant. The problem with this is that SQL doesn't have many of the basic niceties of any Object-Oriented language. Furthermore, since stored procedures use proprietary SQL, they can prevent the migration of database code to another vendor.

OO developers think business logic should reside in the domain layer, since objects are best at representing the real world. Libraries and IDE's like Visual Studio make it very easy to get an OO application off the ground, and they help with maintainability. For many applications, however, the amount of code necessary to create an MVC model, for example, is not necessary and may even be prohibitively burdensome.

In reality, no application design should be used for all problems. Martin Fowler provides four models that couple domain and database access logic.

Transaction Script / Row Data Gateway - Domain code simply passes requests from the UI to the database. Database access is modeled at the record level.

Table Module / Table Data Gateway - Domain code is organized in objects corresponding to tables in the database. Database access is modeled at the table level.

Domain Model / Active Record - Domain code is organized according to business rules. Database access is modeled by CRUD objects.

Domain Model / Data Mapper - Domain code is organized according to business rules. Database access is modeled by a mapping object layer.

Fowler suggests that your choice of pairings should depend upon the complexity of your business logic. An application used for reporting can simply send requests to a database, but a complex sales order process should probably be mirrored by a domain model and a data mapper. A domain model will have a higher up-front cost, but it may pay off as the complexity of an application increases.

I think this general trade-off makes a lot of sense, and it helps me understand and categorize a number of applications I've seen. But, unless I am mistaken, pretty much any enterprise application is going to require a layer for business logic objects, a layer for data mapping, a layer for data access, a layer for the data itself, and, of course, the presentation layer. If it's possible to reduce the complexity of these layers, do so!

Sunday, April 22, 2012

Bigger Faster Stronger

Scalability is one of those words that everyone uses but few understand. It's a measure of how adding resources (typically hardware) affects performance. You can scale vertically by increasing the power of a server. You can scale horizontally by adding servers. The scalability of a system depends on how performance is defined. Martin Fowler suggests a few categories:
Expect more posts on this one
  • Response time, or the amount of time it takes to process a request
  • Responsiveness, or the amount of time it takes to acknowledge a request
  • Latency, or the amount of time it takes to get a response (this is especially important when there is no data to return)
  • Throughput, such as transactions / second
  • Load, or the amount of stress a system is under
  • Load Sensitivity, or response time / load
  • Efficiency, or performance / resources
  • Capacity, as in maximum throughput or load

Systems must be designed to scale, but what scaling means will depend upon the purposes for which a system is built. It might be tempting for database professionals to think about scalability in terms of transactions / second or the number of active accounts. But what really matters is whether or not the system is usable given an increase in transactions or accounts, and this depends upon the use for which the system was created. If we're talking about an e-Commerce system, throughput is probably more important than response time, as long as responsiveness is high. If we're dealing with a manufacturing system, we'll probably be most interested in throughput.

It's important to design systems to be scalable. The Internet has increased adoption rates to unprecedented rates. Consider Instagram, which has 30 million users after 2 years. Draw Something had 36 million users in three weeks. Scalability is a prerequisite for virality.

In the case of N-tier applications which have a Service-Oriented Architecture, it's usually easy to add hardware to the web and application servers. Load balancers and web farms can take care of extra load by distributing it evenly across a number of servers. The real problem is, as always, the database layer.

You can't just add servers to the database layer, because databases must be architected across multiple database servers. Concurrency adds to the difficulty, as database transactions must be ACID (atomic, consistent, isolated, and durable). In other words, you have to manage updates to multiple servers, making sure that an update to Server 2 does not depend on Server 1.

Lighting bolts make it faster
I thought the Cloud might be the solution to database scalability, but Microsoft Azure currently supports databases of only 150 GB in size. In talking with Microsoft consultants, they recommend 'sharding' databases. This means having a master database that directs transactions to the appropriate database server. For instance, all transactions dealing with North American accounts should go to Server 1, South America to Server 2. Sharding adds a layer of abstraction and a layer of complexity, and it requires duplication of database schema, but it's an increasingly popular approach.

Another option is Oracle's RAC system or Microsoft's MatrixDB, which has basically been ported to Azure. I'm skeptical that MatrixDB will make it in to the next edition of SQL Server (2012 has AlwaysOn, which is close, but the mirrors are read-only). In RAC or MatrixDB, databases are replicated across multiple servers and a load balancer directs reads and writes to the server with the least load. Changes are replicated asynchronously between database servers. Still, there are limitations to the size of databases for which this would be feasible.

Relational databases are great up to a certain size (though this is growing, thanks to SSD's and improved caching). It's hard to say exactly what this size is. In the end, scalable databases adhere to principles of normalization and partitioning. After a certain amount of data, RDMS's will be of no use, and NoSQL solutions are the answer to a different problem. Are you ready to scale?

Sunday, April 15, 2012

The Reef and the Market

Philosophy professors like asking students where their ideas come from. "I just think them," students are bound to retort. "Aha!" The professor pounces. "But where does the idea of 'I' come from?"  Silence.

Much of philosophy involves trying to explain where various ideas come from. Socrates and Aristotle understood the advance of thought as a process of dialogue which builds upon the ideas of the past. René Descartes argued that the idea of God was at the root of all true ideas. Karl Marx thought many of our ideas, such as religion, are ideology and a product of power relations. W.E.B. DuBois believed that we understand ourselves and others through the lens of race and that these ideas have a contingent history. William James thought our ideas were a product of 'what worked' for us and people like us in the past. Thomas Kuhn argued that scientific theories belong to a history of evolving paradigms.

Stephen Johnson is concerned with good ideas in his book, Where Do Good Ideas Come From? There are three images of innovation that orient his inquiry. Coral reefs, which make up 0.1% of the Earth's surface but have support 25% of all marine species. There is the city, which, as Geoffrey West has shown, increases in innovation in relation to population at a super-linear rate. And there is the web, which has decreased the time required for innovating and adopting new technologies from 20 years to 2.

By looking at innovation from a number of different scales, including at the level of brains, cities, and, ecosystems, Johnson comes up with a framework I summarize in the following way.  Good ideas are fostered by:
  • networks that can change
  • that have some stability
  • that favor chance encounters
  • that embrace error
  • that support re-use
  • that support building on other good ideas

More creativity per capita than any suburb
How can businesses foster innovation? Johnson shows that the majority of major inventions in the last two hundred years did not happen in R&D labs at major firms or in the garages of people who later struck it rich. They usually took place in colleges and universities, or organizations like CERN. This was surprising to me, given my experience in the academy, its silos of rival departments, and its distance from the real world. However, universities do allow people from very different backgrounds to work together and to circulate and build upon others' ideas freely. They allow for experiments to go wrong and let people research controversial things.

Many businesses today taut the importance of innovation, but few allow for failure, the open exchange of diverse ideas, change, or time for reflection. A recent survey of CEO's showed that they spend around 50 hours a week working but have little time to reflect, given the constant interruptions of BlackBerries. Google, on the other hand, requires employees to work on their own projects 20% of the time. Twitter built an open API and then built their services on top of that. Apple, while opaque to outsiders, has a very messy development process where everyone at each step of the development chain is involved with a new product at the very beginning.

Twitter or GM?
More personally, Johnson's book caused me to reflect on when I'm most creative. I'm best in my sleep or in writing after having a discussion with someone. I need time to let ideas simmer, but I'm lucky enough to have lots of smart people to discuss ideas with. Johnson concludes:

"Go for a walk; cultivate hunches; write everything down, but keep your folders messy; embrace serendipity; make generative mistakes; take on multiple hobbies; frequent coffeehouses and other liquid networks; follow the links; let others build or your ideas; borrow, recycle, reinvent.  Build a tangled bank."

Sunday, April 8, 2012

Do Technologies Have Politics?

I've been exploring the idea that technology isn't just a collection of things which can be used interchangeably for either good or ill. Technologies shape the world and the range of choices we have. For example, in my last post, I suggested that real-time technologies may have effects for the ways we do finance. Our question today is: if technologies can shape financial decisions, can they shape politics as well?

Riot!
It's one thing to say that technologies are used for political ends and another to say that technology is political. The latter is Langdon Winner's claim. If politics is about power and authority, then at least some technologies are political if they embody, define, or exert the power of some people over others.

Technologies are obviously political when they are used as tools of control. It is often feared that new computer technologies will allow governments to spy on people in order to control illegal activities. But forms of control can take much subtler forms. For example, when Robert Moses designed many public works in New York City in the mid 20th century, he designed them so that people could not get to them via public transportation. The overpasses he constructed have a clearance of only 9 feet--far too low for buses and those who would ride them to get to Jones Beach. Similarly, many people don't realize that the grand boulevards of Paris were designed by Baron Haussmann in the mid 19th century as a form of riot control.  As a Penn State alum, I can tell you that narrow streets are much more conducive to riots than open fields.

These are examples of technologies being used to empower some and dis-empower others. But are some technologies inherently political? I think so. Marx and Engels called for factory workers to take over control of 'the means of production,' but Engels later argued that the very technology of industrial manufacturing requires a division between workers and elites. How can you run a factory without a boss?

Winner endorses solar power over nuclear energy because it can be decentralized and doesn't require the scientific and bureaucratic elites who make decisions without the knowledge of anyone else, as in the case of the Fukushima Daiichi disaster. For months, the Japanese government and TEPCO lied to people about the seriousness of the accident. Countless polls show that support for nuclear power is at an all-time low across the world, with Germany abandoning future plans for construction. Nuclear fallout is an issue, of course, but I think people are really worried about the forms of power that are inseperable from nuclear energy.

Visit Japan, future skate park capital of the world.

Technologies that shapes public opinion and collective decision-making are also inherently political. Before the invention of the printing press, it would have been impossible to have democracy on anything other than a very small scale. Though it required a set of elites who broadcast information to information consumers, and though, as Walter Lippmann worried, most people would never get the whole truth, the dynamic duo of newspapers and democracy were better than any alternative. Pamphlets made the American Revolution possible.

I, for one, welcome our new search engine overlords.
Today, internet technologies are already reshaping power. The decline of major 1-directional or broadcast media, including newspapers, is heralded by some as the dawn of a new, more democratic age. Like much breathless optimism about technology, such claims should be taken with a grain of salt. Most internet traffic is routed through about five sites, a problem Michael Hindman terms Googlearchy. Most influential political bloggers are white males who went to ivy league schools. But there are many precedents for thinking that new technologies can undermine and recreate power in radical ways. These will be fought by the old guard and embraced by the new.

It comes back again to: what kind of technology do we want?

Sunday, April 1, 2012

Accounting and Thermodynamics

Predator-vision
A few years ago, I rented a very cheap house in a very cold part of the country. I wanted someplace big to play my drums, but I didn't realize what kind of heating bills I would get in the winter. I ended up keeping the house at 40 degrees Fahrenheit, using space heaters, and freezing a few pipes.

Besides earning a story to tell, I also learned how to see rates of flow. I was suddenly able to see the various heat sources and sinks in my house, with vectors of various strengths showing the direction and rates of flow. Unconsciously, I had always thought of heat as being a property of a room or building, but I now saw heating the way physicists see it.

Locke-vision
Such paradigm shifts, which overlay your present view of the world with a broader experience, are not uncommon. I always enjoyed studying geology, since it allows you to see the seemingly-fixed landscape as a fluid process and to see human activity from the perspective of the Earth. For thousands of years, astrology let people interpret ordinary events through the lens of the cosmos.

One of the most natural ways of seeing the world is as a collection of things with properties. This view was best put down on paper by modern philosophers like John Locke. They went back and forth about how subjective 'secondary' qualities like color and taste could be known to be true to the 'primary' essence of a thing, but they never questioned the atomistic model of the universe. This was only natural when the physics of the day characterized the interactions of the universe by analogy to billiard balls.

I've been trying to get my head around some hard accounting problems, and I realized that my problem was thinking of accounts as things with properties. It is correct, in a sense, to describe accounts as having a dollar amount. But is is more correct to think of them as part of a system of interconnected accounts with various directions and rates of flow, much like the heat in my cold house. This is because the value of an account is constantly changing, and because its changes are the direct result of transfers from other accounts. Even the cash in your wallet is not separate from this plumbing. I've begun to see the systems I build and maintain as part of the flow of the entire monetary system.

This flow is becoming particularly interesting with the growth of currency-less transactions like ACH. If you get direct deposit, you use ACH. In the future, there will be no paper or coin currency. We'll simply transfer funds between accounts with smartphones or other devices. There are many fascinating consequences of the death of currency. For instance, if governments do not have to pay the cost of printing money, the cost of transacting will be borne by retailers in the form of transaction fees. Someone will also need to bear the cost of information theft when you lose your phone.

$0.01, spent at all places and times
But I have a really crazy thought.  If money becomes infinitely liquid, won't its velocity increase infinitely, thus increasing the money supply infinitely, and raising the cost of everything infinitely? I wonder if the the laws of thermodynamics will continue to hold as currency becomes digitized. With real-time web services and other technologies that take us away from daily batch file ETL common to financial systems, we increase the liquidity of money with consequences that are not yet clear. Instead of rates of flow, we may have currency that is in all accounts at all times, much like the Heart of Gold's Infinite Improbability Drive. But I suppose I shouldn't borrow serious thoughts from Douglas Addams.