the Garden of Forking Paths: August 2011

Saturday, August 27, 2011

The Data Revolution

If you haven't read McKinsey Global Institute's report on Big Data (or at least the executive summary), let me try to convey how important a role data will play in the coming years. Every application has to make use of data, and many industries have built up a lot of data over the last few decades. The amount of data stored is expected to continue to grow at an astonishing 50% per year as storage space becomes cheaper and as our ability to access it becomes easier with cloud technologies. Growth rates are predicted to be 20% for structured data (databases) and 60% for unstructured data (documents, messages, pictures, videos). Our ability to store, architect, and analyze data will impact every major industry, not just Wall Street. Consider a few examples.

TaKaDu, an Israeli tech firm has used data that already exists to better predict leaks in water mains, as well as diversions of the water supply. No new sensors are needed--just better analysis of the data at hand. As water struggles become more intense over the coming years, more efficient use of water is key.

The legal world is being changed through artificial intelligence and data mining as well. Blackstone specializes in e-discovery and provides a cheap alternative to expensive law firms and their armies of 1st and 2nd year lawyers. Why pay people to look at thousands of documents when computers can scan millions?

One of the reasons Tesco has dominated the British groceries market is its ability to understand its customers. It does this through collecting information about purchases and then providing price adjustments and customized discounts. Individual stores are tailored according to the demographics of their customers.

Another interesting example involves policing. As funding is cut, police need to be more intelligent about where they patrol. Smart Policing helps police analyze trends to predict where crime will occur in the future, so less time is needed on the beat.

The data revolution will impact health-care, government, marketing, education--you name it. And, of course, all businesses are working internally to deal with the data deluge. Coworkers need need to be able to share documents in a way that is productive. Products like SharePoint and MarkLogic have become increasingly popular to deal with such unstructured data.

The exponential increase in data and the need to wrangle it will be so great that we will struggle to cope for years. MGI predicts that by 2018 there will be a shortage of 140,000 data professionals in the US alone. It's not just DBAs and architects, but philosophers, analysts, miners, mavens, and entrepreneurs that are needed. One reason that Google and Facebook have become so popular is that they help filter out the noise. I don't pretend to know where this is all going, but I can say I'm excited to be a part of the revolution.

Sunday, August 21, 2011

Metaprogramming in Ruby

I've started learning some Ruby, one of the most popular object-oriented languages today. In OO programming, data and actions upon data are bundled together in objects that represent real-world things, such as cars, financial transactions, or database connections. Ruby is remarkable because it allows you to change object definitions on the fly due to 1) its open class structure (no methods are private) and 2) the fact that it is interpreted (not compiled beforehand). Since binding occurs at run-time, you can call any method on any object. If that method is not supported, it will call the method_missing method, which itself can be overridden to interesting effect.

Though Ruby does not support multiple inheritance, you can alter classes dynamically by extending them with modules. These modules are called mixins, since you can mix them in whenever you want. This allows for amazing flexibility and an advanced programming technique called metaprogramming, or the programming of programs by programs.

Metaprogramming sounds esoteric, but it is particularly useful when you're designing classes that need to have dynamic metadata. For example, Ruby's Active Record class implements Object Relation Mapping, creating wrappers for database objects. A table, view, or stored procedure can be accessed with standardized methods that can be automatically generated according to the database object to be instantiated once a database connection is established.

I put together some (unfinished) code that shows how this might work. The DBObject class includes the BuildDBObjects class, which extends the BuildIncludes class, which in turn mixes in the appropriate modules as dictated by the constructor. If you construct a table wrapper, only the BuildTable class is included. Each database type could use the same wrapper properties: metadata, data, and name.

module BuildDBObjects
    def self.included(base)
        base.extend BuildIncludes
    end
  
    module BuildIncludes
        def initialize(dbtype, name)
            case dbtype
                when "table"
                    include BuildTable
                    define_table(name)
                when "view"
                    include BuildView
                    define_view(name)
                    #...
            end
        end
    end

    module BuildTable
        def define_table(name)
            # query the database...
            @metadata = %w(SaleID ProductID DateSold)
            @data = %w(1101421 15981923 11/4/2006)
        end
    end

    module BuildView
        #...
    end

    attr_accessor :metadata, :data, :name
end

class DBObject
    include BuildDBObjects
end

tbl = DBObject.new("table","SalesOrders")
puts tbl.name
puts tbl.metadata
puts tbl.data

This is metaprogramming, since the code itself writes the class definition for each instantiation of the DBObject class. The great thing about using metaprogramming to implement object relation mapping is that your classes can change as your database schema changes. This cuts down on the amount of code you might have to write, depending on the way you're accessing data. (Of course, you should create a data access class that uses the ORM in order to decouple the database schema from the application code--otherwise your application code might break with the slightest changes to the database.)

Though this example isn't exactly esoteric, it's probably not something you're going to do every day. It also shows a downside to metaprogramming: you have to write code that is meant to be read by computers. That means it might not be particularly readable by humans. One of the many reasons people like Ruby is that it is very programmer-friendly. It's very easy to read while at the same time cutting down on a lot of "extra" code, like class accessors. Since metaprogramming is often less easy to read and understand, it's often more difficult to maintain.

A few links:
-Paolo Perrotta's book dedicated to metaprogramming Ruby, which contains an extended look at Active Record
-Ruby's core API
-Programming Ruby Pragmatic Programmer's Guide (2001)
-A good mixin tutorial

Sunday, August 14, 2011

Pragmatic Programming

I'm a big fan of Andy Hunt and David Thomas's Pragmatic Programmer (2000) and the series of books they publish at the Pragmatic Bookshelf. You might think this is because I wrote my dissertation on John Dewey, a famous member of a group of philosophers called the American Pragmatists. To be sure, philosophical pragmatism has a lot in common with what we commonly mean by the word "pragmatic." For instance, the Pragmatists thought traditional philosophical questions were abstracted from the problems they were really meant to solve. Philosophers asked "What is the beautiful?" without working with artists who were trying to make more beautiful art. They asked "What is the good?" apart from the moral dilemmas of actual people. And they tried to define "What is truth?" in abstraction from rival truth claims.

In short, philosophical pragmatism eschews head-in-the-clouds questions and focuses on solving problems. Any reflection that is not grounded in solving a problem risks being disconnected from practical consequences. It might even make us worse at solving problems by narrowing our thinking by theories that are untested by experience.

Though there are dangers to pure theory, I find most technical books (and blogs) to slide too far to the other side of the spectrum between theory and practice. These books are often written in the form: "If you want to do X, do Y." If you want to build an index on a table, here is the command. If you want to inherit one class from another, use this syntax. If you want to create a web service, you must create this kind of a connection. Such books are often so focused on the trees that you never see the forest. They are pragmatic in the sense that they don't give you a bunch of high-falutin' theory that you'll never use, but they aren't really pragmatic in the sense of helping you solve real-world problems.

Why is that? Technical books are often useless because they assume you know what the problem is, and anyone involved in software development for long knows that this is rarely the case. If I know that I need to create an index on a table, I can read a tutorial on creating indexes. But maybe what I really need is to write better queries. Or perhaps I need to normalize my table architecture and make more use of clustered indexes rather than throwing non-clustered indexes at performance problems. I might be using a relational database when I should be using a document store. It could be that my company's code review process needs to be improved so that issues like this are handled by database programmers and not our support staff. Perhaps performance isn't even what I should be worrying about, and the biggest problem facing my company is writing more robust code. Once you start to delve into even the simplest of problems, you can open a Pandora's box of other questions that change your understanding of the matter at hand. The reason I like the pragmatic programming books is that they go beyond the nut-and-bolts approach and provide us with concepts to help us better prioritize and deal with the problems we actually face. In an ideal world, everyone would write perfect code. But in the real world, we are always in the process of improvement, and we need conceptual tools to determine what is most in need of improving.

One of the best recent examples I've seen of the kind of truly pragmatic conceptual tools I'm talking about is Brent Ozar's hierarchy of database needs. We all know that our databases could be more secure, robust, and responsive, but what should we focus on now? Well, if we aren't backing them up, we need to create a maintenance plan ASAP. Then we can worry about security, and so on. The point here is not that Ozar's hierarchy is perfect. As he himself says, the problem with doing backups isn't that we don't know we should be doing backups (I hope); the real problem is that the technology and business groups haven't worked together to prioritize what is most important. And a major cause of such a communication breakdown is the lack of a vocabulary to describe the problem the teams face.

I haven't seen a hierarchy of generalized software development needs, but Hunt and Thomas provide many tools to help you and your team develop your own. The base of the pyramid might be having a version control system, and then a ticketing system could come next. In the middle we'd want to make sure our code is orthogonal, or highly de-coupled. Finally at the top we could turn to documentation. Priorities will vary from place to place and from time to time, but without a language to talk about them, we cannot intelligently decide what to do next.

Sunday, August 7, 2011

You Can't Always Get What You Want

I'm slowly working my way through Bruce Tate's Seven Languages in Seven Weeks, starting with Prolog. Prolog is a purely declarative language, meaning that you tell it what you want rather than how to get it. This is a completely different paradigm from that of mainstream languages like Java or C++, which grew out of the imperative paradigm. A Java or C++ program consists of a series of steps that must be executed. A Prolog program is defined by a set of facts, a set of rules, and queries upon an abstract world outlined by rules and facts. Working in Prolog has stretched the limits of my understanding of a 'program.'

Though the basics are simple, the trick in programming Prolog is mastering recursion. One or two lines have a great deal of power. For example, you could define a function to reverse a list with the following:

reverse([X|Y],Z,W) :- reverse(Y,[X|Z],W).
reverse([],X,X).

You can quickly see how powerful this paradigm is for a certain set of problems. For example, Tate's program to solve Sudoku puzzles is only about 20 lines long. It simply describes the relationship between the squares on an abstract board. After you declare a board, Prolog takes care of the rest.

Once I understood Prolog's power, I immediately thought of a puzzle invented by Douglas Hofstadter. It has come to be called the MU puzzle, and it can be found in his book Gödel, Escher, Bach. Basically, there are four rules which involve the manipulation of strings composed of the characters M, I, and U. The question is whether or not these rules can be used to turn the string MI into MU. It took me some time to come up with a solution, and I'll admit I had to check mine with another one I was able to find. (Note: lower case letters are literals; upper case are variables.)

% rule 1: add u to any i
rule1([i], [i,u]).
rule1([H|X], [H|Y]) :- rule1(X, Y).

% rule 2: anything after m can be doubled
rule2([m|X], [m|Y]) :- append(X, X, Y).

% rule 3: replace any iii with u
rule3([i,i,i|X],[u|X]).
rule3([H|X], [H|Y]) :- rule3(X, Y).

% rule 4: remove any uu
rule4([u,u|X],X).
rule4([H|X], [H|Y]) :- rule4(X, Y).

% the setup
map(X,Y,0) :- X=Y.
map(X,Y,1) :- rule1(X,Y).
map(X,Y,2) :- rule2(X,Y).
map(X,Y,3) :- rule3(X,Y).
map(X,Y,4) :- rule4(X,Y).

% the solution
solve([m,i],[m,i],0).
solve(X,Y,N) :- map(Z,Y,N), solve(X,Z,M).

Prolog's strength is in abstracting the implementation of the code from the world it defines, but this can also be its weakness. The problem with the MU puzzle, and with many other problems, is that it will recurse infinitely along the left branch of the tree, since there is no way to create MU from MI using the first rule. For this reason, my program results in a stack overflow. (The code linked above limits the depth of recursion). Even if your recursive stack eventually bottoms out, a complicated program could churn for hours and simply return the result: no.

It's hard for me to think of many business applications for which Prolog would be useful besides scheduling. However, it has helped me understand SQL better. All the books say that SQL is a declarative and not imperative language, but I'm not sure I truly understood what this meant until now. SQL is the most widespread declarative language today, and it shares many of the same advantages and pitfalls as Prolog. It's very easy to query a database, but it is just as easy to do so in a horribly inefficient way. Without understanding database architecture--such as indexes, execution plans, memory management, and data files--you are likely to make major mistakes. For this reason, it is difficult to completely abstract what you want from how the program provides it for you.

SWI-Prolog does have an ODBC interface that I will have to experiment with some day. Until the data I'm working with has relationships that could be explored recursively, however, I'll probably to stick with SQL.

A few links:
-Download GNU Prolog or SWI-Prolog
-Check out some detailed tutorials
-An excerpt from Tate's book