Sunday, May 5, 2013

Data Science and Its Uses

I wanted to put together a collection of interesting articles and studies in the emerging field of data science.  For your enjoyment:

First, what is data science?

McKinsey had a nice report in 2011 on big data.  It predicts we'll need 140,000 to 190,000 additional data scientists by 2018.

Everyone has heard that companies like Facebook and Google make money off your data, but it's hard to know what that means.  This became all-too-clear when the New York Times explained how Target knows you're pregnant.

Nate Silver has predicted the last two presidential elections--the most recent with 100% accuracy.  This seems astounding, though he has claimed, “The bar set by the competition was invitingly low.  Someone could look like a genius simply by doing some fairly basic research.”

Google used search data over the last year to determine flu trends.  However, since their estimates were based on search terms, and the flu was over-hyped by the media, their numbers are a little high.  Still, Google predicted the surge two weeks before the CDC.

Science--now eScience--has also been changed by data analytics.  Check out the Large Synoptic Survey Telescope, which creates 40 TB of data per day.  Jim Gray has called this the Fourth Paradigm for science, after experimental (18th C.), theoretical (19th C.), and computational (20th C.) paradigms.

Data can come from anywhere.  Some analysts are even mining recipes in order to see how flavor compounds vary by region.  What do you do with this information?  In principle, it helps us understand why certain foods pair well together.

Social media sites like Twitter provide a wealth of data about how people are feeling about just about anything.  Tweets can even be used to predict the stock market.

Even the Humanities have gone the way of big data.  Check out this analysis of emotions in 20th c. books.  Perhaps analytics will dethrone critical theory and identity politics. 

No list would be complete without some cautionary tales.  It's not exactly data science, but it's important to recall Knight Capital's faulty trading algorithm, which cost $440 million in seconds.

An many, like Nassim Talib, worry that big data can show correlations between anything.  The more data, the more false positives.  More data isn't always better.  Start small.

If you want to learn more, I recommend Bill Howe's Coursera class.  Make sure you know some Python.


  1. Hi Zach, are you following the coursera course too?

    --David Dai

  2. Yeah, I'm working on the homework right now!

    1. Nice, I may discuss the resolution with you sometime:)



Related Posts Plugin for WordPress, Blogger...