Useful Data

But is the information useful, monkey? Is it?!
In a recent attempt to "cheat" with science while creating a March Madness bracket, I rediscovered the online data-geek candy store, brain child of renowned statistician Nate Silver.  In another incredible display of sexy stats, Silver and Company have assembled a robust statistical model of the NCAA men's basketball tournament, adding evidence to the thesis that the geeks shall inherit the earth.

While the information on FiveThiryEight is beautifully presented and likely as accurate as one will get, I began to wonder: is this useful? More broadly, what makes information useful?

I don't want to get too hung up on what "useful" means, but for my purposes I'd like to define useful as enabling better performance: greater accuracy, greater speed, or higher success rates in some activity.  What feature of data would make it useful?

Here's my opinion: for information to be useful, it must be actionable.  In other words, for information to enable better performance or higher success rates it must inform what actions should, or should not be performed to improve accuracy.

A great example of this type of information for me has been heart rate data during exercise.  When I run according to my heart rate training plan, I know that I'm working too hard on my easy day when my heart rate goes above some threshold.  This is actionable information.  My heart rate is too high so I change my behavior and slow down.

For the FiveThirtyEighty March Madness predictions, the information is sort of actionable.  The FiveThirtyEight bracket is structured as probabilities of a team winning at each stage of the tournament.  This information is useful if I'm betting on a game (i.e. which team is likely to win) but isn't useful if I'm trying to make a bracket (i.e. which teams are most likely to be in each slot of the bracket).

The lesson here is that the structure of the information should match the decision to be made.  In the example of my hear rate data, my current heart rate is only useful in the context of a threshold.  I must know how my current data point relates to some useful scale.  Only then can I take action to bring my individual measure back into range.

However, generating actionable data is incredibly complicated because it requires a solid understanding of the mechanisms that explain a phenomenon.  In the case of heart rate training, heart rate is a well-established proxy for intensity and systematically modulating intensity is important to balance improvements in fitness against risk for injury.   Creating actionable data is also complicated by the need to understand how it will be used (i.e. betting on a game vs. making a bracket).  For these reasons, generating data without a solid theory to back up action is no better than rock collecting.

A related piece on FiveThirtyEight highlights an interview with White House Chief Data Scientist, Dr. D.J. Patel.   A topic of discussion that caught my attention was that of "Data Products" which Dr. Patel explains as "How do you use data to do something really beneficial?"  The true obstacle to creating powerful data products (the dream of Big Data) isn't access to data or data processing tools as these are now as ubiquitous  as the internet and the personal computer, respectively.  Instead, the obstacle to data products, or "useful data" as I'm calling it, is creating a solid theory about the mechanism driving a phenomenon.  Without this understanding, data remains noise.

For these reasons, while I am as enamored by sexy data as much as the next dork, I am reminded about the need for good scientific theories that allow us to interpret and structure sexy data in a way that makes it useful.  That is the real challenge for data scientists in the age of big data.

No comments:

Post a Comment