Viktor Mayer-Schönberger, Kenneth Cukier: Big Data: A Revolution That Will Transform How We Live, Work, and Think

May 17, 2015

Recently a lot of people around me have been reading Big Data; since it’s a nice buzzword I was looking forward to reading a bit about the topic myself. I know a bit about it since the IT world is filled with it, but there’s no harm in exploring a new perspective or a different take on the topic.

This book wasn’t it. The authors would get excellent grades for the ability to expand simple ideas worth twenty pages into a 250 page book without using exactly the same words over and over, but still saying just a few things. Their take on the matter isn’t original in any way. The topic is treated from a high-level perspective, the authors don’t go into details (when it actually mentions map reduce and Hadoop the phrasing is very misleading) and just skim over the surface picking well-known examples from the news. The main problem of well-known examples is that well … everyone knows them. I was glad there were some minor cases I haven’t heard about, but generally everyone knows about Google, Facebook, Amazon, UPS etc.

One point I do appreciate (but not the fact that it gets repeated on every page) is the emphasis on completeness of the data instead of volume with respect to defining big data. I haven’t really thought about it before but it makes sense. While statistical sampling was developed to enable gathering data and processing them in a world where completeness was cost prohibitive, many domains can now, with correct tooling, benefit from vast, or even complete, data collection. The completeness of the data hopefully provides more insights and especially more opportunities for fine grained analyses.

The middle section of the book deals with changes in approach to extracting knowledge from data. The authors celebrate correlation as a winner and throw away search for causation. This idea really grades on me. It’s true, that unobvious correlations might be sufficient for an action in a system (when visitors come to this page more than to that one, it’s often followed by spike in demand of product X), they’re in no way a substitution for understanding the causes. However, my aversion might be just an instinctual, status quo perpetuating reaction, and the authors could be right. Even if we take for example medicine, a knowledge of a strange correlation is better than lack of knowledge or an unsubstantiated habit - there even has to exist the term evidence based medicine.

My hope is that causal understanding stays the ultimate goal while the big data approach is used a tool when generating ideas how to get to that goal. This would require a delineation between known facts (sure, given some conditions etc.), pretty known facts, strange correlations, lazycommon sense, habits, feelings and politics. Since we’re bad at distinguishing between these even withtout big data correlations, it’s doubtful that adding another piece into the mix would suddenly improve the situation.

Similarly, I’m skeptical about the data protection scheme the authors sketch out. Even though my personal beliefs are liberal, I find it hard to believe, that governments will create any reasonable legislation which would enable to both protect personal information and not hinder innovation (in a significant way). Even the idea of a natural development of personal protection agents who would serve as middlemen between individuals and corporations, negotiating the terms and rules, seems far fetched to me. Having seven billion people individually negotiating with (let’s say) hundreds of thousands companies is impractical, but adding millions of agents between them doesn’t help much. The total number of connections is still way too high and the agents would benefit from merging, eventually being matched 1:1 (or even less) to data users. An effective protection of personal data would be encouraging users to use tools and technologies to anonymize their online activity (either completely or by creating multiple separate personas) and possibly requiring data users to respect those boundaries. This would give the power to decide back to the individuals, while enabling the data users to collect as much data as they want or need.

All put together, I had a lot of problems with this book. The amount of new ideas was very low and could be compressed to 10% of the actual length, leaving out the boring repetition. As for the content, many authors’ attitudes look overly optimistic or naive to me. If you’re in a market for big data literature, try looking elsewhere.