The 3 Fallacies of Big Data

Article by Rob Hawken | Published on June 27, 2017

The term Big Data has been getting a lot of air play, but it’s not new or needed by every business. In this blog, I provide some back ground, and look at the three fallacies of Big Data.

Definition:

According to Gartner Group: 

"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." 

If I was to summarise the popular understanding it would be that it is the sudden explosion in the quantity of data caused by the internet and associated technologies that can only be processed and understood using new and complex techniques and software.

But it’s not new

Back in 1890 U.S. Census data was captured on punched cards using an electromechanical tabulating machine invented by Herman Hollerith.  This resulted in approximately 50 million data records being created. For a more recent and personal experience in 1999 for a wholesaler Telco we were loading 3m call records a night (approximately 1000 million records a year) into a data warehouse.  Given the relative size difference between Telecom NZ and our client, Telecom would be doing an order of magnitude more than that. 

Vendor Marketing Driven

A significant portion of the promotion and hype around Big Data is being driven by vendors whose basic pitch consists of ‘the use of our technology will provide you with instant insight’. This is understandable given vendors are continually looking for new ways to package existing products and services and develop new offerings. 

The challenge is to separate the hype and marketing fluff from the actual business value that is delivered.

The use of an expensive tool set is no guarantee of a good outcome. It is more important what you do with the data than what tool you use. As an example, the Telco data, mentioned above, was processed using flat files and UNIX scripting, no fancy tools or costly software… and it worked!

Common Fallacies

There are several false premises behind the common stories associated with Big Data. 

1.        Quantity over Quality?   

The first of these is the idea that the sheer quantity of data overrides any quality issues there may be with the data.  To expand a bit on this, the premise is that while there will be data of dubious quality the sheer quantity of the good data will prevent this adversely impacting any conclusions. This is a particularly insidious premise as it removes the requirement to assess and understand the data being used, to develop strategies to control for errors and biases and generally ensure that the data is fit for purpose and will not produce misleading answers. An excellent example of this is the 1936 US Presidential Election prediction made by The Literary Digest. This was based on a postal opinion poll of 2.4 million people which predicted a win to the Republican candidate Alfred Landon by 57% to Roosevelts 43%. In fact, Roosevelt won by 61% to Landon’s 31%. For more see here

2.      Correlation without Causation

The second false premise is that of relying on correlation without requiring or understanding causation. To simplify this is looking for statistical patterns in the data. This is inherently attractive as computers are great for looking for patterns in data while understanding and testing what causes those patterns is something that is best done by an expert in that area (until we have true AI anyway). If I was going to be cynical I’d say this approach is a bit like someone who relies on lucky underpants because the first time they wore them they won Lotto, therefore it was the underpants that caused the win! The original poster child for this approach was Google Flu Trends which used search term data to predict the number real-life flu outbreaks. This worked well for a few years but then began consistently over estimating the level of flu cases.  On investigation, it was identified that because the term ‘Flu’ was in the news, more searches were made which caused the underlying algorithm to overestimate the number of cases. Now this is not to say the approach is entirely without merit but that it requires the input of subject area expertise to ensure the results are robust and repeatable. (NB: To see an excellent example of this is what has been done with Google flu Trends since its demise look here ) 

3.      Source Bias

The third systemic issue that needs to be understood with Big Data is that of source bias. The issue here is that where you source the data from is likely to influence the outcomes from any analysis done using that data. As an example Twitter users are mostly young (75% between 15 & 25), urban / suburban , and from the US (51%).

http://www.pewinternet.org/2015/08/19/mobile-messaging-and-social-media-2015/2015-08-19_social-media-update_11/

http://www.beevolve.com/twitter-statistics/#a1

So no matter the quantity of Twitter data you use any results that are generated from it will be skewed due to these inherent biases. 

We need insight

What organisations and decision makers need is insight into how and why their organisations are performing. Simply pattern matching using Big Data is not insight and neither is automated machine learning simply looking for patterns insight. To generate insight requires understanding of the subject, process & source data to provide insight. This is true AI and nothing yet approaches what a skilled person can do.

What we want is to provide the right information, to the right people, in the right format at the right time regardless of the source and technology.


 

 

Rob Hawken is the General Manager for DATAMetrics Business Intelligence and Data Services, based in Christchurch, New Zealand. He has worked with a wide range of New Zealand and international businesses since 1995 and is a master of BI architecture, design, build and support of data driven solutions.