Big Data is hurtling towards us in a big way. It is already in the news and the blip seems to getting bigger. Big Data will soon become the key driver for almost any kind of decision that is to be made in manufacturing, retail, finance all the way to astronomy, oceanography etc. The common aspect of all industries and areas is that data is generated in the order of several petabytes to exabytes. Big Data is the technique to analyze such large volumes of data.

Big Data represents the technique to handle the huge deluge of data that is already becoming enmeshed in our lives. Multiple disparate, varied streams of data (text. tweets, click streams, html) flow through with tremendous volume & velocity. The key aspects of data in the world are the volume, variety and the velocity. It is never ending and never seems to stop. How do we handle this deluge? How do we make sense of this data is what Big Data is all about.

Big Data provides algorithms to find patterns, determine trends or classify data depending on the features provided. It is supposed to enable the decision makes to make key decisions based on the answers the algorithms spew forth.

Big Data is also complicated by the fact that data comes is multiple forms from click streams, tweets, html, texts, CSVs, structured and non –structured data.

The ability to detect patterns, determine trends, classify, identify outliers is no easy task

In this post I try to take a philosophical look at Big Data and ask whether it can really help us. Will it help or will take is on wild goose chase? Can we trust the results?

Big Data depends on algorithms to make sense of data. Big Data deals with data that is in the order of Petabytes to Exabytes. At this scale with multiple features our cognitive abilities are of no use. We must rely on machines and algorithms to make sense of these large amounts of data. Our mind can handle a few hundred data points and at most 3 dimensions. Beyond that the data can hardly make any sense.

Data by itself, in the absence of features & algorithms, is indistinguishable from noise. It is data science that makes sense of data. Data science separates the signal from the noise.

It is the algorithms that try to determine the best fit for a given set of data. But how reliable are the results. For example let us take the following case

An unsupervised learning algorithm for the above data points could try to separate the data into 2 sets. Clearly this is one way but what is more appropriate is that we have 2 shapes, the circle & the rectangle. A machine algorithm would try to work based on the features that we choose. Are we in a position to decide whether the answer the algorithm gives us is correct? We have no way of knowing because the amount of data is beyond our cognitive capabilities,

In other words, Big Data is full of perils and pitfalls.

When we let the machine to analyze on our behalf the possibility of coming to a wrong conclusion is fairly high. This coupled with the fact that we are sometimes led to erroneous judgments, as discussed below, the problem is further compounded.

In his book “Thinking fast, thinking slow” Daniel Kahneman discusses several situations where our mind falls into the traps of lazy thinking. We come to wrong conclusions. Also our minds tend to detect patterns in data where there are none. Sometimes according to Kahneman ‘randomness appears as regularity or a tendency to cluster’. Also he says ‘the tendency to see patterns in randomness is overwhelming’. We could argue that in Big Data it is the algorithm that is determining the pattern we could be tricked into coming to false conclusions. Sometimes the human mind sees causality where there is none. Occasionally we fail to see the obvious.

In the ‘famous gorilla experiment’ the researchers tried to assess selective attention. The participants are asked to count the number of passes those in white t-shirts make. Surprisingly a large number of the participants were complete oblivious a gorilla that appears midway in the video. When we, as human fail to see such large objects, can we expect the machine to accurately identify patterns and perform accurate classifications?

There are techniques that help in determining false positives for e.g. the Bonferroni correction. Simply put the Bonferroni correction tries to determine the possibility of getting at least 1 significant result when one is testing 20 hypothesis simultaneously. If we want to test 20 hypotheses with the significance of 0.05 then the probability of at least 1 significant result is

P(at least one significant result) = 1 – P(no significant results)

= 1 – (1 – 0:05)^20

= 0.64

So, with 20 tests being considered, we have a 64% chance of observing at least one significant result, even if all of the tests are actually not significant. This would be a false positive.

Given that our ability to come to significant conclusions depends largely on being able to choose appropriate features, we must also be able to maneuver between false negatives and false positives. In addition we must also take into account the fallibility of the human mind.

Clearly, Big Data is the future! However with Big Data we are really on treacherous, slippery ground!

Find me on Google+

This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR

Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkRNote 2: You can download this RMarkdown file from Github at Big Data- Python to Pyspark and R to SparkR