Data Deluge: The Value is in the Data (Wrangling)

Sunday, 5 November 2017

The Value is in the Data (Wrangling) - 2017

Far too often, in the current headlong rush to apply machine learning and predictive modelling to data, people are wont to forget the Central paradox of Data: It is impossible to assess whether a given piece of data is any good or not, simply by inspection of the data alone.

This paradox is as true for a Gigabyte of electronic data records in a fancy database, as an individual measurement of the weight of a bag of sweets in grams. It is true for the simplest application of the fundamental statistical question: Compared to What?, as well as the most complex multi-layered questioning of a predictive model (weighing a bag of sweets is a comparison: what is it's mass compared to a standardised unit of mass - the gram).

How to proceed?

Here is a superb, modestly written explanation by Daniel Haight of the reality of what is involved in real data analysis.

He describes 7 steps:

Gather data from inside and outside the firewall
Understand (and document) your sources and their limitations
Clean up the duplicates, blanks, and other simple errors
Join all your data into a single table
Create new data by calculating new fields and recategorizing
Visualize the data to remove outliers and illogical results
Share your findings continuously

I recognise these steps, because this is what I learned by a process of tinkering and making mistakes nearly 30 years ago when I was a Data Wrangler; using image analysis kit, cameras and lenses, Lotus 1-2-3 macros, FORTRAN and C programming languages and hand made data visualisations.

IMAGE from HERE.