Sunday 5 November 2017

The Value is in the Data (Wrangling) - 2017

Far too often, in the current headlong rush to apply machine learning and predictive modelling to data, people are wont to forget the Central paradox of Data: It is impossible to assess whether a given piece of data is any good or not, simply by inspection of the data alone.

This paradox is as true for a Gigabyte of electronic data records in a fancy database, as an individual measurement of the weight of a bag of sweets in grams. It is true for the simplest application of the fundamental statistical question: Compared to What?, as well as the most complex multi-layered questioning of a predictive model (weighing a bag of sweets is a comparison: what is it's mass compared to a standardised unit of mass - the gram). 

How to proceed?

Here is a superb, modestly written explanation by Daniel Haight of the reality of what is involved in real data analysis. 

He describes 7 steps: 
  1. Gather data from inside and outside the firewall
  2. Understand (and document) your sources and their limitations
  3. Clean up the duplicates, blanks, and other simple errors
  4. Join all your data into a single table
  5. Create new data by calculating new fields and recategorizing
  6. Visualize the data to remove outliers and illogical results
  7. Share your findings continuously

I recognise these steps, because this is what I learned by a process of tinkering and making mistakes nearly 30 years ago when I was a Data Wrangler; using image analysis kit, cameras and lenses, Lotus 1-2-3 macros, FORTRAN and C programming languages and hand made data visualisations.