Far too often, in the current headlong rush to apply machine learning and predictive modelling to data, people are wont to forget the Central paradox of Data: It is impossible to assess whether a given piece of data is any good or not, simply by inspection of the data alone.
This paradox is as true for a Gigabyte of electronic data records in a fancy database, as an individual measurement of the weight of a bag of sweets in grams. It is true for the simplest application of the fundamental statistical question: Compared to What?, as well as the most complex multi-layered questioning of a predictive model (weighing a bag of sweets is a comparison: what is it's mass compared to a standardised unit of mass - the gram).
How to proceed?
Here is a superb, modestly written explanation by Daniel Haight of the reality of what is involved in real data analysis.
He describes 7 steps:
- Gather data from inside and outside the firewall
- Understand (and document) your sources and their limitations
- Clean up the duplicates, blanks, and other simple errors
- Join all your data into a single table
- Create new data by calculating new fields and recategorizing
- Visualize the data to remove outliers and illogical results
- Share your findings continuously
I recognise these steps, because this is what I learned by a process of tinkering and making mistakes nearly 30 years ago when I was a Data Wrangler; using image analysis kit, cameras and lenses, Lotus 1-2-3 macros, FORTRAN and C programming languages and hand made data visualisations.
IMAGE from HERE.