Data Deluge: June 2012

Saturday, 30 June 2012

In resolution we trust.

In 1609 Galileo made 3 inventions that extended human visual resolution of fine detail. In June or July he made his first three-powered spyglass, by August he had made an eight-powered instrument and by November he had made a twenty-powered instrument. Galileo then used these instruments to observe the surface details of our Moon, discover the satellites of Jupiter and resolve individual stars from what had previously been fuzzy and indistinct nebular patches. By March 1610 Galileo had published this material including careful observations of the movement of the Moons of Jupiter. This book Sidereus Nuncius was perhaps the worlds first distinctive scientific data set. In 1610, Galileo then turned his telescope to work at close ranges, by 1623 or 1624 he had perfected a compound microscope.

In Siderius Nuncius he presented the stars that had been visible to the naked eye and those now resolvable using his telescope. The image belows shows the difference.

Voodoo correlations

Ed Vul works at UC San Diego in the department of Psychology, he wrote a nice paper with his colleagues in 2009 called Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition.

Vul had wanted to call the paper Voodoo Correlations in Social Neuroscience, and I for one would have preferred his original title, you can see why when you read the abstract of the published paper;

ABSTRACT

Functional magnetic resonance imaging (fMRI) studies of emotion, personality, and social cognition have drawn much attention in recent years, with high-profile studies frequently reporting extremely high (e.g., >.8) correlations between brain activation and personality measures. We show that these correlations are higher than should be expected given the (evidently limited) reliability of both fMRI and personality measures. The high correlations are all the more puzzling because method sections rarely contain much detail about how the correlations were obtained. We surveyed authors of 55 articles that reported findings of this kind to determine a few details on how these correlations were computed. More than half acknowledged using a strategy that computes separate correlations for individual voxels and reports means of only those voxels exceeding chosen thresholds. We show how this non-independent analysis inflates correlations while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample. In addition, we argue that, in some cases, other analysis problems likely created entirely spurious correlations. We outline how the data from these studies could be reanalyzed with unbiased methods to provide accurate estimates of the correlations in question and urge authors to perform such reanalyses. The underlying problems described here appear to be common in fMRI research of many kinds—not just in studies of emotion, personality, and social cognition.

The full paper is HERE.

Figure 5 from the paper is shown below.

Antediluvian Quantification

It is hard to imagine today how profoundly scarce high quality scientific data was a hundred years ago. For example, in 1910 the American physicist Robert Millikan made a first report of a series of experimental measurements that he had made with his graduate student Harvey Fletcher using their own design of equipment. These measurements involved timing how long it took small drops of watch oil to move up and down in an electric field. The timings were then used to make estimates of the tiny electrical charge of individual electrons. This famous series of ‘oil drop’ experiments of Millikan allowed him to make an estimate of the electron charge that is correct to within about 0.5% of the currently accepted value. This series of experiments and the estimation of electron charge they allowed are celebrated to this day; when Millikan’s value for the electronic charge was inserted into Bohr's formula for the hydrogen spectrum, it accurately gave the Rydberg constant. This confluence of experimentally derived estimation and a new theory was impressive and the experiment is still seen by many scientists as one of the first and most convincing proofs of the quantum theory of the atom proposed by Bohr.

Millikan’s oil drop data were first reported in 1910, which prompted some controversy particularly with the physicist Felix Ehrenhaft. After improving his experimental setup Millikan then published a full report of his work in 1913 in the Physical Review (Millikan 1913). In this paper Millikan reports the culmination of 4 years of hard experimental effort, designing and mastering a new technique and assessing the sources and magnitudes of the errors in his approach. The conclusions of the paper are based on the detailed analysis of 58 individual oil droplets with measurements on these droplets having been made over a period of 60 consecutive days. On each of these drops about 40 individual timing measurements were made. The data set used for Millikan’s analysis is presented as Table XX in his 1913 paper. If these numbers are entered into a modern spreadsheet file the dataset is about 28 kilobytes.

Notwithstanding the limited number of kilobytes of data reported in this paper of Millikan, this was a fantastic piece of science. The oil-drop experiment showed that an elegant experimental method could not only provide an accurate determination of the fundamental unit of charge, it could also provide evidence that charge is quantized. In 1923 Robert Millikan was awarded the Nobel prize in Physics, “for his work on the elementary charge of electricity and on the photoelectric effect". In his Nobel prize acceptance speech he firstly celebrated how science at that time was a close partnership of theory and experiment;

The fact that Science walks forward on two feet, namely theory and experiment, is nowhere better illustrated than in the two fields for slight contributions to which you have done me the great honour of awarding me the Nobel Prize in Physics for the year 1923.

Sometimes it is one foot which is put forward first, sometimes the other, but continuous progress is only made by the use of both - by theorizing and then testing, or by finding new relations in the process of experimenting and then bringing the theoretical foot up and pushing it on beyond, and so on in unending alternations.

He then describes the importance of the oil drop experiments he had completed a decade earlier ; “..the electron itself, which man has measured…is neither an uncertainty or an hypothesis . It is a new experimental fact…” (Millikan 1924).

The sheer scarcity of high quality data was a major barrier to scientific progress and many of the leading research scientists of the day spent an enormous amount of effort to design and construct physical instruments that were capable of providing high quality, reproducible data. The reason that these scientists spent so much effort on creating physical devices for generating data for their studies was that they simply did not exist. Compared with today, that era of science was antedlivian; literally "before the deluge". Each and every data point had a real value to a practising scientist because they knew exactly how much effort had been expended in obtaining it.

Those days are long gone. We have never had so much “data”, or so much capacity to store this data, manipulate it and analyse it both mathematically and visually. Yet using data in science has never been more difficult. The problems are not set by the technical limitations of instruments, computers, memory or even mathematical and statistical techniques – though all of these will continue to develop and these developments can help scientists.

The real challenge is in how we can best use the tried and trusted intellectual frameworks that have been the basis of scientific research over the past 400 years in our era of data deluge. Many of our current archetypes of science and our scientific hero’s are products of the data scarce era of science; Galileo, Newton, Kelvin, Einstein, Curie, Feynman, Watson & Crick. These are all antediluvian heroes, they lived, worked and excelled prior to the data deluge.

How then can it be, that if the core of science is measurement and quantification and both are increasing in capacity at such a great rate, there is a problem?

The answer is that science is not just the accretion of raw data or even analysed data. It is a fundamentally creative act. It requires a human intelligence to combine what was already known (or thought to be known), with the new data from experiment and observation, into a new level of knowledge. Paradoxically, this process has historically been aided by the fact that it always took time to collect data. A scientist may well have had to design and build the apparatus before he or she could even begin to make the measurements they were interested in. They had to first conquer the experiment before really understanding the object of attention. This is classical science. And it is not so long ago. One of the key weaknesses in today's science is that modern scientists have become personally disconnected from the measurement process - in the sense that Millikan understood it - in that they have never designed, or built a piece of apparatus. This means that they have reduced the connection they have with their data and reduced the amount of down time in their research.

I remember for my own BSc. Chemistry project having to construct an apparatus to image and measure droplets. Gaining a deep understanding of the shortcomings of an experiment by designing and building the kit has aided the development of science over the past three centuries. Today much of this forced downtime has been driven out of modern research labs. A high quality piece of equipment can be purchased and deployed in a matter of weeks. And with modern analytical instruments having high reliability and high data storage capacity they can be set up to run virtually unattended all day every day.

In previous generations one of the key bottlenecks was the need for experimental quantitation. This in turn led to the need for scientists to be able to develop and deploy scientific instrumentation and the ability to understand what each measured data point meant. This is harked back to by older science teachers who still stress the need for “good lab practice” using paper notebooks, keeping observational records of what is going on etc. Now most scientists do not have a clue about what is going on in their instruments. They have no idea about how much data processing is going on before they get hold of raw data.

Figure 1 from Millikan's 1913 paper.

REFERENCE

On the elementary electrical charge and the Avogadro constant. R.A. Millikan.

The Physical Review. Vol II Series II 1913.

Sunday, 24 June 2012

Stepping Dividers (1585)

The western coastline of the British Isles has been eroded for millienia by the power of the Atlantic Ocean and the Irish Sea. To natives of the British isles this coast has a recognisable and familiar irregularity. Yet this familiar coastline is the source of a puzzle that was first uncovered by the British scientist Lewis Fry Richardson (1881 -- 1953) in the 1950's and published posthumously in 1961. This paper was later used by Benoit Mandelbrot in his classic 1967 paper "How long is the Coast of Britain?".

The so-called coastline paradox is that the measured length of a stretch of coastline, such as the west coast of Britain, depends on the scale of measurement. The specific problem noted by Richardson was found when he approximated the length of the coastline by counting the number of steps of a fixed step length required to cover the whole coastline. Richardson notes that as the step length used to make this estimate becomes smaller, the longer the total measured length of the coastline becomes.

For a general smooth shape this is not the case -- for a circle as the step length decreases so the approximation of the circles circumference gets closer to the real value.

Measuring the length of a smooth curve is already tricky to do. One approach is to use the maritime method of stepping a set of dividers along the coastline and then multiplying the number of steps by the distance between divider points. This method works exactly for a circle - as the step length decreases so the estimates converges on a stable value.

A single-handed divider can be used to approximate a distance on a maritime chart or the length of a border between two countries or regions on a map. The divider points are set to a known distance apart and then stepped along the border. One point of the divider is placed on the border and then the dividers is swung around until it crosses the coast again.

The illustration below shoes a set of dividers in use, it comes from a book published in Antwerp in 1585 by Christoffel Plantijn (image from HERE). This design of divider, usually made from brass and steel, remains virtually unchanged to this day and they are still widely available to purchase.

Image is from Flickr user History of the Book, it is used under the Creative Commons licence BY-NC-SA 2.0 (HERE).

References

Richardson, L. (1961), The problem of contiguity: An appendix of statistics of deadly quarrels, General Systems Yearbook. 6, pp. 139-187.

Friday, 22 June 2012

Mapping the Nation

Professor Susan Schulten of the University of Denver History department has just published a book with University of Chicago Press called Mapping the Nation (HERE). This describes the origins of thematic mapping and graphic knowledge.

There is a website also to support the book, where you can quickly access all 100+ maps in high-resolution and color (HERE).

Chart Showing the locations in which all the cases of Cholera at the Hospital and all the fatal cases elsewhere originated. Report on cholera in Boston, 1849. Map Creator Williams, Henry W. (Henry Willard), 1821-1895

The blurb from the publishers site;

In the nineteenth century, Americans began to use maps in radically new ways. For the first time, medical men mapped diseases to understand and prevent epidemics, natural scientists mapped climate and rainfall to uncover weather patterns, educators mapped the past to foster national loyalty among students, and Northerners mapped slavery to assess the power of the South. After the Civil War, federal agencies embraced statistical and thematic mapping in order to profile the ethnic, racial, economic, moral, and physical attributes of a reunified nation. By the end of the century, Congress had authorized a national archive of maps, an explicit recognition that old maps were not relics to be discarded but unique records of the nation’s past.

All of these experiments involved the realization that maps were not just illustrations of data, but visual tools that were uniquely equipped to convey complex ideas and information. InMapping the Nation, Susan Schulten charts how maps of epidemic disease, slavery, census statistics, the environment, and the past demonstrated the analytical potential of cartography, and in the process transformed the very meaning of a map.

Today, statistical and thematic maps are so ubiquitous that we take for granted that data will be arranged cartographically. Whether for urban planning, public health, marketing, or political strategy, maps have become everyday tools of social organization, governance, and economics. The world we inhabit—saturated with maps and graphic information—grew out of this sea change in spatial thought and representation in the nineteenth century, when Americans learned to see themselves and their nation in new dimensions.

Wednesday, 20 June 2012

The Razor (of Occam?)

There is an ancient reasoning principle, which is usually credited to the English Franciscan friar and philosopher William of Ockham (1288--1348), that offers the following advice;

entities should not be multiplied beyond necessity.

That is, when competing explanations for a situation are equal in other respects, we should prefer the one that uses the fewest assumptions and entities.

This principle is almost universally known as Occam's razor, though there is no evidence that William of Ockham was it's author. The phrase Occam's razor did appear in 1852 in the work of the Scottish philosopher Sir William Hamilton (1788--1856) [1], but William Thorburn in 1918 showed that the wording of the classic formulation of Occam's Razor, as given here, was not medieval at all but a `modern myth'. In fact this formulation was originated in 1639 by John Ponce of Cork, a follower of the medideval philosopher Duns Scotus.[2]

This principle is referred to as a razor because it stresses the need to shave away unnecessary assumptions to get to the simplest explanation. In science, Occam's razor has long been referred to as a heuristic minimal principle, or law of parsimony. It is still used by scientists as a rule of thumb in the development of models, but rarely used to make a qualitative or quantitative choice between alternative models.

A later expansion and clarification of the principle comes from Albert Einstein, who notes that we should seek models that have a balance between simplicity and veracity;

make everything as simple as possible, but not simpler.

References

[1] Sir William Hamilton (1852). Discusions on Philosophy and Literature. p580. HERE

[2] Thorburn, W.M. (1918). The myth of Occam's Razor. Mind. 27 (107) pp 345-353.

Full text of the paper and an introduction HERE.

[3] The image of a razor blades edge is from Robert Hookes Micrographia published 1665. Robert Hooke (1635--1703) was a polymath scientist, an early member of the Royal Society and it's Curator by Office, responsible for putting on regular experiments for the fellows of the society. Hooke also wrote and illustrated the Royal Society's first book, Micrographia, which was published in September 1665.

Hooke, R. (1665). Micrographia: Or some pysiological descriptions of minute bodies made by magnifying glasses with observations and enquiries thereupon. The famous diarist Samuel Pepys described Micrographia as, `the most ingenious book that I ever read in my life'. This was the world's first comprehensive illustrated book on scientific microscopy, it was full of beautifully rendered engravings of everyday objects and insects that Hooke had observed at low power through his microscope, including an engraving of the edge of his razor

Tuesday, 19 June 2012

Black Locust Woodcut.

Much of my technical work over the past 20 years has been concerned with trying to infer what is happening in a 3-dimensional space from inspection of images of cross-sectional cuts through the 3D space (a field known as stereology).

Perhaps that's why I appreciate the work in Woodcut (HERE) - a book by the Connecticut-based artist Bryan Nash Gill. He takes cross sections of trees, telegraph poles, branches, planks and laminates and creates relief prints from them by inking the cross-section.

The image below is a Black Locust with bark, 87 years old when printed.

Monday, 18 June 2012

Just enough is more

As everyone knows Less is more is an aphorism that is associated with the modernist architect Ludwig Mies van der Rohe.

But the motto appears to go back a lot further than that. In 1855 Robert Browning published a poem Andrea del Sarto which used the phrase.

Even earlier the same motto in German; Und minder ist oft mehr was used by Christoph Martin Wieland in his literary review series Der Teutsche Merkur, published on page 4 of the January 1774 edition.

The graphic designer Milton Glaser has his own take on it;

Being a child of modernism I have heard this mantra all my life. Less is more. One morning upon awakening I realised that it was total nonsense, it is an absurd proposition and also fairly meaningless... However, I have an alternative to the proposition that I believe is more appropriate. `Just enough is more.'

M. Glaser (2001). Ten Things I Have Learned. Part of AIGA Talk in London November 22, 2001.

Friday, 15 June 2012

Uncertainty exists in the map, not in the territory.

Many years ago, when I was using Maximum Entropy data processing methods on a tricky problem in gas dispersion, I spent a lot of time trying to understand Bayesian inference. This was mainly for very practical reasons - so I could understand what the MaxEnt programmes were doing - but I did spend quite a bit of time reading up on the work of American physicist Ed Jaynes (a short bio HERE) - who I already referenced in an earlier blog post.[1]

Jaynes had a hang up about what he called the "mind projection fallacy", which is when we assume that the way we see the world reflects the way the world really is. Jaynes also describes another form of the fallacy - when we assume our own lack of knowledge about how things really are, somehow means that they are indeterminate.

Jaynes illustrates this by discussing randomness when we shake a dice;

Shaking does not make the result “random,” because that term is basically meaningless as an attribute of the real world; it has no clear definition applicable in the real world. The belief that “randomness” is some kind of real property existing in Nature is a form of the Mind Projection Fallacy which says, in effect, “I don’t know the detailed causes—therefore—Nature does not know them.” What shaking accomplishes is very different. It does not affect Nature’s workings in any way; it only ensures that no human is able to exert any willful influence on the result. Therefore nobody can be charged with “fixing” the outcome.

This is a tricky concept. Recently I have come across a good analogy that brings this to life a bit. We all know that a map is only a representation of the territory it purports to describe, the map is not reality and we rarely mistake the two; the map is not the territory. [2]

Furthermore, if there is any mismatch between the map and the territory, the uncertainty must exist in the map, that is our man made representation of the territory must by definition have uncertainties and errors - it cannot represent reality as it is a piece of coloured paper (or a coloured computer display screen). Furthermore, if we change the map, for example by scribbling on it, these changes in the map cannot in and of themselves change the territory. Changing our representations of the world (or our beliefs is another way of putting this) cannot change the real world. [3]

Which gets me on to computer models used in science. They are, like a map, not reality. They are a mathematically tractable way of making estimates of what might happen when a particular phenomena unfolds.

In effect a computer based scientific model is a complicated black box, into which is fed a number of input parameters and from which is delivered a “result”. But these results must be treated with care as they are quite unlike the results that scientists obtain from experiments. In an experiment in physical science, in which we know that the laws of physics act always and everywhere the same, when we ask a question of nature, in the form of an experiment, the answer we get tells us something about the question we asked and the laws of nature. In the case of a computer model this is not true.

The computer based model has been created by scientists who have had to find a way of writing a computer program that runs in a reasonable length of time on an available computer. In order to do this they inevitably make a long series of assumptions and approximations that allow them to make progress. Sometimes these assumptions are good and other times they are bad.

But once a model has become a de facto industry standard many of the original assumptions and caveats become forgotten, or deliberately ignored, and the model in some sense begins to take the place of reality in the minds of the many scientists and engineers who use the model on a daily basis. In other words it is an example of the mind projection fallacy - they begin to treat the map as if it were the territory.

To use such a model it will inevitably require a set of assumptions to be made by the practitioner about the input parameters to be used in the black box. In complex models there may be dozens or even hundreds of input parameters. Someone has to make a choice of what these are and those choices will be based in some cases on very firm physical insights and experimentally derived data and in other cases on assumptions, approximation, estimates and guesses.

It is not morally reprehensible to make assumptions, estimates and guesses in science but it is behoven on scientists who do so to be absolutely transparent about what they have done – in order that other scientists can challenge them. Best practice requires that the scientist who uses such a model explicitly explores how sensitive the output of the black box is to different choices they have made about the input parameters.

The minimum that an informed public could expect of scientists who run computer models is transparency. Instead of talking about these models as if they were reality or as if they were based on fundamental physical laws, they should openly say;

These computer models are based on our informed assumptions, approximations and in some cases guesses. They will inevitably have errors in them due to our programming methods. In addition, and in common with ordinary lab based or field based experiments, they are prone to uncertainty and error. We indicate this by always reporting the outputs of our black box model runs with error bars or confidence intervals. Over the past 400 years these have been the accepted way of reporting to readers how much confidence they can have in the results and we use the same conventions.

Long may maps (and computer models) exist, they are a brilliant example of how humans can create new and useful means to represent reality. But let us not mistake the map for the territory, lest we get lost.

References

[1] Ed Jaynes major book Probability Theory: The Logic of Science was published posthumously by Cambridge University Press. A PDF copy of the first three chapters of the book is HERE.

[2] Korzybski, A.(1931). A non-Aristotelian system and its necessity for rigour in mathematics and physics. Read at the American Association for the Advancement of Science, December 28, 1931.

[3] http://wiki.lesswrong.com/wiki/The_map_is_not_the_territory.

Saturday, 9 June 2012

The art of scientific investigation (1957)

William Ian Beardmore Beveridge (1908-2006) was an Australian animal pathologist. He should not be confused with William Beveridge (1879-1963) British economist and social reformer. He was professor of Animal Pathology and Director of the Institute of Animal Pathology at Cambridge University, England, from 1947 until he retired in 1975.

It is tricky to find out much about him on the internet. But there is THIS fabulous book by Beveridge - a fascinating personal description of what Beveridge thought was important to scientific researchers in biology, based on his own experience of experimental work.

Here is an example of Beveridge's thoughts;

The role of hypothesis in research can be discussed more effectively if we consider first some examples of discoveries which originated from hypotheses. One of the best illustrations of such a discovery is provided by the story of Christopher Columbus' voyage; it has many of the features of a classic discovery in science. (a) He was obsessed with an idea—that since the world is round he could reach the Orient by sailing West, (b) the idea was by no means original, but evidently he had obtained some additional evidence from a sailor blown off his course who claimed to have reached land in the west and returned, (c) he met great difficulties in getting someone to provide the money to enable him to test his idea as well as in the actual carrying out of the experimental voyage, (d) when finally he succeeded he did not find the expected new route, but instead found a whole new world, (e) despite all evidence to the contrary he clung to the bitter end to his hypothesis and believed that he had found the route to the Orient, (f) he got little credit or reward during his lifetime and neither he nor others realised the full implications of his discovery, (g) since his time evidence has been brought forward showing that he was by no means the first European to reach America.

W.I. Beveridge was born in Junee Australia and he attended the University of Sydney to study veterinary science, he graduated with a Bachelor of Veterinary science in 1931 and a as a Doctor of Veterinary science in 1941. h elater received a Master of Arts degree from Cambridge University in 1947 and later was awarded an honorary degree as Doctor of veterinary medicine from Hannover Veterinary University, Germany in 1963.

Beveridge also served as a reserach bacteriologist at McMaster Animal Health Lab Sydney, a Commonwealth Fund service fellow at Rockefeller Institute Princeton NJ and at the Bureau of Animal Industry in Washington DC.

In 1972 Beveridge published a book, Frontiers in Comparative Medicine, outlining his views in this area of science.

A full obituary of Beveridge that was published in the Sydney Morning Herald in 2006 is HERE.

Monday, 4 June 2012

The evolution (of evolution) on Wikipedia

Here is an interesting timeseries visualisation of the development of a Wikipedia entry for a controversial subject (evolution).

HERE

3 x 3 Marble Table

For those who used to play marbles a post by John Foster HERE and a 3 x 3 table collection below.

Original individual images copyright Morphys Auction house.

Sunday, 3 June 2012

The great Liverpool shell find of 1855

Liverpool has been a crossroads for international trade for hundreds of years. But one of the strangest true stories I have found that illustrates this is the story of Philip Pearsall Carpenter's discovery of millions of sea-shells that had originally been collected in a remote region of the Sea of Cortez in Mexico and shipped back to Liverpool.

The story begins with the Belgian naturalist Frederick Reigen who lived in the Mazatlán region of Mexico (close to the Sea of Cortez) between 1848 and 1850. Reigen started collecting sea shells and ended up amassing tons of shells - probably one of the biggest collections ever made. Somehow the bulk of this vast collection was sent to Liverpool in 1851 for sale. There the shells were spotted by a British minster and eccentric Philip Pearsall Carpenter;

He had been instructed in natural science when a boy, had made a collection of shells, and had always had a taste for natural history. One day, in 1855, while walking down a street in Liverpool, Carpenter caught sight of some strange shells in a dealer’s window. He went in, and found that the specimens were part of a vast collection made by a Belgian naturalist named Reigen at Mazatlan in California. The collector had died, leaving his shells unsorted and unnamed. Carpenter bought them for 50l. There were fourteen tons of shells, each ton occupying forty cubic feet. The examination, description, naming, and classification of these shells was the chief work of the rest of Carpenter’s life. By the comparison of hundreds of examples, 104 previous species were shown to be mere varieties, while 222 new species were added to the catalogue of the mollusca. Thenceforward, though he sometimes preached, made speeches, and wrote pamphlets, most of Carpenter’s time was given to shells, and even when he received calls or paid visits he would wash and pack up shells during conversation. Their pecuniary value when named and arranged in series was great, but he never tried to grow rich by them, and his whole endeavour was to spread the knowledge of them and to supply as many public institutions as possible with complete collections of Mazatlan mollusca. A full report on them occupies 209 pages of the ‘British Association Reports’ for 1856, and further details are to be found in the same reports for 1863, and in the ‘Smithsonian Reports’ for 1860.

CARPENTER, PHILIP PEARSALL (1819-1877). Excerpt from the Dictionary of National Biography (Stephen and Lee 1921–22)

Carpenters collection was catalogued by Katherine van Winkle Palmer in 1958 - the book is HERE and below is a plate from it.