Data Deluge: June 2009

Monday, 29 June 2009

Digitising is NOT measuring

Measurement is an operationally defined process. It requires a well defined protocol that describes what is being measured by what means and with what level of reproducibility and an understanding of the error structure of the problem.

Digitising is the capture of digital data from a sensor. It is a new and low cost way of generating petabytes of data. However, simply taking an analogue input and transforming it to a digital image or signal does not mean it is a measurement.

There are millions and millions of digital images captured every day (perhaps billions). The vast majority of these are NOT measurements they are 'snapshots'. In order to use digital imagery as a measurement modality for science one needs to take care of magnification issues (not always equal in X-Y), linearity of grey scale response and/or colour response, bit depth, illumination set up to highlight features of interest, effect of image compression, frequency response of lenses used etc etc.

Matt

Saturday, 27 June 2009

Science Studies the Sardine (Ed Ricketts 1947)

The following is the full text of an article published by Ed Ricketts in the Monterey Peninsula Herald in 1947.

Unfortunately for the sardine, and the extended marine food chain that it forms part of, Ricketts was killed in May 1948 and his deep ecological insight and campaigning capability never helped save the sardine. Much later, in 1998, well known fisheries ecologist Dr Daniel Pauly published a seminal paper in Science called 'Fishing Down Marine Food Webs'. This is both a citation classic (with over 1,000 citations) and a highly influential paper. Pauly was not aware of Ricketts work when his paper was published.

Science Studies the Sardine

Mysterious Disappearance Focuses Attention on Woeful Lack of Information Regarding Billion Dollar Fish

By EDWARD F. RICKETTS, Pacific Biological Laboratories

The Herald has undoubtedly reported waterfront opinion correctly in stating that the shortage of sardines is being attributed to a change in the currents. I doubt very much if we can rely on such a simple explanation. I am reminded instead of an old nursery rhyme. Like the "farmer-in-the-Dell" there's a long chain of events involved. And you have to be familiar with all of them in order to know any one very clearly.

This is complicated still further by the fact that scientists haven't yet unravelled it all completely. But already quite a lot of information is available, and from most of the parts we can piece together some sort of picture puzzle. The result may be a labyrinth, but there's no way to avoid it. We just have to be patient and try to follow it through. Because I've done this myself after a fashion, perhaps I can be better than no guide at all.

By means of a very efficient straining apparatus in the gills, sardines are able to feed directly on the most primitive foodstuffs in the ocean. This so-called "plankton," chiefly diatoms (free floating microscopic plants) is the product of oceanic pastures, and like grain and grass and root crops, bears a direct relation to sunlight and fertilizer. Except that in the oceans of course you don't have to depend on the rains for moisture.

Some of these ocean pastures produce more per acre than the others. This is due to variations in the amount of fertilizer brought to the surface by a process called "upwelling."

DUST TO DUST
Only the upper layers of water receive enough light to permit the development of plants. In most places, when the sunlight starts to increase in spring, these plants grow so rapidly that they deplete the fertilizer and die out as a result. In the familiar pattern of dust to dust and ashes to ashes, their bodies disintegrate and relase the chemical elements. These, in the form of a rain of particles, contantly enrich the dark-deeper layers.

On the California coast - one of the few places in the world, incidentally, where this happens - winds from the land blow the surface waters of the shore far out to sea. To replace these waters, vertical currents are formed which bring up cool, fertilizer laden waters from the depths to enrich the surface layers, maintaining a high standard of fertility during the critical summer months. Since the seeds of the minute plants are everywhere ready and waiting to take advantage of favorable conditions, these waters are always blooming. It's a biological rule that where there's food, there's likely to be animals to make use off it. Foremost among such animals is the lowly sardine.

FISH "FACTORY"
In the long chain, this is the first link. Reproductive peculiarities of the sardine itself provide the second. Above their immediate needs, adult sardines store up food in the form of fat. During the breeding season, ALL this fat is converted into eggs and sperm. The adult sardine is nothing but a factory for sexual products.

The production of plankton is known to vary enormously from year to year. If feeding conditions are good, the sardines will all be fat, and each will produce great quantities of eggs. If the usual small percentage of eggs survive the normal hazards of their enemies, a very large year class will be hatched.

ABUNDANT ENEMIES
If then feeding conditions are still good at the time and place of hatching (spring in southern California and Northern Mexico), huge quantities of young sardines will be strewn along the Mexican coast by south-flowing currents. Here their enemies are also abundant. Including man - chiefly in the form of bait fishermen from the tuna clippers. Those surviving this hazard move back into California waters, extending each year further and furher up the coast, the oldest finally migrating each summer clear up to Vancouver Island. All return during the winter to the breeding ground. This process builds up until a balance is established. Or until man, concluding that the supply is in fact endless, builds too many reduction plants on that assumption.

On the other had, the small year classes resulting from decreased plankton production, will in themselves deplete the total sardine population. If then the enemies-again chiefly man - try desperately to take the usual amounts, by means for instance of more and larger boats, greater cruising radius, increased fishing skills, etc, sardine resources can be reduced to the danger point.

FORECAST FEASIBLE
All this being true, we might be able to forecast the year classes if we could get to know something about the conditions to which the adults are subjected at a given time. Their eggs are produced only from stored fat. The fat comes only from plankton. And there are ways of estimating the amount of plankton produced.

An investigator at the Scripps Institution of Oceanography long ago realized the fundmental importance of plankton. He has been counting the number of cells in daily water samples for more than 20 years. This scientist, now retiring perhaps discouraged because the university recently hasn't seen fit even to publish his not-very-spectacular figures, ought to be better known and more honored. He deals with the information of prime importance to the fisheries. But in order to get this information in sufficient detail, you have to know the man, and get him to write you about it personally.

FIGURES NEEDED
Figures from only a single point have very limited value. We should have data of this sort for a number of places up and down the coast so as to equalize the variations; and for many years. But we're lucky to have them even for La Jolla. The graph shows 1926, 1931, 1934 to have been poor years, but during those times the total sardine

landings were still apparently well within the margin of safety. Years 1941-42 were also poor. But during this time the fishery was bringing in really large quantities. The evidence is that very lean adults resulted from these lean years, producing few eggs. And we would have done well take only a few of them and their progeny. A glance at the chart labelled "California Sardine Landings" will show what actually happened instead.

There's still another index of sardine production. This time the cat of the wife of the farmer-in-the-dell catches herself a fine succulent mouse.

TEMPERATURE KNOWN
We know that the food of the sardine depends directly on fertility. And that fertility depends on upwelling. Obviously, the water recently brought up from depths is colder than at the surface. When upwelling is active, surface temperatures will be low. Now we can very easily calculate the mean annual sea water temperatures from the readings taken daily at Hopkins Marine Station

and elsewhere. The chart which has been prepared to show these figures bears out again the fact that 1941 and'42 must have been unfortunate years for the sardines. Following those years we should have taken fewer so as not to deplete the breeding stock.

FISHING Vs. FISH
Instead, each year the number of canneries increased. Each year we expended more fishing energy pursuing fewer fish. Until in the 1944-46 seasons we reached the peak of effort, but with fewer and fewer results. Each year we've been digging a little further into the breeding stock.

A study of the tonnage chart will show all this, and more. The total figures, including Canada, tell us pretty plainly that the initial damage was done in 1936-37, when the offshore reduction plants were being operated beyond regulation. And the subsequent needs of the war years made it difficult then also for us to heed the warnings of the scientists of the Fish and Game Division.

THE ANSWER
The answer to the question "Where are the sardines?, becomes quite obvious in this light. They're in the cans! The parents of the sardines we need so badly now were being ground up then into fish meal, were extracted for oil, were being canned; too many of them, far too many.

But the same line of reasoning shows that even the present small breeding stock, given a decent break, will stage a slow comeback. This year's figures from San Pedro, however, indicate no such decent break.

During this time of low population pressure, the migrating few started late, went not far, and came back earlier than usual. Actually we had our winter run during August and September when the fishermen were striking for higher rates. By the time we had put our local house back in order, the fish had gone on south. Many had failed to migrate in the first place, and were milling around their birthplace in the crowded Southern California waters. And the San Pedro fleet, augmented by out-of-work boats from the northern ports, has been making further serious inroads into the already depleted breeding stock.

WE WILL FORGET
My own personal belief is nevertheless optimistic. Next fall I expect to see the fish arrive, early again, and in somewhat greater quantity. But a really good year will be an evil thing for the industry. And still worse for Monterey. Because we'll forget our fears of the moment, queer misguided mortals that we are! We'll disregard conservation proposals as we have in the past; we'll sabotage those already enacted. And the next time this happens we'll be really sunk. Monterey will have lost its chief industry. And this time for good!

If on the other hand, next year is moderately bad - not as bad as this year, Heaven forbid, but let's hope it's decently bad! - maybe we'll go along with such conservation measures as will have been suggested by a Fish and Game Commission which in the past has shown itself far, far too deferent to the wishes of the operators - or maybe to their lobby!

SWEDEN'S EXAMPLE
We have before us our own depleting forests. We have Sweden's example; now producing more forest products every year than it did a hundred years back when the promise of depletion forced the adoption of conservation measures. We have this year's report on our own halibut landings, like the old days - the result solely of conservation instituted by a United States-Canadian commission which took over in the face of obvious depletion.

If we be hoggish, if we fail to cooperate in working this thing out, Monterey COULD go the way of Nootka, Fort Ross, Notley's Landing, or communities in the Mother Lode, ghost towns that faded when the sea otter or lumber or the gold mining failed. If we'll harvest each year only that year's fair proportion (and it'll take probably an international commission to implement such a plan!) there's no reason why we shouldn't go on indefinitely profiting by this effortless production of sea and sun and fertilizer. The farmer in the dell can go on with his harvesting.

This text from material reprinted for educational purposes by Roy van de Hoek, in its entirety from the newspaper:
MONTEREY PENINSULA HERALD 12th Annual Sardine Edition, p.1,3 March 7, 1947.

Thursday, 18 June 2009

"Fact free science"

"I discuss below a particular example of a dynamic system 'Turing's morphogenetic waves' which gives rise to just the kind of structure that, as a biologist, I want to see. But first I must explain why I have a general feeling of unease when contemplating complex systems dynamics. Its devotees are practicing fact-free science. A fact for them is, at best, the output of a computer simulation: it is rarely a fact about the world.

John Maynard Smith, "Life at the Edge of Chaos?," The New York Review of Books, March 2, 1995

Tuesday, 16 June 2009

Microscopy Site

The humble light microscope is an icon of research science. It has a practical 'resolution' that on a good day almost achieves the theoretical resolution that was calculated by Abbe in the 1880's. HERE is a great and very large website dedicated to microscopy. Enjoy.

Ancient Geometry, Stereology & Modern Medics

This is a paper HERE that Vyvyan Howard and I had fun writing a few years back and published in Chance magazine. It's a popular science type article on “Ancient Geometry, Stereology & Modern Medics”, that was originally written in 2002 and which I have recently tidied up into a PDF. It links ancient ideas in geometry with random sampling and state-of-the-art scientific imaging.

One of the key characters is a famous mathematician Zu Gengzhi, who had a famous mathematical Father Zu Chongzhi (in the picture).

Zu Gengzhi was the first to describe what we now refer to as Cavalieri's theorem;

"The volumes of two solids of the same height are equal if the areas of the plane sections at equal heights are the same."

The translation by Wagner (HERE) is given as

If blocks are piled up to form volumes,
And corresponding areas are equal,
Then the volumes cannot be unequal.

PhenoQuant™

Inspired by a 2004 paper by Vyvyan Howard et al (HERE) I have been trying to develop a business model whereby high quality quantification of the phenotypic impact of specific gene knock-outs in well defined strains of transgenic mice is provided as a service.

The quantification is achieved with protocolised stereological techniques, and clients get the specific data they need on time and in full. At the same time the parent business builds a proprietary database that can be data-mined to provide new insights and discoveries.

I worked with a Masters Student, James Stone, at the Manchester Science Enterprise Centre in 2007 to develop the concept and try to understand the market positioning for such a business, tentatively called PhenoQuant™. James and I did some outline work on a business plan and he explored routes for funding the opportunity. The plan is not mature and it is currently dormant but this is a real opportunity to bring quantification to one key aspect of modern Gene based biomedical research work.

Monday, 15 June 2009

The Art of the Soluble

"No scientist is admired for failing in the attempt to solve problems that lie beyond his competence. The most he can hope for is the kindly contempt earned by the Utopian politician. If politics is the art of the possible, research is surely the art of the soluble. Both are immensely practical-minded affairs." P B Medawar, The Art of the Soluble, 1967.

Galileo the quantifier

"The man who undertakes to solve a scientific question without the help of mathematics undertakes the impossible. We must measure what is measurable and make measurable what cannot be measured".

Galileo during his period in Padua before moving to Florence in 1610.

Cited in; The science of measurement: A historical survey. Herbert Arthur Klein. Dover 1988. p509.

The Data Deluge - 1987 Style

The earliest reference I have yet found to a 'Data Deluge' is in the New York Times 22 years ago (30th June 1987).

This article describes how mad baseball fans can get "...the ultimate service for the most avid baseball addict: daily, detailed reports on 16 minor leagues encompassing 154 teams and more than 3,000 players". This data deluge is made possible by facsimile machines installed in more than 150 minor league ball parks.

'Official scorers or a club official are supposed to file the information within an hour after the game and then the phone-line cost to the team is minimal since the fax transmissions take about 40 seconds each.

Triple-Point Careers™ in Science and Technology

The cartoon phase diagram of a substance is shown in the figure. There is a unique combination of temperature and pressure at which all three common states of matter (Gas, Liquid and Solid) can be in equilibrium, the triple point.

Using this as an analogy we can define the space of opportunities for science knowledge workers who want to remain in science as the “phase diagram” of science careers. The solid state = Corporate R&D, the liquid state = Academic Research and the gaseous state = SME/startups.

Many people occupy positions fully within one of the three discrete phases these people manage their careers by following the norms of the profession. They understand the rules of the game for progression and can see how to piece together a career development plan for themselves. Their managers can understand how to help them develop and have many points of comparison by which they can judge performance and progression.

Many science careers also include a single, or small number, of transitions from one phase to another. Typically, for example, an academic may act as a consultant to a corporate R&D organization and then leave academia to take up a full time senior R&D manager position in the corporate. It is also quite common for a senior R&D manager, perhaps later in their career, to move from corporate R&D into a senior management position in academia. There are a number of national and international assistance schemes around to help scientists make these phase transitions at some stage in their career.

However, in addition to more mainstream careers and “phase transitions’ there are a number of interesting career opportunities at the boundaries between the different states.

Solid-Liquid boundary = Corporate R&D staff who are successful in the corporate and manage to develop a successful part-time or visiting academic position (e.g. professorship).

Liquid-Gas boundary = Academic scientists who have spun out a company and act as CTO, or more rarely CEO.

Solid-Gas boundary = Corporate R&D staff who have a spin-out company (e.g. from UV type activity).

Triple Point = people who successfully manage a corporate R&D career with parallel activity in an academic position and spin-out.

Staff who are occupying one of these boundary positions require specialized career advice and management. Their career development is non-trivial and not catered for by the mainstream career development methods adopted in the Solid, Liquid or Gas phases.

The triple-point career is a relatively rare combination today but is likely to be more common in the future. This is what is known as a Portfolio Career in other professions (e.g. banking/company directorship = see Financial Times 24/1/2006).

It is likely that in the knowledge economy people who are at, or who gravitate towards, a boundary or triple-point career will be uniquely skilled. It may be that at any one time a triple-point career person is flexing two of the three positions they occupy, i.e flipping at boundaries. However over a period of time (2-3 years say) they will flex all three components of their career portfolio.

Issues

Trying to successfully develop and sustain a triple point career is difficult. There are few guidelines for how to get into one of these positions or manage the psychological pressures of being successful there.

Tools

Develop a core intellectual skill set that is deployed into maximally different application of the skill set within each “phase”. This is cross-disciplinary working. It allows a lower psychological stress.

Find a number (at least one) mentor who has a triple point career.

Develop tools for personal “Imagination Management”.

Develop tools for IP and idea management.

Hone to a high level the art of handling contradictory and possibly “conflicting” confidential information. Never waver from this.

Build a network of people who are also triple point career makers but in particular people at the other interfacial positions (Solid-Liquid etc).

Bayesian Networks

I have been aware of Bayesian Networks for a couple of years but never really got deeply into them. I don't know why - the combination of Bayes and graphics fits right into my spectrum of interests. Nevertheless I am going to make up for lost time and dig into the literature. Being of a practical mind I think the following volume looks intersting volume (based on Google preview and downloading a couple of Chapters) Bayesian Networks: A practical guide to Applications. It has been ordered and is hopefully on the way.

Sunday, 14 June 2009

Above all else do No harm

This James Lind library website is a good resource for scientists and statisticians - it is a historical archive of cornerstone articles, books and images that document the development of the concept of 'fair testing' in healthcare. Although I haven't found an explicit mention of the Hippocratic Oath the whole library seems to be motivated by it.

Saturday, 13 June 2009

Basic Ideas of Scientific Sampling

This is a short book by Alan Stuart, and it is not at all Bayesian, but it does describe very simply the concepts of random sampling needed for statistical analysis with unbiased estimators (such as the sample mean).

It has a good description of the 'Central Paradox of Sampling' = "Once a sample has been obtained it is impossible to tell by inspection of the sample whether it has been obtained by a simple random sampling mechanism or not".

IOT: Baconian Science

A great episode of the In Our Time radio programme on Radio 4 in the UK (available as a PodCast or HERE).

Bacon was a Jacobean lawyer and politician turned advocate for science - famously writing a book Novum Organum which described how to use data collection, observation and experiment to do science.

Good material on Wikipedia about Bacon and in Oxford.

Quantification: A History of the Meaning of Measurement in the Natural and Social Sciences

This is a book that you can find on Questia (HERE), it was edited by Harry Woolf and is the proceedings of a conference held in 1959. This volume appeared in 1961.

There are some great articles, including one by the famous philosopher and historian of science Thomas Kuhn (he of 'Paradigm Shift' fame), also there is a great article by Wilks, which describes the role that individual measurements play in large systems of measurement and as inputs to quantitative models.

As Wilks puts it; "The subject of quantification in science is an enormous one with many as
pects. The foundation of quantification is measurement, and any discussion
of the nature of quantification must necessarily begin with a discussion of
the nature of measurement."

Enjoy

Friday, 12 June 2009

When Summary Measures Aren't Enough

Another great Rafe Donahue post - HERE - this is a simple but well explained page that shows how powerful it is to show all of the data (what he calls the data atoms).

The figure shows how a plot can be constructed that shows all of the 'data atoms', together with summary measures of the distributions of the data.

As Donahue puts it;

"Analysis, from the Greek, implies breaking things down into component parts so as to understand the whole. Its opposite is synthesis, bringing together the parts to construct the whole. If we are going to do data analysis, then we must make attempts to break the data down to their component parts, their atoms. Computing summary measures like means and medians and percentiles and standard deviations and even F and chi-square and t statistics and P values, is not analysis; it is synthesis! And, worse than playing games with word meaning, data synthesis often obscures understanding the data. "

Matt

[Figure Copyright Rafe Donahue]

Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics

Rafe Donahue has an excellent extended set of notes HERE regarding how to use statistical concepts for making better graphics in scientific reports.

Norvigs More Ways to be wrong than right

More ways to be wrong than right.

Here is a nice meta list of all the mistakes one can make in experimental design and interpretation at Peter Norvigs website (http://norvig.com/experiment-design.html).

Comic Powers of Ten

One of my favourite web based comic artists, xkcd, has recently completed two sketches that when combined cover (almost) the 40+ orders of magnitude refered to by Edward Tufte and which are comic versions of the Eames Powers of Ten video and book.

From 46 Billion lights years up to Folks (Height = http://xkcd.com/482/)

From people at a PC to a Proton (Depth = http://xkcd.com/485/)

Matt

Quantifying in Scientific Images

I regularly teach histologists and biologists how to extract quantitative 3D information from 2D images and the issue of image dequantification is a key part of the program.

There are a number of linked issues;

(1) Often in experimentally derived images the x and y magnifications can be different due to non square CCD pixels, optical distortions etc. This can be overcome by taking images of known gratcules in both directions to check magnification.

(2) Using calculated "magnifications" derived from the nominal magnification on the side of the objective is often not accurate.

(3) Most histological images are 2D projections of a 3D space (admittedly thin) that has been subjected to considerable tissue processing, shrinkage etc since it was a live piece of tissue. This should be experimentally investigated and accounted for by the scientists. Note that relatively modest linear shrinkage becomes significant 3D volumetric shrinkage. Furthermore not all tissue shrinks the same amount. Famously Herbert Haug a German anatomist found in the 1960's that the brain tissue of young humans shrank more than the brain tissue of old human brains. This led to an apparent "loss" of neurons with age (they had a lower numerical densitry but after correcting for differential shrinkage the same total number).

(4) The act of taking a thin histological (or optical) section leads to an obersved reduction in feature dimension. This is not widely reflected upon and there are no accepted ways to indicate this in standard hsitological images.

(5) Often the histology stains used do not stain tissue compartments uniformly.

(6) Histological images are in fact 2D samples (real Flatland stuff) from 3 space and in common with all statistical sampling they suffer from the "Central Paradox of Sampling" = simply by looking at the picture you have no idea if it has been randomly sampled or carefully selected (i.e. it is an unbiased or a biased sample).

(7) The short advice I give for all quantitative image analysis tasks is always is to think Outside-In not Inside-Out. i.e. Think about what was done to obtain the image you have in front of you and whether this is appropriate for the scientific task at hand NOT get obsessed by the image details some of which are misleading.

Matt

How science works

I have been trying to get my head around 'evidence' as a general concept and unsurprisingly lawyers are pretty hung up about this. How 'evidence' in a scientific, engineering and statistical sense interacts with the law is the subject of this very large and authorative US Federal government Handbook on how to use scientific evidence in law courts = www.fjc.gov/public/pdf.nsf/lookup/sciman00.pdf/$file/sciman00.pdf .

There are extensive chapters on science, stats and engineering and how evidence from these different fields of endeavour should be considered by lawyers and judges.

I had a good detailed read of the chapter on 'How science works' and find it an excellent, informed and irreverent description of what actually happens in science rather than what theoretically happens.

Matt

Moores Law

Data from Wikipedia entry on Transistor Count.

March of Resolution

We can arbitrarily fix the first use of measurement in scientific research, as we would now know it, to Europe in the early 1600’s. This modern paradigm of scientific research is not defined by measurement per se, the Egyptians were already doing that, it is defined by the use of measurements that have been obtained with specialised instruments constructed to extend the human capacity to resolve differences. The core concept here is resolution; the ability to detect, with some level of reliability, differences between two or more states. For example, in an optical microscope resolution is defined as the ability to distinguish and detect fine detail.

This relationship between science and resolution is not just nostalgic, there is an integral connection between the growth of science and the development of measurement resolution, as Edward Tufte describes it (Tufte 2003);

"The history of science over the centuries can be written in terms of improvements in resolution. From the beginning and all the way up to 1609, when Galileo's telescope first assisted human vision, scientific knowledge consisted of making descriptions and comparisons for events taking place at measurement scales accessible to the human eye, from about 10-3 (a tiny speck) and up to 10+7 meters (the Milky Way), some 11 orders of magnitude. Now, 400 years later, scientific descriptions and comparisons take place at scales from 10-18 and up to 10+25 meters, some 44 orders of magnitude. That is, from 1609 to 2003, scientific resolution improved an average of about 8 orders of magnitude per century (or 100 million-fold per century). Scientific resolution has increased an average 10,000,000 to 100,000,000 times per century in each of the 4 centuries since Galileo."

Although Tufte makes the argument with respect to the resolution of distances a similar argument can be made about improvements in temporal resolution (from sun dials with rough “hourly” resolution in antiquity to modern atomic clocks which can maintain an accuracy of about 10-9 seconds per day), chemical resolution or mass resolution.

Matt

Cramming more components onto integrated circuits

There are a number of ways to illustrate the revolutionary growth in computational power over the past 50 years or so, but perhaps one of the easiest is as a simple folk story.

In the good old days of the late fifties, 1958 to be exact, an electronics engineer at Texas Instruments called Jack Kilby built the very first integrated circuit (IC); it was a major technical achievement and was composed of five separate components (resistors, transistors and capacitors).

By 2004 Intel introduced the Itanium 2 chip with with 9MB cache and 592 million transistors..

These arbitrary dates, 1958 and 2004, can be used to bookend an incredible growth in computer complexity and power. What is almost as surprising as this incredible growth is the fact that it was predicted in advance in 1965 by someone who was in the thick of the then new electronics industry. The paper in question was published under the heading, “The experts look ahead” and was written by the founder of Intel, Gordon Moore, when he was the Research & Development Director of Fairchild Semiconductor. The article had a typical engineering swagger and directness about its title; “Cramming more components onto integrated circuits”, and it appeared in a trade journal called Electronics (Moore 1965).

In keeping with the down to Earth tone of the title was the simple and crystal clear message that the author spelt out. Moore predicted that the pace of technical developments in the semi-conductor industry would be extremely rapid. They would follow a growth law, now known as Moore's law, such that there would be a rough doubling of the number of transistors you could fit onto an integrated chip every 2 years. Moore predicted that,

“Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years”.

In fact the doubling every two years that Moore predicted has remained essentially constant for the past 40 years.

The End of Science

In August 2008 the editor-in-chief of Wired magazine, Chris Anderson, proclaimed, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. This claim was backed up by 12 case studies that he uses to illustrate his point; Google quality data analysis has made the search for causation obsolete and that “correlation is enough”. There will be many working scientists who will not lose much sleep over this claim. In the end the Supercrunchers have to have data to work on and in science good data is still not trivial to obtain, nevertheless this article and its premise should provoke active scientists and statisticians to think deeply about the implications.

In contrast to Wired we believe that the core tools of science and scientific inference are as valid today as they ever were. But the pitfalls for lack of rigour are worse. There are now even more ways to do it wrong than right. We believe that even more than ever scientists need to be skilled in the art of Integrated Quantification (IQ); the ability to take into account all aspects of quantification in a holistic manner to achieve high quality scientific research.

This skillset is not taught yet in current University courses; neither science degrees nor statistics degrees deal with post data deluge quantification in a holistic manner. The ability to critically appraise the output of a Google-Algorithm that has found a putative correlation will be more important than ever. In business it may be enough to say that the correlation helps give you and edge when interacting with billions of humans but in science this has yet to be proven. If we don’t build this skillset and build methods to utilise it in both academic and commercial research we will have a new generation of scientists who believe the Wired view; the scientific method is obsolete.

Matt

Experimental Control Vs Statistical Control

This paper HERE, is a very thought provoking old paper dealing with what statistical procedures are suitable for single -subject studies.

One of the interesting thoughts is the role of Experimental Control Vs Statistical Control of variability.

Measurement Scientists build expertise in Experimental Control – they build equipment, modify instruments, write software understand their system. They do not undertake hypothesis driven activities using “traditional” statistical inference approaches such as Null Hypotheses (H0) and Alternative Hypotheses (HA). They are mainly physicists. This skillset is very specific to a given experimental technique and is generally NOT transferable.

Data Scientists build expertise in Statistical Control. They design experiments, calculate the power of an experiment. They undertake hypothesis driven activities using “traditional” statistical inference approaches. They are mainly mathematicians or statisticians. This skillset is very generic and transferable.

Ed Jaynes

Ed Jaynes was an American physicist who was born in 1922 and who carried on working pretty much until his death in 1998. Jaynes was one of the first people to fully explain that the classical probability theory developed by scientists such as Laplace (1749–1827) can be considered as a generalization of Aristotelian logic. In the case of reasoning about uncertainty our logic is that of probability theory and in the special case that our hypotheses are either true or false we can reason using classical deductive logic.

The distinctive approach of Ed Jaynes is not so widely known as it should be. Jaynes life’s work is summarised in a book called Probability Theory: The Logic of Science. This was published by Cambridge University Press and is HERE. Reviews include the following; ‘This is not an ordinary text. It is an unabashed, hard sell of the Bayesian approach to statistics. It is wonderfully down to earth, with hundreds of telling examples. Everyone who is interested in the problems or applications of statistics should have a serious look.’ SIAM News.

The statement that ‘this is not an ordinary text’ is a serious understatement. At one level the book is easy to read and has numerous real insights about the problems that a quantitative scientist faces when trying to reason about an unknown reality based on partial information – the situation most natural and social scientists face during their working lives. At a deeper level this is a difficult and very ambitious book. The aim of the book is no less than an explication of the foundations of applied scientific inference. It combines physical insight, philosophical challenges and mathematical difficulty. It will stretch your conceptions about how and where probability theory can and should be applied. In the final analysis much of the books technical material may well be shown to be wrong. However, simply trying to understand the book and its expansive vision of what scientific inference should be may well lead to a productive and creative life’s worth of scientific work.

Jaynes is a leading thinker in the field of Objective Bayesian statistics (see Wikipedia HERE).

Other bits and pieces about Jaynes, including a short bio, are HERE.

Tufte Principles of Analytical Design

Edward Tufte describes 6 fundamental principles for analytical design (that he claims are merely mirrors of 6 principles of analytical thinking).

In brief these are;

(1) Show comparisons, contrasts, differences.

(2) Show causality, mechanism, explanation, systematic structure.

(3) Show multivariate data; that is show more than 1 or 2 variables.

(4) Completely integrate words, numbers, images, diagrams.

(5) Thoroughy describe the evidence. Provide a detailed title, indicate the authors and sponsors, document the data sources, show complete measurement scales, point out relevant issues.

(6) Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.

Matt

Antediluvian

Antediluvian - before the Deluge.

Only 100 years ago high quality scientific data was scarce. Very scarce. This scarcity was a major barrier to scientific progress and many of the leading research scientists of the day spent enormous efforts to design and construct physical instruments that were capable of providing high quality, reproducible data. For example, consider the life work of Hermann von Helmholtz (1821-1894), a German quantitative scientist who made important contributions to theoretical physics and founded the field of psycho-physics. Helmholtz had broad interests in theoretical and practical science and Einstein famously said of him, “I admire ever more the original, free thinker Helmholtz” (Einstein 1899).

In the course of his wide ranging career Helmholtz invented a number of key measurement devices. In 1851 he revolutionised study of the human eye with the invention of the opthalmascope and in the 1860’s he invented the Helmholtz resonator to generate pure tones for his studies on human tone sensation and perception. Helmholtz was not a mere measurer of things and he also contributed new and deep thinking to the philosophy of science.

However, his famous dictum that “all science is measurement”, was based on a lifetimes commitment to the creation of and use of high quality experimental data. The reason that Helmholtz spent so much effort on creating physical devices for generating data for his studies was that they simply did not exist. Compared with today, that era of science could be described as a data desert. Each and every data point had a real value to a practising scientist because they knew exactly how much effort had been expended in obtaining it.

Return of the Max

Recently I have been digging back in time to re-acquant myself with Maximum Entropy data processing (MaxEnt). It’s a full 11 years since I last earned a living by applying MaxEnt and trying to get back into the area has been a good experience – I have been pleasantly surprised by how things are in MaxEnt land.

Last I knew, MaxEnt and Bayes where officially two distinct things. There was an understanding that MaxEnt could be thought of as a Bayesian prior probability distribution for positive additive distributed quantities (e.g. images and spectra). Now it seems that they are deeply and fundamentally connected and that there is a single underlying theory (confusingly called Maximum Relative Entropy or MrE) that shows that Bayes updating and MaxEnt are both special cases of a unifying inference framework. I do not pretend to fully understand all of the nuances, either philosophical or mathematical, but it appears that Ariel Caticha and his PhD student Adom Giffin have been having great fun setting up the new formalism and applying it (see particularly Catcicha’s lecture notes and Giffins thesis on arXiv.org which I have been reading). My experience is that even the best ideas take a time to catch on – som many people have to unlearn so much etc, but somehow I suspect that in another 11 years or so we may all be maximising our relative entropies!

Matt

Wigmore Evidence Charts

I have been reading a fascinating book by two probabilists Joseph Kadane and David Schum called “A probabilistic analysis of the Sacco and Vanzetti Evidence” about the famous US murder case from the 1920’s.

One interesting feature of the book is their use of a particular type of graphic which they call a ‘Wigmore Evidence Chart’. They also show a complete appendix of these charts summarising the Sacco and Vanzetti evidence. Wigmore was a US jurist who developed a probabalistic approach to proof (he published this approach from 1913 onwards) and is seen by the authors as the earliest exponent of what today are called ‘inference networks’ (Wikipedia HERE).

I have not seen this type of chart before. They are interesting but visually a bit ropey, they could really do with improvements in clarity, line weights, text integration and colours etc.

However, they are not dead; a quick search on Google took me to this 2008 paper by Bruce Hay of Harvard Law School (HERE). The paper is from the journal “Law, Probability and Risk” and the Abstract says the following;

“Wigmore’s `The Problem of Proof’, published in 1913, was a path-breaking attempt to systematize the process of drawing inferences from trial evidence. In this paper, written for a conference on visual approaches to evidence, I look at the Wigmore article in relation to cubist art, which coincidentally made its American debut in New York and Chicago the same spring that the article appeared. The point of the paper is to encourage greater attention to the complex meanings embedded in visual diagrams, meanings overlooked by the prevailing cognitive scientific approaches to the Wigmore method.”

I will be ordering this paper and reading around it, but in the meantime you may find this interesting.

Just found more on Wigmore – a whole book by Anderson, Schum & Twining HERE that has long explanation of how to apply the method.

Matt