Data Deluge: The Ecological Fallacy & MAUP

Tuesday, 29 June 2010

The Ecological Fallacy & MAUP

The ecological fallacy occurs when analyses that are based on grouped data lead to conclusions that are diﬀerent from those based on the analysis of individual data. One of the early examples is given in Robinson (1950).

From the Wikipedia entry;

`An ecological fallacy (or ecological inference fallacy) is an error in the interpretation of statistical data in an ecological study, whereby inferences about the nature of specific individuals are based solely upon aggregate statistics collected for the group to which those individuals belong. This fallacy assumes that individual members of a group have the average characteristics of the group at large...'

(on Robinson 1950)

`...for each of the 48 states in the US as of the 1930 census, he computed the literacy rate and the proportion of the population born outside the US. He showed that these two figures were associated with a positive correlation of 0.53 — in other words, the greater the proportion of immigrants in a state, the higher its average literacy. However, when individuals are considered, the correlation was ?0.11 — immigrants were on average less literate than native citizens. Robinson showed that the positive correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or "ecological" data'

This is closely related to a problem that I have been aware of for a long time under the name of "change of support problem" - which is how it is known in the field of mathematical morphology and integral geometry. I recently found out that within spatial statistics and GIS it has another special name; Modifiable Areal Unit Problem (MAUP). The basic problem is that for spatial data, such as Health outcomes recorded by zip-codes or counties, socio-demographic data from Census tracts, safety or health exposure estimates within a region of suspected source etc etc, statistical inference changes with scale.

A classic early paper is Gehlke and Biehl (1934) who found that the magnitude of the correlation between two variables tended to increase as districts formed from Census tracts increased in size.

Waller & Gotway (2004) describe it as a "geographic manifestation of the ecological fallacy in which conclusions based on data aggregated to a particular set of districts may change if one aggregates the same underlying data to a different set of districts".

The paper by Openshaw and Taylor (1979) described how they had constructed all possible groupings of the 99 Counties in Iowa into larger districts. When considering the correlation between %Republican voters and %elderly voters, they could produce "a million or so" correlation coefficients. A set of 12 districts could be contrived to produce correlations that ranged from -0.97 to +0.99.

More here = http://www.samsi.info/200304/multi/cgcrawford.pdf

From Openshaw (1984);

`the areal units (zonal objects) used in many geographical studies are arbitrary, modifiable, and subject to the whims and fancies of whoever is doing, or did, the aggregating.'

below is an example figure from Openshaw (1984).

References.

Gehlke, C. E. and K. Biehl (1934). Certain effects of grouping upon the size of the correlation coefficient in census tract material. Journal of the American Statistical Association, 29, 169-170.

Openshaw, S. (1984). The Modifiable Areal Unit Problem. CATMOG 38. ISBN 0 86094 134 5

Openshaw, S. and P. Taylor (1979). A million or so correlation coefficients. In N. Wrigley (Ed.), Statistical Methods in the Spatial Sciences, London, pp. 127-144. Pion.

Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological
Review, 15, 351–357.

Waller, L.A. and C.A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. Hoboken, NJ: John Wiley & Sons.