Correlation does not imply causality examples


Introduction

I'am looking for examples showing that correlation does not imply causality, the targeted audience consists of undergraduate students (their first year at the university but in the BioMathStat track). All practicals are under R. If you still don't have R at hand, you can copy/paste the source code below in our RWeb server and then click on the submit button to reproduce the graphics.

Smoking is good for childrens

This great example was pointed to me by Thomas Lumley from the University of Washington. The original dataset is here and its description there. Don't look at the original post on R-help diffusion list there because there is a hint inside.

fev <- read.table("http://pbil.univ-lyon1.fr/members/lobry/corcau/fev.txt")
names(fev) <-c("seqnbr","subjid","age","fev","height","sex","smoke")
smk  <- fev[fev$smoke == 1, ] # smokers
nsmk <- fev[fev$smoke == 2, ] # non smoker
boxplot(fev$fev~fev$smoke, col = c("pink","lightblue"),
        names = c(paste("smoking (n =", nrow(smk), ")"),
                  paste("non smoking (n =", nrow(nsmk), ")")),
        notch = TRUE,
        ylab = "Forced expiratory volume (l/s)",
        main = paste("Respiratory function for", nrow(fev),"childrens")) 

There is clearly a difference between the smoking and non-smoking group. Since the notches of the two boxplots do not overlap then, at a critical level of 5 %, we can reject the null hypothesis stating that the medians of the respiratory function are the same between the two groups. But but but ... wait a minute ... the results are better in the smoking group! I won't give here the solution of this apparently curious result, there are enough informations in this data set to understand what's going on. Smoking is of course not good for childrens, so why the smoking group has better respiratory results?

Super bowls and stock markets

This example was pointed to me by Jill Caviglia-Harris from Salisbury University. The Super Bowls is a popular sport contest in the USA, based on American football. The Super Bowl theory correlates U.S.A. stock market performance to the results of the super bowls. This theory (Sommers 2000) states that: "When a team from the National Football Conference wins the Super Bowl, the stock market goes up, and when it doesn't, the stock market goes down. Usually, that is. Alas, the predictive power of this rule is on the decline and in a few years it may be down to zero. The author does not predict what the next spurious correlation will be."

sbt <- read.table("http://pbil.univ-lyon1.fr/members/lobry/corcau/sbt.txt")
sbt <- sbt[sbt$Year < 1999, ] # Don't look at all data
nfc  <- sbt[sbt$Conference == "NFC", ] # National Football Conference 
nnfc <- sbt[sbt$Conference == "AFC", ] # Non National Football Conference 

boxplot(sbt$Per~sbt$Conf, col = c("pink","lightblue"), 
        names = c(paste("non NFC (n =", nrow(nnfc), ")"), 
                 paste("NFC (n =", nrow(nfc), ")")), 
        notch = TRUE,
        ylab = "Percentage Change in DJIA ",
        main = paste("Super bowls and stock markets (",
                      min(sbt$Year), "-", max(sbt$Year), ")" ))

Arglll... Too bad! I loved this one too much! But it is not striking enough...

Si vis pacem para bellum

This dataset was posted by Albyn Jones from Reed College, in Portland, Oregon (USA) on R-help diffusion list. The expression "gun registration laws" refers to the fact that some USA states do not require mandatory licensing and registration of handguns. The gun registration indicator is taken from a USA Today article (Tuesday, January 7, 1992, PAGE 5A). The article reported that gun registration laws lead to increased numbers of homicides, a conclusion reached by comparing the mean number of homicides in states with gun registration laws to states without registration laws:

guns <- read.table("http://pbil.univ-lyon1.fr/members/lobry/corcau/guns.txt", header = TRUE)
grl  <- guns[guns$gunreg == 1, ] # States with gun registration laws
ngrl <- guns[guns$gunreg == 0, ] # States without gun registration laws
boxplot(guns$homicide~guns$gunreg, col = c("pink","lightblue"),
        notch = TRUE,
        ylab = "Homicide counts (log scale)",
        names = c(paste("No gun law (n =", nrow(ngrl), ")"), 
                  paste("Gun law (n =", nrow(grl), ")")),
        main = paste("Homicides in", nrow(guns),"USA states"),
        log = "y")

There is clearly a difference between the states with or without gun registration laws. Since the notches of the two boxplots do not overlap then, at a critical level of 5 %, we can reject the null hypothesis stating that the medians of homicide numbers are the same between the two groups. But but but ... wait a minute ... the results are better in the states without gun registration laws! Here again, there are enough informations in this data set to understand what's going on.

Sunspot increase Lynx's libido

I have set up this one directly from R's standard dataset:

data(sunspots)
library(ts)
data(lynx)
spots <- window(sunspots, freq = 1, start = 1880, end = 1900)
lnx <- window(lynx, start = 1880, end = 1900)
ratio <- max(lnx)/max(spots)
par(mai = rep(1, 4))
plot(lnx, main = "Sun spots intensity\nand lynx population density",
     t = "b", ylab = "")
lines(ratio*spots, col = "red", t = "b")
axis(side = 4, col = "red", col.axis = "red", at = ratio*pretty(spots),
     lab = pretty(spots))
legend(1887, 4500, col = c("red", "black"), c("spots", "lynx"), pch = 21)

And the correlation vanishes when you consider all available data:

spots <- window(sunspots, freq = 1, start = 1821, end = 1934)
ratio <- max(lynx)/max(spots)
plot(lynx, main = "Sun spots intensity\nand lynx population density",
     ylab = "")
lines(ratio*spots, col = "red")
axis(side = 4, col = "red", col.axis = "red", at = ratio*pretty(spots),
    lab = pretty(spots))
mtext("Sun spots", adj = 1, col = "red")
mtext("Lynx", adj = 0)

As pointed to me by Eric Rexstad from the University of Alaska Fairbanks, It turns out that this is not a very good example because the Lynx-sunspot relationship is a controversial issue (yes!). It was first put forward in: Elton, C. (1924) Br. J. Exp. Biol., 2:119-163 and rejected later by the same auhor in: Elton, C., Nicholson, M. (1942) J. Anim. Ecol., 11:215-244. However, using a new dataset about the past fluctuations of the main prey of the Lynx, Sinclair et al. (1993) Am. Nat.,141:173-198 do not rule out a possible indirect modulation by solar activity of the prey population dynamics (but see also Lindstrom et al. (1996) There is nothing new under the sunspots. Oikos, 77:565-568. Ranta et al. (1997) Solar activity and hare dynamics: a cross-continental comparison. Am. Nat.,149:765-775. Sinclair & Gosline (1997) Solar activity and mammal cycles in the Northern Hemisphere. Am. Nat.,149:776-784.) It is therefore advisable to avoid this example.

Solar radiation and the London Stock Exchange index

Pointed to me by Jason Turner, page 23 in Tufte's bible there is a graph showing correlation between solar radiation and the London Stock Exchange index. Failed to find the corresponding data set up to now.

Beer drinking in the USA and children mortality in Japan

Just found this link up to now. So, Paul Watzlawick is apparently at the origin of this one.

Weather patterns and NYSE

Pointed to me by Scott Milligan from McMaster University:

I seem to remember a tongue-in-cheek paper describing how weather 
patterns in some South American country were used to accurately predict 
fluctuations in the NYSE, but this morning I wasnąt able to find it anywhere.
Nevertheless, you might be able to locate it. It was quite good.

Stork population and Human birth rate in Europe

Pointed to me by Sean O'Riordain:
I seem to remember a truely wonderful correlation done between the declining 
stork population in Europe and the declining birth rate in these countries :-)
Spencer Graves posted on r-help:
George Box presented a beautiful plot that showed a high correlation 
between stork sightings and the number of human babies born year by year 
for several years in Oldenberg (Germany, I think).  I believe it appears 
early in Box, Hunter and Hunter (1978) Statistics for Experimenters 
(Wiley), but I don't have the book in my hand, so I can't confirm that 
or give a page reference. 
Posted to me by Patrick Connolly in reply to I don't know who:
|> Hi all:
|>   
|> And then there's the classic case of higher frequency of cases with
|> coronary heart disease in Arizona than NYC.  Does that support the
|> hypothesis that cleaner air causes CHD or that those with CHD choose
|> Arizona for its cleaner air?

Before that came the relationship between the number of births in 19th
century Sweden which was 'evidently' caused by the increase in the
number of storks (which find human created environments convenient and
proliferated as the human population increased).

Looks like you can have some fun with this one.  I had a quick look
for a stork dataset and found this page
http://horizon.unc.edu/courses/292/schedule.asp

I found this ref: Steffen Wirth . (2003) King Kong, Storks and Birth Rates. Teaching Statistics  25: 1, 29-32. Looks interesting.

More information for the stork example were kindly send to me by Kevin Wright:

A link to a listserv here, with the following post:

Date:         Fri, 26 Nov 1999 12:15:14 +0000
Reply-To:     "R. Allan Reese" <[log in to unmask]>
Sender:       "National Birding Hotline Cooperative (Chat Line)"
              <[log in to unmask]>
From:         "R. Allan Reese" <[log in to unmask]>
Subject:      Storks and data - reference found
Content-Type: TEXT/PLAIN; charset=US-ASCII

The storks and babies correlation is based on annual counts made 1930-36
in Oldenburg, Germany. The data are discussed in

Glass, GV (1984) Statistical Methods in Education and Psychology
                 Prentice-Hall pp104-105
who quotes the data from
Box, GEP; Hunter WG & Hunter JS (1978) Statistics for Experimenters
                 John Wiley pp8-9
who in turn credit the data to
Ornithologische Monatsberichte, 44 No 2, Jahrgang, 1936, Berlin
         "                    , 48 No 1, Jahrgang, 1940, Berlin
Statistiches Jahrbuch Deutscher Gemeinden, 27-33, Jahrgang, 1932-1938,
Gustav Fischer, Jena.

The spurious correlation is ascribed to the increase in storks and
humans over the years.

R. Allan Reese                       Email: [log in to unmask]
Associate Manager                    Direct voice:   +44 1482 466845
Graduate Research Institute          Voice messages: +44 1482 466844
Hull University, Hull HU6 7RX, UK.   Fax:            +44 1482 466846
====================================================================

A link to a practical with this (Oldenburg, Germany) dataset:

Fertility_Enhancer	People 
140	55.5 
148	55.5 
175	64.9 
195	67.0 
245	69.0 
250	72.0 
250	75.5 

With Syvain Mousset, we have also used recently a stork example (with actual data) for an exam (in french).

Milk production and SP500 returns

Posted by Dirk Eddelbuettel on R-Help:

IIRC someone did a paper a few ago correlating "everything" in a OECD or
World Bank database of macroeconomic variables... and found something like
milk production in Indonesia to be a perfect predictor for subsequent SP500
returns.

The point was, of course, to show the fallacy in such 'data mining'.  I wish
I had the original reference.

Contact

Any idea for one more dataset? Please contact Jean Lobry.

Back to PBIL home page