I realize this is subjective, but I thought it would be nice to talk about our favorite datasets and what we think makes them interesting. There is a wealth of data out there, and what with all of the APIs (e.g., Datamob) along with classic datasets (e.g., R data), I think this could have some very interesting responses.
For example, I have always liked datasets like the “Boston Housing” dataset (unfortunate implications notwithstanding) and “mtcars” for their versatility. From a pedagogical standpoint, one can show the merits of a wide variety of statistical techniques using them; and Anderson/Fisher’s iris dataset will always have a place in my heart.
The low birth weight study
This is one of the datasets in Hosmer and Lemeshow’s textbook on Applied Logistic Regression (2000, Wiley, 2nd ed.). The goal of this prospective study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2,500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.
It is available in R as
data(birthwt, package="MASS") or in Stata with
webuse lbw. A text version appears here: lowbwt.dat (description). Of note, there are several versions of this dataset because it was extended to a case-control study (1-1 or 1-3, matched on age), as illustrated by Hosmer and Lemeshow in ALR chapter 7.
I used to teach introductory courses based on this dataset for the following reasons:
- It is interesting from an historical and epidemiological perspective (data were collected in 1986); no prior background in medicine or statistics is required to understand the main ideas and what questions can be asked from that study.
- Several variables of mixed types (continuous, ordinal, and nominal) are available which makes it easy to present basic association tests (t-test, ANOVA, χ2-test for two-way tables, odds-ratio, Cochrane and Armitage trend test, etc.). Morever, birth weight is available as a continuous measure as well as a binary indicator (above or below 2.5 kg): We can start building simple linear models, followed by multiple regression (with predictors of interest selected from prior exploratory analysis), and then switch to GLM (logistic regression), possibly discussing the choice of a cutoff.
- It allows to discuss different modeling perspectives (explanatory or predictive approaches), and the implication of the sampling scheme when developing models (stratification/matched cases).
Other points that can be emphasized, depending on the audience and level of expertise with statistical software, or statistics in general.
As for the dataset available in R, categorical predictors are scored as integers (e.g., for mother’s ethnicity we have ‘1’ = white, ‘2’ = black, ‘3’ = other), notwithstanding the fact that natural ordering for some predictors (e.g., number of previous premature labors or number of physician visits) or the use of explicit labels (it is always a good idea to use ‘yes’/’no’ instead of 1/0 for binary variables, even if that doesn’t change anything in the design matrix!) are simply absent. As such, it is easy to discuss what issues may be raised by ignoring levels or units of measurement in data analysis.
Variables of mixed types are interesting when it comes to do some exploratory analysis and discuss what kind of graphical displays are appropriate for summarizing univariate, bivariate or trivariate relationships. Likewise, producing nice summary tables, and more generally reporting, is another interesting aspect of this dataset (but the
Hmisc::summary.formulacommand makes it so easy under R).
Hosmer and Lemeshow reported that actual data were modified to protect subject confidentiality (p. 25). It might be interesting to discuss data confidentiality issues, as was done in one of our earlier Journal Club, but see its transcript. (I must admit I never go into much details with that.)
It is easy to introduce some missing values or erroneous values (which are common issues in real life of a statistician), which lead to discuss (a) their detection through codebook (
codebook) or exploratory graphics (always plot your data first!), and (b) possible remedial (data imputation, listwise deletion or pairwise measure of association, etc.).