Truth or Truthiness Distinguishing Fact from Fiction by Learning to Think Like a Data Scientist

Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.
Truth or Truthiness
Teacher tenure is a problem. Teacher tenure is a solution. Fracking is safe.
Fracking causes earthquakes. Our kids are overtested. Our kids are not tested
We read claims like these in the newspaper every day, often with no jus
tication other than “it feels right.” How can we gure out what
Escaping from the clutches of truthiness begins with one simple ques
tion:“what’s the evidence?” With his usual verve and air, Howard Wainer
nonsense, and outright deception. Using the tools of causal inference he
evaluates the evidence, or lack thereof, supporting claims in many elds,
is wise book is a must-read for anyone who’s ever wanted to challenge
the pronouncements of authority gures and a lucid and captivating narra
tive that entertains and educates at the sametime.
Howard Wainer is a Distinguished Research Scientist at the National Board
of Medical Examiners who has published more than four hundred schol
Medical Illuminations:Using Evidence, Visualization and Statistical inking
to Improve Healthcare
Robert Weber e New Yorker Collection/Cartoon Bank, reproduced with permission.
Distinguishing Fact from Fiction
by Learning to Think Like
National Board of Medical Examiners
32 Avenue of the Americas, NewYork, NY 10013-2473,USA
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit
of education,learning, and research at the highest international levels of excellence.
Information on this title:
© Howard Wainer2016
is publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge UniversityPress.
First published2016
Printed in the United States of America
A catalog record for this publication is available from the British Library
ISBN 978-1-107-13057-9 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of URLs
for external or third-party Internet Web sites referred to in this publication and does not
guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
Sam & Jennifer
Laurent, Lyn &Koa
Annotated Table of Contents
Preface and Acknowledgments
Section I
inking Like a Data Scientist
How the Rule of 72 Can Provide Guidance to Advance Your
Wealth, Your Career, and Your Gas Mileage
comprehend. In this chapter we illustrate this with several
examples drawn from history and current experience. en we
introduce a simple rule of thumb, often used to help nancial
planners tame the cognitive load of exponential growth, and show
how it can be used more widely to help explain a broad range of
other issues. e Rule of 72 illustrates the power of having such
“rules” in your toolbox for use as the need arises.
Piano Virtuosos and the Four-MinuteMile
e frequency of truly extreme observations and the size of the
sample of observations being considered are inexorably related.
Over the last century the number of musical virtuosos has
who can perform pieces that would have daunted all but the most
talented artists in the past. In this chapter we nd that a simple
mathematical model explains this result as well as why a runner
breaking the four-minute barrier in the mile has ceased to be
 \r\f  \n
Happiness and Causal Inference
Here we are introduced to Rubin’s Model for Causal Inference,
which directs us to focus on measuring the e\tect of a cause
rather than chasing a chimera by trying to nd the cause of
an e\tect. is reorientation leads us naturally to the random
assignment-controlled experiment as a key companion to the
laying out how we can untie the Gordian knot that entangles
happiness and performance. It also provides a powerful light
that can be used to illuminate the dark corners of baseless
Causal Inference andDeath
e path toward estimating the size of causal e\tects does not
always run smoothly in the real world because of the ubiquitous
nuisance of missing data. In this chapter we examine the
very practical situation in which unexpected events occur
that unbalance carefully constructed experiments. We use a
medical example in which some patients die inconveniently
mid-experiment, and we must try to estimate a causal e\tect
despite this disturbance. Once again Rubin’s Model guides
us to a solution, which is unexpectedly both subtle and obvious,
Using Experiments to Answer Four Vexing Questions
Public education is a rich eld for the application of rigorous
that truthiness manifests itself widely within the discussions
surrounding public education, the e\bcacy of which is often
measured with tests. us it is not surprising that many topics
both sides of the question overwhelms facts. We examine four
questions that either have already been decided in courts (but not
decisively) or are on the way to court as this chapter was being
Causal Inferences from Observational Studies:Fracking,
Injection Wells, Earthquakes, and Oklahoma
It is not always practical to perform an experiment, and we must
make do with an observational study. Over the past six years the
 \r\f  \n
number of serious earthquakes in Oklahoma (magnitude 3.0 or
more) has increased from less than two a year to almost two a day.
In this chapter we explore how we can use an observational study
to estimate the size of the causal e\tect of fracking and the disposal
of wastewater through its high-pressure injection into the earth on
seismicity. e evidence for such a connection is overwhelming
despite denials from state o\bcials and representatives of the
Life Follows Art:Gaming the Missing Data Algorithm
A compelling argument can be made that the biggest problem
faced by data scientists is what to do about observations that
are missing (missing data). In this chapter we learn of what
improperly game the system. It also illustrates what may be the
most e\tective way to deal with such shenanigans.
Section II
Communicating Like a Data Scientist
On the Crucial Role of Empathy in the Design of
Graphical display is perhaps the most important tool data science
possesses that allows the data to communicate their meaning to
the data scientist. ey are also unsurpassed in thence allowing
the scientist to communicate to everyone else as well. By far,
the most crucial attitude that anyone wishing to communicate
e\tectively can have is a strong sense of empathy. In this chapter we
discuss two di\terent communications and show how the lessons
for mutations that increase a woman’s likelihood of cancer.
Improving Data Displays:e Media’s andOurs
where inuence ows in both directions, we see how advances
in graphical display pioneered in the scientic literature were
adopted by the mass media, and how advances in the media have
been unfortunately slow to catch on among scientists.
 \r\f  \n
Inside OutPlots
One of the greatest design challenges faced in visually displaying
high-dimensional data (data involving more than just two
variables) is the limitations of a two-dimensional plotting
medium (a piece of paper or a computer screen). In this chapter
we illustrate how to use inside out plots to reveal many of the
chose compares the performances of six baseball all-stars on eight
A Century and a Half of Moral Statistics:Plotting
Evidence to A\tect SocialPolicy
(like election outcomes by state or population per census tract)
cries out for a map. Maps are the oldest of graphical displays
with surviving examples from Nilotic surveyors in ancient Egypt
and even older Chinese maps that long predate the Xia dynasty.
of the plotting plane to represent the geographic information.
Only much later were other, nongeographic variables added
on top of the geographic background. In this chapter we are
lawyer and statistician, who plotted such moral statistics as
the prevalence of ignorance, bastardy, crime, and improvident
marriages, on maps of England and Wales. e discussion of his
goal of increased social justice might have been aided through the
Section III
Applying the Tools of Data Science
to Education
Public education touches everyone. We all pay for it through
our local property taxes and almost all of us have partaken of
think of any other area of such a broad-based activity that is so
riddled with misconceptions borne of truthiness. In this section
we will examine ve di\terent areas in which public opinion
has been shaped by claims made from anecdotes and not from
 \r\f  \n
evidence. Each chapter describes a claim and then presents widely
available evidence that clearly refutes it. is section is meant as
sections Iand II are used to reinforce an attitude of skepticism
while providing an evidence-based approach to assessing the
likelihood of the claims being credible.
Waiting for Achilles
e educational system of the United States is regularly
excoriated for the poor performance of its students and the
and white students. In this chapter we use evidence to clarify
both of these issues and, in so doing, discover that the situation is
not anywhere near as bleak as truthiness-driven critics would have
us believe.
How Much Is TenureWorth?
Critics of public education often lay a considerable portion of the
blame for its shortcomings at the door of teacher tenure. In this
chapter we trace the origins of tenure and provide evidence that
eliminating it may be far more expensive and less e\tective than
Whenever tests have important consequences there is always the
possibility of cheating. To limit cheating, student performances
In this chapter we describe two instances in which the fervor of
the investigation outstripped the evidence supporting the claim of
When Nothing Is Not Zero:ATrue Saga of Missing Data,
Adequate Yearly Progress, and a Memphis CharterSchool
test scores earned by their students. In this chapter we learn of a
charter school in Memphis that was placed on probation because
the average scores of its students were too low. Unfortunately this
apparent deciency was not due to the school but rather to the
city’s treatment of missing data.
 \r\f  \n
Musing about Changes in the SAT:Is the College Board
Modern college entrance exams in the United States have existed
their scoring, and their use have been made steadily. In this
chapter we use evidence and statistical thinking to focus our
discussion of three changes to the SAT that have been recently
announced by the College Board. Two of them are likely to have
almost no e\tect, but the third is a major change. Ihypothesize
why these particular changes were selected and conclude that the
College Board might have been guided by the strategy developed
by Dartmouth’s former president John Kemeny when he led the
For Want of a Nail:Why Worthless Subscores May Be
Seriously Impeding the Progress of Western Civilization
In the 2010 U.S. Census, it cost about forty dollars for each
person counted. is seems like an extravagant amount because
changes in the population of the United States can be accurately
estimated by noting that it increases by about one person every
small area estimates that Census data provide. In this chapter we
the opportunity costs of too long tests are likely extensive enough
that they may be impeding progress in a serious way.
Section IV
Try is atHome
Preface and Acknowledgments
ere have been many remarkable changes in the world over the last
century, but few have surprised me as much as the transformation in
public attitude toward my chosen profession, statistics– the science of
uncertainty. roughout most of my life the word
common adjective associated with the noun
. In the statistics
most prevalent reason that students gave for why they were taking the
course was “it’s required.” is dreary reputation nevertheless gave rise
to some small pleasures. Whenever Ifound myself on a plane, happily
involved with a book, and my seatmate inquired, “What do you do?”
Icould reply, “I’m a statistician,” and condently expect the conversation
is attitude began to change among professional scientists decades ago
as the realization grew that statisticians were the scientic generalists of
from mathematics, so memorably put it, “as a statistician, Ican play in
everyone’s backyard.”
Statistics, as a discipline, grew out of the murk of applied probability as
practiced in gambling dens to wide applicability in demography, agricul
ture, and the social sciences. But that was only the beginning. e rise of
sciences, needed to understand uncertainty. e health professions joined
Evidence-Based Medicine
became a proper noun. Prediction models
election outcomes. Economics and nance was transformed as “quants”
joined the investment teams and their success made it clear that you
ignore statistical rigor in devising investment schemes at your ownperil.
ese triumphs, as broad and wide ranging as they were, still did not
capture the public attention until Nate Silver showed up and starting
predicting the outcomes of sporting events with uncanny accuracy. His
success at this gave him an attentive audience for his early predictions
of the outcomes of elections. Talking heads and pundits would opine,
using their years of experience and deeply held beliefs, but anyone who
truly cared about what would happen went to FiveirtyEight, Silver’s
website, for the unvarnishedtruth.
After Nate Silver my life was not the same. e response to my dec
laration about being a statistician became “Really? at’s way cool!” e
serenity of long-distance air travel waslost.
As surprising as this shift in attitudes has been, it is still more amaz
ing to me how resistant so many are to accepting evidence as a principal
three possible reasons:
An excessive dimness of mind that prevents connecting the dots of evi
dence to yield a clear picture of likely outcome.
e rst reason is one of my principal motivations in writing this book.
e other was my own enthusiasm with this material and how much
Iwant to share its beauty with others.
e second reason was reected in Upton Sinclair’s observation, “It is dif
his not understanding it!” We have seen how senators from coal-producing
states are against clean air regulations; how the National Rie Association
believes, despite all evidence (see
), that more guns will lower
the homicide rate; and how purveyors of coastal real estate believe that ris
ing seas accompanying global warming are a perniciousmyth.
e third reason is a late addition to my list, and would be unfair,
if the observed behavior could be explained with reason two. But Iwas
forced to include it when, on ursday, February 26, 2015, Senator Jim
Inhofe (Republican from Oklahoma, who is Chairman of the Senate
Environment and Public Works Committee) brought a snowball onto
the oor of the senate as proof that reactions to evidence about global
warming are hysterical, and the report that 2014 was the warmest year on
record was anomalous. What could explain Senator Inhofe’s statements?
It could be (1), but as a senator he has been privy to endless discussions
by experts with exquisite pedigrees and credentials, which anyone with
any wit would be forced to acknowledge as credible. It could be (2)if, for
future would be grimmer if the nation were to take seriously the role that
the burning of such fuels has on global warming. Inote that three of the
ve billionaires in the state of Oklahoma (Harold Hamm, George Kaiser,
and Lynn Schusterman) owe their fortunes to the oil and gas industry.
at being the case, it isn’t surprising that Senator Inhofe might owe
some allegiance to the welfare of their nancial interests. What makes
him a possible candidate for the third category is his apparent belief that
his argument would burnish his national reputation rather than mak
ing him a punchline on newscasts and late-night TV. Iam reminded
of Voltaire’s prayer “Dear God, make my enemies ridiculous.” He knew
that politicians can endure anything but the sorts of ridicule that renders
them a joke. at Senator Inhofe would purposely place himself into
such a position suggests including him in category(3).
Senator Inhofe is not alone on such a list. Iwould probably want to also
include then-senator (now governor) Sam Brownback (Republican from
Kansas), former governor Mike Huckabee (Republican from Arkansas),
and Representative Tom Tancredo (Republican from Colorado) who, in
a 2007 presidential debate all raised their hands to indicate their lack of
belief in evolution. It isn’t hard to nd other possible candidates.
Michelle Bachmann, a six-term congresswoman from Minnesota comes immediately to
mind for her avid support of the teaching of creationism in schools, whose appeal to funda
mentalist members of her constituency would seem to provide evidence to place her more
most corrupt” lists (being under investigation by the Federal Election Commission, House
Ethics Committee, and Federal Bureau of Investigation for violating campaign nance laws
while running for president by improperly paying sta\t from her leadership Political Action
Committee and using her campaign resources to promote her memoir) suggests she holds
a belief in her own invincibility that makes (3)a more likely explanation
I thoroughly understand that no presentation, no matter how lucid,
no matter how compelling, can have a direct e\tect on diminishing either
(2)or (3). However, Ihave hopes that some indirect help can be had by
improving the statistical literacy of the general population. e greater
arguments and hence not be swayed by them, the less e\tective such
arguments will become. Ido not believe that this will result in those
people whose arguments are based on truthiness changing to another
approach. My hopes lie in an educated electorate choosing di\terent peo
ple. Paraphrasing Einstein, “old arguments never die, just the people who
I have recently become haunted by lasts. We are usually immediately
aware of the rsts in our lives:our rst car, rst love, rst child. We typi
cally become aware of lasts only well after the event; the last time Ispoke
with my father, the last time Icarried my son on my shoulders, the last
time Ihiked to the top of a mountain. Usually, at least for me, the real
Iknown it was the last time Iwould ever speak with my grandfather there
are some things Iwould have liked to have spoken about. Had Iknown
it was the last time Iwould see my mother Iwould have told her how
much Ilovedher.
As you read this, Iwill be well passed the biblically prescribed life
span of three-score and ten. Although this is surely my latest book, it
especially careful to thank all of those who have contributed both to this
specic work and to the more general, and more di\bcult, task of shaping
I begin with my employer, the National Board of Medical Examiners,
which has been my professional home since 2001 and has provided a
Donald Melnick, its longtime president, whose vision of the organization
included room for scholarship and basic research. My gratitude to him
Next, Imust thank my colleagues at the National Board begin
ning with Ron Nungester, senior vice president, and Brian Clauser,
vice president, who have always provided support and a thought
ful response to any questions Imight have had– both procedural and
Chase, Steve Clyman, Monica Cuddy, Richard Feinberg, Bob Galbraith,
through my explanations of one obscure thing or another. ese expla
nations would typically continue until Idecided that Ihad, at long last,
Over the course of the past half-century many intellectual debts have
accumulated to friends and colleagues who have taught me a great deal.
Ihave neither space nor memory enough to include everyone, but with
those limitations in mind, my major benefactors have been:Leona Aiken,
Joe Bernstein, Jacques Bertin, Al Biderman, Darrell Bock, Eric Bradlow,
Henry Braun, Rob Cook, Neil Dorans, John Durso, Steve Fienberg,
Paul Holland, Larry Hubert, Bill Lichten, George Miller, Bob Mislevy,
Malcolm Ree, Dan Robinson, Alex Roche, Tom Saka, Sam Savage, Billy
Skorupski, Ian Spence, Steve Stigler, Edward Tufte, Xiaohui Wang, Lee
Wilkinson, and MikeZieky.
A very special thinks to David issen, my once student, longtime
collaborator, and very dear friend.
University acquiring my academic union card. Under ordinary circum
stance one would expect that those three years would not have a very dif
ferent e\tect on my life than any number of other time periods of similar
length. But, that does not seem to have been the case. On a regular basis
in the forty-seven years since Ileft her campus Ihave been in need of
guidance of one sort or another. And unfailingly, before Icould ounder
for too long, a former Tiger appeared and gave me as much assistance
as was needed. ose most prominent in my continued education and
John Tukey *39, Fred Mosteller *46, Bert Green *51, Sam Messick *56,
Don Rubin ‘65, Jim Ramsay *66, Shelby Haberman ‘67, Bill Berg *67,
Linda Steinberg S*68 P07, Charlie Lewis *70, Michael Friendly *70, Dave
Hoaglin *71, Dick DeVeaux ‘73, Paul Velleman *75, David Donoho ‘79,
Cathy Durso ‘83, and Sam Palmer‘07.
What’s the mystery? Aquick glance through this list shows that only
her progeny and has somehow arranged to do so. Whatever the mecha
nism, they and she have my gratitude.
Last, to the sta\t at Cambridge University Press, the alchemist who
handsome volume you hold in your hand now.
Primus inter pares
editor, Lauren Cowles, who both saw the value in what Iwas doing and
insisted that Icontinue rewriting and revising until the result lived up to
the initial promise that she had divined. She has my sincerest thanks. In
addition, Iam grateful for the skills and e\tort of copy editor Christine
Dunn, indexer Lin Maria Riotta and Kanimozhi Ramamurthy and her
sta\t at Newgen KnowledgeWorks.
e modern method is tocount;
e ancient one was toguess.
Samuel Johnson
very di\terent kinds of outcome predictions. On one side were partisans, usu
ally Republicans, telling us about the imminent defeat of President Obama.
ey based their prognostication on experience, inside information from
“experts,” and talking heads from Fox News. On the other side, were “the
Quants” represented most visibly by Nate Silver, whose predictions were
based on a broad range of polls, historical data, and statistical models. e
anecdotes and feigned fervor, who amplied the deeply held beliefs of their
colleagues. e other side relied largely on the stark beauty of unadorned
facts. Augmenting their bona des was a history of success in predicting the
outcomes of previous elections, and, perhaps even more convincing, was
come of a broad range of sporting events.
It would be easy to say that the apparent supporters of an
anecdote-based approach to political prediction didn’t really believe their
And perhaps that cynical conclusion was often true. But how
real money into what was almost surely a rat hole of failure? And what
about Mitt Romney, a man of uncommon intelligence, who appeared to
believe that in January 2013, he was going to be moving into e White
House? Perhaps, deep in his pragmatic and quantitative soul, he knew
that the presidency was not his destiny, but Idon’t think so. Ibelieve that
he succumbed to that most natural of human tendencies, the triumph of
hope over evidence.
We need not reach into the antics of America’s right wing to nd exam
ples of humanity’s frequent preference for magical thinking over empiri
cism; it is widespread. Renée Haynes (1906–94), a writer and historian,
introduced the useful concept of a
boggle threshold
:“the level at which the
mind boggles when faced with some new idea.” e renowned Stanford
anthropologist Tanya Luhrmann (
) illustrates the boggle threshold
with a number of examples (e.g., “A god who has a human son whom
he allows to be killed is natural; a god with eight arms and a lusty sexual
her evocative phrase, as the place “where reason ends and faith begins.”
e goal of this book is to provide an illustrated toolkit to allow us
to identify that line– that place beyond which evidence and reason have
been abandoned– so that we can act sensibly in the face of noisy claims
that lie beyond the boggle threshold.
e tools that Ishall o\ter are drawn from the eld of data science.
e character of the support for claims made to the right of the boggle
threshold we will call their “truthiness.”
Data science is the study of the generalizable extraction of
knowledge from
Peter Naur1960
Truthiness is a quality characterizing a “truth” that a person
making an argument or assertion claims to know intuitively
“from the gut” or because it “feels right” without regard to evi
Stephen Colbert, October 17, 2005
Data science
expanded on by statisticians Je\t Wu (in 1997)and Bill Cleveland (in

2001). ey characterized data science as an extension of the science
of statistics to include multidisciplinary investigations, models and
and theory. e modern conception is a complex mixture of ideas and
ing, mathematics, probability models, machine learning, statistical
learning, computer programming, data engineering, pattern recogni
tion and learning, visualization, uncertainty modeling, data warehous
ing, and high-performance computing. It sounds complicated and so
any attempt for even a partial mastery seems exhausting. And, indeed
it is, but just as one needn’t master solid state physics to successfully
operate a TV, so too one can, by understanding some basic principles
of data science, be able to think like an expert and so recognize claims
that are made without evidence, and by doing so banish them from
any place of inuence. e core of data science is, in fact, science, and
replicable provides its verysoul.
is book is meant as a primer on thinking like a data scientist. It is a
series of loosely related case studies in which the principles of data science
are exemplied. ere are only a few such principles illustrated, but it has
been my experience that these few can carry you a longway.
, although a new word, is a very old concept and has long
predated science. It is so well inculcated in the human psyche that trying
to banish it is surely a task of insuperable di\bculty. e best we can hope
for is to recognize that the core of truthiness’s origins lies in the reptilian
curb it through the practice of logical thinking.
Escaping from the clutches of truthiness begins with one simple
question. When a claim is made the rst question that we ought to ask
ourselves is “how can anyone know this?” And, if the answer isn’t obvi
ous, we must ask the person who made the claim, “what evidence do you
have to supportit?”
It is beyond my immediate goals to discuss what sorts of evolutionary pressures must have
existed to establish and preserve truthiness. For such an in-depth look there is no place
inking Fast,
Having your child repeat kindergarten would be a goodidea.
Sex with uncircumcised men is a cause of cervical cancer inwomen.
ere are about one thousand sh in that pond.
for they only make sense if you say them fast.
make such a claim, and then try to imagine how close, in the real world,
had the treatment. In this situation we must compare the child’s devel
opment after having regular conversations with its mother with how it
would have developed had there been only silence. Obviously the same
an action by comparing its outcome with that of a counterfactual isn’t
inferences based on averages within groups, in which we have one group
group in which the alternative was tried (the comparison group). If the
two groups were formed through a random process, it becomes plausi
ble to believe that what was observed in the comparison (control) group
would have been observed in the treatment group had that group had the
control condition.
Next, what is the treatment? How much time is spent conversing
Or is just cooing permissible? And what is the alternative condition? Is
e tendency of stupid ideas to seem smarter when they come at you quickly is known in
some quarters as the “Dopeler E\tect.”

language matter? What about the correctness of syntax and grammar?
And nally, we need a dependent variable. What is meant by “the
child’s development”? Is it their nal adult height? Or is it the speed with
which they acquire language? eir general happiness? What? And how
do we measure each child and so be able to make the comparison? And
when? Is it at birth? At age one? Five? Twenty?
It seems sensible when confronted with claims like this to ask at least
some of these questions. e answers will allow you to classify the claim
as based on evidence or truthiness.
Claim 2:Repeat Kindergarten
e same issues that arise in assessing the evidentiary support for the e\b
do if not held back? What are the dependent variables that reect the
success of the intervention? Could there ever have been an experiment in
which some randomly chosen children were held back and others not?
And if this unlikely scenario had actually been followed, how was success
children held back would be taller and older than those who progressed
children would be happier if their progress is delayed. Are they reading
It isn’t hard to construct a credible theory to support repeating a
grade– if a child can’t add integers, it makes little sense to move them
forward into a class where such skill is assumed, but such decisions
are rarely so cut and dried. It is more likely a quantitative decision:“Is
this child’s skill too low to be able to manage at the next level?” is
is knowable, but it requires gathering of evidence. We might display
the results of such a study as a graph in which a child’s math score in
kindergarten is plotted on the horizontal axis and her math score in
formances in the two grades, but it does not tell us about the e\bcacy
of repeating kindergarten. For that we need to know the counterfac
tual event of what the child’s score would have been had she repeated
kindergarten. We would need to know how scores compared the rst
time taking the test with the second time, that is, how she did in rst
grade after repeating and how she would have done in rst grade had
she not repeated.
Again, it is possible to construct such an experiment, based on aver
age group performance and random assignment, but the likelihood that
any such experiment has ever been performed issmall.
Try to imagine the response to your asking about what sort of evi
dence was used to support a teacher’s recommendation that your child
should repeat kindergarten. e response would be overowing with
truthiness and rich with phrases like “in my experience” or “I deeplyfeel.”
Claim 3:Male Circumcision as a Cause of CervicalCancer
is example was brought to my attention by a student in STAT 112 at
the University of Pennsylvania. Each student was asked to nd a claim in
the popular media and design a study that would produce the necessary
evidence to support that claim. en they were to try to guess what data
were actually gathered and judge how close those were to what would be
required for a credible conclusion.
e student recognized that the decision to have a baby boy cir
cumcised was likely related to social variables that might have some
connection with cervical cancer. To eliminate this possibility, she felt
that a sensible experiment that controlled for an unseen connection
would need to randomly assign boys to be circumcised or not. She
also recognized that women’s choice of sex partner might have some
and women should also be done at random. Once such a design was
carried out, there would be nothing more to do than to keep track of
all of the women in the study for thirty or forty years and count up the
frequency of cervical cancer on the basis of the circumcision status of
their sex partner. Of course, they would need to keep the same partner
for all that time, or we would not have an unambiguous connection to
the treatment.

Last, she noted that in the United States about twelve thousand
women a year are diagnosed with cervical cancer (out of about 155mil
lion women), or about one case for each thirteen thousand women. So
the study would probably need at least half a million women in each of
Once she had prepared this list of desiderata, she realized that such
an experiment was almost certainly never done. Instead, she guessed that
someone asked a bunch of women with cervical cancer about the status of
their companions and found an overabundance of uncircumcised men.
is led her to conclude that the possibilities of alternative explanations
were su\bciently numerous and likely to allow her to dismiss theclaim.
data collection? In situations where the full experiment is too di\bcult to
perform, there are a number of alternatives, like a case-control study that
could provide some of the credibility of a full-randomized experiment,
with a vastly more practical format.
Modern science is a complex edice built on techniques that may not
be obvious or even understandable to a layperson. How are we to know
Claim 4:Counting Fish inaPond
“ere are about one thousand sh in that pond.” How could anyone
and count them? at sounds unlikely. And so, we may doubt the accu
estimate. ough it is important to maintain a healthy skepticism it is
sensible to ask the person making the claim of one thousand sh what
supporting evidence she might have. Had we done so, she might have
requires clarication. And so she expands, “Last week we came here and
caught 100 sh, tagged them, and threw them back. We allowed a week
to pass so that the tagged sh could mix in with the others and then we
tagged. e calculation is simple, 10% of the sh we caught were tagged,
and we know that in total, 100 were tagged. erefore there must be
about 1,000 sh in thepond.”
e use of capture-recapture procedures can be traced back at least
to 1783, when the famous French polymath Pierre-Simon Laplace used
it to estimate the population of France.
is approach is widely used
for many purposes; one is to estimate the number of illegal aliens in the
United States.
e lesson to be learned from these four examples is that skepticism is
important, but we must keep an open mind to the possibilities of mod
experiments that could yield the evidence that would support
the claims made. If we can’t imagine one that could work, or if whatever
we imagine is unlikely to be practical, we should keep our skepticism, but
ask for an explanation, based on science not anecdotes, from the person
likely the claim is to be on the truthiness side of the boggle threshold.
is book has threeparts:
How to think like a data scientist
has, as its centerpiece, a beautiful
ing about claims. In each situation, Iillustrate the approach with a
real-life claim and its supporting evidence. e questions examined
tuosos in both music and track; how much has fracking in Oklahoma
a\tected the frequency of earthquakes in that state; and even how to
evaluate experimental evidence the collection of which has been cen
sored bydeath.
How data scientists communicate to themselves and others.
Ibegin with some theory about the importance of empathy and

e\tective communication, and then narrow the focus to the com
munication of quantitative phenomena. e topics include com
e application of these tools of thinking and communicating
to the eld of education.
Among the topics explored are the sur
prising trends in student performance over the past few decades, the
point of teacher tenure in public schools, and what might have moti
vated the College Board in 2014 to institute three changes to the
In each section of this book a series of case studies describe some of the
deep ideas of modern data science and how they can be used to help us
defeat deception. e world of ideas is often divided into two camps:the
me that nothing is so practical as a good theory. e problems associ
ated with making causal inferences lie at the very core of all aspects of
our attempts to understand the world we live in, and so there is really no
other way to begin than with a discussion of causal inference. is dis
cussion focuses on a very good theory indeed, one that has come to be
called “Rubin’s Model for Causal Inference” after the Harvard statistician
Donald Rubin, who rst laid it out forty yearsago.
provide a brief warm-up, so that, in
, we
can turn our attention to the rudiments of Rubin’s Model and show how
it can be used to clarify a vexing chicken-and-egg question. It does this
by guiding us to the structure of an experiment, the results of which can
Iexpand the applicability of Rubin’s Model
and show how it casts light into dark corners of scientic inquiry in ways
that are surprising. In
, we continue on this same tack, by
using the fundamental ideas of Rubin’s Model to help us design experi
ments that can answer questions that appear, literally, beyond the reach
of empirical solution. After this, the story ebbs and ows, but always with
conviction borne of facts. Istrive to avoid the passionate intensity that
always seems to accompany evidence-starved truthiness.
inking Like a Data Scientist
aid in thinking can come from some simple rules of thumb. We start
In the rst, Ishow how the Rule of 72, long used in nance, can
have much wider application.
examines a puzzle posed by a
NewYork Times
music critic, why are there so many piano virtuosos?
with one simple twist of my statistical wrist. In these two chapters
approximate answer and (2)that the likelihood of extreme observa
tions increases apace with the size of the sample. is latter idea– that,
for example, the tallest person in a group of one hundred is likely not
as tall as the tallest in a group of one thousand– although this result
can be expressed explicitly with a little mathematics it can be under
stood intuitively without them and so be used to explain phenomena
we encounter everyday.
I consider the most important contribution to scientic thinking
since David Hume to be Donald B.Rubin’s Model for Causal Inference.
Rubin’s Model is the heart of this section and of this book. Although the
fundamental ideas of Rubin’s Model are easy to state, the deep contem
mastery of it changes you. In a very real sense learning this approach to
causal inference is closely akin to learning how to swim or how to read.
ey are di\bcult tasks both, but once mastered you are changed forever.
After learning to read or to swim, it is hard to imagine what it was like
not being able to do so. In the same way, once you absorb Rubin’s Model
your thinking about the world will change. It will make you powerfully
skeptical, pointing the way to truly nd things out. In
trate how to use this approach to assess of the causal e\tect of school per
formance on happiness as well as the opposite:the causal e\tect happiness
has on school performance. And then in
Iexpand the discus
sion to show how Rubin’s Model helps us deal with the inevitable situa
tion of unanticipated missing data. e example Iuse is when subjects in
are surprising and subtle.
Fundamental to the power of Rubin’s Model is the control available
through experimentation. In
Idescribe several vexing questions
in educational testing, where small but carefully designed and executed
experiments can yield unambiguous answers, even though prior observa
tional studies, even those based on “Big Data,” have only obscured the
vast darkness of thetopic.
ments to answer important causal questions. When this occurs we are
forced to do an observational study. In
we illustrate one com
pelling instance of this as we try to estimate the size of the causal e\tect of
fracking– and the high-pressure injection of waste water into disposal
wells) on seismic activity. We show how the credibility of our estimates
increases the closer we come in our observational study to the true exper
And nally, we take on what is one of the most profound problems
in all practical investigations:missing data. Adiscussion of what is gen
Ishow two
situations where the way that missing data were treated has yielded huge
errors. In both cases the results were purposely manipulated by taking
advantage of the quirks in the missing data algorithm that was chosen.
In the rst case, some of the manipulators lost their jobs, in the second
some went to jail. e point of the chapter is to emphasize how impor
tant an awareness of the potential for deception provided by missing
motivated to learn to focus our attention on how the inevitable missing
data are treated.
How the Rule of 72 Can Provide
Guidance to Advance Your Wealth,
Your Career, and Your Gas Mileage
e sciences do not try to explain, they hardly even try to interpret, they
mainly make models. By a model is meant a mathematical construct which,
with the addition of certain verbal interpretations, describes observed phe
nomena. e justication of such a mathematical construct is solely and
precisely that it is expected towork.
John Von Neumann
two prizes. You can opt for either:
$10,000 every day for a month,or
One penny on the rst day of the month, two on the second, four on
the third, and continued doubling every day thereafter for the entire
Which option would you prefer?
Some back-of-the envelope calculations show that after ten days
option (1)has already yielded $100,000, whereas option (2)only yielded
$10.23. e choice seems clear, but we continue with some more arith
(2)has yielded $10,485.75. Is there any way that over the remainder of
the month the tortoise of option (2)can possibly overtake the hare of
  \f   \n   ­
tum has become inexorable, for by day twenty-one it is $21,971, by day
twenty-two it is $41,943, and so by day twenty-ve, even though option
(1)has reached its laudable, but linear, total of $250,000, option (2)has
passed it, reaching $335,544 and is sprinting away toward the end-of-
If the month was a non–leap year February, option (2)would yield
$2,684,354, almost ten times option (1)’s total. But with the single extra
day of a leap year it would double to $5,368,709. And, if you were fortu
nate enough to have the month chosen being one lasting thirty-one days,
the penny-a-day doubling would have accumulated to $21,474,836.47;
almost seventy times the penurious $10,000/day’stotal.
As we can now see, the decision of which option to choose is not even
many of us could have foreseenit?
Exponential growth has befuddled human intuition for centu
ries. One of the earliest descriptions of the confusion it engendered
Ferdowski at around 1000 CE. e story revolves around the Indian
mathematician Sessa, who invented the game of chess and showed it to
his Maharajah. e Maharajah was so delighted with the new game that
he gave Sessa the right to name his own prize for the invention. Sessa
asked the Maharajah for his prize to be paid in wheat. e amount of
of the chessboard, two on the second square, four on the third, and so
forth, doubling each time for all sixty-four squares. is seemed to the
Maharajah to be a modest request, and he quickly agreed and ordered
his treasurer to calculate the total amount and hand it over to Sessa.
turned out that the Maharajah had a far poorer understanding of expo
nential growth than did Sessa, for the total was so great that it would
versions of the end of the story. In one, Sessa becomes the new ruler; in
another, Sessa was beheaded.
e exponential growth, invisible to the Maharajah, would yield
But we don’t have to reach back to preclassical antiquity to nd
examples of such confusion. In 1869 the British polymath Francis
Galton (1822–1911) was studying the distribution of height in Britain.
Using the power of the normal distribution Galton was able to project
the distribution of heights for the entire population
from his modest
sample. Because he misestimated how quickly the exponential nature of
the normal curve forces it to fall toward zero, he predicted that several
time a nine-footer would have been about thirteen standard deviations
above the mean, an event of truly tiny likelihood. Without doing the
calculations you can check your own intuition by estimating the height
of the normal curve in the middle if the height thirteen standard devia
smaller than the light year, your intuition is likely as awed as Galton’s.
Compound interest yields exponential growth, so nancial planners
pound interest is hard to grasp, deep in one’s soul. To aid our intuition a
derivation) by Luca Pacioli (1445–1514) in1494.
In brief, the Rule of 72 gives a good approximation as to how long
it will take for your money to double at any given compound inter
est rate. e doubling time is derived by dividing the interest rate into
seventy-two. So at 6percent your money will double in twelve years, at
9percent in eight years, and so forth.
Although this approximation is
easy to compute in your head, it is surprisingly accurate (see
But exponential growth happens in many circumstances outside
of nance. When Iwas a graduate student, the remarkable John Tukey
e height in the center of this normal curve is approximately 3.4million times the diam
e Rule of 72 is a rough approximation. Amore accurate one for a doubling in T time
Ln(2)=69.3, which would provide a slightly more accurate estimate than 72; but because
72 has so many integer factors, it is easy to see why it has been preferred.
  \f   \n   ­
advised me that to succeed in my career, Iwould have to work harder
harder in just 7years you will know twice as much as they do.” Stated in
that way, it seems that at a cost of just forty-eight minutes a day you can
Now that we have widened our view of the breadth of application of
the Rule of 72, we can easily see other avenues where it can provide clar
was dismayed at the state of my fellow graduates. But, once Irealized that
those who had allowed their weight to creep upward at even the modest
size Iremembered from their yearbook portrait.
In the same way, if we can increase gas mileage just 4percent each
year, in only eighteen years the typical new car would use only half as
much gasoline as today’scars.
Rule of 72
Interest Rate
ears to Doub
ars to dou
e power of compound interest shown as the number of years for
money to double as a function of the interestrate.
Of course, this rule also provides insight into how e\tective various
kinds of plans for world domination can be a\tected. One possible way
for a culture to dominate all others is for its population to grow faster
rate is just 6percent greater its population will double in just twelveyears.
Here Ijoin with Mark Twain (
) in that what we both like best
such a triing investment offact.”
Piano Virtuosos and
“Virtuosos becoming a dime a dozen,” exclaimed Anthony Tommasini,
NewYork Times
in his column in the arts section
of that newspaper on Sunday, August 14, 2011. Tommasini described,
with some awe, the remarkable increase in the number of young musi
cians whose technical prociency on the piano allows them to play any
thing. He contrasts this with some virtuosos of the past– he singles out
Rudolf Serkin as an
example– who had only the technique they needed
to play the music that was meaningful to them. Serkin did not perform
pieces like “Prokoev’s nger twisting ird Piano Concerto or the
mighty Liszt Sonata,” although such pieces are well within the capacity
of most modern virtuosos.
quering the piano”? Tommasini doesn’t attempt to answer this question
A new level of technical excellence is expected of emerging pianists. Isee it not just on the
concert circuit but also at conservatories and colleges. In recent years Ihave repeatedly been
struck by the sheer level of instrumental expertise that seems a given.
e pianist Jerome Lowenthal, a longtime faculty member at Juilliard School of Music,
observes it in his own studio. When the 1996 movie
nist David Helfgott, raised curiosity about Rachmanino\t’s ird Piano Concerto, Mr.
movie had suggested. He said that he had two answers:“One was that this piece truly is ter
ribly hard. Two was that all my 16-year-old students were playing it.”
Anthoni Tommasini,
NewYork Times,
August 14,2011.
We see an apparently unending upward spiral in remarkable levels
implicit riddle. Idon’t mean to imply that this increase in musical vir
although Iwould be the last to gainsay their possible contribution. I think
a major contributor to this remarkable increase in prociency is population
size. I’ll elaborate.
e world record for running the mile has steadily improved by almost
century began the record was 4:13. It took almost fty years until Roger
four minutes. In a little more than a decade his record was being surpassed
El Guerrouj broke the tape at 3:43. What happened? How could the capac
ity of humans to run improve so drastically in such a relatively short time?
Humans have been running for a very long time, and in the more distant
past, the ability to run quickly was far more important for survival than it
is today. Aclue toward an answer lies in the names of the record holders.
In the early part of the century the record was held by Scandinavians–
Paavo Nurmi, Gunder Haag, and Arne Andersson. en mid-century
Africans arrived; rst Filbert Bayi, then Noureddine Morceli and Hicham el
times improved. Arunner who wins a race that is the culmination of events
to be slower than one who is the best of a million.
A simple statistical model, proposed and tested in 2002 by Scott
Berry, captures this idea. It posits that human running ability has not
changed over the past century. at in both 1900 and 2000 the distri
bution of running ability of the human race is well characterized by a
normal curve with the same average and the same variability. What has
changed is how many people live under that curve. And so in 1900 the
best miler in the world (as far as we know) was the best of a billion; in
2000 he was the best of six billion. It turns out that this simple model
 €   ‚-ƒ\f
contests for which there is an objective criterion.
taking place in other areas of human endeavor. Looking over the list of
extraordinary young pianists mentioned by Tommasini, we see names
that are commonplace now, but would have seemed wildly out of place at
Carnegie Hall a century ago– Lang Lang, Yundi Li, and Yuja Wang. As
the reach of classical music extended into areas previously untouched, is
we would discover some pianists of remarkable virtuosity?
Tommasini illustrates his point with his reaction to eighty-year-old
recordings of the respected pianist Alfred Cortot. He concludes that
Cortot “would probably not be admitted to Julliard now.” is should
not surprise us any more than the likelihood that Paavo Nurmi, the ying
Finn, would have trouble making a Division Icollegiate track team. e
Of course, social factors– the shrinking and homogenization of the world– have increased
country. Not only was the global pool smaller, but an entire continent’s population was
excluded from it. At that time probably none of the long-legged Kalenjin peoples who lived
heard of the Olympics, or even of Paris, where they were held that year; and if they had,
the practical as well as the conceptual possibility of traveling across a quarter of the world
lated that in 2005 the Kalenjin made up .0005percent of the world’s population but won
40percent of the top international distance running events.
Happiness and Causal Inference
My old, and very dear, friend Henry Braun describes a data scientist as
to be an accountant. Ilike the ambiguity of the description, vaguely
reminiscent of a sign next to a new housing development near me,
“Never so much for so little.” But although ambiguity has an honored
place in humor, it is less suitable within science. Ibelieve that although
some ambiguity is irreducible, some could be avoided if we could just
Issues of causality have haunted human thinkers for centuries, with
the modern view usually ascribed to the Scot David Hume. Statisticians
Ronald Fisher and Jerzy Neyman began to o\ter new insights into the
topic in the 1920s, but the last forty years, beginning with Don Rubin’s
unlikely sourced 1974 paper, have witnessed an explosion in clarity and
Asignal event in statisticians’ modern exploration of this ancient topic
was Paul Holland’s comprehensive 1986 paper “Statistics and Causal
Inference,” which laid out the foundations of what he referred to as
“Rubin’s Model for Causal Inference.”
My gratitude to Don Rubin for encouragement and many helpful comments and
„„  \n\f  
Ovid, Metamorphosis, IVc.5
A key idea in Rubin’s model is that nding the cause of an e\tect is a task
of insuperable di\bculty, and so science can make itself most valuable by
measuring the e\tects of causes. What is the e\tect of a cause? It is the
ment versus what would have been the result had it not been.
condition is a counterfactual and hence impossible to observe. Stated in
outcome and some unobserved potential outcome.
Counterfactuals can never be observed hence, for an individual, we
can never calculate the size of a causal e\tect directly. What we can do is
calculate the average causal e\tect for a group. is can credibly be done
through randomization. If we divide a group randomly into a treatment
group and a control group (to pick one obvious situation), it is credible
to believe that, because there is nothing special about being in the con
trol group, the result that we observe in the control group is what we
would have observed had the treatment group been enrolled in the con
outcomes is a measure of the size of the average causal e\tect of the treat
ment (relative to the control condition). e randomization is the key
to making this a credible conclusion. But, in order for randomization to
be possible, we must be able to assign either treatment or control to any
particular participant. us is derived Rubin’s
conclusion that there can be “no causation without manipulation.”
is simple result has important consequences. It means that some
variables, like gender or race, cannot be fruitfully thought of as causal
because we cannot randomly assign them. us the statement “she is
short because she is a woman” is causally meaningless, for to measure the
e cause is hidden, but the e\tect isknown.
Hume famously dened causality:“We may dene a cause to be an object followed by
another, and where all the objects, similar to the rst, are followed by objects similar to
the second, and, where,
if the rst object had not been, the second would never have
.” It is the underlined second clause where counterfactuals enter the discussion.
(1740, Author emphasis.)
e\tect of being a woman we would have to know how tall she would have
been had she been a man. e heroic assumptions required for such a
conclusion removes it from the realm of empirical discussion.
Although the spread of Rubin’s and Holland’s ideas has been broad
within the statistical community, their di\tusion through much of the
social sciences and the humanities, where they are especially relevant, has
been disappointingly slow. e one exception to this is in economics,
where making valid causal inferences has crucial importance. One goal
of this chapter is to help speed that di\tusion by showing how they can
illuminate a vexing issue in science, assessing the direction of the causal
arrow. Or, more precisely, how we can measure the size of the causal
e\tect in each direction. is issue arose, most recently, in an article
Journal of the American Medical Association
that proposed a theory
of obesity that turned the dominant theory on its head. Specically, the
authors argued that there is evidence that people eat too much because
they are fat; not that they are fat because they eat too much. Obviously,
measuring the relative size of the e\tects of the two plausible causes is of
festation of the same problem, which has some subtler aspects worth
illuminating– the e\tect of performance on happiness and the e\tect of
happiness on performance (see
Happiness:Its Causes and Consequences
human sense of well-being (what Iwill call “happiness”) and successful
Ludwig and Friedman, May 16,
Dilbert cartoon.
:Courtesy of AMU Reprints.
„„  \n\f  
performance on some cognitive task (say school grades or exam scores).
the e\tect of being happy is higher grades); others point out that when some
one does well, it pleases them and they are happier (e.g., the e\tect of doing
well is increased happiness). How are we to disentangle this chicken and egg
state of the art (as much as Icould discern it) in the happiness literature.
Some claim
that often the rigors associated with high performance
generates unhappiness. To achieve the goal of making our children hap
pier, this “nding” has led to the suggestion that academic standards
should be relaxed. e existence of this suggestion,
seriously considered, lifts the subject of the direction of the causal arrow
into that of the practically important.
happiness and performance. How credible is this evidence? is is hard
Journal of Happiness Studies
Education Research International
entic rigor is unknown to me. Idid note a fair number of cross-sectional
But they often carry a caveat akinto:
As with any study based on correlational evidence, care must
ings. Specically, the nature of the evidence does not support
does exist. Additional research would therefore be warranted
and among the variables explored in the present study.
What sort of care? Happily, the authors help us with an elaboration:
a larger sample with a more even distribution of gender and
race could also stand to strengthen the ndings as would a
is notion falls into the category of a “rapid idea.” Rapid ideas are those that only make
sense if you say themfast.
Gilman and Huebner
; Verkuyten and ijs
sample of participants from beyond the Midwestern United
States and from larger universities.
Is the character of the sample the only problem? In 2007 Quinn and
ter explored “[i]
n a prospective, longitudinal study,” which they did.
e value of a longitudinal study harks back to Hume’s famous crite
ria for causality. Akey one is that a cause must come before an e\tect.
Without gathering longitudinal data we cannot know the order. But this
is a necessary condition for causality, not a su\bcient one. In Quinn and
Duckworth’s study they measured the happiness of a sample of students
(along with some background variables) and then recorded the students’
cluded, “Participants reporting higher well-being were more likely to
earn higher nal grades” and “students earning higher grades tended to
go on to experience higher well- being.” ey sum up:“e ndings sug
be reciprocally causal.”
Trying to draw longitudinal inferences from cross-sectional data is
a task of great di\bculty. For example, Ionce constructed a theory of
on a walking tour of south Florida. Inoted that most people, when they
were young, primarily spoke Spanish. But, when people were old they
usually spoke Yiddish. Itested this theory by noting that the adolescents
who worked in local stores spoke mostly Spanish, but a little Yiddish. You
It is easy to see that the results obtained from a longitudinal study
are less likely to su\ter from the same artifacts as a cross-sectional one.
But because a longitudinal study’s causal conclusions su\ter from fewer
possible fatal aws than a cross-sectional study does not mean that such
Model forhelp.
does not conict with Quinn and Duckworth’s causal conclusion. But
„„  \n\f  
the important question is quantitative not qualitative. Can we design an
experiment in which the treatments (e.g., happiness) can be randomly
Suppose we take a sample of students and randomly divide them in
half, say into groups Aand B.We now measure their happiness using
whatever instruments are generally favored. Next we administer an exam
to them and subtract fteen points from the scores that we report to all
students in group Awhile adding fteen points to the scores reported
to those in group B (it is easy to see generalizations of this in which the
size of the treatments are modied in some systematic way, but that isn’t
important right now). Now we remeasure their happiness. My suspicion
happier. e amount of change in happiness is the causal e\tect of the
treatment, relative to the control. Had a fuller experiment been done we
of points added and the change in happiness. is ends Stage 1 of the
Next Stage 2:we now have two groups of students whose happiness
was randomly assigned, so we can now readminister a parallel form of
the same test. We then calculate the di\terence in the scores from the rst
administration (the actual score, not the modied one) to the second.
e size of that di\terence is a measure of the causal e\tect of happiness
on academic performance. e Stage 2 portion of the experiment, where
the assignment probabilities depend on the outcomes of the rst stage, is
usually called a sequentially randomized or “split plot” design.
e ratio of the size of the two causal e\tects tells us the relative
inuence of the two treatments. An outcome from such an experiment
could yield conclusions like “the e\tect of improved performance has ten
times the e\tect on happiness, than a similar increase in happiness has on
Such a quantitative outcome is surely more satisfying
Intuitively this surely makes sense, for if you don’t know the answer to a question, being
happier isn’t going to change your knowledgebase.
and more useful than continued speculation about the direction of the
Even a cursory reading of the happiness literature reveals the kind of
conclusions that researchers would like to make. Typically, Zahra, Khak,
, 225)tell us that “[r]
esults showed that in addition to
achievement of university students, happiness could also explain 13%
of changes of academic achievement.” You can feel the authors’ desire
to be causal, and they come very close to making a causal claim– it’s
that way. But the character of the
to be yearned for. Most were observational studies, and the rest might
rough the use of Rubin’s Model we can design true experimental stud
ies that can provide answers to the questions we want to ask. Moreover,
ized experiment that would be needed to measure the causal e\tects of
interest greatly claries what causal question is beingasked.
e bad news is that such studies are not as easy as picking up old
correctly were easy everyone would doit.
Precision is important, note that the treatment in Stage 1 is not higher performance ver
sus lower performance, but rather higher scores than expected versus lower scores than
expected. Asubtle, but important, distinction.
Causal Inference andDeath
e best-laid schemes o’ mice an’men
Gang aftagley,
An’ lea’e us nought but grief an’pain,
For promis’djoy!
Robert Burns1785
we learned how being guided by Rubin’s Model for Causal
Inference helps us design experiments to measure the e\tects of possi
to unravel a causal puzzle of happiness. Is it really this easy? e short
answer is, unfortunately, no. But in the practical world, more compli
cated than the one evoked in my proposed happiness study, Rubin’s
Model is even more useful. In this chapter we go deeper into the dimly
lit practical world, where participants in our causal experiment drop out
for reasons outside our control. Ishow how statistical thinking in general,
and allow time for our eyes to acclimate to the darkness.
Controlled experimental studies are typically regarded as the gold
standard for which all investigators should strive, and observational stud
ies as their polar opposite, pejoratively described as “some data we found
is chapter is a lightly rewritten version of Wainer and Rubin (
). It has beneted
Linda Steinberg, and
primus inter pares
, Don Rubin, who generously suggested the area of
application and provided the example.
are often willing to admit. e distinguished statistician Paul Holland,
expanding on Robert Burns, observedthat
All experimental studies are observational studies waiting to
is is an important and useful warning to all who are wise enough to
e key to an experimental study is control. In an experiment, those
running it control:
What is the treatment condition,
What is the alternative condition,
What are the outcome (dependent) variables.
Consider an experiment to measure the causal e\tect of smoking on life
expectancy. Were we to do an experiment, the treatment might be a pack
be no smoking. en we would randomly assign people to smoke or not
smoke, and the dependent variable would be their age atdeath.
tical because we cannot assign people to smoke or not at random.
is the most typical shortcoming of even a well-designed observational
study. Investigators could search out and recruit treatment participants
who smoke a pack a day, and also control participants who are nonsmok
ers. And they could balance the two groups on the basis of various observ
groups). But when there is another variable related to length of life that is
Holland, P. W., Personal communication, October 26,1993.
is experiment has been attempted with animals, where random assignment is practical,
but it has never shown any causal e\tect. is presumably is because easily available exper
imental animals (e.g., dogs or rats) do not live long enough for the carcinogenic e\tects of
smoking to appear, and animals with long-enough lives (e.g., tortoises) cannot be induced
\n\f   
not measured, there may be no balance on it. Randomization, of course,
provides balance, on average, for all such “lurking missing variables.”
is example characterizes the shortcoming of most observational
treatment conditions that randomization achieves (on average) in an
experiment. It also makes clear why an observational study needs to
collect lots of ancillary information about each participant so that the
kind of balancing required can be attempted. In a true experiment, with
random assignment, such information is (in theory) not required. Here
enters Paul Holland, whose observations about the inevitability of miss
ing data will further illuminate our journey.
Suppose some participants drop out of the study. ey might move
away, stop showing up for annual visits, decide to quit smoking (for the
for some other reason unrelated to smoking. At this point the randomi
zation is knocked into a cocked hat and we must try to rescue the study
using the tools of observational studies. Such a rescue is impossible if we
have not had the foresight to record all of the crucial ancillary informa
tion, the covariates, that are the hallmark of a good observationalstudy.
Our inability to randomize has allowed missing variables to mislead
us in the past. For a long time the e\tect of obesity on life expectancy
was underestimated because smokers tended to both die younger and
be more slender than nonsmokers. Hence the negative e\tect of smoking
on life expectancy was conated with the advantage of avoiding obesity.
Experiments in which we randomly assign people to be obese or not are
not possible. Modern research on the e\tects of obesity excludes partici
pants who have ever smoked.
I used to be Snow White, but Idrifted.
It is obvious now, if it wasn’t already, that even the most carefully planned
e issue is clear:How can we estimate causal e\tects when unexpected
events intrude on the data collection, causing some observations to be
that no amount of verbal or
mathematical legerdemain canalter.
e Magic of Statistics Cannot Put Actual Numbers
Whereere AreNone
And so, when we run an experiment and some of the subjects drop out,
taking with them the observations that were to be the grist for our infer
ential mill, what are wetodo?
ere are many ways to answer this question, but all that are credible
must include an increase in the uncertainty of our answer over what it
would have been had the observations not disappeared.
Coronary Bypass Surgery:An Illuminating Example
What can be done when, because of clogged arteries that have com
promised its blood supply, the heart no longer operates e\tectively? For
more than fty years one solution to this problem has been coronary
bypass surgery. In this procedure another blood vessel is harvested and
sewn into the heart’s blood supply, bypassing the clogged one. is sur
gery carries substantial risks, usually greater for a patient who is not in
robust health. Before recommending this procedure widely, it is cru
cially important to assess how much it is likely to help each type of
of the causal e\tects that such surgery has on the health of the patients
after the treatment.
program of exercise.
in both groups chosen at random from a pool of indi
viduals judged to have one or more clogged arteries that supply
I suspect that this has been stated many times by many seers, but my source of this version
was that prince of statistical aphorisms, Paul Holland (October 26,1993).
\n\f   
Outcome Measure
– We shall judge the success of the intervention
by a “quality of life” (QOL) score based on both medical and behav
ioral measures taken one year after the surgery.
ness of the randomization to assure ourselves that the two groups match
on age, sex, SES (Socio-Economic Status), smoking, initial QOL, and
everything else we could think of. It all checks out, and the experiment
Der mentsh trakht und Got lakht.
Ancient Yiddish proverb
As the experiment progresses, some of the patients die. Some died before
the surgery, some during, and some afterward. For none of these patients
was the one-year QOL measured. What are wetodo?
One option might be to exclude all patients with missing QOL from
the analysis and proceed as if they were never in the experiment. is
approach is too often followed in the analysis of survey data in which
as if they represented everyone. is is almost surely a mistake, and the
size of the bias introduced is generally proportional to the amount of
missingness. One survey by a company of its employees tried to assess
“engagement with the company.” ey reported that 86percent of those
responding were “engaged” or “highly engaged.” ey also found that
only 22percent of those polled were engaged enough to respond, which
at minimum should cause the reader to wonder about the degree of
engagement of the 78percent who opted not to respond.
A second option might be to impute a value for the missing infor
mation. For the nonrespondents in the engagement survey one might
impute a value of zero engagement for the nonrespondents, but that
might be too extreme.
It is tempting to follow this approach with the bypass surgery and
impute a zero QOL for everyone who is dead. But is this true? For many
Man plans and God laughs.
higher than some miserable living situations and construct living wills to
enshrine this belief.
ble) data from such an experiment,
which we summarize in
From this table we can deduce that 60percent of those patients who
received the treatment lived, whereas only 40percent of those who received
the control survived. us, we would conclude that on the intermediate,
but still very important, variable of survival, the treatment is superior to
the control. But, among those who lived, QOL was higher for those who
received the control condition than for those who received the treatment
(750 vs. 700)– perhaps suggesting that treatment is not without itscosts.
We could easily have obtained these two inferences from any one of
dozens of statistical software packages, but this path is fraught with dan
ger. Remembering Picasso’s observation that for some things, “computers
are worthless; they only give answers.” It seems wise to think before we
jump to calculate and then to conclude anything from these calculations.
We can believe the life versus death conclusion because the two
groups, treatment versus control, were assigned at random. us, we can
credibly estimate the size of the causal e\tect of the treatment relative to
the control on survival as 60percent versus 40percent.
But the randomization is no longer in full e\tect when we consider
QOL, because the two groups being compared were not composed at
random, for death has intervened in a decidedly nonrandom fashion and
we would err if we concluded that treatment reduced QOL by 50 points.
What are wetodo?
e Observed Data
Percent of
is example is taken, with only minor modications, from Rubin
\n\f   
Rubin’s Model provides clarity. Remember one of the key pieces of
Rubin’s Model is the idea of a potential outcome:each subject of the
experiment, before the experiment begins, has a potential outcome under
these two outcomes would be the causal e\tect of the treatment, but we
of a summary causal e\tect, for example, averaged over each of the two
experimental groups. And this is most credible with random assignment
of subjects to treatment.
e bypass experiment has one outcome measure that we have
planned, the QOL score, and a second, an intermediate outcome, life
or death, which was unplanned. us, for each participant we observe
Live with treatment and live with control,
Live with treatment but die with control,
Die with treatment but live with control,or
Die with treatment and die with control.
Of course, each of these four survival strata is randomly divided into
two groups:those who actually got the treatment and those who got the
So for stratum 1 we can observe their QOL score regardless of which
experimental group they arein.
In stratum 2, we can only observe QOL for those who got the treat
ment, and similarly in stratum 3 we can only observe QOL for those who
were in the control group. In stratum 4, which is probably made up of
very fragile individuals, we cannot observe QOL for anyone.
e summary in
makes it clear that if we simply ignore
those who died, we are comparing the QOL of those in cells (a)and
(b)with that of those in (e)and (g). We are ignoring all subjects in
cells(c), (d), (f), and(h).
At this point three things areclear:
e use of Rubin’s idea of potential outcomes has helped us to think
mate of the average causal e\tect of the treatment on QOL, for only in
this principal stratum are the QOL data in each of the two treatment
conditions generated from a random sample from those in that stra
tum, untainted by interveningdeath.
who lived and got the control was in stratum 1 or stratum 3.And,
unless we can make this decision (at least on average), our insight is
sider a supernatural solution. Specically, suppose a miracle occurred and
some benevolent deity decided to ease our task and give us the outcome
data we require.
ose results are summarized in
We can easily expand
to make explicit what is happening.
, in which the rst row of
has been broken into two rows:the rst representing those partici
pants who received the treatment and the second those who received
the control. e rest of the columns are the same, except that some of
) are counterfactual, given the experiment– what
e Experiment Stratied by the Potential Results on the
Intermediate (Life/Death) Outcome
Survival Strata
1. Live with treatment– Live with control (LL)
2. Live with treatment– Die with control (LD)
3. Die with treatment– Live with control (DL)
4. Die with treatment– Die with control (DD)
It is unfortunate that the benevolent deity that provided the classications in
didn’t also provide us with the QOL scores for all experiment participants, but that’s how
miracles typically work, and we shouldn’t be ungrateful. History is rife with partial mira
cles. Why mess around with seven plagues, when God could just as easily have transported
the Children of Israel directly to the Land of Milk and Honey? And while we’re at it, why
shepherd them to the one area in the Middle East that had no oil? Would it have hurt to
whisk them instead directly toMaui?
\n\f   
have happened had they been placed in the other condition
instead. It also emphasizes that it is only from the two potential out
comes in stratum 1 that we can obtain estimates of the size of the
is expansion makes it clearer which strata provide us with a dened
estimate of the causal e\tect of the treatment (LL) and which do not. It
also shows us why the unbalanced inclusion of QOL outcomes from the
LD and DL strata misled us about the causal advantage of the control
In the a\tairs of life, it is impossible for us to count on miracles
Immanuel Kant (1724–1804)
True (but Partially Unobserved) Outcomes
Percent of
Treatment Group
Control Group
Live or Die
Live or Die
True (but Partially Unobserved) Outcomes Split by Treatment
e combination of Rubin’s Model (to clarify our thinking) and a
minor miracle (to provide the information we needed to implement
that clarity of thought) has guided us to the correct causal conclusions.
Unfortunately, as Kant (
) so clearly pointed out, that although mir
rare nowadays, and so we must look elsewhere for solutions to our con
temporary problems.
Without the benet of an occasional handy miracle, how are we to
form the stratication that allowed us to estimate the causal e\tect of the
treatment? Here enters Paul Holland’s aphorism. If we were canny when
we designed this experiment, we recognized the possibility that some
thing might happen and so gathered extra information (covariates) that
physician and a patient that might occur in the course of treatment.
“Fred, aside from your heart, you’re in great shape. While Idon’t think
you are in any immediate danger, the surgery will improve your life
“Fred, you’re in trouble, and without the surgery we don’t hold out
much hope. But you’re a good candidate for the surgery, and with it
we think you’ll do very well.”(LD)
“Fred, for the specic reasons we’ve discussed, Idon’t think you can
survive surgery, but you’re in good enough shape so that with medica
you will be strong enough for the surgery.”(DL)
(To Fred’s family) “Fred is in very poor shape, and nothing we can do
will help. We don’t think he can survive surgery, and without it, it is
just a matter of time. I’m sorry.” (DD)
Obviously, each of these conversations involved a prediction, made by
the physician, of what is likely to happen with and without the surgery.
Such a prediction would be based on previous cases in which various
measures of the patient’s health were used to estimate the likelihood of
survival. e existence of such a system of predictors allows us, without
\n\f   
the benet of anything supernatural, to estimate each participant’s sur
vival stratum.
experiment began, as a simple preassignment predictor of survival stra
tum. We obtain this before any opportunity for postassignment dropouts.
di\ters from
in two critical ways. First, it includes
tion of who was included in each of the survival strata was based on how
the value of the covariate relates to the intermediate (live/die) outcome,
and not on the whim of some benevolent deity. More specically, indi
vidual participants were grouped and then stratied by the value of their
initial QOL score– no miracle required.
ose with very low QOL (300) were in remarkably poor condition,
the treatment. ose with intermediate QOL scores (500) were not in
very good shape and only survived if they had the treatment, whereas
they perished if they did not. And last, those with very high QOL (900)
to 800, which is still not bad; but if treated, they died. is is an unex
pected outcome and requires follow-up interviews with their physicians
Table4.4 but now Including the Critical Covariate “Initial QOL”
haps they felt so good after the treatment that they engaged in some
too-strenuous activity that turned out to be fatally unwise.
successful theory has to work for a living.
Before assignment in an experiment each subject has two potential
outcomes– what the outcome would be under the treatment and
what would it be under the control.
e causal e\tect of the treatment (relative to the control) is the dif
through random assignment we make credible the assumption that
what we observed among those in the control group is what we
would have observed in the treatment group had they not gotten the
When someone dies (or moves away or wins the lottery or otherwise
stops coming for appointments) before we can measure the depen
dent variable (QOL), we are out of luck. us, the only people who
can provide estimates of the causal e\tect are those who would have
lived under the treatment AND who would have lived under the
control. No others can provide the information required.
So subjects who would survive under the treatment but not the con
trol are no help; likewise those who would die under the treatment
but not the control; and, of course, those who would die under both
from those who survive under both conditions.
only under the condition to which they were assigned.
factual condition, we need to predict that outcome using additional
(covariate) information.
\n\f   
If such information doesn’t exist, or if it isn’t good enough to make
a su\bciently accurate prediction, we are stuck. And no amount of
yelling and screaming will changethat.
In this chapter we have taken a step into the real world, where even
randomized experiments can be tricky to analyze correctly. Although we
illustrated this with death as the very dramatic reason for nonresponse,
For example, a parallel situation occurs in randomized medical trials
rapidly, rescue therapy (a standard, already approved drug) is used. As we
demonstrated here, and in that context as well, one should compare the
ese are but a tiny sampling of the many situations in which the
clear thinking a\torded by Rubin’s Model with its emphasis on potential
outcomes can help us avoid being led astray. We also have emphasized the
crucial importance of covariate information, which provided the path
way to forming the stratication that revealed the correct way to estimate
the true value of the causal e\tect. For cogent and concise description, we
have used a remarkably good covariate (initial QOL), which made the
stratication of individuals unambiguous. Such covariates are welcome
but rare. When such collateral information is not quite so good, we will
goals of this chapter.
What we have learned in this exercise isthat:
e naïve estimate of the causal e\tect of the treatment on QOL,
derived in
, waswrong.
by classifying the study participants by their potential outcomes, for
it is only in the survival stratum occupied by participants who would
have lived under either treatment or control that such an unambigu
done with ancillary information (here their prestudy QOL). e
information, the greater the uncertainty in the estimate of the size of
is was a simplied and contrived example, but the deep ideas con
tained are correct, and so nothing that was said needs to be relearned by
those who choose to go further.
Using Experiments to Answer Four
Vexing Questions
Quid gratis asseritur, gratis negatur.
self-esteem (happiness) and school performance gave rise to vari
ous causal theories, at least one of which, if acted upon, could lead to
unhappy outcomes. We saw how a simple experiment, by measuring the
size of the causal e\tect, could provide evidence that would allow us to
valid or specious. Every day we encounter many correlations that some
claim to indicate causation. For example, there is a very high correlation
Some cautious folk, seeing such a strong relation, have suggested that eat
ing ice cream should be sharply limited, especially for children. Happily,
cooler heads prevailed and pointed out that the correlation was caused
by a third variable, the extent of warm weather. When the weather is
warm, more people go swimming, and hence risk drowning, and more
people consume ice cream. Despite the high correlation, neither eating
ice cream nor swimming tragedies are likely to cause the weather to warm
is is an ancient Latin proverb translated by Christopher Hitchens as, “What can be
asserted without evidence can also be dismissed without evidence.” e translation
appeared in his 2007 book
God Is Not Great:How Religion Poisons Everything
up. We could design an experiment to conrm this conjecture, but it
hardly seems necessary.
After it was noted that infants who sleep with the lights on were more
likely to be nearsighted when they grew up, advice to new parents was
both rampant and adamant to be sure to turn the lights o\t when their
baby went to sleep. Only later was it revealed that nearsightedness has
to leave lights on. Again, a controlled experiment– even a small one,
though practically di\bcult to do– would have made clear the direction
and size of the causal connection.
As one last example, consider the well-known fact that men who
are married live longer, on average, than those who are single. e usual
causal inference is that the love of a good woman, and the regularity of
life that it yields, is the cause of the observed longer life.
In fact, the
causal arrow might go in the other direction. Men who are attractive to
women typically are those who are healthier and wealthier, and hence
more likely to live longer. Men who are in poor health and/or of limited
means and prospects nd it more di\bcult to nd awife.
the drawing of causal conclusions from observational data. e likeli
hood of such confusion is not diminished by increasing the amount of
data, although the publicity given to “big data” would have us believe so.
ice cream does not diminish if we increase the number of cases from a
few dozen to a few million. e amateur carpenter’s complaint that “this
board is too short, and even though I’ve cut it four more times, it is still
too short,” seems eerily appropriate.
It is too easy for the sheer mass of big data to overwhelm the sorts of
healthy skepticism that is required to defeat deception. And now, with so
much of our lives being tied up with giant electronic memory systems, it
is almost trivial for terabytes of data to accumulate on almost any topic.
When sixteen gazillion data points stand behind some hypothesis, how
could we gowrong?
At least one wag has suggested that life for married men is not longer; it just seems thatway.
‰ Š‹„   ‚ €‹ Œ
Data can be anything you have:shoe sizes, bank balances, horsepower,
test scores, hair curl, skin reectivity, family size, and on and on. Evidence
is much narrower. Evidence is data related to a claim. Evidence has two
value in supporting or refuting the claim:(1)how related is the evidence
of the evidence), and (2)how much evidence is
there (the
On March 5, 2015, Jane Hsu, the principal of Public School 116
on Manhattan’s east side announced that henceforth they would ban
homework for students in fth grade or younger. She blamed the
“well-established” negative e\tects of homework on student learning for
the policy change. What would be evidence that would support such a
claim? It is easy to construct an argument to support both sides of this
issue. I’m sure Ms. Hsu has been the recipient of phone calls from par
ents complaining about the limited time their children have for other
afterschool activities, and how stressed out both they and their children
are trying to t in the of learning multiplication tables, soccer practice,
teachers asking their students to practice reading and, yes, learning math
facts). Amassing evidence in support of, or refuting, the claims about
homework for young children having a negative e\tect is just building an
argument:their validity and their reliability.
Looking at the number of Google hits on travel and entertainment
websites from people who lived in PS 116’s district would be data, but
not evidence; and their evidentiary value would not improve even if there
were millions ofhits.
e mindless gathering of truckloads of data is mooted by the gath
ering of even a small amount of thoughtfully collected evidence. Imagine
a small experiment in which we chose a few second grade classrooms
at home for, say, an hour a week, while allowing the other half to do
whatever they wanted. en after a semester we compared the gains in
the reading scores of the two groups from the beginning of the school
year. If we found that the group assigned to read showed greater gains
we would have evidence that their homework made a di\terence. If sim
ilar studies were done assigning the learning of math facts and found
similar gains this would add credence to the claim that homework helps
for those topics for that age child. If the same experiment was repeated
in many schools and the outcome was replicated, the support would be
strengthened. Of course, this experiment does not speak to concerns of
homework causing increased stress among students and parents, but that
too could be measured.
My point, and the point of the rest of this chapter (as well as most of
ment can actually provide a credible answer that would elude enormous,
but undirected, data gathering.
I have recurrent dreams in which Iam involved in a trial on which Ihave
shifted from my usual role as a witness and instead have been thrust onto
the bench in judge’s robes. One such dream involved an examinee with
a disability who had requested an accommodation from the testing orga
nization. After a careful study of the supporting materials submitted by
the examinee she was granted a 50percent increase in testing time. e
turned down. e nal step in the appellate process was the hearing in
which Iwas presiding. e examinee argued that because of her physical
limitations she required, at a minimum, double the standard time. e
testing company disagreed and said that, in their expert opinion, 50per
cent more time was su\bcient.
e goal of accommodations on exams is to level the eld so that
someone with some sort of physical disability is not unfairly disadvan
taged. But balance is important. We do not want to give such a gener
claims are ill tested with
data gathering, despite the supercial resemblance
of the two words (di\tering only by a vowel movement).
‰ Š‹„   ‚ €‹ Œ
advantage. Both sides agreed that this was not a qualitative question;
the examinee ought to have some accommodation. e issue, in this
instance, is the quantitative question “how much is enough?”
As Isat there on the bench thinking about the two arguments, Ifelt
the heavy weight of responsibility. Surprisingly, my mind shifted back
ods. What do you think?”
Over the past forty years it has been my good fortune to work with
two modern masters, Paul Holland and Don Rubin. ough Iam no
more a substitute for them than Watson was for Holmes, Iknow their
And so sitting there, very much alone on that bench, Itried to apply
e key question was one of causal inference– in this instance the
“treatment” is the size of the accommodation, and the causal e\tect of
interest is the extent of the boost in score associated with a specied
increase in testing time. e e\tect, once measured, then has to be com
pared with how much of a boost is enough. With the question phrased
in this way, the character of a study constructed to begin to answer it
was straightforward:randomly assign examinees to one of several time
connecting accommodation length to score could then inform decisions
on the size of the accommodation. It would also tell us how important it
tion is at from 50percent more time onward, the testing organization
could allow as much time as requested without compromising fairness or
validity of thetest.
is chain of reasoning led me to ask the representatives of the test
monotonically increasing– more time meant higher scores.
of our age, but Iwill leave an exploration of this great question to other accounts.
while to see the need for such studies, to design them, and to carry them
my second question, “How long have you been giving this test with extra
ey replied, “About fteen years.”
I nodded and said, “at’s enough time to have done the studies
It is too bad that real life is so rarely as satisfying as fantasy.
On the Role of Experiments in Answering Causal Questions
In medical research a random-assignment controlled experiment has long
been touted as the “gold standard” of evidence. Random assignment is a
way to make credible the crucial assumption that the experimental and
control groups were the same (on average) on all other variables, both
measured and unmeasured. Without such an assumption we cannot be
sure how much of the posttreatment result observed was due to the treat
groups that we did not accountfor.
Of course random assignment is not always practical, and for these
cases there is a rich literature on how to make the assumption of no dif
tional study.
But even if done perfectly, an observational study can only
approach, but never reach, the credibility of randomization in assuring
that there is no missing third variable that accounts for the di\terences
observed in the experimental outcome.
used in education. But when it is used correctly on an important question,
In an
study the control of one or more of these aspects is lacking. Most commonly
those in the treatment group self-select to do it (e.g., to measure the impact of smoking
on health, we might compare smokers with nonsmokers, but people are not assigned to
are the standard references in theeld.
‰ Š‹„   ‚ €‹ Œ
it can provide us with an answer. at is one reason why the Tennessee
study of class size in the early school grades (discussed in many places–
gives an especially clear and comprehensive description)
has assumed such a (well-deserved) place of prominence in educational
research; it has provided a credible answer.
researchers from doing more of it in the same problem area; there is
plenty of darkness to go around.
Problem 1:Accommodation for Examinees with Disabilities
e experiment that Iimagined in the dream that began this chap
extrapolation that is common in the study of drug e\bcacy and widely
used in so-called Delaney Clause research. at clause in the law forbids
any food additive that is carcinogenic. e challenge is to construct a
dose-response curve for an additive suspected of being a possible carcino
gen. To accomplish this, one assigns the experimental animals randomly
to several groups, gives each group a di\terent dosage of the additive and
then records the number of tumors as a function of the dose. But, and
here is the big idea, for most additives, tumors are rare with the dosages
typically encountered in everyday life. So, to boost the size of the e\tect,
tained in ten cases/day; and a third might be ve cases/day. Such massive
doses, though unrealistic, would accentuate any carcinogenic e\tect the
additive might have. To the results of this experiment we t a function
connecting dose and response and extrapolate downward to the antici
pated low doses, hence the name
low-dose extrapolation
fairly. e experiment would take a sample of examinees without any
disabilities and randomly assign them to a number of di\terent time lim
its, say, normal time, 25percent more than normal, 50percent more,
100percent, 200percent, and unlimited. en administer the test and
keep track of the mean (or median) score for each group and connect the
results with some sort of continuous interpolating function. With this
function in hand we can allow examinees to take the test with whatever
amount of time they prefer and estimate what their score would be with
unlimited time (or, if required, what it would be for any specic time).
Of course, it is not credible that the same function would suit examinees
with disabilities, but we can allow examinees who require a time accom
modation to have unlimited time. We can then compare their scores with
the estimated asymptotic scores of the rest of the population. In this
way we can make fair comparisons among all examinees, and thus there
is no need to ag any scores as having been taken under nonstandard
Some practical questions may need to be resolved, but this approach
provides the bones of an empirical solution to a very di\bcult problem.
Note the importance of the experimental design. Had we instead
used an observational approach and merely kept track of how long each
examinee took, and what their score was, we would likely nd that those
who nished most quickly had the highest scores and those who took the
most time had the lowest. We would then reach the (unlikely) conclusion
that to boost examinees’ scores we should give them less time. Reasoning
like this led researcher Roy Freedle in 2003 to conclude that we could
reduce the race di\terences in standardized tests by making them more
di\bcult; an approach that was subsequently dubbed Freedle’s Folly to
An experiment in which examinees were randomly assigned to di\ter
ent time allocations was performed on an October 2000 administration
of the SAT.
e practical limitations involved in separately timing a spe
cic section of the SAT led to a clever alternative. Instead, a section of
the verbal portion of the test (which is not part of the examinee’s score),
which had twenty-ve items on it, was modied so that some examin
ees had two more items added in randomly, others had ve more, and
a fourth group had ten more items. But only the core twenty-ve were
chapter7 in Wainer
in Wainer
; Bridgeman, Trapani, and Curley
‰ Š‹„   ‚ €‹ Œ
scored. Obviously, those with more items had less time per item. Aparal
lel manipulation was done on the math portion of the exam. What they
discovered was revealing and important. On the math portion of the SAT
the more time allocated, the higher the score, but not uniformly across
the score distribution (see
). For those who had an SAT-M
score of 300, extra time was of no benet. For those with a score of 700,
50percent more time yielded a gain of almost forty points. e infer
answers to hard questions, if they had enough time, whereas examin
ees of lower ability were stumped, and it didn’t matter how much time
e results from the verbal portion of the exam (see
di\terent, but equally important, story. It turned out that extra time, over
the time ordinarily allocated, had almost no e\tect. is means that extra
time can be allocated freely without concern that it will confer an unfair
Expected gain scores on the various experimental math sections over
what was expected from the standard-length section. Results are shown conditional on
the score from the appropriate operational SAT-Mform.
and, more quantitatively, how much an
e\tect the speededness has on performance.
Problem 2:Unplanned Interruptions in Testing
To be comparable, test scores must be obtained under identical condi
tions are not identical, we should try to measure how much change in
during the course of a test, events occur that interrupt testing. Perhaps
48% more time
Allocating extra time on the SAT-V does not seem to have any consistent
A speeded test is one where a sizable proportion of the examinees do not have enough time
to answer 95percent of theitems.
by Brian Clauser and his colleagues on a medical licensingexam.
‰ Š‹„   ‚ €‹ Œ
one school had a re scare, or a storm and the electricity went out, or an
examinee got sick and disrupted the exam temporarily. When this hap
over the years, and perhaps they su\bced. But now, with the widespread
use of computerized test administrations, the likelihood of interruptions
has increased. is increase has at least two reasons:(1)computers are
to go wrong, and (2)computerized tests are usually administered contin
vastly increased time span during which the tests are beinggiven.
What do we do when
this happens? Typically, attempts to measure the e\tect of the interruption
are made after the fact.
Such studies typically compare the average score
before the interruption with that afterward. If the scores are the same, the
testing organization heaves a sigh of relief and infers that the interruption
was of no consequence and did not a\tect the scores’ validity.
Of course, if the di\bculty of the test is not the same in these two parts,
that has to be adjusted for. Similarly, if there are fatigue or speededness
e\tects, they too must be accounted for. All of these adjustments require
assumptions, and all entail some error. Is such an approach good enough?
company’s best interest, when there is an interruption, to look for an
e\tect and not nd one, that is, to accept the null hypothesis. To do this
is easy:simply do a poor study with a small sample size, large errors,
imperative to use the largest samples and the most powerful and sensitive
designs possible. en, if no signicant e\tects are found, it adds credibil
ity to the desired conclusion.
is sort of opportunistic before and after analysis has too many
approach, a denite improvement, compares the di\terence in the before
Davis 2013; Solochek 2011; and Moore2010.
Bynum, Ho\tman, and Swain
; Hill
; Mee, Clauser, and Harik
and after scores for those who have su\tered an interruption (the treatment
group) with those who have not– a comparison group. Because the two
groups were not created by random assignment (although the interruption
was unpredicted), the comparison group must be matched to the treatment
group on the basis of as many suitable “covariates” as have been recorded,
Such careful observational studies of the e\tect of interruptions are a
denite improvement over the raw change score studies, but they are too
rare. In 2014, distinguished California researcher Sandip Sinharay laid out a
studies to allow more credible estimates of the e\tects of interruptions. He
uncertain availability of appropriate matching groups.
ere are limits to the accuracy of any reactive studies of the causal e\tect
of an interruption. Any such study that tries to estimate causal e\tects su\ters
from the very real possibility of some missing third variable either being the
actual cause of the di\terences observed or of mitigating a true e\tect by impart
ing another e\tect in the opposite direction. e most credible estimates of the
size of the causal e\tect of an interruption come from a true experiment with
random assignment of individuals to treatment groups. Running such experi
ments means that we must be proactive. One such study design would divide
up examinees into groups dened by when and for how long interruptions
occur. is approach would allow us to estimate the e\tect of the interruption
Problem 3:Measuring the E\tect of Feedback to Judges
For licensing tests it establishes the cut-score for passing; for many educa
tional tests it establishes boundaries for the various categories of performance.
tests has led to the canonization
National Assessment of Educational Progress.
‰ Š‹„   ‚ €‹ Œ
(reication?) of the four category labels Advanced, Procient, Basic, and
Below Basic.
Now almost all K–12 state tests have these same categories,
which are then used as if they had real meaning. Children are promoted (or
not) on the basis of which category they are placed in; educational programs
and the careers of teachers and other school o\bcials hang in the balance.
the United States, these boundaries must be decided upon by the judgment
of experts using one or more of several kinds of tasks.
Many of the most
ting process. One way that this takes place is that after the judges arrive at a
consensus on where the pass-fail line should be drawn (or, on the boundaries
for multiple performance categories), they are told what would have been
the consequences of such boundaries had they been used in a previous test
administration. ey might be told how, using their standards, only 20per
cent of the examinees would pass, only 12percent of minority examinees
would pass, and so forth. e judges are then encouraged to reconvene and
upon. Usually the committee of experts, using their judgment in combina
tion with the real-world information given them by those running the stan
For example, suppose the standards committee goes through one of
ey then nd out that, had that cut-score been in use the previous year,
only 30percent of examinees would have passed instead of the more
usual 70percent. ey then reconsider and decide that perhaps they had
been too severe, and, with the wisdom borne of further discussion, revise
the cut-score and nd, with only a modest change, that 69.4percent
would now have passed. ey then agree that the new cut-score is the
correct one, and they go home, assured of a job welldone.
or a cynical nature would immediately ask:What is the causal e\tect of the
I suspect that Below Basic was adopted as a term to avoid using the pejorative, but perhaps
See Cizek and Bunch
or Zieky, Perie, and Livingston
feedback? Does it improve judgments through its infusion of reality? Or
do the experts slavishly follow wherever the feedback leads them? If the lat
as last. Or, alternatively, do away with the feedback and accept the more
error-laden, but untainted, judgments of the unaided experts.
How can we tell? If we just follow protocol, all we can tell is how
much good information a\tects judgments. We don’t know how much
bad information would a\tect them. Can the expert ratings be manipu
lated anyway we want? e best, and perhaps the only, way to nd out is
to separate the two parts of the “treatment”:the feedback and the accu
design in which some judges receive accurate feedback and others receive
wildly inaccurate information. e goal is to nd out how much the
judges’ ratings can be manipulated.
Happily, in 2009 Brian Clauser and his colleagues at the National
Board of Medical Examiners published one such study. ey found that
how much we can believe what comes outofit.
Problem 4:What Is the Value of “Teaching to theTest”?
Considerable discussion has focused on the increased consequences of
students’ test scores for both students and school faculty. Specically, it is
focus undue attention on test preparation. It would be worthwhile to
know how much advantage “teaching to the test” conveys. An old adage
says that one should not learn “the tricks of the trade” but instead should
their instruction focuses primarily on the subject and deals only mini
iment that explores this would be worthwhile. If such an experiment shows
that extended test preparation confers no advantage, or even a disadvantage,
‰ Š‹„   ‚ €‹ Œ
test scores take care of themselves. Or, if it shows that such excessive prepa
ration does indeed help, it would suggest restructuring the exams so that this
is no longer true. One suitable experiment could have two conditions:only
minimal test preparation and a heavy focus on such preparation. Students
and teachers would be randomly assigned to each condition (with careful
supervision to ensure that the treatment assigned was actually carried out).
en the exam scores would be compared.
I don’t know of any experiments that have ever been carried out like
this, although enormous numbers of observational studies have exam
ined the value of coaching courses for major examinations. Typically such
courses focus strictly on the test. e most studied are coaching courses
for the SAT, and the overwhelming nding of independent researchers
(not studies by coaching schools, which tend to be self-serving and use
idiosyncratically selected samples) show that the e\tects of coaching are
very modest. In 1983 Harvard’s Rebecca DerSimonian and Nan Laird, in
a twenty-point gain on a 1,200-point scale, replicating and reinforcing
recently, Colorado’s Derek Briggs in
studies on coaching for the LSAT and the USMLE show even smaller
Such results, though strongly suggestive, have apparently not been
designed experiment that included such ancillary factors as student ability,
Discussion and Conclusions
If you think doing it right is expensive, try doing it wrong.
In this chapter Ihave described four very di\terent, but important,
research questions in educational measurement. In all four, approximate
With apologies to Derek Bok, whose oft-quoted comment on the cost of education is
answers, of varying credibility, can be obtained through observational
studies. However, the data used in observational studies are found lying
entirely sure of their meaning. So we must substitute assumptions for
control. When we run a randomized experiment, in which we are in
control of both what is the treatment and who receives it, the randomiza
tion process provides a credible substitute for the assumption of
Why is this? In all the situations Ihave discussed, the goal was always
cases we are interested in measuring the size of the causal e\tect.
when the experimental group received the treatment and the counterfac
tual event of what would have happened if that same group had received
the control condition. But we do not know what would have happened
had the experimental group received the control condition– that is why
we call it a
. We can know what happened to the group
when they got the control condition. We can only substitute the out
come obtained from the control group for the counterfactual associated
with the experimental group if we have credible evidence that there are
trol group. If the people in either group were selected by a process we do
not understand (e.g., Why did someone elect to nish the exam quickly?
Why was one exam interrupted and another not?), we have no evidence
to allow us to believe that the two groups are the same on all other con
ditions except the treatment. Randomization largely removes all these
Because of the randomization the average outcome of the control
group is equal to what the average would have been in the experimental
group had they received the control. is is so because nothing is special
about the units in either condition– any subject is as likely to be in one
group as theother.
A full formal discussion of Rubin’s Model is well beyond the goals
of this chapter; interested readers are referred to Paul Holland’s justly
‰ Š‹„   ‚ €‹ Œ
famous 1986 paper “Statistics and Causal Inference.”
As Ipointed out
, a key idea in Holland’s paper (derived from Rubin’s foun
dational 1974 paper) is that the estimate of the average outcome seen in
the control group is the same as what would have been observed in the
experimental group, had they been in the control condition. e truth of
this counterfactual rests on the randomization.
Without the control o\tered by a true randomized experiment, we
power of homogeneity provided by the random assignment. And so
Ihave argued for greater use of the gold standard of causal inference–
randomized, controlled experiments– instead of the easier, but more
It has not escaped my attention that the tasks associated with doing
true designed experiments is more di\bcult than merely analyzing some
must make do with observational studies. But experience has taught us
that might have been thought impossible can bedone.
For example, suppose some horrible disease is killing and crippling
our children. Suppose further that researchers have developed a vaccine
that animal research gives us high hopes of it working. One approach
is an observational study in which we give the vaccine to everyone and
then compare the incidence of the horrible disease with what had hap
pened in previous years. If the incidence was lower, we could conclude
that the vaccine worked or that this was a lucky year. Obviously the evi
dence from such a study is weaker than if we had done a true random
assignment experiment. But imagine how di\bcult that would be. e
dependent variable is the number of children who come down with the
disease– the size of the causal e\tect is the di\terence in that dependent
quences of denying the treatment to the control group are profound.
In my opinion, as well as those of many others, this paper ranks among the most impor
experiment was done to test the Salk vaccine against polio. It had more
than ve million dollars of direct costs ($43.6million in 2014 dollars)
and involved 1.8million children. In one portion of the experiment
200,745 children were vaccinated and 201,229 received a placebo. ese
enormous sample sizes were required to show the e\tect size anticipated.
ere were eighty-two cases of polio in the treatment group and 162
cases in the placebo group. is di\terence was large enough to prove the
value of the vaccine. Inote in passing that there was enough year-to-year
variation in the incidence of polio that if an uncontrolled experiment had
been performed in 1931 the drop in incidence in 1932 would have incor
rectly indicated that the treatment tried was a success.
answer, surely the cost of an experimental insertion of a testing delay is
well within the range of tolerability.
e Salk experiment is not an isolated incident. In 1939 an Italian
surgeon named Fieschi introduced a surgical treatment for angina that
involved ligation of two arteries to improve blood ow to the heart. It
worked very well indeed. In 1959 Leonard Cobb operated on seventeen
patients; eight had their arteries ligated and nine got incisions in their
with the advent of “informed consent” such sham surgery is much more
di\bcult to do now, but the fact that it was done reects on the size of
controlled experiments.
human costs of doing true experimentation within the area of education
in perspective. None of the experiments Ihave proposed here, or varia
tions on them, have consequences for their participants that are as pro
found as those seen in their medical counterparts. In addition, we must
always weigh the costs of doing the experiment against the ongoing costs
Causal Inferences from Observational
Studies:Fracking, Injection Wells,
Earthquakes, and Oklahoma
On November 11, 1854, Henry David oreau observed, “Some circum
stantial evidence is very strong, as when you nd a trout in the milk.” He
was referring to an 1849 dairyman’s strike in which some of the purvey
ors were suspected of watering down the product. oreau is especially
relevant when we are faced with trying to estimate a causal e\tect, but do
not have easily available the possibility of doing a suitable experiment,
and so are constrained to using available data for an observationalstudy.
we saw how the key to estimating the causal
e\tect of some treatment was comparing what occurred under that treat
ment with what would have occurred without it. In
we showed
how the structure of a randomized, controlled experiment was ideally
suited for the estimation of causal e\tects. But such experiments are not
always practical, and when that is the case we are constrained to use an
observational study, with naturally occurring groups, to estimate the size
of the causal e\tect. When we do not have randomization to balance the
treatment and control groups we must rely on some sort of post hoc
matching to make the equivalence of the two groups credible. Results
from observational studies must rely on evidence that is circumstantial,
e balance of this chapter deals with a single remarkable example, an
and earthquakes. More specically, we will explore the consequences of
the unfortunate combination of using a drilling technique called hydrau
) and the disposal of wastewater
by the high-pressure injection of it back into the earth. Ibelieve that the
An oil well is considered to be exhausted when the amount of oil it
yields is no longer su\bcient to justify the cost of its extraction. Most
of Oklahoma’s wells fell into this category by the 1990s because of the
immense amount of wastewater that was brought up along with the
diminishing amount of oil. But in the twenty-rst century the combina
tion of dewatering technologies and the rising price of oil made many of
Oklahoma’s abandoned wells economically viable again. e idea was to
just pull up the water with the oil– about ten barrels of water for each
barrel of oil. is has yielded billions of barrels of wastewater annually
pumps to inject it back into the earth in wastewaterwells.
Fracking is the process of drilling down into the earth before a
high-pressure water mixture is directed at the rock to release the gas
inside. Water, sand, and chemicals are injected into the rock at high pres
sure that allows the gas to ow out to the head of the well. is proce
dure has been in use for about sixty years. However, horizontal drilling is
a new wrinkle introduced by 1990 that could dramatically increase the
yield of the well. Horizontal drilling is a horizontal shaft added onto the
vertical one, after the vertical drilling has reached the desired depth (as
deep as two miles). is combination expands the region of the well sub
stantially. e high-pressure liquid mixture injected into the well serves
several purposes:it extends the fractures in the rock, adds lubrication,
\n\f    Ž\r \f 
and carries materials (proppants) to hold the fractures open and thus
extend the life of the well. Horizontal fracking is especially useful in shale
formations that are not su\bciently permeable to be economically viable
in a vertical well. e liquid mixture that is used in fracking is disposed
of in the same way as the wastewater from dewateringwells.
e principal concern about the use of fracking began with the volume
per well) and the subsequent possible contamination of drinking water
if the chemicals used in fracking leached into the groundwater. But it
was not too long before concerns arose about the disposal of wastewater
generated from fracking and dewatering, causing a substantial increase
in seismic activities. Most troubling was a vast increase in earthquakes in
areas unused to them.
It is the concern that most of these earthquakes
are manmade that is the principal focus of this chapter.
A Possible Experiment to Study the Seismic
E\tects of Fracking
If we had a free hand to do whatever we wished to estimate the causal
e\tect that the injection of large quantities of wastewater has on earth
quakes, all sorts of experiments suggest themselves. One might be to
pair at random and institute a program of water injection (the treatment
group) and leave the other undisturbed (the control group). Of course,
we would have to make sure that all the areas chosen were su\bciently far
from one another that the treatment does not have an e\tect on a mem
ber of the control group. en we start the experiment, keep track of the
A January 2015 study in
e Bulletin of the Seismological Society of America
fracking built up subterranean pressures that repeatedly caused slippage in an existing fault
as close as a half-mile beneath the wells (
[accessed August 27,2015]).
number of earthquakes in the region of the treatments, and keep track of
the number of earthquakes in the control regions.
It might take some time, but eventually we would have both a mea
sure of the causal e\tect of such injections and a measure of the variability
within each of the two groups.
such a study before we take action is of little solace to those people,
like Prague, Oklahoma, resident Sandra Ladra, who on November 5,
2011 landed in the hospital from injuries she su\tered when the chim
ney of her house collapsed in a 5.7 magnitude earthquake (the largest
ever recorded in Oklahoma)– the same series of quakes that destroyed
fteen homes in her neighborhood as well as the spire on Benedictine
Hall at St. Gregory’s University, in nearby Shawnee. Subsequently,
researchers analyzed the data from that quake
and concluded that the
quake that injured Ms. Ladra was likely due to injection of uids asso
ciated with oil and gas exploration. e quake was felt in at least seven
teen states but that “the tip of the initial rupture plane is within
of active injection wells.”
One Consequence of Not Having Good Estimates
It isn’t hard to imagine the conicting interests associated with the
nding of a causal e\tect associated with oil and gas exploration in
Oklahoma. Randy Keller, director of the Oklahoma Geological Survey,
posted a position paper saying that it believes that the increase in earth
quakes is the result of natural causes. In 2014, when faced with the
increase of seismic activity, Mary Fallin, the governor of Oklahoma,
advised Oklahomans to buy earthquake insurance. Unfortunately,
many policies specically exclude coverage for earthquakes that are
induced by human activity.
\n\f    Ž\r \f 
An ObservationalStudy
So we are faced with the unlikely event of doing a true, random assign
ing; a high-volume wastewater injection on seismic activity; and the
urgent need to estimate what is that e\tect. What can we do? e answer
must be an observational study. One way to design an observational
study is to rst consider what would be the optimal experimental design
Treatment Condition.
Treatment condition is oil exploration using
fracking and dewatering in which the wastewater generated is injected
under pressure into disposal wells. is will be in the state of Oklahoma
during the time period 2008 to the present, which is when these tech
niques became increasingly widespread.
Control Condition.
Don’t do it; no fracking and, especially, no
disposal of wastewater using high-pressure injection into disposal
wells. e control condition would be what existed in the state of
Oklahoma for the thirty years from 1978 until 2008, and for the
same time period in the state of Kansas, which abuts Oklahoma to the
north. Kansas shares the same topography, climate, and geology, and,
over the time period from 1973 to the present, has had far less gas and
Dependent Variable.
e dependent variable is the number of
earthquakes with magnitude of 3.0 or greater. We chose 3.0 because that
equipment. Since Oklahoma has begun to experience increased seismic
activity the U.S.and Oklahoma Geological Surveys (USGS and OGS)
A Trout in theMilk
(from the USGS) we see Oklahoma’s seismic activity sum
marized over the past thirty-eightyears.
By the end of 2014 there had been 585 earthquakes of magnitude
3.0 or greater. If smaller earthquakes were to be included the total would
greater than ve thousand! So far, in 2015 there has been an average of
two earthquakes of magnitude 3.0 or greater
and gas exploration there averaged fewer than two earthquakes of magni
tude 3.0 or greater
is three-hundred-fold increase has not gone unnoticed by the gen
eral population. Oklahomans receive daily earthquake reports like they
do weather. Oklahoma native, and NewYorker writer, Rivka Galchen
reports that driving by an electronic billboard outside Oklahoma City
last November he saw, in rotation, “an advertisement for one per cent
cash back at the underbird Casino, an advertisement for a Cash N
e frequency of 3.0+ earthquakes in Oklahoma since 1978 (from
\n\f    Ž\r \f 
Gold pawnshop, a three-day weather forecast, and an announcement of
a 3.0 earthquake in Noble County.” Driving by the next evening he saw
that “the display was the same, except that the earthquake was a 3.4 near
e geographic distribution of quakes is shown in
from the USGS, in which the blue dots represent the eighty-nine earth
quakes in the thirty-nine years prior to 2009, the other 960 dots repre
sent the ve and a quarter years sincethen.
Finally, what about the control group? What was the seismic activity
is a similar map, although
the coding of the plotted points is di\terent than in the Oklahoma map.
is its magnitude. e four quakes shown for the period 1973 to the pre
sent were all shallow and of 3.5 to 4.0 magnitude.
e geographic distribution of 3.0+ earthquakes in Oklahoma since 1970.
e clusters surround injection wells for the disposal of wastewater (from theUSGS).
e inferences that can credibly be drawn from observational studies
have limitations. Consider the well-established fact that among elemen
there is a third variable, age, that generates the observed relation. Older
study that does not adjust for this variable would draw the wrong infer
ence. And there is always the possibility of such a missing third variable
unless there is random assignment to the treatment and control groups.
For it is through randomization all missing third variables, known or
unknown, are balanced on average.
e evidence presented here makes it clear that there is a strong positive
e geographic distribution of 3.5+ earthquakes in Kansas since 1973
(from theUSGS).
\n\f    Ž\r \f 
missing third variable that would explain the observed phenomenon, and
the size of the apparent causal connection could shrink, or even disappear.
However, no one would believe that foot size has any direct causal
connection with reading prociency, because we know about reading
structure of earthquakes and the character of the rock substrata that lies
beneath the state of Oklahoma, we can draw credible causal conclusions
about the evidence presentedhere.
e inferences to be drawn from these results seem straightforward
to me. Icould not imagine what missing third variable might account
for what we have observed. What other plausible explanation could there
be for the huge increase in seismic activity? But what Ibelieve is of small
edgeable geologists are much more credible. What do theythink?
In an interview with Rivka Galchen, William Ellsworth, a research
geologist at the USGS said, “We can say with virtual certainty that the
increased seismicity in Oklahoma has to do with recent changes in the
way that oil and gas are being produced.
. Scientically, it’s really
quite clear.” ere is a substantial chorus of other geologists who echo
Ellsworth’s views in the recent scientic literature.
But not everyone sees it that way. Michael Teague, Oklahoma’s
NPR station, said “we need to learn more.” His perspicacity on environ
mental topics was illuminated when he was asked if he believed in cli
mate change. He replied that he believed that climate changed everyday.
On April 6, 2015, CBS News reporter Manuel Bojorquez inter
Association. She said the science to prove a denitive link simply isn’t
there. “Coincidence is not correlation,” said Hateld. “is area has been
seismically active over eons and the fact that this is unprecedented in our
experience doesn’t necessarily mean it hasn’t happened before.”
Her view was echoed by Jim Inhofe, Oklahoma’s senior senator, who,
April 8, 2015, said that “Oklahoma is located on a fault line and has
always had seismic activity. While there has been an increase in activity
over the past year, the data on earthquakes that is being examined only
goes back to 1978. Seismic activity is believed to have been going on
for thousands of years in Oklahoma, so looking at just the last 35years
to make denitive conclusions about trends and industry connections is
short sighted.
. We shouldn’t jump to making rash conclusions at this
point. Many credible organizations, such as the National Academies, have
said there is very little risk of seismic activity from the practice of hydraulic
which is being regulated and comes from a number of sources than just
oil and gas extraction, is causing seismic activity. e scientists are looking
at it, and before the issue becomes hyper-politicized by environmentalists,
e evidence Ihave presented here is certainly circumstantial, but com
there have been a substantial number of studies published by the foremost
of authorities in the most prestigious of peer-reviewed journals that sup
ated wastewater disposal, to the onslaught of earthquakes that have besieged
Ihave not been able to nd any credible reports to the contrary.
is easy to understand why state o\bcials would nd it hard to acknowl
edge evidence linking their activities to negative outcomes, regardless
of the credibility of that evidence. Idon’t know how experimental evi
would be raised. It is reminiscent of the April 14, 1994 congressional
testimony of the CEOs of the seven largest tobacco companies, who all
swore that to the best of their knowledge, nicotine was not addictive.
e responses by those who deny that the dramatic increase in earth
quakes is due to human activity are eerily parallel to those who deny
that global warming has a human component. Indeed these are often the
In the preface we were introduced to Senator Inhofe, who took the oor in the U.S. Senate
holding a snowball he made just a few minutes earlier. He indicated that this was proof
positive of the fallaciousness of global warming. Of course, if we were to take seriously the
leum industry. It shouldn’t be a surprise that anyone in thrall to that industry would nd it
di\bcult to be convinced of their culpability.
\n\f    Ž\r \f 
An argument often used in both situations (climate change and
increased seismicity) is that we have had such symptoms before; we have
had heat waves and droughts before, many worse than what we have
now, so why credit global warming? And similarly, from Senator Inhofe,
“Seismic activity is believed to have been going on for thousands of years
in Oklahoma.”
Both statements are undoubtedly true; is there an e\tective answer?
nessman and the physicist John Durso that seems relevant. e business
man argued that sure there were some hot days and some strong storms,
but over the course of his life he could remember hotter days and stronger
storms. He didn’t buy this as evidence of global warming. Professor Durso
remember a couple of years ago, when a short section of the interstate
highway that goes by your town had to close down for repairs. During
ere was also a substantial increase in accidents during that period. While
it is certainly true that you couldn’t point to any one accident and say it was
caused by the closure of the highway, you would be foolish to assert that the
increase in accidents was not related to the closure.”
e businessman nodded in agreement. e combination of the facts
and the argument convinced him. Sadly, wisdom borne of long experi
ence tells me that he was an unusual man. e evidence Ihave described
evidence will be enough to sway everyone. But it is a place to start.
But not to nish; we can gather more evidence in support of the claim that the lion’s share
of the seismic activity in Oklahoma is manmade by keeping track (1)of what happens
when other states ignore Oklahoma’s experience and institute their own programs of waste
water disposal through injection wells (e.g. North Dakota and Alberta, Canada) and (2)of
seismic activity should there be a substantial decline in such wastewater disposal. e latter
is not likely to yield a fast answer, for there appears to be a substantial latency possible. But
those are avenues for adding to the supporting evidence.
Life FollowsArt
Gaming the Missing Data Algorithm
In 1969 Bowdoin College was pathbreaking when it changed its admis
sions policy to make college admissions tests optional. About one-third
of its accepted classes took advantage of this policy and did not submit
SAT scores. Ifollowed up on Bowdoin’s class of 1999 and found that
the 106 students who did not submit SAT scores did substantially worse
in their rst year grades at Bowdoin than did their 273 classmates who
did submit SAT scores (see
). Would their SAT scores, had
they been available to Bowdoin’s admissions o\bce, have predicted their
diminished academic performance?
As it turned out, all of those students who did not submit SAT scores,
actually took the test, but decided not to submit them to Bowdoin. Why?
ere are many plausible reasons, but one of the most likely ones was that
they did not think that their test scores were high enough to be of any
stances, this speculative answer is not the beginning of an investigation,
but its end. e SAT scores of students who did not submit them have
to be treated as missing data– at least by Bowdoin’s admissions o\bce,
but not by me. rough a special data-gathering e\tort at the Educational
students who submitted SAT scores averaged 1323 (the sum of their ver
bal and quantitative scores), those who didn’t submit them averaged only
1201– more than a standard deviation lower! As it turned out, had the
  ‚\f\f :­  ƒ  \f
SAT scores
Did not submi
SAT score
All students
First Year Grade Point Average
A normal approximation to the distributions of rst-year grade point
SAT scores.
admissions o\bce had access to these scores they could have predicted the
lower collegiate performance of these students (see
Why would a college opt for ignorance of useful information? Again
there is a long list of possible reasons, and your speculations are at least
as valid as mine, so Iwill focus on just one:the consequences of treating
missing data as missing at random (that means that the average missing
score is equal to the average score that was reported, or that those who
did not report their SAT scores did just as well as those who did). e
average SAT score for Bowdoin’s class of 1999 was observed to be 1323,
but the true average, including all members of the class was 1288. An
average score of 1323 places Bowdoin comfortably ahead of such ne
institutions as Carnegie Mellon, Barnard, and Georgia Tech, whereas
1288 drops Bowdoin below them. e inuential
US News and World
college rankings use average SAT score as an important compo
nent. But those rankings use the reported scores as the average, essentially
assuming that the missing scores were missing at random. us, by mak
ing the SAT optional, a school could game the rankings and thereby
Of course, Bowdoin’s decision to adopt a policy of “SAT Optional”
predates the
US News and World Report
tainly not their motivation. But that cannot be said for all other schools
that have adopted such a policy in the interim. Or so Ithought.
ulated myself on uncovering a subtle way that colleges were manipulating
rankings– silly, pompous me. Isuspect that one should never assume
subtle, modest manipulations, if obvious large changes are so easy.
News and World Report
from the schools, thus allowing the schools to report anything they damn
well please.
In 2013 it was reported that six prestigious institutions admit
US News and World Report
A normal approximation to the distributions of SAT scores among all
members of the Bowdoin class of1999.
  ‚\f\f :­  ƒ  \f
(and also the U.S. Department of Education and their own accrediting
Claremont McKenna College simply sent in inated SAT scores;
Bucknell admitted they had been boosting their scores by sixteen points
for years; Tulane upped theirs by thirty-ve points; and Emory used the
mean scores of all the students that were admitted, which included stu
dents who opted to go elsewhere– they also inated class ranks! And
there are lots more examples.
of my 2011 book
Uneducated Guesses
, Idiscussed the
use of value-added models for the evaluation of teachers. As part of this
Idescribed the treatment of missing data that was currently in use by the
of the school year to posttest scores at the end– among the school, the
student, and the teacher. e average change associated with each teacher
was that teacher’s “value-added.” ere were consequences for teachers
with low value-added scores and di\terent ones for high-scoring teachers.
ere were also consequences for school administrators based on their
component of the total value-added amount.
ere are fundamentally two approaches taken in dealing with the
inevitable missing data. One is to only deal with students who have com
students (missing at random). Amore sophisticated approach that is used
is to impute the missing values based on the scores of the students who
had scores, perhaps matched on other information that was available.
what Harvard’s Don Rubin characterized as “heroic assumptions.”
To make the problems of such a missing data strategy more vivid,
Isuggested (tongue rmly in cheek) that were Ia principal in a school
being evaluated Iwould take advantage of the imputation scheme by
the day of the posttest. e missing groups would have scores imputed
(accessed August 24,2015).
for them based on the average of the scores of those who were there. Such
a scheme would boost the change scores, and the amount of the increase
would be greatest for schools with the most diverse populations. Surely a
Whenever Igave a talk about value-added and mentioned this scheme
to game the school evaluations it always generated gu\taws from most of
the audience (although there were always a few who busied themselves
taking careful notes). Iusually appended the
obiter dictum
that if Icould
think of this scheme, the school administrators in the eld, whose career
advancement was riding on the results, would surely be even more inven
tive. Sadly, Iwas prescient.
On October 13, 2012, Manny Fernandez reported in
the NewYork
that Former El Paso schools superintendent, Lorenzo Garcia was
sentenced to prison for his role in orchestrating a testing scandal. e
Texas Assessment of Knowledge and Skills (TAKS) is a state-mandated
test for high school sophomores. e TAKS missing data algorithm was
to treat missing data as missing at random, and hence the score for the
ogy is so easy to game that it was clearly a disaster waiting to happen. And
it did. e missing data algorithm used by Texas was obviously under
stood by school administrators; for all aspects of their scheme was to keep
potentially low-scoring students out of the classroom so they would not
take the test and possibly drag scores down. Students identied as likely
low performing “were transferred to charter schools, discouraged from
enrolling in school or were visited at home by truant o\bcers and told not
to go to school on testday.”
scripts or grades changed from passing to failing so they could be reclas
were intentionally held back were allowed to catch up before gradua
tion with “turbo-mesters” in which a student could acquire the necessary
credits for graduation in a few hours in front of a computer.
Superintendent Garcia boasted of his special success at Bowie High
School, calling his program “the Bowie Model.” e school and its admin
istrators earned praise and bonuses in 2008 for its high rating. Parents
los desaparecidos
” (the disappeared). It
  ‚\f\f :­  ƒ  \f
received this name because in the fall of 2007 381 students were enrolled
in Bowie as freshman; however, the following fall the sophomore class
It is an ill wind indeed that doesn’t blow some good. ese two exam
employed utilize the shortcomings of the missing data schemes that were
in use to game the system, they also tell us two important things:
Dealing with missing data is a crucial part of any practical situation,
and doing it poorly is not likely to end well;and
of multiple imputations pioneered by Rod Little and Don Rubin; oppo
nents cannot claim that they are too complicated for ordinary people to
understand. e unfolding of events has shown conclusively their general
comprehensibility. But this is not likely to be enough. To be e\tective we
almost surely need to use some sort of serious punishment for those who
are caught.
(accessed August 24,2015).
is conclusion was sent to me in an e-mail from Don Rubin on December 12, 2013. He
wrote, “Amazing how simple-minded these manipulators are. e only way to address these
Communicating Like a Data Scientist
looking for. Tukey was giving voice to what all data scientists now accept
as gospel– statistical graphs are powerful tools for the discovery of quan
titative phenomena, for their communication, and even for the e\bcient
relatively modern invention. Its origins are not shrouded in history like the
invention of the wheel or of re. ere was no reason for the invention of
accepted part of scientic epistemology. us it isn’t surprising that graphs
only began to appear during the eighteenth-century Enlightenment after
the writings of the British empiricists John Locke (1632–1704), George
Berkeley (1685–1753), and David Hume (1711–76) popularized and
justied empiricism as a way of knowing things.
Graphical display did not emerge from the preempirical murk in bits
and pieces. Once the epistemological ground was prepared, its birth was
more like Botticelli’s Venus– arising fully adult. e year 1786 is the
birthday of modern statistical graphics, devised by the Scottish iconoclast
William Playfair (1759–1823) who invented
what was an almost entirely
may be a trie too strong because there were scattered examples prepared previ
ously, most notably in the study of weather, but they tended to be ad hoc primitive a\tairs,
whereas Playfair’s depictions were fully developed and, even by modern standards, beautiful
and well designed.
\n    
new way to communicate quantitative phenomena. Playfair showed the
charts; the extent of Turkey that lay in each of three continents with the
rst pie chart; and the characteristics of Scotland’s trade in a single year
with a bar chart. us in a single remarkable volume, an atlas that con
tained not a single map, he provided spectacular versions of three of the
four most important graphical forms. His work was celebrated and has
subsequently been seized upon by data scientists as a crucial tool to com
municate empirical ndings to one another and, indeed, even to oneself.
In this section we celebrate graphical display as a tool to communi
cate quantitative evidence by telling four stories. In
e\tective communications of any sort, although there is a modest tilting
toward visual communications. e primary example shows how com
mutated genes that increase the likelihood of cancer, requires both empa
we examine the inuence that graphs designed by sci
entists have inuenced the media, and vice versa, commenting speci
cally on how media designers seem, at least at this moment in history, to
e history of data display has provided innumerable examples of
clear depictions of two-dimensional data arrayed over a two-dimensional
surface (a map depicting the locally two-dimensional surface of the
Earth is the earliest and best example). e design challenge is how to
depict more than two dimensions on that same at surface. Shading
a map to depict population is a popular approach to show a third
dimension on top of the two geographic ones. But what about four
dimensions? Or ve? Or more? Evocative solutions to this challenge
are justly celebrated (the most celebrated is Charles Joseph Minard’s
1869 six-dimensional depiction of Napoleon’s catastrophic Russian
campaign). In
we introduce the inside-out plot as a way
of exploring very high-dimensional data, and then, in
of the two geographic variables to illustrate the distribution of vari
ables like crime, ignorance, bastardy, and improvident marriages across
England. By juxtaposing these “moral maps” he tried to elicit causal
conclusions (e.g., areas of high ignorance were also areas of high crime)
and so suggest how alleviating one might have a positive e\tect on the
other (e.g., increased funding of education would reduce crime). After
a more modern graphic form and show the qualitative arguments that
tively using scatterplots.
On the Crucial Role of Empathy
in the Design of Communications
Good information design is clear thinking made visible, while bad design
is stupidity in action.
Edward Tufte
An eective memo should have, at most, onepoint.
Paul Holland
e e\tectiveness of any sort of communication depends strongly on
the extent to which the person preparing the communication is able to
empathize with its recipients. To maximize e\tectiveness it is absolutely
are those rare communications that do it well. More than a dozen years
dean of admission, recognized that the recipient is principally interested
Ž  \n\f \f  Š„•
allel version (imagined in
) that might have been sent to the
remaining 92percent.
parallel version Iginned up was never in the cards. When Iasked about
can say ‘Yes’ quickly; ‘No’ takes a lot longer.”
versity or o\tering a job are but two of the many important situations in
which information is conveyed to a candidate. Crafting such messages
circumstances, as important as they are, pale in comparison with other
\n    
kinds of communications. One of these arose on May 14, 2013, when
Angelina Jolie wrote an OpEd essay in the
NewYork Times
decision to have a double mastectomy.
e trail that ended with this drastic decision began many years ear
lier, when her forty-six-year-old mother contracted breast cancer, which
she succumbed to ten years later. Ms. Jolie feared a familial component
tion that would substantially increase the likelihood of her su\tering the
same fate as her mother.
Ž  \n\f \f  Š„•
Ordinarily, one in eight women will contract breast cancer during
their lives, but this 12percent chance is increased six- to sevenfold if
a woman carries a mutated form of the 187delAG BRCA1, 5385insC
BRCA1, or 617delT BRCA2 genes. Such mutations are, thankfully, very
rare, but among Ashkenazi Jewish women the likelihood of such a muta
tion rises to about 2.6 to 2.8percent. In addition, there are other risk
unfortunate mutation, and, on that basis, decided to have a prophylactic
double mastectomy.
How do you communicate the results of such a test? If ever a situa
tion required empathy, this is it. is situation di\ters from admission to
happy news. It is the remaining small part of the population that requires
Although it contains a great deal of verbiage, the report resembles the
the summary message of principal interest to the recipient. Isuspect that
of time. Inevitably, such collaborations generate a kind of entropy that
keeps adding more while subtracting nothing.
the recipient wants to know. One display that responds to this enlarges
later, should the recipient ever care to do so (see
the people tested, but the news it carries portends a nightmare. Dean
LeMenager’s wise advice looms large. e good news we can say fast; the
Surprisingly, in form it is identical to the one that carries the happy
news of no mutations. Only the content varies. Is this the best we
\n    
Before considering changes, if any, it is crucial that we place this
report in its proper context. e patient does not receive the report
everything means and what are the various options available. Usually
Notice of negative nding of gene mutation from Myriad Laboratories.
Ž  \n\f \f  Š„•
the counselor. In the grand scheme of things, such extra help is neither
Remembering that the core of e\tective communication is empathy,
an appointment a fortnight or so in the future and asked to come in then
for the results. At the appointed hour they arrive and sit nervously in a
large waiting room. Often a loved one accompanies them, as they await,
in terror, what might be the potential outcome. After what must seem
opens the folder, and delivers thenews.
e vast majority of the time, the test reveals no mutation, and
rarely, the test has missed a mutation, the mood is celebratory, and the
associated caveats fade into the background. After a short time, perhaps
only after leaving the clinic, the thought occurs:“Why did Ihave to
come in? Why couldn’t someone call as soon as the results were available
and tell me ‘everything’sOK’?”
Suggested revision of notice of negative nding of gene mutation that
emphasizes the principal message.
\n    
the oncological discussions are, quite literally, a matter of life anddeath.
Notice of positive nding of gene mutation from Myriad Laboratories.
Ž  \n\f \f  Š„•
an early communication that says “All Clear,” those who hear nothing,
and hence must come into the clinic for their appointment, can infer
the bad news, and hence know it in advance of having the support of a
e issue is clear. Is the reduced angst yielded by providing early
news to the 97percent who do not have a mutation overbalanced by
the lack of immediate counseling for those with a positive report? is
is not an easy calculation. Perhaps it is helpful to remember that when
the news is bad, there are only two kinds of reports. One kind causes
the recipient great sadness and terror, and the other does the same
thing a little worse. One has no good way to convey this sort of infor
mation; one has only a bad way and a worse way. Iam reminded of
what the e\tect must have been during World War II on a family that
received a telegram from the War Department. Even without opening
it, they knew that their worst nightmare had been realized. How much
di\terence would it have made had the telegram been replaced by a
skilled clinical psychologist who walked up the front steps to deliver
the news in person?
It seems a cogent argument could be made for a policy that schedules
all patients for an appointment when results become available, but for
those with a negative outcome a phone call could deliver the happy news
and cancel the appointment. e balance would come in as scheduled. It
surely merits further discussion.
in a very subsidiary role, could be of value. It seems worthwhile to try
such a change in format, as well as changing the time and mode of
delivery. It is also abundantly clear that, for those who receive the news
of a relevant mutant gene, no format change would make any material
e story told in this chapter provides a dramatic example of the
thought processes that ought to accompany all communications. Data are
always gathered within a context, and to understand and communicate
those data without seriously considering that context is to court disaster.
For this reason it is always a mistake to “sta\t out” the preparation of data
\n    
emerge from such a practice is that you could miss what you might have
found; more serious errors can easily be imagined, for example, consider
Improving Data Displays
e Media’s and Ours
More than thirty years ago
Iwrote an article with the ironic
title “How to Display Data Badly.” In it Ichose a dozen or so examples of
awed displays and suggested some paths toward improvement. Two major
newspapers, the
NewYork Times
Washington Post
, were the source
of most of my examples. ose examples were drawn over a remarkably
short period of time. It wasn’t hard to nd examples of bad graphs.
Happily, in the intervening years those papers have become increas
ingly aware of the canons of good practice and have improved their data
displays profoundly. Indeed, when one considers both the complexity of
the data that are often displayed as well as the short time intervals permit
ted for their preparation, the results are often remarkable.
Eight years ago, over the course of about a fortnight, Ipicked out a
few graphs from the
NewYork Times
that were especially notable. Over
is is the third incarnation of this essay. In 2007 it appeared as an article in
magazine aimed at statisticians; a revised version was published as
chapter11 in my 2009
Picturing the Uncertain World
. In 2015, as Iwas preparing this book, Idecided that
the essay’s message was even more relevant today than it was when it was originally prepared
and so opted to include the revised version you see here. Iam grateful to
\n    
the same period Inoticed decidedly inferior graphs in the scientic lit
erature for data that had the same features. At rst Ithought that it felt
more comfortable in the “good old days” when we scientists did it right
and the media’s results were awed. But the old days were not actually so
good. Graphical practices in scientic journals have not evolved as fast
as those of the mass media. is year Iredid the same investigation and
reached the same conclusions. It is time we in the scientic community
learned from the media’s example.
e U.S.federal government is fond of producing pie charts, and so it
) a pie chart of the sources of gov
ernment receipts. Of course, the grapher felt it necessary to “enliven” the
presentation by adding a specious extra dimension, and to pull forward
Excise ta
income tax
income tax
income tax
income tax
FY 2000
FY 2007
; accessed December 18,2008).
„   „\f•: ’ Ž
the segment representing corporate taxes, which has the unfortunate
perceptual consequence of making that segment look larger than it is.
allow the viewer to see the changes that have taken place over that time
period (roughly the span of the Bush administration). e only change
Iwas able to discern was shrinkage in the contribution of individual
I replotted the data in a format that provides a clearer view
) and immediately saw that the decrease in individual income
cically, increasing social security taxes, whose e\tect ends after the rst
hundred thousand dollars of earned income, paid for the cost of tax cuts
aimed principally at the wealthy.
play, for my expectation of data displays constructed for broad consump
tion was not high. Hence when Iwas told of a graph in an article in the
NewYork Times Sunday Magazine
choose to speak out on, Ianticipated the worst. My imagination created,
Social insurance
income tax
Excise ta
ercent of F
ederal Go
nment Receipts
FY 2007
FY 2000
FY 2007
FY 2000
FY 2007
FY 2007
FY 2000
FY 2000
A restructuring of the same data shown in
making clearer
\n    
oating before my eyes, a pie chart displaying such data (
dimensions, and the categories were ordered bysize.
Instead, Ifound a pleasant surprise (
e graph (pro
duced by an organization named “Catalogtree”) was a variant of a pie
chart in which the segments all subtend the same central angle, but their
radii are proportional to the amount being displayed. is is a striking
improvement on the usual pie because it facilitates comparisons among
ent places. e results from each segment are always in the same place,
whereas with pie charts the locations of segments may vary as the data
change. Compare it with the pie in
leapt out for no apparent reason, except possibly to mislead. In
the extended segment that represents the topics of hunger and poverty is
ercentage of Amer
ican churchgoers
whose clergy members speak out on:
y Marr
A typical pie chart representation of the relative popularity of various
topics among the U.S.clergy.
„   „\f•: ’ Ž
eye-catching for a good reason– it represents the single topic that domi
nates all others discussed in church.
is plot also indicates enough historical consciousness to evoke
Florence Nightingale’s (
) famous Rose of the Crimean War
e original Nightingale Rose dramatically showed the far
greater death toll caused by lack of sanitation than battle wounds and was
ercentage of Amer
churchgoers whose clergy
ugust 2006 sur
Research Center f
or the P
eople and the Press
and the
um on Religion and Pub
religious ser
vices at least monthly
ws regarding
intelligent design
A display from the February 18, 2007
NewYork Times Sunday Magazine
(page11) showing the same data depicted in
as a NightingaleRose.
chapter11 in Wainer
A redrafting of Florence Nightingale’s famous “coxcomb” display (what has since become known as
aNightingale Rose) showing the variation in mortality over the months of theyear.
„   „\f•: ’ Ž
hence very e\tective in helping her reform the military ministry’s battle
Sadly, this elegant display contains one small aw that distorts our
perceptions. e length of the radius of each segment is proportional
to the percentage depicted; but the area of the segment, not its radius,
inuences us. us, the radii need to be proportional to the square root
of the percentage for the areas to be perceived correctly. An alternative
guration with this characteristic is shown as
Journalists might complain that the staid nature of
not make the visual point about hunger and poverty being the big story
NewYork Times
). No it
doesn’t. And that is exactly my point. Although hunger and poverty are,
in combination, the most common topic, it does not dominate the way
our perceptions of
suggest. e goal of good display is full and
accurate communication and so a display that yields misperceptions is a
Stem-cell Research
Death P
y Marr
Hunger &
ercent of Amer
ican Churchgoers
e same data previously shown in
recast as a
\n    
In 1973 Jacques Bertin, the acknowledged master theorist of modern
graphics explained that when one produces a graph, it is best to label each
of the elements in the graph directly. He proposed this as the preferred
alternative to appending some sort of legend that denes each element.
His point was that when the two are connected, you could comprehend
the graph in a single moment of perception, as opposed to having to rst
look at the lines, then read the legend, and then match the legend to
is advice is too rarely followed. For example, Michigan State’s
Mark Reckase,
in a simple plot of two lines (
), chose not to
label the lines directly– even though there was plenty of room to do
so– and instead chose to put in a legend. And the legend reverses the
order of the lines, so the top line in the graph becomes the bottom line
in the legend, thus increasing the opportunity for readererror.
, from
Pfe\termann and Tiller,
comes a valiant e\tort to do so. Here the legend
is hidden in the gure caption, and again its order does not match the
order of the lines in the graph. Moreover, the only way to distinguish
BMK from UnBMK is to notice a small dot. e only way Icould think
ments and their identiers worse would be to move the latter to an
How does the
NewYork Times
fare on this aspect of e\tective dis
play? Very well indeed. In
are two plots roughly following a
NewYork Times
design that describe one hundred years of employment
in New Haven County. In each panel the lines are labeled directly, mak
ing the decline of manufacturing jobs clear. In the following week another
graph appeared showing ve time series over three decades. Aredrafted
and corrected version of a
NewYork Times
. In
this plot the long lines and their crossing patterns made it possible for
the viewer to confuse one line with another. Labeling both ends of each
line ameliorated this possibility; a ne idea, worthy of being copied by
those of us whose data share the same characteristics. Unfortunately, in
Pfe\termann and Tiller
„   „\f•: ’ Ž
Intended Cut Score
A graph taken from Reckase (
graph lines directly they are identied through a legend– indeed a legend whose order
, Benchmar
ed, and Unbenchmar
Estimates of
otal Monthly Unemplo
yment, South Atlantic Division
(numbers in 10,000) (
A graph taken from Pfe\termann and Tiller (
) in which the three
data series are identied in acronym form in the caption. ere is plenty of room on the
\n    
yment in Ne
w Ha
en Count
Three Industr
ercent of All Emplo
yment of
w Ha
n Countr
NewYork Times
showing two panels containing three lines each, in which each line
is identied directly. See panels 9.9a and9.9b.
„   „\f•: ’ Ž
the original the grapher added some confusion by using equally spaced
Example3:Channeling Playfair to Measure
China’s Industrial Expansion
On the business pages of the March 13, 2007
NewYork Times
graph used to support the principal thesis of an article on how the growth
expansion of acquisitions. at graph, shown here as
, has
two data series. e rst shows the amount of money spent by China on
external acquisitions from 1990 through 2006. e second time series
shows the number of such acquisitions.
ercent of respondents who e
“a great deal of confidence”
in the
wing institutions:
, organiz
ed religion, the militar
the press
, and Congress
ercent of Respondents
A graph modeled after one from the News of the Week in Review
section of the February 25, 2007
NewYork Times
(page15) showing ve long lines each,
in which each line is identied directly at both of its extremes, thus making identication
easy, even when the linescross.
The Ne


Thomson Financial
A graph taken from the business section of the March 13, 2007
NewYork Times
(page C1) showing two data
„   „\f•: ’ Ž
e display format, while reasonably original in its totality, bor
rows heavily from William Playfair. First, the idea of including two
quite di\terent data series in the same chart is reminiscent of Playfair’s
chart comparing the cost of wheat with the salary of a mechanic
However, in plotting China’s expenditures the graphers
had to confront the vast increases over the time period shown; a linear
scale would have obscured the changes in the early years. e solution
they chose was also borrowed from Playfair’s plot of Hindoostan in
Statistical Breviary
). Playfair showed the areas
of various parts of Hindoostan as circles. e areas of the circles were
proportional to the areas of the segments of the country, but the radii
are proportional to the square root of the areas. us, by lining up the
circles on a common line, we can see the di\terences of the heights of
the circles that is, in e\tect, a square-root transformation of the areas.
is visual transformation helps to place diverse data points on a more
A graph taken from Playfair (
). It contains two data series that
are meant to be compared. e rst is a line that represents the “weekly wages of a
A graph taken from Playfair (
). It contains three data series. e area of each circle is proportional to the area
of the geographic location indicated. e vertical line to the left of each circle expresses the number of inhabitants, in millions. e
vertical line to the right represents the revenue generated in that region in millions of pounds sterling.
„   „\f•: ’ Ž
NewYork Times’
s plot of China’s increasing acquisitiveness has
two things going for it. It contains thirty-four data points, which by mass
media standards is data rich, showing vividly the concomitant increases
in the two data series over a seventeen-year period. is is a long way
from Playfair’s penchant for showing a century or more, but in the mod
ern world, where changes occur at a less leisurely pace than in the eigh
teenth century, seventeen years is often enough. And second, by using
Playfair’s circle representation it allows the visibility of expenditures over
Panel 9.14a is a straightforward scatter plot showing the linear increases
in the number of acquisitions that China has made over the past seven
teen years. e slope of the tted line tells us that over those seventeen
years China has, on average, increased its acquisitions by 5.5/year. is
tted regression line in the scatter plot. Panel 9.14b shows the increase in
money spent on acquisitions over those same seventeen years. e plot
is on a log scale, and its overall trend is well described by a straight line.
increase of about 32percent per year. us, the trend established over
these seventeen years shows that China has both increased the number
e key advantage of using paired scatter plots with linearizing trans
formations and tted straight lines is that they provide a quantitative
measure of how China’s acquisitiveness has changed. is distinguishes
from the
NewYork Times
all the quantitative information necessary to do these calculations, had
primarily a qualitative message.
Magnum esse solem philosophus probabit, quantus sit
Seneca, Epistulae88.27
Roughly translated, “while philosophy says the sun is large, mathematics takes its measure.”
\n    
w Man
y Acquisitions Has China Made?
China acquired 5.5
more companies each y
ear than
it had the y
ear be
alue of China's Acquisitions Outside of China
(in millions of US dollars)
y (in millions)
e data from
redrafted as two scatter plots (Panel a and
Panel b). e plot of money is shown on a log scale, which linearizes the relationship
both panels allows the viewer to draw quantitative inferences about the rates of growth
that was not possible with the depiction shown in
„   „\f•: ’ Ž
two hundred years ago. Since that time rules have been codied,
many books have described and exemplied good graphical practice.
All of these have had an e\tect on graphical practice. But it would appear
from my sample of convenience, that the e\tect was larger on the mass
media than on the scientic literature. Idon’t know why, but Iwill pos
tulate two possible reasons. First, scientists make graphs with the software
they have available and tend, more often than is proper, to accept the
second reason why poor graphical practices persist is akin to Einstein’s
observation on the persistence of incorrect scientic theories:“Old theo
ries never die, just the people who believe in them.”
Graphical display is prose’s nonverbal partner in the quest to e\tec
tively communicate quantitative phenomena. When France’s Louis XVI,
an amateur of geography and owner of many ne atlases, rst saw the
statistical graphics invented by Playfair, he immediately grasped their
meaning and their importance. He said, “ey spoke all languages and
were very clear and easily understood.”
e requirement of clarity is in
the orthogonal complement of my earlier denition of truthiness. Telling
the truth is of no help if the impression it leaves is fuzzy or, worse, incor
rect. e media tend to eschew displays of great clarity like scatter plots,
because they are thought to be too dreary or too technical. e scientic
community may avoid clear displays because individual scientists lack the
training to make them clear and/or the empathy to care. Iview mislead
ing your readers out of ignorance as a venial sin, because it can be xed
with training. Using truthiness to twist the conclusions of the audience
toward “that which is false” just to suit your own point of view is a mortal
E.g., Bertin
; Tufte
; and Wainer
is is a gentler version of Wolfgang Pauli’s well-known quip that, “Science advances one
funeral at atime.”
Reported in William Playfair (1822–3) in an unpublished manuscript held by John
Lawrence Playfair, Toronto, Canada (transcribed and annotated by Ian Spence).
\n    
sin. As we saw in the discussion of the e\tects of fracking and wastewater
injection on earthquakes in Oklahoma (
), clarity was never an
issue; quite the opposite, the goal was to obfuscate the connection. Ifear
that by showing some ways that data displays can confuse Imay be inad
vertently aiding those who are enemies of the truth. Ihopenot.
Inside OutPlots
e modern world is full of complexity. Data that describe it too often
must mirror that complexity. Statistical problems with only one indepen
dent variable and a single dependent variable are usually only found in
textbooks. e real world is hopelessly multivariate and lled with inter
connected variables. Any data display that fails to represent those com
plexities risks misleading us. Einstein’s advice that “everything should be
as simple as possible, but no simpler” looms prescient.
If we couple Einstein’s advice with Tukey’s (1977) observation (dis
cussed in the introduction to this section) that the best way to nd what
we are not expecting is with a well-designed graphic display, we have an
immediate problem. Most of our data displays must be represented on a
numbers are represented by a bigger bar, a larger pie segment, a line that
reaches higher, or any of the other Cartesian representations.
oped to display multivariate data on a two-dimensional surface.
ese include:
Icons that contain many features each paired with a variable whose size or shape relate
to the size of the variable– e.g., polygons or cartoonfaces.
Complex periodic functions where each variable represented is paired with a separate
Fourier component.
this note is continued on the next page
\n    
Forty years ago Yale’s John Hartigan proposed a simple approach
for looking at some kinds of multivariate data. is is now called the
“Inside Out Plot.” Most data begin as a table, and so it is logical that
we use a semigraphic display to help us look at such tabular data. e
well-constructed table can be an e\tective display, giving us hope that a
tations of a two-dimensional plotting surface.
As in most instances on the topic of display, explanation is best done
through the use of an example. e popularity of the movie
provides thetopic.
A Multivariate Example:Joe Mauer vs. Some Immortals
In the February 17, 2010 issue of
USA Today
there was an article about
the Minnesota Twins all-star catcher Joe Mauer. Mauer’s rst six years in
the major leagues have been remarkable by any measure, but especially
from an o\tensive perspective (during that time he won three batting
titles). e article’s author (Bob Nightingale) tries to cement his point
by comparing Mauer’s o\tensive statistics with those of ve other great
catchers during their rst six years. e data he presents are shown here
How can we look at these data to see what messages they carry?
Obviously we would like to somehow summarize across all of these o\ten
sive categories, but how? ey are a mixture of variables with di\terent
batting in 102 runs? How is a raven like a writing desk? Before they
can be compared, and thence summarized, we must rst somehow place
all of the variables on a common scale. We’ll do this in two steps. First
we’ll center each column by subtracting out some middle value of each
Minard’s sublime six-dimensional map showing Napoleon’s ill-fated Russian cam
paign as a rushing river crossing into Russia from Poland and a tiny stream trickling
ere are many other approaches.
Oensive Statistics for the First Six Years in the Careers of Six Great Catchers
Mauer vs. Other Catching Greats after Six Seasons
At Bats
On Base
Joe Mauer
Mickey Cochrane
Yogi Berra
Johnny Bench
Ivan Rodriguez
Mike Piazza
OPS = On Base + Slugging.
\n    
this is done we can make comparisons across columns and character
ize the overall performance of each player. Exactly how to do this will
become clearer as we goalong.
But rst we must clean up the table. We can begin by rst sim
plifying it by removing the column indicating the years they played.
It may be of some background interest but it is not a performance
statistic. Also, because the OPS is just the sum of two other columns
(on-base percentage and slugging percentage), keeping it in will just
give extra weight to its component variables. ere seems to be no rea
son to count those measures twice and so we will also elide the OPS
column. In
this shrunken table augments each column by
the median value for that variable. is augmentation allows us to
easily answer the obvious question “what’s a typical value for this vari
able?” We choose the median, rather than the mean, because we want
a robust measure that will not be overly a\tected by an unusual data
point, and that, once removed, will allow unusual points to stick out.
Also, because it is merely the middle value of those presented, it is very
Now that the table is cleaned up, we can center all of the columns
by subtracting out the column medians. Such a table of column-centered
variables is shown as
. After the columns are all centered we see
that there is substantial variability within each column. For example, we
compare this to batting averages in which Johnny Bench’s was .040 lower
than the median whereas Mike Piazza’s was .024 above the median, a dif
ference of .064. How are we to compare .064 batting average points with
193 RBIs? It is that raven and writing desk again. Obviously, to make
comparisons we need to equalize the variation within each column. is
is easily done by characterizing the amount of variability in each column
and then dividing all elements of that column by that characterization.
At the bottom of each column is the Median Absolute Deviation
(MAD). is is the median of the absolute values of all of the entries in
Original Data with Years and OPS Elided and Column Medians Calculated
At Bats
On Base
Joe Mauer
Mickey Cochrane
Yogi Berra
Johnny Bench
Ivan Rodriguez
Mike Piazza
Results from
Column Centered by Subtracting Out Column Medians
Column Centered
At Bats
On Base
Joe Mauer
Mickey Cochrane
Yogi Berra
Johnny Bench
Ivan Rodriguez
Mike Piazza
Absolute Deviations (MADs) calculated.
that column. e MAD is a robust measure of spread.
We use the MAD
instead of the standard deviation for exactly the same reasons that we
Now that we have a robust measure of scale we can divide every
column by its MAD.
is nally allows us to summarize each player’s
performance across columns. We do this by taking row medians, which,
value– what we can call the player e\tects.
is is shown in
e player “e\tects” provide at least part of the answer to the ques
tions for which these data were gathered. We can see the outcome more
easily if we reorder the players by the player e\tects. Such a reordered table
is shown as
Now we can see that Mike Piazza is, overall, the best performing
o\tensive catcher among this illustrious group, and Ivan Rodriguez the
worst. We also see that Joe Mauer clearly belongs in this company, sitting
ond issue, at least as important as the overall rankings, is an understand
ing of any unusual performances of these players on one or more of the
component variables. To understand these we must rst remove the row
e\tects by subtracting them out, and look at the residuals that remain.
is result is shown in
. On its right ank are the player e\tects,
and its entries are the doubly centered and column-scaled residuals. If we
want to see the extent to which a specic player does unusually well or
unusually poorly on one or more of the various performance measures it
will be worth our time to look at these residuals carefully. Buthow?
Indeed, when the data are Gaussian, it is, in expectation, a xed fraction of the standard
us each column is centered and scaled as a kind of robust z-score.
In traditional statistical jargon these are the
column standardized row eects
e stem-and-leaf display was proposed by John Tukey as a simple display to show the dis
tribution of a single variable quickly to the ying eye. e stem is just a vertically arranged,
equally spaced, list of the numbers under consideration. e leaves are labels associated
Results from
Rescaled by Dividing Each Entry in a Column by its MAD and Row Medians (Player
Eects) Calculated
Column Standardized
At Bats
On Base
Joe Mauer
Mickey Cochrane
Yogi Berra
Johnny Bench
Ivan Rodriguez
Mike Piazza
with its Rows Reordered by Row (Player) Eects
Column Standardized and Reordered
At Bats
On Base
Mike Piazza
Mickey Cochrane
Johnny Bench
Joe Mauer
Yogi Berra
Ivan Rodriguez
\n    
outside of the table and numbers on the inside. To view the results in this
Such an inside out plot is shown in
. Note we used the
convention of using the plotting symbol /–/ to represent all the unnamed
players whose residuals were essentially zero on that variable, when there
are too many to t their names in explicitly. is is sensible because it
renders anonymous those players whose residuals on this variable are too
small to be of interest anyway.
Even a quick glance at
provides enlightenment. We see
that Johnny Bench had an unusually high number of at bats, whereas
Yogi Berra and Mike Piazza had fewer than we would expect from their
overall ranking. Mickey Cochran scored a great deal more runs than we
would have expected, although many fewer home runs. ey were all
highly related variables:hits, batting average, and on-base percentage,
whereas Johnny Bench is at or near the opposite extreme on those. And
the biggest residual is reserved for Mike Piazza’s slugging percentage,
which outshines even Yogi Berra on this measure.
e inside out plot provides a simple robust way to look at data
that are not measured on the same scale. We do not suggest that other
approaches will not also bear fruit, but only that this one is simple and
Mike Piazza
Mickey Cochrane
Johnny Bench
Joe Mauer
Yogi Berra
Ivan Rodriguez
Stem-and-leaf diagram of player e\tects.
e Results from Table10.5 with the Row Eects Removed. ADoubly Centered and Column Scaled Data Matrix
Column Standardized and Reordered with Row E\tects Removed
At Bats
On Base
Mike Piazza
Mickey Cochrane
Johnny Bench
Joe Mauer
Yogi Berra
Ivan Rodriguez
On Base
At Bats
Home Runs
Johnny Bench
Ivan Rodriguez
Piazza, Bench
Bench, Berra
Berra, Rodriguez
Mickey Cochrane
Mauer, Bench


Joe Mauer
Piazza, Berra
Mauer, Cochrane
Mike Piazza
Yogi Berra
Standardized residuals plotted insideout.
easy– providing a way to look at the t and also at the residuals from
the t. It requires little more special processing beyond that provided by
Of course, with a toy example like this, comprised of only six players
and eight variables, much of what we found is also seen through the care
). e value of inside
out plotting would be clearer with a larger sample of players. If it included
defensive statistics. We suspect that, had this been done, Ivan Rodriguez
made the power of this technique more obvious, it would also have been
more cumbersome; this was the suitable size for a demonstration.
But how far does this approach scale upward? Suppose we had twenty
or eighty catchers? Inside out plots will still work, although some modi
cations may be helpful; replacing each catcher’s name with a less evoca
tive, but more compact representation is a start. Next, remembering that
residuals near zero is the place where the most names will pile up and
are also of least interest. Hence replacing large numbers of names with a
single /–/ worksne.
A second issue– that of highly correlated variables– has already been
hinted at with our dismissal of OPS as redundant. Decisions about which
variables to include and which to elide must be made, but are not directly
are highly related the inside out plot will tell us, for the residuals will
appear very similar. As is true for all investigations, inside out plots are
often iterative, where one plot provides information on how to make the
One can also easily imagine dynamic augmentations to inside out
plots. For example, one might prepare a program for inside out plots
so that if you point to a player’s name a line appears that connects that
name across all variables. One strategy might be to construct in a series
of these augmented versions dynamically, and subsequently choose a few
especially interesting ones to yield a series of static displays. We leave it to
the readers’ imaginations to devise other variations.
A Century and a Half of Moral Statistics
Plotting Evidence to Aect Social Policy
It is a truth universally acknowledged, that any person in possession of a
geographic data set, must be in want ofamap.
Jane Austen
e sophisticated use of thematic maps began on November 30, 1826
when Charles Dupin (1784–1873) gave an address on the topic of popu
lar education and its relation to French prosperity. He used a map shaded
according to the proportion of male children in school relative to the
size of the population in that départment. is graphic approach was
improved upon in 1830 when Frére de Montizon produced a map show
ing the population of France in which he represented the population
by dots, in which each dot represented ten thousand people. Although
the most important conceptual breakthrough in thematic mapping. It
was the direct linear antecedent of John Snow’s famous map of the 1854
cholera epidemic in London. Snow’s map, presented in
bars to show the location and number of cholera deaths, which Snow
the vectors of cholera contagion were unknown at that time,
Ironically, the same year as the London epidemic, the Italian Filippo Pacini identied the
Vibrio cholera
as the proximal cause of the disease.
 \n•   \f  ƒ\f 
pattern of deaths relative to the water pump suggested to Snow the cause
pump’s handle, and within a week the epidemic, which had taken 570
lives, ended. is may not be history’s most powerful example of Tukey’s
observation about a graph being the best way to nd what you were not
expecting, but it is certainly among the topve.
Maps of moral statistics started appearing at about the same time; these typ
ically focused on various aspects of crime. Most widely known were those
of Adriano Balbi (1782–1848) and Andre-Michel Guerry (1802–66),
whose 1829 map pairing the popularity of instruction with the incidence
John Snow’s 1854 cholera map of London.
\n    
of crimes was but the start of their continuing investigations. eir work
1831, produced a marvelous map of crimes against property in France,
in which the shading was continuous across internal boundaries rather
than being uniform within department. Guerry expanded his investiga
tion in the 1830s to add three other moral variables (illegitimacy, charity,
and suicides) to crime and instruction. us, despite graphical display’s
English birth, its early childhood was spent on the continent.
published the rst of several long papers. is paper was lled with tables
and but a single, rudimentary map. In fact, he explicitly rejected the use
the columns of numbers and thus avoid the bother and expense of draft
two years later
in two articles of the same title (but of much greater length) as his 1847
paper. ese included many shaded maps that were innovative in both
content and format. Formatrst.
raw numbers but instead their deviations from the mean. And so the
scale of tints varied from the middle, an approach that is common now.
In 1849, the approach was an innovation. e advantage of centering
maps, on very di\terent topics and with very di\terent scales, and place
them side by side for comparison without worrying about the location of
the scale. He also oriented the shading, as much as possible, in the same
way. In hiswords,
In all the Maps it will be observed that the
numbers are appropriated to the
end of the
maps accompanies these tables, to illustrate the most important branches of the investiga
tion, and Ihave endeavoured to supply the deciency which H.R.H.Prince Albert was
pleased to point out, of the want of more illustrations of thiskind.”
 \n•   \f  ƒ\f 
Of course, with variables like population density it is hard to know
which end is favorable. He chose lesser population as more favorable
because, we speculate, it seemed to accompany favorable outcomes on
were noteworthy, the content he chose makes him special. He did not
make maps of obvious physical variables like wind direction or altitude or
even the distribution of the population (although he did make one map
of that, it was for comparative purposes). No, he had more profound
For example, beside his plot of ignorance in England and Wales
of the data. For example, hewrote,
We thus nd that the decline in
ignorance to be slowest
He then tied this analysis to the phenomenon made observable through
he darkest tints of ignorance go with those of crime, from
the more southern of the Midland Manufacturing Counties,
through the South Midland and Eastern Agricultural
. and it will be well to observe, as an example of their
use, that all four of the tests of moral inuences now employed
are seen to be on the side of the more instructed districts.
And then, looking into the phenomenon at a more microscopic level, he
e two least criminal regions are at the opposite extremes
in this respect (the Celtic and the Scandinavian), with this
important di\terence, that in the region where there is the
greatest decline of absolute ignorance among the criminals
(the Scandinavian), there is not one-half of the amount of it in
the population at large which exists in theother.
In addition to these maps, he also produced parallel plots of bas
tardy in England and Wales, of improvident marriages, of persons of
\n    
independent means, of pauperism, and of many other variables that
could plausibly be thought of as either causes or e\tects of other variables.
His goal was to generate and test hypotheses, which might then be used
to guide subsequent social action and policy.
rance, as well as their relation over time, can be attained by comparing
 \n•   \f  ƒ\f 
thematic maps, the process is neither easy nor precise– even using mod
was innovative but perhaps not the ideal tool. Had he been interested
his “investigation of the orbits of revolving double stars,” published in
\n    
1833, British astronomer John Frederick William Herschel plotted the
positions of stars on one axis against the observation year on the other
as inventor of what is now called a
scatter plot
Such a representation
Guns, Murders, Life, Death, and Ignorance
Friendly and Denis
 \n•   \f  ƒ\f 
connection suggesting that improving schooling would simultaneously
reduce ignorance and crime. Areanalysis of the same data a century
later showed that the policies of improved public education favored by
and crime, empirical evidence to support such a claim, even partially, is
or another dispute such claims, regardless of how obvious they seem. So,
I shall make two kinds of claims and hence mobilize two kinds of
displays in their support. e rst kind of claim will be about the relation
number of guns in a state and the number of people murdered by guns
in that state. ese claims will be illustrated with scatter plots. Such plots
mation. We augment the scatter plots a little by indicating the states that
voted in the majority for Barack Obama in the 2012 presidential election
(“blue states”) and those that voted for Mitt Romney (“red states”)
e second kind of claim relates more directly to the geographic dis
tribution of the variables displayed in the scatter plots. To accomplish
more than 150years ago. We modernize his format slightly, following the
lead of Andrew Gelman
in his maps of the 2008 presidential election.
We shade each map so that states above the median in the direction gen
erally agreed as positive (e.g., fewer murders, longer life expectancy, lesser
ignorance, greater income) will be in blue (the further from the median,
the more saturated the blue). And states below the median are shaded in
red (the lower, the more saturated the red). e closer to the median that
a state scores, the paler is the coloration, so that a state at the median is
shaded white. Our goal is to examine the extent to which geography is
destiny. Is there coherence among states that always seem to nd them
selves at the unfortunate end of these variables? Perhaps such a view can
\n    
suggest ways that those states can amend their policies to improve the
lives of their inhabitants. Or, if those states resist such change, the maps
can guide their inhabitants to alternative places tolive.
Variables and Data Sources
Variable 1.
accurate estimate, so we used, as a proxy, the 2012 NCIS rearm back
ground checks per one hundred thousand residents in each state. We
assume that the more background checks, the more guns in the state.
We assume that the actual number of guns purchased far exceeds the
the two is monotonic. Some might claim that by not requiring back
ground checks we might thus reduce gun violence, but we remain
unconvinced by this argument.
Variable 2.
Firearm death rates per one hundred thousand residents
Variable 3.
2010–11 life expectancy from the American Human
Development Project.
Variable 4.
Ignorance. We used the 2011 eighth grade reading score
from NAEP (popularly referred to as “e Nation’s Report Card”) as
a measure of the ignorance of the state’s population. NAEP scores are
a random sample from the population, and all scores on all age groups
and all subjects are strongly and positively related to one another,
hence we can simply chose any one and it would be representative
of all of them. We chose eighth grade reading, but any other would
tion of individuals who signed their wedding license with an “X” as
his indicant of ignorance. We believe that our choice represents an
Variable 5.
Income. 2010 per capita income from U.S. Census Bureau
2012 Statistical Abstracts.
More technically, they form a “tight positive manifold” in the examspace.
 \n•   \f  ƒ\f 
Claim 1.e more people killed by guns in a state, the lower life expec
Claim 2.e more guns registered in a state, the greater will be the num
Claim 3.e greater the ignorance in a state, the greater the number of
Claim 4.e lesser the ignorance, the greater the income (see
ere is a coherence among states that has remained stable for at least
on a large mixture of indices of quality of life, closely resemble what
e Expectancy
m Deaths (per 100,000)
A scatter plot of 2010–11 life expectancy versus rearm death rate per
one hundred thousand by state. e solid dots are states that voted for Obama in the
2012 presidential election; states with the open circles voted for Romney.
m Deaths (per 100,000)
Gun Registration Bac
kground Chec
ks (per 100,000)
e horizontal axis shows the 2012 NCIS rearm background checks
per one hundred thousand in each state, the vertical axis is the rearm death rate per one
hundred thousand. Once again, states with solid dots are states that voted for Obama in
2012; states with open circles voted for Romney.
w Jerse
m Deaths (per 100,000)
8th-Grade NAEP Reading Scores
e horizontal axis shows 2011 NAEP eighth grade reading scores for each
state; the vertical axis has the rearm death rate per one hundred thousand. Solid dots
represent states that voted for Obama in 2012; states with open circles voted for Romney.
 \n•   \f  ƒ\f 
we see in our twenty-rst-century data. ough people di\ter in their
– life expectancy;
– rearm death
participate in decisions about the future and welfare of themselves and
their progeny (
– NAEP Reading Scores).
Even a cursory study of these gures reveals a startling similarity in
the geographic patterns of these variables. is pattern repeats itself in
, gun background checks, suggesting a plausible causal con
nection that seems worth more serious consideration. e red and blue
coloration scheme used here is based solely on the states’ location on the
variable plotted, not on their voting in the 2012 presidential election
voting map are striking. e direction of the causal arrow connecting the
voting habits of states and their position on these variables is uncertain.
8th-Grade NAEP Reading Scores
e horizontal axis shows 2011 NAEP eighth grade reading scores; the
vertical axis shows the 2010 per capita income. e solid dots represent states that voted
for Obama in 2012; the open circles voted for Romney.
\n    
Are the citizens of states ignorant because of the policies espoused by
Governor Romney? Or are ignorant people attracted to such policies?
Our data do not illuminate this question.
United States of Ameirca
Scale of
ing li

pectancy in
and transf
med de
e and belo
w median US lif
+2 SD
+1 SD
–1 SD
–2 SD
2012 expectancy.
m Deaths
United States of Ameirca
as a propor
tion of state population
Scale of
wing firear
m deaths in rate
per 100,000 people and
ed de
viations abo
and belo
w median US state lif
2 SD
1 SD
–1 SD
–2 SD
2012 U.S.rearm deathrate.
 \n•   \f  ƒ\f 
+2 SD
+1 SD
–1 SD
–2 SD
Scale of
ing US Dollars and de
tions and one
er square root
e and belo
w median US per
capita incom
r Capita Income
United States of Ameirca
2005 US Dollars
2012 U.S.per capita income.
United States of Ameirca
as measured
8th Grade NAEP
Reading score
Scale of
wing 8th Grade NAEP
Reading Score as well as
ed de
viations from
US median
+2 SD
2011 NAEP scores:Eighth grade reading.
\n    
r Capita Gun Owner
United States of Ameirca
as a measued
ground Checks
Scale of
ing number of NICS
ground checks per capita
as well as de
viations from US
ground checks per capita
–2 SD
–1 SD
+1 SD
2 SD
2012 NCIS background checks.
ing percent diff
United States of Ameirca
as giv
percent diff
erence in popular
2012 U.S.presidential popularvote.
 \n•   \f  ƒ\f 
e investigation we have performed here is not causal in any serious
way. To credibly make clausal claims we would need to hew more closely
to the approach laid out in
, using Rubin’s Model. Indeed the
rough-and-ready approach used here was designed to be exploratory–
what might be called “reconnaissance at a distance,” which suggested
possible causal arrows that could be explored more deeply in other ways.
Such an exploratory approach is perfectly ne so long as the investigator
is well aware of its limitations. Principal among them is the ecological
fallacy, in which apparent structure exists in grouped (e.g., average state)
data that disappears or even reverses on the individual level. All that said,
investigations like these provide circumstantial causal evidence that can
often prove to be quite convincing.
maps. eir strengths are answering the dual questions “what’s happening
here?” and “where is this happening?” ey are less powerful in helping
are ignorant also crime ridden? Questions of the latter type are answered
tion about geographic distribution. is result drove us to the inexorable
To clarify how such a strategy can be used, we looked at data from
the twenty-rst-century United States and used them to illustrate how
beginning of an evidence-based discussion, which in the case we chose
leads to an inexorable conclusion.
convincing, like when you nd a trout in the milk.” e structure found in the milky
waters of twenty-rst-century political discourse, when looked at carefully, does bear a
startling resemblance to ash.
\n    
philosophical underpinning of the tack that he took, Ishall. Hobbes
described the natural state of mankind, without the control of some sort
of central government, as a “war of every man against every man,” and
the result of such a circumstance as lives that are “solitary, poor, nasty,
brutish, and short.” We have seen that, as access to rearms becomes less
and less controlled, lives become shorter, wealth diminishes, and igno
rance increases. Or at least that is what the evidenceshows.
What Hobbes actually said was, “Whatsoever therefore is consequent to a time of Warre,
where every man is Enemy to every man; the same is consequent to the time, wherein men
live without other security, than what their own strength, and their own invention shall
furnish them withall. In such condition, there is no place for Industry; because the fruit
thereof is uncertain; and consequently no Culture of the Earth; no Navigation, nor use of
the commodities that may be imported by Sea; no commodious Building; no Instruments
of moving, and removing such things as require much force; no Knowledge of the face of
continuall feare, and danger of violent death; And the life of man, solitary, poore, nasty,
brutish, and short.”
Applying the Tools of Data Science
From 1996 until 2001 Iserved as an elected member of the Board of
voters pass a $61million bond issue. Each board member was assigned
to appear in several public venues to describe the projects and try to
convince those in attendance of their value so that they would agree to
support the bond issue. e repayment of the bond was projected to
add about $500 to the annual school taxes for the average house, which
would continue for the forty years of the bond. It was my misfortune to
be named as the board representative to a local organization of senior
they had no children in the schools, that the schools were more than good
enough, and that they were living on xed incomes and any substantial
increase in taxes could constitute a hardship, which would likely continue
for the rest of their lives. During all of this Iwisely remained silent. en,
when a pugnacious octogenarian strode to the microphone, Ifeared the
worst. He glared out at the gathered crowd and proclaimed, “You’re all
idiots.” He then elaborated, “What can you add to your house for $500/
year that would increase its value as much as this massive improvement
„„\f•  \f     Š
immediately, and you won’t live long enough to pay even a small portion
of the cost. You’re idiots.” en he stepped down. Alarge number of the
gray heads in the audience turned to one another and nodded in agree
ment. e bond issue passed overwhelmingly.
Each year, when the real estate tax bill arrives, every homeowner is
tem works well, it is money well spent, even for those residents without
children in the schools. For as surely as night follows day, real estate val
ues march in lockstep with the reputation of the local schools. Of course,
the importance of education to all of us goes well beyond the money
spent on it. e e\tects of schools touch everyone’s lives in a deep and per
sonal way– through our own experiences, our children’s, and everyone
we know. us it isn’t surprising that education issues are so often fea
tured in media reports. What is surprising is how often those reports are
based on claims that lie well beyond the boggle threshold, with support
relying more on truthiness than on evidence. e ubiquity of educational
policies that rest on weak evidentiary bases, combined with the impor
tance of education in all of our lives, justies devoting a fair amount of
e\tort to understanding such policies. In this section Idiscuss six impor
tant contemporary issues in education and try to see what evidence there
A popular narrative in the mass media proclaims that public schools
ences are building a permanent underclass. In
we look at
data gathered over the last twenty years by the National Assessment of
Educational Progress (NAEP– often called “the Nation’s Report Card”)
to see if those data provide evidence to support these claims. Instead what
emerges is clear evidence of remarkable, nationwide, improvements in
student performance of all states, and that the improvement in minority
performance was even greater than that of white students.
Part of that same narrative assigns causes for the nonexistent decline
in educational performance. Most commonly, the blame falls on teach
from just treatment by intransigent unions and the power of tenure,
which provides a sinecure for older and more highly paid teachers. Calls
for the abolition of tenure are regularly heard, and some states are begin
ning to act on those calls. ey are doing this in the hope of easing the
removal of highly paid teachers and opening the door for new, younger,
ing costs. In
we consider the origins of teacher tenure and
how one of its goals was, ironically, to control costs. We show data that
illustrate unambiguously how removing tenure is likely to cause payroll
Much of what we know of the performance of our educational sys
tem is based on student performance on tests. For test scores to be a
valid indicator of performance, we must be sure that they represent the
students’ abilities and are not distorted though cheating. In
we learn of how one testing organization used faulty statistical measures
su\bcient evidence. Moreover, they did this despite the existence of an
easy and much more e\bcacious alternative.
We often hear about failing schools, whose students do not reach
even minimal standards. In
we learn of a Memphis char
ter school whose state charter was threatened because their fth grad
ers scored zero on the statewide assessment. is remarkable result was
obtained not because of poor performance by the students but because a
state rule mandated that any student who did not take the test be auto
matically given a zero. is story and its outcome are the topic of this
e College Board is at least as much a political organization as a sci
entic one. So when it announces changes to its iconic college admissions
test (the SAT), the announcement and the changes it describes garner
substantial media attention. us, in March 2014 when, with great fan
fare, we learned of three changes to the next version of the SAT, it seemed
sensible to try to understand what the e\tect of those changes was going
to be and to try to infer what led to the decision to make these changes.
we do exactly this and nd a surprising link to some advice
given to Yale’s then-president Kingman Brewster about how to smooth
Yale’s passage to admittingwomen.
„„\f•  \f     Š
In 2014 the media was full of reports decrying the overuse of tests in
the United States. AGoogle search of the phrase “Testing Overload in
America’s Schools” generated more than a million hits. Is too much time
spent testing? In
we examine the
evidence surrounding the
actual length of tests and try to understand how much is enough. e
of overlong testing.
Waiting for Achilles
A famous paradox, attributed to the Greek mathematician Zeno, involves
their vastly di\terent speeds, the tortoise was granted a substantial head
start. e race began, and in a short time Achilles had reached the tor
toise’s starting spot. But in that short time, the tortoise had moved slightly
ahead. In the second stage of the race Achilles quickly covered that short
distance, but the tortoise was not stationary and he moved a little further
onward. And so they continued– Achilles would reach where the tortoise
had been, but the tortoise would always inch ahead, just out of his reach.
From this example, the great Aristotle, concluded that, “In a race, the
quickest runner can never overtake the slowest, since the pursuer must
rst reach the point whence the pursued started, so that the slower must
always hold alead.”
e lesson that we should take from this paradox is that when we
the big picture. Nowhere is this more obvious than in the current public
dents on the well-developed scale of the tests of the NAEP has shrunk by
only about 25percent over the past two decades. e conclusion drawn
was that even though the change is in the right direction, it is far tooslow.
But focusing on the di\terence blinds us to a remarkable success in
education over the past twenty years. Although the direction and size of
student improvements occur across many subject areas and many age
„„\f•  \f     Š
groups, Iwill describe just one– fourth grade mathematics. e dots in
represent the average scores for all available states on NAEP’s
fourth grade mathematics test (with the nation as a whole as well as the
state of New Jersey’s dots labeled for emphasis), for black students and
white students in 1992 and 2011. Both racial groups made steep gains
over this time period (somewhat steeper gains for blacks than for whites).
mances of black and white students, but here comes Achilles. New Jersey’s
black students performed as well in 2011 as New Jersey’s white students
two groups, which have always been inextricably connected with stu
dent performance, reaching this mark is an accomplishment worthy of
w Jerse
State average fourth grade mathematics scores of black and white
students in 1992 and 2011 on the NAEP.
:U.S. Department of Education, Institute of Education Sciences, National Center
for Education Statistics, NAEP, 1992, 2000, and 2011 Mathematics Assessments.
Importantly, we also see that the performance of New Jersey’s stu
dents was among the very best of all states in both years and in both
If we couple our concerns about American education and the remark
able success shown in these data, it seems sensible to try to understand
me make one observation. Alittle more than thirty years ago, several
lawsuits were working their way through the courts that challenged the
fairness of local property taxes as the principal source of public school
nancing. In California it was
Serrano v.Priest
and in New Jersey it was
Abbott v.Burke
; there were others elsewhere. e courts decided that, in
order for the mandated “equal educational opportunity” to be upheld,
per pupil expenditures in all school districts should be about equal. In
order for this to happen, given the vast di\terences in the tax base across
poorer districts. e fact that substantially increased funding has accom
panied these robust improvements in student performance must be con
sidered as a prime candidate in any search forcause.
is conclusion, albeit with a broader purview, was expressed by the
Money is not without its advantages and the case to the con
trary, although it has often been made, has never proved
widely persuasive.
How Much Is TenureWorth?
governments have led to a number of proposals that are remarkable in
one way or another.
It seems
to examine New Jersey Governor
Although there are
many ways of characterizing the value of tenure, Iwill focus here on the
e scal goal of removing tenure is to make it easier, during peri
ods of limited funds, for school administrators to lay o\t more expensive
(i.e., more senior/tenured) teachers in favor of keeping less experienced/
cheaper ones. Without the protection of tenure, administrators would
not have to gather the necessary evidence, which due process requires, to
terminate a senior teacher. us, it is argued, school districts would have
likely towork?
icy of giving public schoolteachers tenure evolved in the rst place. e
canonical reason given for tenure is usually to protect academic freedom,
is work was made possible through the help of Mary Ann Awad (NYSUT), David
Helfman (MSEA), Rosemary Knab (NJEA), and Harris Zwerling (PSEA), who provided
the data from NewYork, Maryland, New Jersey, and Pennsylvania, respectively. ey have
Sine quibusnon
Governor Christie is not alone in proposing changes to long-standing progressive poli
cies. Most notorious is Wisconsin Governor Scott Walker’s attempt to eliminate collective
bargaining among state employees. Iassume that many, if not all, of these e\torts, which
seem only tangentially related to the scal crisis, reected the same spirit expressed by then
White House Chief-of-Sta\t Rahm Emanuel’s observation that one ought not waste a seri
 ƒ  š›
to allow teachers to provide instruction in what might be viewed as con
troversial topics. is is surely true, but battles akin to those that resulted
from John Scopes’s decision to teach evolution in the face a dogmatic
school board in Dayton, Tennessee, are, happily, rare. But the reason
that most teachers would want tenure is because it provides them with
increased job security in general, and, in particular, as protection against
A more interesting question is why did states agree to grant tenure in
the rst place? In fact, local school boards had no direct say in the mat
ter, for it was mandated by the state. e state o\bcials who made this
decision must have known that it would reduce exibility in modifying
school sta\t, and that it would make following due process in terminat
ing a tenured teacher more time consuming and expensive. Why is ten
ure almost uniformly agreed to in all states? Idon’t know for sure, but
Iam sure that most progressive o\bcials
appreciate, and value, the importance of academic freedom. But that is
not the most pressing practical reason. ey recognize that for teachers,
tenure is a job benet, much like health insurance, pensions, and sick
time. As such it has a cash value. But it is a di\terent kind of benet, for
unlike the others it has no direct cost. Sure, there are extra expenses when
a tenured teacher is to be terminated. But if reasonable care is exercised
in hiring and promotion, such expenses occur very rarely. So, Iconclude,
tenure was instituted to save money, exactly the opposite of what is being
claimed by those who seek to abolishit.
Who is right? Is Governor Christie or am I? Happily this is a ques
tion that, once phrased carefully, is susceptible to empirical investigation.
e answer has two parts. e rst part is the title of this chapter:
Much is Tenure Worth?
e second part is:Do we save enough by shifting
the salary distribution of sta\t to compensate for the cost of its removal?
Ihave no data on the latter and so will focus on the former.
be to survey teachers and ask them how much money we would have
to give them in order for them to relinquish it. e responses to such a
As Imentioned in the introduction to this section, Iserved for ve years as an elected
Personnel Committee.
„„\f•  \f     Š
survey would surely be complex. Teachers at various stages in their careers
care very much, whereas someone in mid-career might insist on a much
larger number. Teachers in subject areas, like mathematics, in which there
is a shortage of qualied instructors, might not ask for much. Others, like
elementary schoolteachers, in which there is a large supply, might hold
tenure very dear indeed. One thing is certain:the likely answer to such a
survey is going to be many thousands of dollars.
approach and run an experiment in the spirit of those described in
. Suppose we run the following experiment:we select a few
school districts, say ve or six, and designate them the experimental dis
tricts, in which teachers do not have tenure. en we pair them with an
equal number of districts, matched on all of the usual variables that char
districts, and they do have tenure. We also try to have each experimental
district geographically near a matched control district.
Now we run the experiment. All of the faculty members from all
the districts are put in one giant pool, and the personnel directors of
each district are directed to sample from that pool of teachers in order to
sta\t their district. is continues until the pool is empty and all of the
districts are fully sta\ted. At this point we look at a number of dependent
the two groups. We might also want to look at the di\terences in the
conditional distributions of salary, after conditioning on subject exper
tise and years of experience. Isuspect that nontenure districts would pay
more for teachers at a given level of training and experience, but be forced
to concentrate their recruiting at the lowest end of thescale.
Although Iam sure that the experimental (nontenure) districts will
have to pay more for their sta\t, the question of importance is how much
may run out of money before they are fully sta\ted.
time soon, although Iwould welcome it– perhaps so too might the
 ƒ  š›
Happily, we actually have some observational data that shed light on
us consider the teachers’ salaries shown in
. In it are shown the
annual mean salaries of public schoolteachers for four northeastern states
along with that of the entire United States. e average salary increase
for the four states was just slightly less than $1,500/year (with New Jersey
the highest of the four with a mean increase of $1,640). For the United
States as a whole, it was about $1,230. e increases over these thirty
years have a strong linear component.
ese data can serve as a background for the rest of the discussion.
repeats the New Jersey results for teachers and adds the
mean salary for school superintendents. As we can see, superintendents
make more than teachers. In 1980 they made a little bit more, but by
2009 they made a great deal more. Superintendents’ salaries increased, on
Mean teachers’ salaries for four states and for the United States,
Actually, the average slope of the tted regression lines for eachstate.
„„\f•  \f     Š
average, more than $4,000/year.
shows the increasing dispar
at which this increase began. For this, we need to look at a di\terentplot.
Ihave plotted the ratio of average superintendents’ sal
aries to average teachers’ salaries for the past thirty-three years. We quickly
see that in 1977 the average superintendent in New Jersey earned about
2.25 times as much as the average teacher, but this disparity was dimin
ishing, so that in the early 1990s the average superintendent was earning
just twice what the average teacher earned. en the disparity began to
increase sharply, so that by 2009 superintendents’ pay was two-and-a-
half times that of teachers. What happened that should occasion this
dramatic change? In 1991 the New Jersey legislature eliminated tenure
the removal of tenure and the relative increase in superintendents’ wages,
which is likely due to the need for existing three- or four-year superin
tendent contracts to expire before they could renegotiate their salaries in
the new nontenure environment. But once this happened, it is clear that
Salaries for New Jersey superintendents and teachers, 1977–2009.
 ƒ  š›
ure, and that compensation involved tens of thousands of dollars.
And now the real irony. New Jersey’s legislators are well aware of the
cost of eliminating tenure. On August 19, 2010 Assemblywoman Joan
Voss, vice chair of the Assembly Education Committee, said in a press
. Bloated salaries and over-inated,
lavish compensation packages taxpayers are currently forced to fund.” To
remedy this, she and Assemblyman Ralph Caputo proposed legislation
(A-2359) that would reinstitute the system of career tenure for school
fitting quadratic function
Salaries of school superintendents in New Jersey, relative to those of
teachers, soared after career tenure was removed in1991.
If It Could Have Been, It Must Have Been
was interrupted by a phone call from an old friend. It seemed that he was
hired as a statistical consultant for an unfortunate young man who had
been accused of cheating on a licensing exam. My friend, a ne statisti
led to the young man’s problem was directly due to fallacies associated
with the unprincipled exploratory analyses of big data. e accusers had
sion because they did not consider the likelihood of false positives. As
, we see again how a little thought should precede
the rush to calculate.
Fishing for Cheaters
e young man had taken, and passed, a licensing test. It was the third
time he took the exam– the rst two times he failed by a small amount,
Si cela aurait pu être, il doit avoir été
. (I would like to have translated this into Latin, some
thing akin to the well-known
“Post hoc, ergo propter hoc,”
hence the French version. Iwould be grateful for classicalhelp.)
 \nœ\f•
but this time he passed, also by a small amount. e licensing agency,
as part of their program to root out cheating, did a
pro forma
based on the number of incorrect item responses those examinees had in
common. For the eleven thousand examinees they calculated this for all
forty-six examinees (twenty-three pairs) were much more similar than
one would expect by chance. Of these twenty-three pairs, only one took
the exam at the same time in the same room, and they sat but one row
where examinees could do scratch work before deciding on the correct
answer. e investigator concluded on the basis of this examination that
that the examinee had actually done the work, and so reached the deci
sion that the examinee had copied and his score was not earned. His
passing score was then disallowed, and he was forbidden from applying
to take this exam again for ten years. e second person of the questioned
pair had sat in front, and so it was decided that she could not have cop
ied, and hence she faced no disciplinary action. e examinee who was
to be punished broughtsuit.
Industry Standards
ere are many testing companies. Most of them are special-purpose
organizations that deal with one specic exam. Typically their resources
are too limited to allow the kinds of careful, rigorous research required
to justify the serious consequences that investigation of cheating might
of the three major professional organizations that are concerned with
they guessed on a large proportion of the items and just chose the same option repeatedly,
„„\f•  \f     Š
lished practices would require extensive (and expensive) justication. e
key standards for this situation
Statistical evidence of cheating is almost never the primary motivator
of an investigation. Usually there is an instigating event, for example, a
of having cheated, other examinees reporting knowledge of collusion,
or an extraordinarily large increase in score from a previous time. Only
when such reports are received and documented is a statistical analysis
of the sort done in this case prepared– but as conrmation. Such anal
yses are not done as an exploratory procedure.
e subject of the investigation is the score, not the examinee. After an
examination in which all evidence converges to support the hypothesis
states that the testing company cannot stand behind the validity of the
test score and hence will not report it to whoever required the result.
e nding is tentative. Typically the examinee is provided with ve
e test taker may provide information that might explain the
questioned circumstances (e.g., after an unusually large jump in
score the test taker might bring a physician’s note conrming
severe illness when the test was originally taken). If accepted the
score obtained is immediately reported.
at no cost, and at a convenient time to conrm the score being
questioned score, the original score is conrmed. If it is not the
test taker is o\tered a choice of other options.
e American Educational Research Association, the American Psychological Association,
and the National Council on Measurement in Education jointly publish
Standards for
Educational and Psychological Testing
. e most recent edition appeared in 1999. e com
mittee who prepared each edition of the
at one time or another had spent signicant time at one (or more) of the major testing
preceding synthesis was derived from that report.
 \nœ\f•
e test taker may choose independent arbitration of the issue
by the American Arbitration Association. e testing organiza
tion pays the arbitration fee and agrees to be bound by the arbi
trator’s decision.
e test taker may opt to have the score reported accompanied
by the testing organization’s reason for questioning it and the
test taker’s explanation.
e test taker may request cancellation of the score. e test fee
future administration.
Option (ii) is, by far, the one most commonly chosen.
Why Are ese Standards Important?
Obviously, all three of these standards were violated in this summer’s
investigation. And, Iwill conclude, the testing organization ought to both
and the wisdom borne of experience have led serious testing organiza
tions to scrupulously follow these guides. e example, using mammo
how the problem of false positives is the culprit that bedevils attempts to
nd rare events.
False Positives and Mammograms
Annually about 180,000 new cases of invasive breast cancer are diagnosed
in women in the United States. About forty thousand of these women are
expected to die from breast cancer. Breast cancer is second only to skin
cancer as the most commonly diagnosed cancer, and second only to lung
cancer in death rates. Among U.S.women, about one in four cancers are
breast cancer, and one out of every eight U.S.women can expect to be
diagnosed with breast cancer at some time in theirlives.
However, some progress in the battle against the horrors of breast can
cer has been made. Death rates have been declining over the past twenty
„„\f•  \f     Š
e strategy is then to investigate any unusual lumps found by these
most particularly a biopsy.
How e\tective are mammograms? One measure of their e\tectiveness
is characterized in a simple statistic. If a mammogram is found to be pos
itive, what is the probability that it is cancer? We can estimate this prob
ability from a fraction that has two parts. e numerator is the number
of breast cancers found, and the denominator is the number of positive
mammograms. e denominator contains both the true and the false
e denominator has two parts:the true positives, 180,000, plus
the false positives. How many of these are there? Each year thirty-seven
million mammograms are given in the United States. e accuracy of
mammograms varies from 80percent to 90percent depending on cir
90percent accuracy. is means that when there is a cancer, 90percent
of the time it will be found, and when there is no cancer, 90percent of
the time it will indicate no cancer. But this means that 10percent of the
time it will indicate a possible cancer when there is none. So, 10percent
of thirty-seven million mammograms yields 3.7million false positives.
And so the denominator of our fraction is 180,000 plus 3.7million, or
roughly 3.9million positive mammograms.
erefore the probability of someone with a positive mammogram
having breast cancer is 180,000/3.9million or about 5percent. at
means that 95percent of the women who receive the horrible news that
ere is a huge research literature on breast cancer, much of which looks into this very
raphy had an accuracy of 78percent, but when combined with ultrasound was boosted to
91percent. So the gure that Iuse of 90percent accuracy for mammograms alone does no
damage to the reputation of mammography. Of key importance, this 90percent gure is
the probability of nding a cancer given that it is there. But this is not what we want. We
are interested in the inverse question, what is the probability of cancer given that the test
says it is there. To do this formally requires a simple application of Bayes’s eorem, which
Ido, informally, in the subsequent derivation.
 \nœ\f•
Is a test with this level of accuracy worth doing? e answer to this
of doing mammograms versus the costs of not doingthem.
e standards used in doing mammograms follow industry standards
of testing. Obviously the 5percent accuracy would be much worse if the
followed when we note that it is never recommended that the entire pop
ulation of the United States have annual mammograms. Instead there is
some prescreening based on other characteristics. Modern standards sug
gest that only women with a family history of breast cancer or are more
given to revising these to make them more restrictivestill.
e third standard is also followed (with option (ii) the most fre
quently chosen one even in this situation), in that anyone tagged as
the unfortunate victim of the current investigation. First, instead of
thirty-seven million mammograms with 3.7million false positives, we
would have three hundred million people tested yielding thirty million
false positives. is would mean that about 99.5percent of all positive
mammograms would be wrong– making the test almost worthless as a
Suppose after a positive mammogram, instead of continued testing,
we moved straight to treatment including, but not limited to, a mas
tectomy with radiation and chemotherapy. Obviously, this would be a
remarkably dopeyidea.
for which they have been training for several years? is is especially
e cost associated with not doing a mammogram is measured in the number of women
justifying the widespread use of mammograms. But in 2010, a large Norwegian study that
examined the e\bcacy of modern, directed therapy showed that there was essentially no dif
ference in survival rates of women who had mammograms and those who did not (Kalager
). is led to the suggestion that mammograms use be much more limited
„„\f•  \f     Š
in Question?
e short answer is, “I don’t know.” To calculate this we would need to
know (1)how many cheaters there are– analogous to how many cancer
them as we would any missing data and, through multiple imputations,
develop a range of plausible values. Iwill impute a single value for these
unknowns but invite readers to stick in as many others as you wish to
span the space of plausibility.
thousand examinees. If this number seems unreasonable to you, substi
accurate (although because the reliability of the test is about 0.80, this is
at identifying an honest examinee. is too is unrealistic because the
only copied a very few items, or who, by choosing a partner wisely, only
where these assumptions carryus.
We are trying to estimate the probability of being a cheater given that
honest-to-goodness cheaters are identied with 90percent accuracy. e
denominator is the true cheaters identied– 90, plus the 1,100 false
positives (10percent of the eleven thousand honest examinees). So the
fraction is 90/1,190=7.6percent.
Or, given these assumptions, more than 92percent of those
identied as cheaters were falsely accused! If you believe that these
assumptions are wrong, change them and see where it leads. e only
100percent accurate. As Damon Runyon pointed out, “nothing in life
is more than 3to1.”
 \nœ\f•
In the current situation the pool of positive results was limited to
forty-six individuals. How can we adapt this result to the model illus
trated by the mammogram example? One approach might be to treat
those forty-six as being drawn at random from the population of the
1,190 yielded by our example, thus maintaining the nding that only
7.6percent of those identied as cheaters were correctly so identied.
tle power in nding an examinee that only copied a few answers from
a neighbor. is reality might have suggested that the cost of missing
had not only copied a substantial number of items, but also were unfor
tunate enough to have copied a substantial number of wrong answers.
that only forty-six people out of eleven thousand were snared as possible.
en they were forced to conclude that twenty-two of the twenty-three
pairs of examinees were false positives because of the physical impossi
bility of copying. us, despite the uncontroversial evidence that at least
twenty-two of twenty-three were false positives, they concluded that if it
could be, then it was. It is certainly plausible that all twenty-three pairs
were false positives and that one of them happened to occur in the same
testing center. is conclusion is bolstered by the facts that the validity of
herring; for although we know that the one candidate in question had
what were deemed insu\bcient supporting calculations, we don’t know
It seems clear from these analyses that the evidence is insu\bcient for
Of course, it is important for any testing organization to impose
standards of proper behavior upon its examinees. Acertied test score
imperfect. And the costs of its imperfections are borne by the innocents
„„\f•  \f     Š
inconvenience and expense involved for the vast majority of women who
for the small percentage of women whose lives are lengthened because of
aimed at reducing the false positives without increasing the risk for those
women with cancer.
e arguments of those opposed to capital punishment have been
strengthened enormously by the more than two hundred innocent men
released from long prison terms in the twenty years since DNA evidence
became more widely used. is result dramatically makes the point that
sures when alternatives are available.
After extensive and expensive legal wrangling, the testing organization
security. He passed.
When Nothing Is NotZero
A True Saga of Missing Data, Adequate Yearly
Progress, and a Memphis Charter School
One of the most vexing problems in all of school evaluation is missing
teacher, or a student is being evaluated on the basis of their performance,
the most common missing data are test scores.
If we want to measure growth we need both a pre- and a postscore.
What are we to do when one, the other, or both are missing? If we want
to measure school performance, what do we do when some of the student
test scores are missing?
missingness can be ignored, but usually this approach is reasonable only
when the proportion of missing data is very small. Otherwise the most
common strategy is “data imputation” (discussed in
and its misuse was shown in
). Data imputation involves deriv
ing some plausible numbers and inserting them in the holes. How we
choose those numbers depends on the situation and on what ancillary
information is available.
inner-city charter school in Memphis, Tennessee, that enrolled students
in kindergarten through fourth grade for the 2010–11 school year. Its
performance has been evaluated on many criteria, but of relevance
„„\f•  \f     Š
here is the performance of Promise students on the state’s reading/lan
guage arts (RLA) test. is score is dubbed its Reading/Language Arts
Adequate Yearly Progress (AYP) and it depends on two components:the
scores of third- and fourth-grade students on the RLA portion of the
test and the performance of fth-grade students on a separate writing
We observe the RLA scores, but because Promise Academy does not
have any fth-grade students, all of the writing scores are missing. What
scores should we impute to allow us to calculate a plausible total score?
e state of Tennessee’s rules require a score of zero be inserted. In some
circumstances imputing scores of zero might be reasonable. For example,
if a school only tests half of its students, we might reasonably infer that it
was trying to game the system by choosing to omit the scores from what
are likely to be the lowest-performing students. is strategy is mooted
by imputing a zero for each missing score. But this is not the situation for
Promise Academy. Here the missing scores are structurally missing– they
could not submit fth-grade scores because they have no fth graders!
revoking Promise Academy’s charter.
What scores should we impute for the missing ones so that we can
more sensibly compute the required AYP summary?
the performance of all Tennessee schools on the third- and fourth-grade
RLA tests. We see that Promise Academy did reasonably well on both the
third- and fourth-gradetests.
Not surprisingly, the calculated gure for AYP can be estimated
from third- and fourth-grade RLA performance alone. It will not be
perfect, because AYP also includes fth-grade performance on the
writing test, but it turns out to estimate AYP very well indeed. e
shows the bivariate distribution of all the
schools with their AYP RLA score on the horizontal axis and the pre
dicted AYP RLA score on the vertical axis. Drawn in on this plot is
the regression line that provides the prediction of AYP RLA from the
best linear combination of third- and fourth-grade test scores. In this
plot we have included Promise Academy. As can be seen from the
š ž  žŸ
prediction equationis:
AYP=.12 + 0.48× 3rd RLA + .44× 4thRLA
Promise Academy’s predicted AYP score is much higher than its actual
AYP score (0.33 predicted vs. 0.23 actual) because it does not have any
fth graders and hence the writing scores were missing. e di\terence
ing score for that missing value. e tted straight line gives the best
estimate of what Promise’s AYP would have been in the counterfactual
case of their having had fth graders to test. We thus estimate that they
would have nished thirty-seventh among all schools in AYP rather than
eighty-eighth based on the imputation of azero.
is dramatic shift in Promise Academy’s rank is clearer in
which shows the two AYP scores graphically.
Language Arts
4th Grade
Language Art
e distributions of all Tennessee schools on RLA for third and fourth
grades with Promise Academy’s scores emphasized.
„„\f•  \f     Š
Standard statistical practice would strongly support imputing an esti
mated score for the impossible-to-obtain fth-grade writing score, rather
than the grossly implausible value ofzero.
e solid performance of Promise Academy’s third and fourth graders
in the language arts tests they took was paralleled by their performance
in mathematics. In
we can compare the math performance
of Promise’s students with those from the 110 other schools directly. is
adds further support to the estimated AYP RLA score that we derived for
Promise Academy.
e analyses performed thus far focus solely on the actual scores of
the students at Promise Academy. ey ignore entirely one important
aspect of the school’s performance that has been a primary focus for the
evaluation of Tennessee schools for more than a decade– value added. It
has long been felt that having a single bar over which all schools must pass
was unfair to schools whose students pose a greater challenge for instruc
tion. us, evaluators have begun to assess not just what level the students
reach, but also how far they have come. To do this, a complex statistical
model, pioneered by William Sanders and his colleagues, is t to “before
Actual AYP
A scatter plot comparing the actual AYP on the horizontal axis with its
predicted value obtained without using fth-grade writing on the verticalaxis.
š ž  žŸ
and after” data to estimate the gains– the value added– that characterize
the students at a particular school. at work provides ancillary infor
mation to augment the argument made so far about the performance of
ent research organizations (Stanford’s Center for Research on Education
Outcomes and Tennessee’s own Value-Added Assessment System) both
report that in terms of value added, Promise Academy’s instructional pro
gram has yielded results that place it among the best-performing schools
in the state. ough many issues regarding the utility of the value-added
models remain unresolved, none of them would suggest that a large gain
implies a bad result.
When data are missing, there will always be greater uncertainty in
the estimation of any summary statistic that has, as one component, a
missing piece. When this occurs, standards of statistical practice require
A comparison of the actual AYP scores based on all three years of
data with its predicted value based on just third- and fourth-grade data. e di\terence
„„\f•  \f     Š
that we use all available ancillary data to estimate what was not measured.
In the current case, the missing piece was the performance of fth grad
ers in a school that ends at fourth grade. In this instance, it is clear from
third- and fourth-grade information in both RLA and mathematics, as
AYP score obtained by imputing a zero score for the missing fth-grade
writing assessment grossly misestimates the school’s performance.
In October 2011 the school presented its case to the Memphis City
Schools administration. Using all the information available, the adminis
charter was renewed for another year, at which time Promise Academy’s
structurally missing data will vanish.
But the lesson from this experience is important and should be
nothing is not alwayszero
3rd Grade
4th Grade
A display of the performance of all Tennessee schools in third- and
fourth-grade mathematics shows that the performance of Promise Academy’s students in
RLA was not an isolated instance.
Musing about Changes in theSAT
During the rst week of March 2014 the front page of most U.S.newspa
pers reported a story released by the College Board of big changes planned
for the SAT.
ese were amplied and augmented by the cover story of
the March 9 issue of the
NewYork Times’ Sunday Magazine
(Balf 2014).
Having spent twenty-one years as an employee of the Educational
Testing Service, the College Board’s principal contractor in the produc
tion, administration, and scoring of the SAT, Iread these reports with
great interest. In the end Iwas left wondering why these changes, in par
ticular, were being proposed and, moreover, why all the hoopla.
e scoring would no longer have a penalty for guessing.
ey would reduce the amount of arcane vocabulary used on thetest.
just the canonical verbal and quantitative sections, leaving writing as a
By my reckoning the rst change is likely to have only small e\tect, but it
probably will increase the error variance of the test, making the scores a
„„\f•  \f     Š
much of a problem, but it is always wise to be vigilant about including
item characteristics that are unrelated to the trait being measured. Iam
not sanguine about any changes being implemented successfully. e
third change is probably a face-saving reaction to the 2005 introduction
of the writing section, which did not work as hoped, and is the one mod
ication that is likely to yield a positive e\tect.
No Penalty for Guessing
e current SAT uses what is called “formula scoring.” at is, an exam
inee’s score is equal to the number right diminished by one-fourth the
number wrong (for the kinds of ve-choice items commonly used on
random among the choices on the items for which they don’t know the
answer, for every four items
one right by chance. So, under these circumstances, the expected gain
from guessing is zero. us there is neither any benet to guessing nor
does guessing add bias to the scores, although there is the binomial var
iance due to guessing that is unnecessarily added to the score. Note that
if an examinee has partial knowledge, and can eliminate one or more of
the distractors (the wrong answers), the expected gain from guessing,
even after correction, is positive, thus giving some credit for such partial
knowledge. e proposed change would do away with this correction
What is likely to be the e\tect? My memory is that the correlation
whatever change occurs will probably be small. But, maybe not, for it
ing. Such guessing adds no information to the score, just noise, so it is
hard to make a coherent argument for why we would want to encourage
it. But perhaps it was decided that if the e\tect of making the change is
Or (k-1) for a k-choiceitems.
small, why not do it– perhaps it would make the College Board look
responsive to critics without making any real change.
Reduce the Amount of Arcane Vocabulary
Murphy’s December 2013
article), but the meaning of arcane, in
Denition:arcane (adjective)
known or understood by very few;
She knew a lot about Sanskrit grammar and other arcane
Arcane words, dened in this way, bring to mind obscure words used
in very narrow contexts, such as
seven-and-a-half-minute periods that make up a polo match and is
rumored to have last been used as part of an SAT more than sixty years
ago; the second derives from the Welsh word for
, and its principal
But this does not seem to characterize what critics of SAT vocabulary
have in mind. An even dozen words that have been used to illustrate this
cious, transform, and unscrupulous.
Words of this character are less often heard in common conversation
. Wanting to rid the SAT of the lexical richness of words
accumulated through broad reading seems hard to justify. Iwill not try.
How much arcane vocabulary is on the SAT? Isuspect that, using the
true denition of
, there is close to none. Using my modied
denition of “literary” vocabulary, there is likely some; but with the
promised shift to including more “foundational” documents on the
test (e.g., Declaration of Independence, Federalist Papers), it seems
unavoidable that certain kinds of literary, if not arcane, vocabulary
„„\f•  \f     Š
will show up. In an introductory paragraph of Alexander Hamilton’s
Federalist #1
Ifound a fair number of my illustrative dozen (indi
in the below box).
Federalist #1 General Introduction
by Alexander Hamilton
which is more commonly the fault of the head than of the heart, will
larity at the expense of the public good. It will be forgotten, on the
of love, and that the
noble enthusiasm of liberty is apt to be infected with a spirit of narrow
distrust. On the other hand, it will be equally forgot
ten that the vigor of government is essential to the security of liberty;
that, in the contemplation of a sound and well-informed judgment,
their interest can never be separated; and that a dangerous ambition
more often lurks behind the
mask of zeal for the rights of the
people than under the forbidden appearance of zeal for the rmness
and e\bciency of government. History will teach us that the former
has been found a much more certain road to the introduction of
than the latter, and that of those men who have overturned
the liberties of republics, the greatest number have begun their career
by paying an
court to the people; commencing
, and ending tyrants.”
Is supporting the enrichment of language with unusual words neces
sarily a bad thing? Ifound Hamilton’s “General Introduction” to be
lucid and well argued. Was it so in spite of his vocabulary? Or because
of it? James Murphy, in his December 2013
article, “e Case
for SAT Words,” argues persuasively in support of enrichment. Itend
But some pointlessly arcane words may still appear on the SAT and
rarely appear anywhere else (akin to such words as
appeared in my writing for the rst time today). If such vocabulary
is actually on the SAT, how did it nd its way there? is question
probably has many answers, but Ibelieve that they all share a com
mon root. Consider a typical item on the verbal section of the SAT
(or any other verbal exam)– say a verbal reasoning item or a verbal
reasoning or the complexity of the analogy. It is a sad, but inexora
ble fact about test construction that item writers cannot write items
that are more di\bcult than they are smart. And so the distribution of
item di\bculties looks a great deal like the distribution of item-writer
ability. But, in order to discriminate among candidates at high levels
of ability, test specications require a fair number of di\bcult items.
How is the item writer going to respond when her supervisor tells her
to write ten hard items? Often the only way to generate such items is
to dig into a thesaurus and insert words that are outside broad usage.
(the very denition of

Clearly the inclusion of such vocabulary is not directly related to
the trait being tested (e.g., verbal reasoning), any more than making a
Iapplaud it. But how, then, will di\bcult verbal items be generated?
One possibility is to hire much smarter item writers; but such peo
ple are not easy to nd, nor are they cheap. e College Board’s plan
may work for a while, as long as unemployment among Ivy League
such talent will become rarer. us, so long as the need for di\bcult
verbal items remains, Ifear we may see the inexorable seepage of a
few arcane words back onto the test. But with all the checks and edits
a prospective SAT item must negotiate, Idon’t expect to seemany.
Making the Writing Portion Optional
To discuss this topic fully, we need to review the purposes of a test. ere
are at leastthree:
have is fairness.
„„\f•  \f     Š
Test as measuring instrument– the outcome of the test is used for fur
Test as prod– Why are you studying? Ihave a test. Or, more particu
larly, why does the instructor insist that students write essays? Because
they will need to write on the test. For this purpose the test doesn’t even
have to be scored, although that practice would not be sustainable.
With these purposes in mind, why was the writing portion added to the
core SAT in 2005? Idon’t know. Isuspect for a combination of reasons,
but principally (3), as a prod.
Why a prod and not for other purposes?
Scoring essays, because of its inherent subjectivity, is a very di\bcult task
on which to obtain much uniformity of opinion. More than a century
ago it was found that there was as much variability among scorers of a
single essay as there was across all the essays seen by a single scorer. After
uncovering this disturbing result the examiners reached the conclusion
more modern analysis of a California writing test found that the variance
component for raters was the same as that for examinees. So a hundred
years of experience training essay raters didn’thelp.
A study done in the mid-1990s used a test made up of three
thirty-minute sections. Two sections were essays, and one was a
multiple-choice test of verbal ability. e multiple-choice score corre
lated more highly with each of the essay scores than the essay scores did
with each other. is means that if you want to predict how an examinee
will do on some future writing task, you can do so more accurately with
is conclusion is supported in the words of Richard Atkinson, who was president of the
University of California system at the time of the addition of the writing section (and who
was seen as the prime mover behind the changes). He was quite explicit that he wanted
to use it as a prod– “an important aspect of admissions tests was to convey to students,
as well as their teachers and parents, the importance of learning to write and the neces
sity of mastering at least 8th through 10th grade mathematics.
.” and “From my view
point, the most important reason for changing the SAT is to send a clear message to K–12
students, their teachers and parents that learning to write and mastering a solid back
ground in mathematics is of critical importance. e changes that are being made in the
SAT go a long way toward accomplishing that goal.”
(accessed August 24,2015).
us, Iconclude that the writing section must have been included
primarily as a prod, so that teachers would emphasize writing as part
of their curriculum. Of course, the writing section also included a
multiple-choice portion (which was allocated thirty-ve of the sixty min
utes for the whole section) to boost the reliability of the scores up to
icism from teachers of writing, who claimed, credibly, at least to me,
that allocating twenty-ve minutes yielded no real measure of a student’s
ability. is view was supported by the sorts of canned general essays that
coaching schools had their students memorize. Such essays contained the
key elements of a high-scoring essay (four hundred words long, three
quotations from famous people, seven complex words, and the suitable
insertion of some of the words in the “prompt” that instigated the essay).
Which brings us to the point where we can examine what has
Board to reverse eld and start removing it. Isuspect that at least part of
the reason was that the College Board had unreasonable expectations for
a writing task that could be administered in the one hour available for it.
It was thus an unrealistic task that was expensive to administer and score,
and it yielded an unreliable measure of littlevalue.
Making it optional, as well as scoring it on a di\terent scale than
it gracefully and gradually. Considering the resources planned for the
continued development and scoring of this section, it appears that the
College Board is guessing that very few colleges will require it and few
e SAT has been in existence, in one form or another, since 1926. Its
character was not arrived at by whim. Strong evidence, accumulated over
those nine decades, supports many of the decisions made in its construc
tion. But it is not carved in stone, and changes have occurred continually.
However, those changes were small, inserted with the hope of making an
improvement if they work and not being too disastrous if they do not.
„„\f•  \f     Š
is follows the best advice of experts in quality control and has served
the College Board well. e current changes fall within these same limits.
ey are likely to make only a very small di\terence, but with luck the dif
ference will be a positive one. e most likely place for an improvement
Some insight into this question is provided by recalling a conver
the presidents of Dartmouth and Yale, respectively. Dartmouth had just
nalized the plans to go coed and successfully avoided the ire of those
alumni who inevitably tend to oppose any changes. Yale was about to
At the same time that Dartmouth made the enrollment change, they
also switched their mascot from the Dartmouth Indian to the Big Green.
Alumni apparently were so up in arms about the change in mascot that
they hardly noticed the girls. By the time they did, it was a
fait accom
(and they then noticed that they could now send their daughters to
Dartmouth and were content).
Could it be that the College Board’s announced changes vis-à-vis
guessing and arcane vocabulary were merely their bulldog, which they
planned to use to divert attention from the reversal of opinion repre
sented by the diminution of importance of the writing section? Judging
from the reaction in the media to the College Board’s announcement,
For Want ofaNail
Why Worthless Subscores May Be Seriously
Impeding the Progress of Western Civilization
How’s yourwife?
Compared towhat?
Henny Youngman
work, to choose among applicants for college admission, or to license
candidates for various professions, are often marathons. Tests designed
to evaluate knowledge of coursework typically use the canonical hour,
admissions tests are usually two to three hours, and licensing exams can
take days. Why are they as long as they are?
e rst answer that jumps
and its reliability.
And so to know how long a test should be we must
Issues of overtesting have been prominently in the news since the implementation of “No
Child Left Behind” because it is felt that too much of students’ instructional time was being
used for testing. e argument Imake here is certainly applicable to those concerns but
is more general. Iwill argue that for most purposes tests, and hence testing time, can be
reduced substantially and still serve those purposes.
Reliability is a measure of the stability of a score, essentially telling you how much the score
would change were the person to take another test on the same subject later, with all else
kept constant. In
Idened reliability in a heuristic way, as the amount of evi
dence brought into play to support a claim. High reliability typically implies a great deal of
evidence. is denition ts well here, where the longer the test, the more reliable,
„„\f•  \f     Š
rst ask, “How much reliability is needed?” Which brings us to Henny
Youngman’s quip, compared to what? ere is a long answer to this, but
quickly and so the marginal gain in reliability with increases in test length
we show the reliability of a typ
ical professionally prepared test as a function of its length. It shows that
the marginal gain of moving from a thirty-item test to a sixty- or even
ity is required. As we will show shortly, it is hard to think of a situation
where that would be the case, so why are tests as long as they are? What
is to be gained through the excessive length? What islost?
A Clarifying Example– the U.S.Census
Our intuitions can be claried with an example, the Decennial U.S.
Census. On midnight of January 1, 2010, the year of the last Census, it
Spearman-Brown function showing the reliability of a test as a function
of its length, if a one-item test has a reliability of0.15.
was estimated that there were 308,745,538 souls within the borders of the
Census was $13 billion, or approximately $42 per person. Is it worth
question, note that a single clerk, with access to an adding machine in a
minute or two, could estimate the change from the previous Census (to
an accuracy of ±0.1percent).
It doesn’t take a congressional study group and the O\bce of
was just that single number it would be a colossal waste of taxpayer
money. However, the constitutionally mandated purpose of the Census
is far broader than just providing a single number. It must also provide
the estimates used by states to allocate congressional representation, as
well as much, much narrower estimates (small area estimates, like “how
many households with two parents and three or more children live in
the Bushwick section of Brooklyn?”). In statistics these are called
area estimates
, and they are crucial for the allocation of social services
and for all sorts of other purposes. e Census provides such small area
estimates admirably well, and because it can do so, makes it worth its
Back toTests
instead examinee time. Is it worth using an hour (or two or more) of
examinee time to estimate just a single number– a single score? Is the
small marginal increase in accuracy obtained from a sixty- or eighty-item
test over, say a thirty-item test, worth the extra examineetime?
A glance at the gradual slope of the Spearman-Brown curve shown in
of the total population at any moment one merely needs to ascertain how much time has
elapsed since the last estimate, in seconds, divide by thirteen, and add in that increment.
„„\f•  \f     Š
each examinee by the millions of examinees that often take such tests
makes this conclusion stronger still. What would be the circumstances
in which a test score with a reliability of 0.89 will not su\bce, but one of
Perhaps there are other uses for the information gathered by the test
that require additional length; the equivalent of the small area estimates
of Census. In testing such estimates are usually called
are really small area estimates on various aspects of the subject matter of
the test. On a high school math test these might be subscores on algebra,
nary medicine there might be subscores on the pulmonary system, the
of cross-classied subscores in which the same item is used on more than
one subscore– perhaps one on dogs, another on cats, and others on cows,
horses, and pigs. Such cross-classied subscores are akin to the Census
e production of meaningful subscores would be a justication for
tests that contain more items than would be required merely for an accu
rate enough estimate of total score. What is a meaningful subscore? It is
one that is reliable enough for its prospective use and one that has infor
mation that is not adequately contained in the total testscore.
ere are at least two prospective uses of such subscores:
To aid examinees in assessing their strengths and weaknesses, often
with an eye toward remediating the latter,and
To aid individuals and institutions (e.g., teachers and schools) in
assessing the e\tectiveness of their instruction, again with an eye
toward remediating weaknesses.
In the rst case, helping examinees, the subscores need to be reliable
enough so that attempts to remediate weaknesses do not become just the
futile pursuit of noise. And, obviously, the subscore must contain infor
two characteristics of a worthwhile subscore:
Subscores’ reliability is governed by the same inexorable rules of reliabil
ity as overall scores– as the number of items they are based on decreases,
so too does their reliability. us if we need reliable subscores we must
have enough items for that purpose. is would mean that the overall
test’s length would have to be greater than would be necessary for merely
a singlescore.
For the second use, helping institutions, the test’s length would not
have to increase, for the reliability would be calculated over the number
of individuals from that institution who took the items of interest. If that
number was large enough the estimate could achieve high reliability.
rst to be the excessive lengths of most common tests is to provide feed
successful are test developers in providing such subscores?
Not particularly, for such scores are typically based on few items and
hence are not very reliable. is result led to the development of empir
strength from other items on the test that empirically yield an increase
in reliability.
scores substantially, but at the same time the inuence of items from
the rest of the test reduced the orthogonality of those subscores to the
was needed was a way to measure value of an augmented subscore that
Until such a measure became available the instigating question “how
successful are test developers in providing useful subscores?” would still
remain unanswered.
Happily, the ability to answer this important question was improved
markedly in 2008 with the publication of Shelby Haberman’s powerful
new statistic that combined both reliability and orthogonality.
this tool Sandip Sinharay
searched high and low for subscores that had
added value over total score, but came up empty. Sinharay’s empirical
results were validated in simulations he did that matched the structure
chapter9 of issen and Wainer’s
Test Scoring
„„\f•  \f     Š
commonly encountered in di\terent kinds of testing situations. Sinharay’s
results were enriched and expanded by Richard Feinberg,
paucity of subscores worth having was conrmed. is same nding, of
subscores adding no marginal value over total score, was reconrmed
While it is too early to say that
there are no subscores that are ever reported that are worth having, it
seems sensible that unless tests are massively redesigned such subscores
are likelyrare.
Surprisingly, at least to me, the search for subscores of value to insti
tutions also seems to have been futile
that were attempting to use such scores had fewer than fty examinees,
and so those scores too were neither reliable nor orthogonal enough to
If Not for Subscores, Is ere Another Justication
Where does this leave us? Unless we can nd a viable purpose for which
unreliable and nonorthogonal subscores have marginal value over just the
to administer tests that take more examinee time than is justied by the
As we saw in
, one possible purpose is the use of the test as
a prod to motivate students to study all aspects of the curriculum and for
the teachers to teach it. Surely, if the test is much shorter, fewer aspects
of the curriculum will be well represented. But this is easily circumvented
in a number of ways. If the curriculum is sampled cleverly neither the
teachers nor the examinees will know exactly what will be on the test and
so have to include all of it in their study. Another approach is to follow
NAEP’s lead and use some sort of clever design in which all sectors of the
Haberman, Sinharay, and Puhan
will allow estimates of subarea mastery to be estimated in the aggregate
and, through the statistical magic of score equating, still allow all exam
result that has been shown repeatedly with adaptive tests in which we can
give a test of half its usual length with no loss of motivation. Sure there
are fewer items of each sort, but examinees must still study all aspects
because on a shorter test each item “counts” more toward the nalscore.
So, unless evidence can be gathered that shows a radical change in
teaching and studying behavior with shorter tests, we believe that we can
reject motivation as a reason for giving too longtests.
Another justication for the apparent excessive lengths of tests is that
the small additional reliability due to that extra length is of practical
importance. Of course, the legitimacy of such a claim would need to be
examined on a case-by-case basis, but perhaps we can gain some insight
through a careful study of one articial situation that has similarities to a
nation that has, say, three hundred items, takes eight hours to administer,
and has a reliability of 0.95. Such characteristics show a marked similar
ians or nurses or certied public accountants). e purpose of such an
score is 63percent correct.
accuracy of our decisions if we make truly draconian reductions in the
reduce the test to just seventy-ve items. Because of the gradual slope of
the reliability curve shown in
, this kind of reduction would
only shrink the reliability to 0.83. Is this still high enough to safeguard
how many wrong pass-fail decisions would bemade.
With the original test 3.13percent of the decisions would be incor
„„\f•  \f     Š
they should have failed) of 1.18percent and false negatives (failing when
they should have passed) of 1.95percent. How well does our shrunken
test do? First, the overall error rate increases to 6.06percent, almost dou
ble what the longer test yielded. is breaks down to a false positive rate
of 2.26percent and a false negative rate of 3.80percent. Is the inevitable
diminution of accuracy su\bciently large to justify the fourfold increase
in test length? Of course, that is a value judgment, but before making it
we must realize that the cost in accuracy can be eased. e false positive
rate for this test is the most important one, for it measures the propor
can control the false positive rate by simply raising the passing score. If
we increase the passing score to 65percent instead of 63percent the false
positive rate drops back to the same 1.18percent we had with the full
test. Of course, by doing this, the false negative rate grows to 6.6percent,
but this is a venial sin that can be ameliorated easily by adding additional
items to those candidates who only barely failed. Note that the same
(shown in
adding more items to decrease the false negative rate. e parallel to the
Spearman-Brown curve is shown in
e improvement in the false negative rate yielded through the
shows us that by adding but forty items
for only those examinees just below the cut score (those whose scores
range from just below the minimal passing score of 65percent to about
62percent) we can attain false negative rates that are acceptably close
to those obtained with the three-hundred-item test. is extension can
most of the examinees their test would take but one-fourth the time as it
would have previously, and even for the small number of examinees who
had to answer extra items, these were few enough so that even they still
we show a summary of these results as well as par
allel results for an even more dramatically reduced test form with but
us, at least for simple pass-fail decisions, it seems that we can dis
miss accuracy as a reason for the excessive test lengths used; for within
plausible limits we can obtain comparable error rates with much shorter
tests, albeit with an adaptive stoppingrule.
As it turns out this is also true if a purpose of the test score is to pick
winners. is is a more complex process than simply seeing if any score is
above or below some xed passing score, for now we must look at a pair
Summary of Passing Statistics
„„\f•  \f     Š
e Costs of Using Excessively Long Tests
and the Progress of Western Civilization
Costs can be measured in various ways and, of course, they accrue dif
ferentially to di\terent portions of the interested populations:the testing
organization, the users of the scores, and the examinees.
e cost to the users of the scores is nil because neither their time nor
money is used to gather the scores.
e cost to the testing organization is likely substantial because a
single operational item’s cost is considerably greater than $2,500. Add
to this the costs of “seat time” paid to whoever administers the exam,
grading costs, and so forth, and it adds to a considerable sum. Fixed costs
being what they are, giving a new test of one-fourth the length of an older
one does not mean one-fourth the cost, but it does portend worthwhile
savings. We are also well aware that concerns of test speededness at cur
rent lengths could be ameliorated easily if the time allowed for the test
was shrunken, but not quite as far as the number of items would suggest.
Which brings us to the examinees. eir costs are of two kinds:(1)the
actual costs paid to the testing organization, which could be reduced
if the costs to that organization were dramatically reduced, and (2)the
opportunity costs of time.
If the current test takes eight hours then a shortened form of only
one-fourth its length might,
ings of six hours per examinee. Multiplied by perhaps one hundred thou
sand examinees who annually seek legal licensure, this would yield a time
savings of six hundred thousand hours. Keeping in mind that the exam
inees taking a bar exam are (or shortly will be) all attorneys. What can be
accomplished with six hundred thousand extra hours of attorneys’time?
ink of how much good six hundred thousand annual hours of
pro bono
legal aid could do– or a like amount of e\tort from edging
accountants at tax time, architects, professional engineers, physicians, or
spare time from pre- or just-licensed professionals could accelerate the
progress of our civilization.
ese opportunity costs, if translated to K–12 schoolchildren, should be measured in lost
e facts presented here leave us with two possibilities:
To shorten our tests to the minimal length required to yield accept
able accuracy for the total score, and thence choose more protable
activities for the time freedup,or
To reengineer our tests so that the subscores that we calculate have
the properties that we require.
e fault, dear Brutus, is not in our statistical measures, but in
our tests, that they are poorly designed.
Cassius to Brutus in Shakespeare’s Julius Caesar (Act 1, SceneII)
It has been shown
clearly that the shortcomings found in the subscores
calculated on so many of our large-scale tests are due to aws in the tests’
designs. us the second option, the one we nd most attractive, requires
redesigning our tests. Aredesign is necessary because, in a parallel to what
Cassius so clearly pointed out to Brutus more than four centuries ago,
that information was not built in to begin with. Happily, a blueprint for
how to do this was carefully laid out in 2003 when Bob Mislevy, Linda
Steinberg, and Russell Almond provided the principles and procedures
for what they dubbed “Evidence Centered Design.” It seems worth a shot
to try it. In the meantime we ought to stop wasting resources giving tests
that are longer than the information they yield isworth.
Census:while reliable small area estimates can be made, they are not
likely to comecheap.
It may be that such reengineering will not help if the examinee popula
tion is too homogeneous. Consider a health inventory. Surely everyone
would agree that recording both height and weight is important. While
the two variables are correlated, each contains important information not
Sinhary, Haberman, and Wainer
„„\f•  \f     Š
found in the other; a person who weighs 220 pounds is treated di\terently
if he is 6’5” than if he is 5’6”. Additionally both can be recorded with
great accuracy. So, by the rules of valuable subscores, both should be kept
in the inventory. But suppose the examinees were all men from a regi
ment of marines. eir heights and weights would be so highly correlated
that if you know one you don’t need the other. is may be the case for
such highly selected populations as attorneys or physicians and so there
is likely little prospect for worthwhile subscores. But there remains the
strong possibility of advancing civilization.
Try is atHome
For most of its history, science shared many characteristics of a sect. It had
its own language, strict requirements for membership, and its own closely
mountain, existing at an altitude far above where ordinary people lived.
2,500years since the time of the Pythagoreans but is now in the midst
of a revolutionary change. is shift is occurring because it has become
used for humans to understand the world we inhabit.
One of the principal purposes of this book has been to illustrate how
a bicycle without knowing the equations of motion that describe how to
balance upright on just two wheels. Stanford’s Sam Savage has an evoc
ative phrase, only slightly inaccurate, that for some tasks we can learn
intellect. Obviously, both parts of our anatomy are often useful to master
a broad range of situations.
Most of the knotty problems discussed so far were unraveled using
little more than three of the essential parts of scientic investigations:
Some carefully gathered data, combinedwith
Clear thinkingand
Graphical displays that permit the results of the rst two steps to be
Of course, many of the problems encountered by scientists are not sus
ceptible to amateurs– coronary bypass surgery, the design of nuclear
immediately to mind; there are many others. is is not surprising. What
is remarkable is how broad the range of problems is that are susceptible to
illumination by thoughtful and industrious amateurs.
the Cost of HealthCare
the costs of medical care are not available to patients in advance of the
Cram who, in 2013, tried, unsuccessfully, to nd out what would be
the cost of a hip replacement. eir ndings were replicated and ampli
ed by
NewYork Times
science reporter Gina Kolata who, on behalf
of her pregnant and uninsured daughter, called a number of hospi
tals to try to nd out how much the delivery would cost. She was not
particularly successful until she identied herself as a
and got through to the Kennedy Health System’s president and chief
executive, Martin A.Bieber, who told her that a normal delivery is
approximately $4,900, and newborn care costs about $1,400. And, he
added, Kennedy charges all uninsured patients those prices, which are
115percent of the Medicare rates, no matter what their income. She
also reported that Dartmouth Medical Center, one of the few places
that posts its prices, charges the uninsured about $11,000 for a normal
delivery and newborncare.
However, one can argue that such procedures as hip replacements or
even giving birth are complex and idiosyncratic, thus the costs, which
depend heavily on the specic situation, might vary enormously. is
variation may make cost estimates too di\bcult to predict accurately. In
addition, it is uncertain what strictures about revealing prices to the pub
lic are placed on hospital employees.
is issue was discussed around the dinner table at the Bernstein
household in Haverford, Pennsylvania, recently. Jill Bernstein (age 14),
advised and guided by her father Joseph (an orthopedic surgeon) devised
an investigation that she felt would shed additional light on the situation.
Ms. Bernstein called twenty local (Philadelphia) hospitals and
explained she needed an electrocardiogram (ECG) but had no insur
ance. She then asked how much would one cost. An ECG is simple and
its costs ought not to vary by case. Seventeen of the twenty hospitals
refused to tell her and the three that did quoted prices of $137, $600, and
$1,200, respectively (even surpassing the vast proportional variation in
delivery costs that Gina Kolata found). en (and here’s Ms. Bernstein’s
genius) she called the hospitals back and said that she was coming in for
provided parking costs (ten o\tered free or discounted parking, indicat
ing an awareness of consumer’s concerns about cost). She then wrote up
and published her results
and was subsequently interviewed on National
How Often Is an Active Cell Phone in a Car
Like an OpenBeerCan?
In 2003 a study by Utah psychologist David Strayer and his colleagues
found that “people are as impaired when they drive and talk on a cell
phone as they are when they drive intoxicated at the legal blood-alcohol
limit” of 0.08percent.
is result was derived from a simulated driving
task in which the subjects of the experiment had to “drive” while either
is nding also sparked discussion around the Bernstein dinner
table. Now it was Jill’s sixteen-year-old brother James who was interested
in further investigation. He was curious how often drivers had their cell
Bernstein and Bernstein2014.
is is the minimum level that denes illegal drunken driving in most U.S.states.
phone at the ready and such other variablesas:
Having the seat belt fastened,
Driver’s gender,and
e type of vehicle.
many drivers, while waiting at the red light for about thirty seconds, used
the time to speak or text on their phones. He found about 20percent (of
the thousand cars he recorded) did and that amount was unrelated to sex
of driver or type of vehicle, but it was related to seatbelt use (the relation
continued speaking as they drove o\t. He concluded that having an active
cell phone in a car should be regarded in much the same way as an “open
bottle” of alcohol– a dangerous temptation.
How Do Monarch Butteries Come
North Each Spring?
In the early 1970s Fred Urquhart and his colleagues at the University of
Toronto used a tiny tag to identify Monarch Butteries (
Danaus plexip
) and so to track them to their winter home in Mexico.
With the
found their destination. It was a site in the Sierra Madre mountains west
covered by orange Monarchs enjoying the warm weather until it was time
tion was to track all of the butteries on the trip north. But this step was
beyond the resources of this research team– they were ummoxed.
us far this buttery study is typical of many scientic investi
gations– being run by experts using fancy techniques and high-tech
Bernstein & Bernstein, 2015.
My thanks to Lee Wilkinson (
instrumentation. But all of this horsepower was insu\bcient for the nal
step, for to accomplish it would require following all of the swarms of
butteries when they left Mexico and keeping track of their many di\ter
ent pathways– both where they went and when they got there; a task well
beyond the resources available.
Figure C.1
where and when Monarchs traveled on their way north. How was this
I have always depended on the kindness of strangers.
Blanche Dubois
It will come as no surprise, given the theme of this section, that the cavalry
that rode to the rescue of this investigation were hundreds of children,
teachers, and other observers that reported to the Journey North Project
under the auspices of the Annenberg/CPB Learner Online program.
Each data point in the graph is the location and date of an observer’s rst
sighting of an adult Monarch during the rst six months of1977.
uly 15
y 31
y 16
il 30
il 15
une 15
in Mexico.
from the ivory tower and can be used by anyone with su\bcient grit and
results are important enough, to the general understanding.
I have reached the end of my tale. Each chapter was meant to convey
a little of the way a modern data scientist thinks, and how to begin to
solve what may have seemed like impossibly vexing problems. Underlying
all of this are some deep ideas about the critical role in science played
by evidence and some of its important characteristics. Especially crucial
are (1)explicit hypotheses, (2)evidence that can test those hypotheses,
(3)reproducibility, and (4)the control of hubris.
emphasized these characteristics in memorableways.
e pitcher rst– Paraphrasing the great Satchel Page’s comment on
the value ofmoney:
Evidence may not buy happiness,
but it sure does steady the nerves.
is idea was augmented and emphasized by the chemist August Wilhelm
von Hofman (1818–92), the discoverer of formaldehyde:
A century later one key aspect of scientic evidence, reproducibility, was
made explicit by the journal editor G.H. Scherr (
e glorious endeavour that we know today as science has
grown out of the murk of sorcery, religious ritual, and cook
ing. But while witches, priests, and chefs were developing
mining the validity of their results; they learned to ask Are
they reproducible?
And nally, on the crucial importance of modesty and controlling the
natural hubris that would have us draw inferences beyond our data, from
As it now seems to me to seemtobe.
Professor Alyea’s observation seems like a remarkably apt way for me to
express my own feelings as Iend thisbook.
American Educational Research Association, the American Psychological
Association, and the National Council on Measurement in Education
Standards for Educational and Psychological Testing
. Washington,
DC:American Psychological Association.
Amoros, J. (2014). Recapturing Laplace.
Andrews, D.F. (1972). Plots of high-dimensional data.
Ango\t, C., and Mencken, H.L. (1931). e worst American state.
Arbuthnot, J. (1710). An argument for divine providence taken from the constant
regularity in the births of both sexes.
Philosophical Transactions of the Royal
Austen, J. (1817).
North to the Hanger, Abbey
. London:John Murray, Albermarle
Balf, T. (March 9, 2014). e SAT is hated by– all of the above.
NewYork Times
Sunday Magazine
Beardsley, N. (February 23, 2005). e Kalenjins:Alook at why they are so good at
long-distance running.
Human Adaptability
vs. mammography alone in women at elevated risk of breast cancer.
Journal of
the American Medical Association
Bernstein, J.R.H., and Bernstein, J. (2014). Availability of consumer prices from
Philadelphia area hospitals for common services:Electrocardiograms vs. park
JAMA Internal Medicine
(accessed December 2,2014).
Bernstein, J. J., and Bernstein, J. (2015). Texting at the light and other forms
BMC Public Health
, 15:968. DOI
10.1186/s12889-015-2343-8 (accessed September 29, 2015).
Berry, S. (2002). One modern man or 15 Tarzans?
Bertin, J. (1973).
Semiologie Graphique
. e Hague:Mouton-Gautier. 2nd ed.
(English translation done by William Berg and Howard Wainer and published
Semiology of Graphics
, Madison:University of Wisconsin Press,1983.)
Bock, R.D. (June 17, 1991).
e California Assessment. Atalk given at the
Educational Testing Service
Bridgeman, B., Cline, F., and Hessinger, J. (2004). E\tect of extra time on verbal
and quantitative GRE scores.
Applied Measurement in Education
Bridgeman, B., Trapani, C., and Curley, E. (2004). Impact of fewer questions per
section on SAT Iscores.
Journal of Educational Measurement
Briggs, D.C. (2001). e e\tect of admissions test preparation:Evidence from
Bynum, B.H., Ho\tman, R.G., and Swain, M.S. (2013). Astatistical investiga
tion of the e\tects of computer disruptions on student and school scores. Final
report prepared for Minnesota Department of Education, Human Resources
Research Organization.
Cherno\t, H. (1973). e use of faces to represent points in k-dimensional space
Journal of the American Statistical Association
Cizek, G.J., and Bunch, M.B. (2007).
Standard Setting:AGuide to Establishing
and Evaluating Performance Standards on Tests
. ousand Oaks, CA:Sage.
Clauser, B.E., Mee, J., Baldwin, S.G., Margolis, M.J., and Dillon, G.F. (2009).
cise for a medical licensing examination:An experimental study.
Journal of
Educational Measurement
Cleveland, W.S. (2001). Data science:An action plan for expanding the technical
areas of the eld of statistics.
International Statistical Review
Cobb, L.A., omas, G.I., Dillard, D.H., Merendino, K.A., and Bruce, R.A.
(1959). An evaluation of internal-mammary-artery ligation by a double-blind
New England Journal of Medicine
the worst places to live in the UK and the US. In C. Kostelnick and M.
Kimball (Eds.),
Visible Numbers, the History of Statistical Graphics
. Farnham,
UK:Ashgate Publishing (forthcoming).
(2012). A century and a half of moral statistics in the United Kingdom:Variations
Education Week
DerSimonian, R., and Laird, N. (1983). Evaluating the e\tect of coaching on SAT
Harvard Education Review
Dorling, D. (2005).
Human Geography of the UK
. London:Sage Publications.
Dorling, D., and omas, B. (2011).
Bankrupt Britain:An Atlas of Social Change
Bristol, UK:PolicyPress.
Educational Testing Service (1993).
Test Security:Assuring Fairness for All
NJ:Educational Testing Service.
Feinberg, R.A. (2012). Asimulation study of the situations in which reporting
subscores can add value to licensure examinations. PhD diss., University of
Delaware. Accessed October 31, 2012, from ProQuest Digital Dissertations
database (Publication No. 3526412).
Fernandez, M. (October 13, 2012). El Paso Schools confront scandal of students
who “disappeared” at test time.
Fisher, R. A. (1925).
of the Statistical Society of London
Summary of the Moral Statistics of England and Wales
. Privately printed
(1847). Moral and educational statistics of England and Wales.
Journal of the
Statistical Society of London
Fox, P., and Hender, J. (2014). e science of data science.
Big Data
for reestimating SAT scores.
Harvard Educational Review
Friendly, M., and Denis, D. (2005). e early origins and development of the scat
Journal of the History of the Behavioral Sciences
Friendly, M., and Wainer, H. (2004). Nobody’s perfect.
arrival of man-made earthquakes.
e NewYorker
Gelman, A. (2008).
Red State, Blue State, Rich State, Poor State:Why Americans Vote
the Way ey Do
Gilman, R., and Huebner, E.S. (2006). Characteristics of adolescents who report
very high life satisfaction.
Journal of Youth and Adolescence
Graunt, J. (1662).
Natural and Political Observations on the Bills of Mortality
London:John Martyn and James Allestry.
Haberman, S. (2008). When can subscores have value?
Journal of Educational and
Behavioral Statistics
Haberman, S.J., Sinharay, S., and Puhan, G. (2009). Reporting subscores for insti
British Journal of Mathematical and Statistical Psychology
Halley, E. (1686). An historical account of the trade winds, and monsoons, observ
Philosophical Transactions
183:153–68. e issue
Hand, E. (July 4, 2014). Injection wells blamed in Oklahoma earthquakes.
Harness, H.D. (1837).
Atlas to Accompany the Second Report of the Railway
Commissioners, Ireland
. Dublin:Irish Railway Commission.
Hartigan, J.A. (1975).
Clustering Algorithms
. NewYork:Wiley.
Haynes, R. (1961).
e Hidden Springs:An Enquiry into Extra-Sensory Perception
London:Hollis and Carter. Rev. ed. Boston:Little, Brown,1973.
Hill, R. (2013). An analysis of the impact of interruptions on the 2013 admin
istration of the Indiana Statewide Testing for Educational Progress– Plus
Hobbes, T. (1651).
Leviathan, or the matter, forme, and power of a commonwealth,
ecclesiasticall and civill
. Republished in 2010, ed. Ian Shapiro (New Haven,
CT:Yale University Press).
Holland, P.W. (October 26, 1980).
Personal communication
(1986). Statistics and causal inference.
Journal of the American Statistical
(October 26, 1993).
Personal communication
Hopkins, Eric. (1989).
Birmingham:e First Manufacturing Town in the World,
. London:Weidenfeld and Nicolson.
(accessed August
Hume, D. (1740). ATreatise on Human Nature.
Huygens, C. (1669). In Huygens, C.(1895).
Oeuvres complétes, Tome Sixiéme
Correspondance (pp
Martinus Nijho\t.
inking Fast, inking Slow
. NewYork:Farrar, Straus and Giroux.
Kalager, M., Zelen, M., Langmark, F., and Adami, H. (2010). E\tect of screening
mammography on breast-cancer mortality in Norway.
New England Journal of
Religion within the Limits of Reason Alone (
pp. 83–4
trans. T.M. Green and H.H. Hudon. NewYork:Harper Torchbook.
Keranen, K.M., Savage, H.M., Abers, G.A., and Cochran, E.S. (June 2013).
5.7 earthquake sequence.
Keranen, K.M., Weingarten, M., Abers, G.A., Bekins, B.A., and Ge, S. (July 25, 2014).
Sharp increase in central Oklahoma seismicity since 2008 induced by massive
wastewater injection.
345(6195):448–51. Published online July 3,2014.
40–59kg/m) and mortality:Apooled analysis of 20 prospective studies.
Kolata, G. (July 8, 2013). What does birth cost? Hard to tell.
Mémoires de l’Académie Royale des Sciences de Paris1783.
Little, R.J.A., and Rubin, D.B. (1987).
Statistical Analysis with Missing Data
Ludwig, D.S., and Friedman, M.I. (2014). Increasing adiposity:Consequence or
cause of overeating?
Journal of the American Medical Association
. Published
online:May 16, 2014. doi:10.1001/jama.2014.4133.
Luhrmann, T.M. (July 27, 2014). Where reason ends and faith begins.
, News of the Week in Review,p.9.
Macdonell, A.A. (January 1898). e origin and early history of chess.
Journal of
the Royal Asiatic Society
Mee, J., Clauser, B., and Harik, P. (April 2003). An examination of the impact of
computer interruptions on the scores from computer administered examina
Council of Educational Measurement, Chicago.
Meier, P. (1977). e biggest health experiment ever:e 1954 eld trial of the Salk
Poliomyelitis vaccine. In
Statistics:AGuide to the Study of the Biological and
Health Sciences (pp
. 88–100). NewYork:Holden-Day.
Psychological Bulletin
Mislevy, R.J., Steinberg, L.S., and Almond, R.G. (2003). On the structure of edu
Measurement:Interdisciplinary Research and Perspectives
Moore, A. (2010, August 27). Wyoming Department of Education Memorandum
Number 2010–151:
Report on Eects of 2010 PAWS Administration
Irregularities on Students Scores
Mosteller, F. (1995). e Tennessee study of class size in the early school grades.
Future of Children
Murphy, J.S. (December 11, 2013). e case for SAT Words.
e Atlantic
National Institutes of Health. (2014). Estimates of Funding for Various Research,
Condition, and Disease Categories.
(accessed September 29,2014).
Neyman, J. (1923). On the application of probability theory to agricultural experi
ments. Translation of excerpts by D.Dabrowska and T.Speed.
Nightingale, F. (1858).
Notes on Matters Aecting the Health, Eciency and Hospital
Administration of the British Army
. London:Harrison andSons.
Pacioli, Luca (1494).
Summa de Arithmetica
. Venice, folio 181,p.44.
Pfe\termann, D., and Tiller, R. (2006). Small-area estimation with state-space
models subject to benchmark constraints.
Journal of the American Statistical
Playfair, W. (1821).
A Letter on Our Agricultural Distresses, eir Causes and
. London:W.Sams.
e Statistical Breviary; Shewing on a Principle Entirely New,
the Resources of Every State and Kingdom in Europe; Illustrated with Stained
Copper-Plate Charts, Representing the Physical Powers of Each Distinct Nation
with Ease and Perspicuity
. Edited and introduced by Howard Wainer and Ian
Spence. NewYork:Cambridge UniversityPress.
e Commercial and Political Atlas, Representing, by Means
of Stained Copper-Plate Charts, the Progress of the Commerce, Revenues,
Expenditure, and Debts of England, during the Whole of the Eighteenth Century
Facsimile reprint edited and annotated by Howard Wainer and Ian Spence.
NewYork:Cambridge University Press,2006.
Quinn, P.D., and Duckworth, A.L. (May 2007). Happiness and academic achieve
of the Association for Psychological Science, Washington,DC.
models proposed by Schulz.
Educational Measurement:Issues and Practice
Robbins, A. (2006).
e Overachievers:e Secret Lives of Driven Kids
. NewYork:
Robinson, A.H. (1982).
Early ematic Mapping in the History of Cartography
Chicago:University of ChicagoPress.
Rosen, G. (February 18, 2007). Narrowing the religion gap
. NewYork Times Sunday
Rosenbaum, P. (2009).
Design of Observational Studies
. NewYork:Springer.
Observational Studies
. NewYork:Springer.
Rosenthal, J.A., Lu, X., and Cram, P. (2013). Availability of consumer prices from
JAMA Internal Medicine
Rubin, D.B. (2006). Causal inference through potential outcomes and principal
stratication:Application to studies with “censoring” due to death.
(2005). Causal inference using potential outcomes:Design, modeling, decisions.
2004 Fisher Lecture.
Journal of the American Statistical Association
(1978). Bayesian inference for causal e\tects:e role of randomization.
Annals of Statistics
(1975). Bayesian inference for causality:e importance of randomization. In
SocialStatistics Section,
Proceedings of the American Statistical Association
(1974). Estimating causal e\tects of treatments in randomized and non-randomized
Journal of Educational Psychology
Scherr, G.H. (1983). Irreproducible science:Editor’s introduction. In
e Best of
the Journal of Irreproducible Results
. NewYork:Workman Publishing.
Sinharay, S. (2014). Analysis of added value of subscores with respect to classica
Journal of Educational Measurement
(2010). How often do subscores have added value? Results from operational and
Journal of Educational Measurement
Sinharay, S., Haberman, S.J., and Wainer, H. (2011). Do adjusted subscores
lack validity? Don’t blame the messenger.
Educational and Psychological
Sinharay, S., Wan, P., Whitaker, M., Kim, D-I., Zhang, L., and Choi, S. (2014).
Study of the overall impact of interruptions during online testing on the test
scores. Unpublished manuscript.
Slavin, Steve (1989).
All the Math You’ll Ever Need (
pp. 153–4
. NewYork:John
Wiley andSons.
Solochek, J. (2011, May 17). Problems, problems everywhere with Pearson’s testing
Tampa Bay Times
Strayer, D.L., Drews, F.A., and Crouch, D.J. (2003). Fatal distraction? Acompari
son of the cell-phone driver and the drunk driver. In D.V. McGehee, J.D. Lee,
and M. Rizzo (Eds.),
Driving Assessment 2003:International Symposium on
Human Factors in Driver Assessment, Training, and Vehicle Design
(pp. 25–30).
Public Policy Center, University ofIowa.
acker, A. (2013). Oklahoma interruption investigation. Presented to the
Oklahoma State Board of Education.
issen, D., and Wainer, H. (2001).
Test Scoring
. Hillsdale, NJ:Lawrence Erlbaum
Tufte, E.R. (2006).
Beautiful Evidence
. Cheshire, CT:GraphicsPress.
(November 15, 2000). Lecture on information display given as part of the Yale
Graduate School’s Tercentennial lecture series “In the Company of Scholars” at
Levinson Auditorium of the Yale University Law School.
Visual Explanations
. Cheshire, CT:GraphicsPress.
Envisioning Information
. Cheshire, CT:GraphicsPress.
e Visual Display of Quantitative Information
. Cheshire,
Twain, M. (1883).
Life on the Mississippi
. Montreal:Dawson Brothers.
Verkuyten, M., and ijs, J. (2002). School satisfaction of elementary school chil
Indicators Research
Uneducated Guesses Using Evidence to Uncover Misguided Education
(2011b). Value-added models to evaluate teachers:Acry for help.
Picturing the Uncertain World:How to Understand, Communicate and
Control Uncertainty through Graphical Display
(2007). Galton’s normal is too platykurtic.
Graphic Discovery:ATrout in the Milk and Other Visual Adventures
(2002). Clear thinking made visible:Redesigning score reports for students.
standards and the Charybdis of court decisions.
Visual Revelations:Graphical Tales of Fate and Deception from Napoleon
Bonaparte to Ross Perot
. 2nd ed. Hillsdale, NJ:Lawrence Erlbaum Associates.
(1984). How to display data badly.
e American Statistician
(1983). Multivariate displays. In M.H. Rizvi, J. Rustagi, and D. Siegmund
Recent Advances in Statistics (
pp. 469–508). NewYork:AcademicPress.
Wainer, H., and Rubin, D.B. (2015). Causal inference and death.
Wainer, H., Lukele, R., and issen, D. (1994). On the relative value of
multiple-choice, constructed response, and examinee-selected items on two
achievement tests.
Journal of Educational Measurement
Wainer, H., Sheehan, K., and Wang, X. (2000). Some paths toward making Praxis
scores more useful.
Journal of Educational Measurement
Wainer, H., Bridgeman, B., Najarian, M., and Trapani, C. (2004). How much does
extra time on the SAT help?
Walker, C.O., Winn, T.D., and Lutjens, R.M. (2008). Examining relationships
Education Research International
(2012), Article ID 643438, 7 pages.
(accessed August 27,2015).
Waterman, A.S. (1993). Two conceptions of happiness:Contrasts of personal
expressiveness (eudaimonia) and hedonic enjoyment.
Journal of Personality
and Social Psychology
Wilkinson, L. (2005).
e Grammar of Graphics
. 2nd ed. NewYork:Springer-Verlag.
Winchester, S. (2009).
e Map at Changed the World
. NewYork:Harper Perennial.
Wyld, James (1815) in Jarcho, S.(1973). Some early demographic maps
. Bulletin of
the NewYork Academy of Medicine
to produce 2010 county population estimates. U.S. Census Bureau Working
Paper No.100.
model of personality-happiness and the academic achievement of physical edu
European Journal of Experimental Biology
standards of performance on educational and occupational tests.
(accessed August 27,2015).
How the Rule of 72 Can Provide Guidance to Advance Your Wealth, Your Career,
and Your Gas Mileage.
Piano Virtuosos and the Four-Minute Mile.
Happiness and Causal Inference.
Causal Inference and Death (with D.B. Rubin).
Using Experiments to Answer Four Vexing Questions.
Life Follows Art:Gaming the Missing Data Algorithm.
On the Crucial Role of Empathy in the Design of Communications:Testing as an
Improving Data Displays:Our’s and e Media’s.
Improving Data Displays:Our’s and e Media’s.
in Wainer, H.
the Uncertain World
Inside Out Plots (with J.O. Ramsay).

A Century and a Half of Moral Statistics in the United Kingdom:Variations on
Waiting for Achilles.
How Much Is Tenure Worth?
. NewYork:Routledge,
When Nothing Is Not Zero:ATrue Saga of Missing Data, Adequate Yearly Progress,
and a Memphis Charter School.
For Want of a Nail:Why Worthless Subscores May Be Seriously Impeding the
Progress of Western Civilization (with R.Feinberg).
12(1), 16–21,2015.
Try is at Home.
29(2), forthcoming,2016.
Abbott v.
Achilles/tortoise paradox,
Almond, Russell,
Alyea, Hubert N.,
angina, surgical treatment for (Fieschi),
Atkinson, Richard,
Bachmann, Michelle,
Balbi, Adriano,
Bannister, Roger,
Bayes’ eorem,
Bayi, Filbert,
Beardsley, Noah,
Bench, Johnny,
Berkeley, George,
Bernstein, James,
Bernstein, Jill,
Bernstein, Joseph,
Berra, Yogi,
Berry, Scott,
Bertin, Jacques,
Bieber, Martin A.,
boggle threshold
Bojorquez, Manuel,
Bok, Derek,
Bowdoin College and optional SAT
Bowie High School (El Paso, Texas),
Bowie Model,
Braun, Henry,
breast cancer,
Brewster, Kingman,
Bridgeman, Brent,
Briggs, Derek,
Brownback, Sam,
Bulletin of the Seismological Society of
capture-recapture procedures,
carcinogenic additives,
“Case for SAT Words, e” (Murphy),
covariate information and,
interruptions and,
missing variable and,
dened (Hume),
ordering cause and eect,
Census, U. S. Decennial,
chess reward (Ferdowski),
strengths and weaknesses,
Claremont McKenna College,
Clauser, Brian,
clean air regulations,
Cleveland, Bill,
Cobb, Leonard,
Cochran, Mickey,
Coe, Sebastian,
communication, and empathy,
compound interest,
conjecture, and fact,
counterfactuals and,
through experimentation,
coronary by-pass surgery,
correlation, and coincidence,
Cortot, Alfred,
Cristie, Chris,
bias size and,
excluding subjects,
uncertainty and,
control and,
Delaney Clause research,
delivery (birth) costs,
DerSimonian, Rebecca,
from the mean, plotting,
Median Absolute (MAD),
drilling, horizontal and vertical,
driving impaired,
drug trials,
Duckworth, A.L.,
Dupin, Charles,
Durso, John,
educational performance
Promise Academy and,
property taxes,
tenure and,
Education Research International
Einstein, Albert,
El Guerrouj, Hicham,
electrocardiogram (ECG) costs,
Elliot, Herb,
Ellsworth, William,
Emanuel, Rahm,
boggle threshold
characteristics of,
resistance to,
Evidence Centered Design
(Mislevy, Steinberg,
Fallin, Mary,
false positives,
Federalist #1
Feinberg, Richard,
Ferdowski Tusi, Hakim Abu Al-Qasim,
Fernandez, Manny,
Fieschi, Davide,
Fisher, Ronald,
fracking (hydraulic fracturing),
seismic eects of,
Freedle, Roy,
Freedle’s Folly,
Galchen, Rivka,
Galton, Francis,
Garcia, Lorenzo,
Gelman, Andrew,
graphical display,
pie charts,
measuring, requirements for,
Guerry, Andre-Michel,
Haag, Gunder,
Haberman, Shelby,
Hamilton, Alexander,
Hamm, Harold,
performance and,
Harder, Donelle,
Hargadon, Fred,
Hartigan, John,
Hateld, Kim,
Haynes, Renée,
health inventory,
healthcare, controlling cost of,
height distribution study (Galton),
Helfgott, David,
Herschel, John Frederick William,
Hindoostan, Playfair's plot of,
hip replacement costs,
Hobbes, omas,
Hofman, August Wilhelm von,
Holland, Paul,
Holmes, Sherlock,
homework ban,
How to Display Data Badly
Hsu, Jane,
Huckabee, Mike,
Hume, David,
ice cream and drownings,
Dopeler Eect
based on averages within groups,
beyond data,
limits of, in observational studies,
longitudinal, and cross-sectional data,
randomized, controlled experiments and,
Inhofe, Jim,
Inside-Out Plot (Hartigan),
Jolie, Angelina,
Journal of Happiness Studies
Journal of the
American Medical Association
Journey North Project (Annenberg/CPB Learner
Jungeblut, Ann,
Kahneman, Danny,
Kaiser, George,
Kant, Immanuel,
Keino, Kip,
Keller, Randy,
Kemeny, John,
knowledge acquisition, exponential,
Kolata, Gina,
Ladra, Sandra,
Laird, Nan,
Landy, John,
Laplace, Pierre-Simon,
LeMenager, Steve,
Li, Yundi,
Liszt, Franz,
Little, Rod,
Locke, John,
Lowenthal, Jerome,
Lu, Xin,
Luhrmann, Tanya,
eectiveness of,
French population (Montizon),
instruction and crime (Balbi and
London cholera epidemic (Snow),
male students and population (Dupin),
marriage and life expectancy,
Mauer, Joe,
Median Absolute Deviation (MAD),
Mencken, H. L.,
Messick, Sam,
Minard, Charles Joseph,
Mislevy, Bob,
Monarch Butteries (
Montizon, Frére de,
Morceli, Noureddine,
Murphy, James,
National Assessment of Educational Progress
National Board of Medical Examiners,
National Rie Association,
New Haven County employment,
Neyman, Jerzy,
nicotine, addictivenessof,
Nightingale, Bob,
Nightingale, Florence,
“no causation without manipulation”,
“No Child Left Behind”,
nothing is not always zero
Nurmi, Paavo,
Obama, Barack,
observational studies,
sample size and,
covariate information and,
dependent variable,
sequentially randomized (split plot)
Pacini, Filippo,
Pacioli, Luca,
Page, Satchel,
Pauli, Wolfgang,
performance, and happiness,
Piazza, Mike,
Picasso, Pablo,
Picturing the Uncertain World
Playfair, William,
population size and,
Prokoev, Sergei,
Promise Academy (Memphis TN),
Public School 116 (New York City NY),
Quality of Life (QOL) score,
Quinn, P.D.,
Rachmanino, Sergei,
credible assumptions and,
missing variables and,
reading and shoe size,
Reading/Language Arts Adequate Yearly Progress
(AYP, Tennessee),
reason, and faith,
Reckase, Mark,
reliability (dened),
rescue therapy,
Rodriguez, Ivan,
Romney, Mitt,
Rose of the Crimean War (Nightingale),
Rosenthal, Jaime,
Rubin, Donald B.,
Rubin’s Model for Causal Inference,
central ideaof,
control and,
Rule of 72 (Pacioli),
as an approximation,
Runyon, Damon,
Salk vaccine,
sample size,
Sanders, William,
Savage, Sam,
2008 presidential election (Gelman),
2012 presidential election,
gun ownership and gun violence,
Scherr, G. H.,
Scholastic Aptitude Test(SAT)
Bowdoin College and,
March 2014,
October 2000,
coaching for,
Schusterman, Lynn,
scientic investigations, essentials of,
Scopes, John,
sequentially randomized (split plot) design,
Serkin, Rudolf,
Serrano v. Priest
Sinclair, Upton,
Sinharay, Sandip,
small area estimates,
Snow, John,
Spearman-Brown curve,
Standards for Educational and
Statistical Breviary
moral (maps of),
origins of,
Statistics and Causal Inference
Steinberg, Linda,
stem-and-leaf display (Tukey),
Strayer, David,
as gold standard for evidence,
observational studies and,
ancillary information and,
covariates and,
characteristics of,
uses of,
Tancredo, Tom,
teacher evaluation value-added models,
Teague, Michael,
academic freedom and,
value of,
as measuring instruments,
as prods,
statistical evidence of,
costs of,
fourth-grade mathematics,
length of, and reliability,
prototypical licensing,
purposes of,
scores and racial groups,
unplanned interruptions in,
increased likelihoodof,
Texas Assessment of Knowledge and Skills
inking Fast, inking
ird Piano Concerto
oreau, Henry David,
Tommasini, Anthony,
circumcision and cervical cancer,
repeating kindergarten,
reptilian brainand,
Tukey, John,
Twain, Mark,
statistics as science of,
Uneducated Guesses
Urquhart, Fred,
US News and World Report
highly correlated,
two or more (high-dimension),
Voss, Joan,
Walker, Scott,
Wang, Yuja,
Watson, John,
weight gain, exponential,
Wu, Je,
Youngman, Henny,

Приложенные файлы

  • pdf 10714360
    Размер файла: 5 MB Загрузок: 0

Добавить комментарий