Measuring Up: What Educational Testing Really Tells Us, by Daniel Koretz [book review]
reviewed by Philip Staradamskis
In Measuring Up, Daniel Koretz continues his defense of the theory with which he is most famously identified: "Score inflation is a preoccupation of mine." He argues that high stakes induce "teaching to the test," which in turn produces artificial test-score gains (i.e., test-score inflation). The result, according to Koretz:
Scores on high-stakes tests-tests that have serious consequences for students or teachers-often become severely inflated. That is, gains in scores on these tests are often far larger than true gains in students' learning. Worse, this inflation is highly variable and unpredictable, so one cannot tell which school's scores are inflated and which are legitimate. (p. 131)
Thus, Koretz, a long-time associate of the federally funded Center for Research on Educational Standards and Student Testing (CRESST), provides the many educators predisposed to dislike high-stakes tests anyway a seemingly scientific (and seemingly not self-serving or ideological) argument for opposing them. Meanwhile, he provides policymakers a conundrum: if scores on high-stakes tests improve, likely they are meaningless-leaving them no external measure for school improvement. So they might just as well do nothing as bother doing anything.
Measuring Up supports this theory by ridiculing straw men-declaring a pittance of flawed supporting evidence sufficient (pp. 11, 59, 63, 132, & chapter 10) and a superabundance of contrary evidence nonexistent-and mostly by repeatedly insisting that he is right. (See, for example, chapter 1, pp. 131-133, & 231-236.) He also shows little patience for those who choose to disagree with him. They want "simple answers", speak "nonsense", assert "hogwash", employ "logical sleight(s) of hand", write "polemics", or are "social scientists who ought to know better".
The concept of test-score inflation emerged in the late 1980s from the celebrated studies of the physician John J. Cannell (1987, 1989). Dr. Cannell caught every U.S. state bragging that its students' average scores on national norm-referenced tests were "above the national average," a mathematical impossibility. The phenomenon was dubbed the "Lake Wobegon Effect," in tribute to the mythical radio comedy community in which "all the children are above average."
What had caused the Lake Wobegon Effect? Cannell identified several suspects, including educator dishonesty and conflict of interest; lax test security; and inadequate or outdated norms. But Cannell's seemingly straightforward conclusions did not make it unscathed into the educational literature. For instance, one prominent CRESST study provided a table with a cross-tabulation that summarized (allegedly all) the explanations provided for the spuriously high scores (Shepard 1990, 16). Conspicuously absent from the table, however, were Cannell's two primary suspects-educator dishonesty and lax test security.
Likewise, Koretz and several CRESST colleagues followed up with their own study in an unnamed school district, with unnamed tests and unidentified content frameworks. Contrasting a steadily increasing rise in scores on a new, "high stakes" test with the substantially lower scores recorded on an older, no-stakes test, Koretz and his colleagues attributed the inflation to the alleged high stakes. Not examined was why two different tests, developed by two completely different groups of people under entirely separate conditions, using no common standard for content, would be expected to produce nearly identical scores.
The study traced the annual trend in average scores on a third-grade test "perceived to be high stakes" over several years, then administered a different third-grade test, with no stakes, that had been administered in the district several years earlier. The researchers, finding a steadily increasing rise in scores on the new test contrasted with a substantially lower score on the old, no-stakes test, attributed the rise in scores on the new test to inflation allegedly caused by the alleged high stakes. The study ignored several factors that could have influenced the results, such as differing content, teachers, students, and incentives. Indeed, it ignored most of the factors that could have influenced the results, or speculated that they must have conveniently cancelled each other out, and then declared that high stakes must have done it.
Even nearly two decades later, much of the study remains shrouded in mystery: "The price of admission [to conduct the study] was that we take extraordinary steps to protect the anonymity of the [school] district, so I cannot tell you its name, the state it was in, or even the names of the tests we used." Thus, the study is neither replicable nor falsifiable. An easy solution would be a content match study between the two tests used for comparison. If, as claimed, the two tests represented the same domain (identified, i.e., it could have been [and likely was] as broad as a "grade level" of mathematics from two completely different content frameworks with non-parallel topical sequences), why not support that assertion with some empirical evidence?
This research framework presaged what was to come. The Lake Wobegon Effect continued to receive considerable attention, but Cannell's main points-that educator cheating was rampant and test security inadequate-were dismissed out of hand and persistently ignored thereafter. The educational consensus, supported by the work of CRESST and other researchers, fingered "teaching to the test" for the crime, manifestly under pressure from the high stakes of the tests.
Problematically, however, only one of Cannell's dozens of score-inflated tests had any stakes attached. All but that one were no-stakes diagnostic tests, administered without test-security protocols. The absence of security allowed education administrators to manipulate various aspects of the tests' administration, artificially inflate scores, and then advertise the phony score trends as evidence of their own managerial prowess. Ironically, many of the same states simultaneously administered separate, genuinely high-stakes tests with tight security and no evidence of score inflation.
Much of Measuring Up recapitulates the author's earlier writings, but on page 243, we do learn what he and his colleagues actually found in that influential follow-up to Cannell's findings. Exactly why had scores risen so dramatically on the new, high-stakes third-grade test they examined?
[A]lthough the testing system in this district was considered high-stakes by the standards of the late 1980s, by today's standards it was tame. There were no cash awards . . . threats to dissolve schools or remove students in response to low scores. . . . The pressure arose only from less tangible things, such as publicity and jawboning.
In other words, this foundational study did not include a high-stakes test. After all, in our open democracy, all tests are subject to "publicity and jawboning," whether they genuinely have high stakes or no stakes. (Koretz, incidentally, is also incorrect in characterizing the test as "high stakes by the standards of the late 1980s": at the time more than twenty states administered high school graduation exams-for which failing students were denied diplomas.)
Do as I Say, Not as I Do
Many testing researchers (unsurprisingly, not associated with CRESST) caution against the simplistic assumptions that any test will generalize to any other simply because they have the same subject field name or that one test can be used to benchmark trends in the scores of another (Archbald, 1994; Bhola, Impara, and Buckendahl, 2003, 28; Buckendahl, et al., 2000; Cohen and Spillane, 1993, 53; Freeman, et al., 1983; Impara, 2001; Impara, et al., 2000; Moore, 1991; Plake, et al., 2000; Schmidt, 2004; Wainer, 2011). Ironically, despite himself, Koretz cannot help agreeing with them. Much of the space in Measuring Up is devoted to cautioning the reader against doing exactly what he does-making apples-to-oranges comparisons with scores or score trends from different tests. For example:
One sometimes disquieting consequence of the incompleteness of tests is that different tests often provide somewhat inconsistent results. (p. 10)
Even a single test can provide varying results. Just as polls have a margin of error, so do achievement tests. Students who take more than one form of a test typically obtain different scores. (p. 11)
Even well-designed tests will often provide substantially different views of trends because of differences in content and other aspects of the tests' design. . . . [W]e have to be careful not to place too much confidence in detailed findings, such as the precise size of changes over time or of differences between groups. (p. 92)
[O]ne cannot give all the credit or blame to one factor . . . without investigating the impact of others. Many of the complex statistical models used in economics, sociology, epidemiology, and other sciences are efforts to take into account (or 'control' for') other factors that offer plausible alternative explanations of the observed data, and many apportion variation in the outcome-say, test scores-among various possible causes. …A hypothesis is only scientifically credible when the evidence gathered has ruled out plausible alternative explanations. (pp. 122-123)
[A] simple correlation need not indicate that one of the factors causes the other. (p. 123)
Any number of studies have shown the complexity of the non-educational factors that can affect achievement and test scores. (p. 129)
Koretz's vague suggestion that educators teach to "a broader domain" would dilute coverage of required content that typically has been developed through a painstaking public process of expert review and evaluation. In its place, educators would teach what exactly? Content that Koretz and other anti-standards educators prefer? When the content domain of a test is the legally (or intellectually) mandated curriculum, teachers who "teach to the test" are not only teaching what they are told they should be teaching, they are teaching what they are legally and ethically obligated to teach (Gardner 2008).
Another example of an imprudent recommendation: the Princeton Review sells test preparation services, most prominently for the ACT and SAT college admission tests. Its publishers argue that students need not learn subject matter to do well on the tests, only learn some test-taking tricks. Pay a small fortune for one of their prep courses and you, too, can learn these tricks, they advertise. Curiously, independent studies have been unable to confirm Review's claims (see, for example, Camara, 2008; Crocker, 2005; Palmer, 2002; Tuckman, 1994; Tuckman and Trimble, 1997; Allensworth, Correa, & Ponisciak, 2008), but Koretz supports them: "…this technique does often help to raise scores."
After investigations and sustained pressure from better business groups, the Princeton Review in 2010 voluntary agreed to pull its advertising that promised score increases from taking its courses (National Advertising Division, 2010).
Scripting a hoax
Around 1910, a laborer at the Piltdown quarries of southern England discovered the first of two skulls that appeared to represent the missing link between ape and human. In the decades following, mainstream science and some of the world's most celebrated scientists would accept "Piltdown Man" as an authentic specimen of an early hominid. Along the way, other scientists, typically of the less famous variety, proffered criticisms of the evidence, but were routinely ignored. Only in the 1950s, after a new dating technique applied to the fossil remains found them to be modern, was the accumulated abundance of contrary evidence widely considered. The Piltdown fossils, it turned out, were cleverly disguised forgeries.
"Piltdown man is one of the most famous frauds in the history of science," writes Richard Harter in his review of the hoax literature (1996-1997). Why was it so successful? Harter offers these explanations:
• some of the world's most celebrated scientists supported it;
• it matched what prevailing theories at the time had led scientists to expect;
• various officials responsible for verification turned a blind eye;
• the forgers were knowledgeable and skilled in the art of deception;
• the evidence was accepted as sufficient despite an absence of critical details; and
• contrary evidence was repeatedly ignored or dismissed.
Measuring Up's high-stakes-cause-test-score-inflation myth-making fits the hoax script perfectly.
Citation: Staradamskis, P. (2008, Fall). Measuring up: What educational testing really tells us. Book review, Educational Horizons, 87(1).
Allensworth E., Correa, M., Ponisciak, S. (2008, May). From High School to the Future: ACT Preparation–Too Much, Too Late: Why ACT Scores Are Low in Chicago and What It Means for Schools. Chicago, IL: Consortium on Chicago School Research at the University of Chicago.
Archbald, D. 1994. On the Design and Purposes of State Curriculum Guides: A Comparison of Mathematics and Social Studies Guides from Four States (RR-029). Consortium for Policy Research in Education.
Bhola, D. D., J. C. Impara, and C. W. Buckendahl. 2003, Fall. "Aligning Tests with States' Content Standards: Methods and Issues." Educational Measurement: Issues and Practice, 21-29.
Buckendahl, C. W., B. S. Plake, J. C. Impara, and P. M. Irwin. 2000. Alignment of standardized achievement tests to state content standards: A comparison of publishers' and teachers' perspectives. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, La.
Camara, W. J. 2008. "College Admission Testing: Myths and Realities in an Age of Admissions Hype," in Correcting Fallacies about Educational and Psychological Testing (chapter 4), ed. R.P. Phelps. Washington, D.C.: American Psychological Association.
Cannell, J. J. 1987. Nationally Normed Elementary Achievement Testing in America's Public Schools: How All Fifty States Are above the National Average. (2nd Ed.), Daniels, W. Va.: Friends for Education.
---. 1989. How Public Educators Cheat on Standardized Achievement Tests. Albuquerque, N.M.: Friends for Education.
Cohen, D. K., and J. P. Spillane. 1993. "Policy and Practice: The Relations between Governance and Instruction." Designing Coherent Education Policy: Improving the System, ed. S. H. Fuhrman, 35-95. San Francisco: Jossey-Bass.
Crocker, L. 2005. "Teaching for the Test: How and Why Test Preparation Is Appropriate." In Defending Standardized Testing, ed. R. P. Phelps, 159-174. Mahwah, N.J.: Lawrence Erlbaum.
Freeman, D., et al. 1983. "Do Textbooks and Tests Define a National Curriculum in Elementary School Mathematics?" Elementary School Journal 83(5): 501-514.
Gardner, W. 2008, April 17. "Good Teachers Teach to the Test: That's Because It's Eminently Sound Pedagogy." Christian Science Monitor.
Harter, R. 1996-1997. Piltdown Man: The Bogus Bones Caper. The TalkOrigins Archive. Downloaded May 13, 2008, from <http://www.talkorigins.org/faqs/piltdown.html>.
Impara, J. C. 2001, April. Alignment: One element of an assessment's instructional utility. Paper presented at the annual meeting of the National Council on Measurement in Education, Seattle, Washington.
Impara, J. C., B. S., Plake, and C. W. Buckendahl. 2000, June. The comparability of norm-referenced achievement tests as they align to Nebraska's language arts content standards. Paper presented at the Large Scale Assessment Conference, Snowbird, Utah.
Moore, W. P. 1991. Relationships among teacher test performance pressures, perceived testing benefits, test preparation strategies, and student test performance. PhD dissertation, University of Kansas, Lawrence.
National Advertising Division. (2010). The Princeton Review voluntarily discontinues certain advertising claims; NAD finds company’s action necessary and appropriate’. New York, NY: National Advertising Review Council, Council of Better Business Bureaus, CBBB Children’s Advertising Review Unit, National Advertising Review Board, and Electronic Retailing Self-Regulation Program. Retrieved June 25, 2010 from http://www.nadreview.org/DocView.aspx?DocumentID=8017&DocType=1
Palmer, J. S. 2002. Performance Incentives, Teachers, and Students: Estimating the Effects of Rewards Policies on Classroom Practices and Student Performance. PhD dissertation. Columbus, Ohio: Ohio State University.
Plake, B. S., C. W. Buckendahl, and J. C. Impara. 2000, June. A comparison of publishers' and teachers' perspectives on the alignment of norm-referenced tests to Nebraska's language arts content standards. Paper presented at the Large Scale Assessment Conference, Snowbird, Utah.
Schmidt, W. (2004, October 22). The role of content in value-added. Talk presented at the conference Value-Added Modeling: Issues with Theory and Application, University of Maryland, College Park.
Shepard, L. A. 1990, Fall. Inflated Test Score Gains: Is the Problem Old Norms or Teaching the Test? Educational Measurement: Issues and Practice, 15-22.
Tuckman, B. W. 1994, April 4-8. Comparing incentive motivation to metacognitive strategy in its effect on achievement. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, La. Available from ERIC (ED368790).
Tuckman, B. W., and S. Trimble. 1997, August. Using tests as a performance incentive to motivate eighth-graders to study. Paper presented at the annual meeting of the American Psychological Association, Chicago. Available from ERIC (ED418785).
Wainer, H. (2011, pp.134-137). Uneducated guesses: Using evidence to uncover misguided education policies. Princeton, NJ: Princeton University Press.