Should test scores be used AT ALL for teacher evaluation? By Valerie Strauss


Earlier this week a major report was released (pdf) saying that “value-added” formulas based on standardized test scores to evaluate teachers are unreliable and should not be used as a major factor in teacher assessment.

“Value-added modeling” has become the new big phrase in the education world. Essentially, it means measures that use test scores to track the growth of individual students as they progress through the grades to see how much “value” a teacher has added.

The value-added movement is supported by the Obama administration, which encouraged states to change laws to allow teachers to be evaluated primarily by such measures. And the Los Angeles Times recently used such a formula to grade more than 6,000 California teachers in a project that is highly controversial.

This would all be fine if assessment experts haven’t repeatedly warned that standardized tests designed for students should not be used to evaluate teachers. But they have. In addition, value-added formulas do not include other factors that affect students, and can skew results by giving better scores to teachers who “teach to the test” and lesser scores to teachers who are assigned students with the greatest educational needs.

In this climate, the Economic Policy Institute, a nonpartisan, nonprofit think tank based in Washington, the report, which concludes that heavy reliance on VAM methods should not dominate high-stakes decisions about teacher evaluation and pay.

The report, written by 10 prominent educators and researchers, says:

There is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.

For a variety of reasons, analyses of VAM results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach.

And it warns of negative consequences if “value added” is a key component in evaluation — including more “teaching to the test” and narrowed curriculum. Further, the study says, teachers may try to avoid being assigned particularly needy students because they do worse on standardized tests.

With all of that said, I wondered why the report did not say that these measures should not be used at all in evaluation.

The executive summary says:

Legislatures should not mandate a test-based approach to teacher evaluation that is unproven and likely to harm not only teachers, but also the children they instruct.

But it also says:

Adopting an invalid teacher evaluation system and tying it to rewards and sanctions is likely to lead to inaccurate personnel decisions and to demoralize teachers, causing talented teachers to avoid high-needs students and schools, or to leave the profession entirely, and discouraging potentially effective teachers from entering it.

So, what gives? Why should VAM measures be used if there is no consensus that they are reliable assessment tools? Why should they be given any weight? There are better, albeit more time-consuming ways, to weed out bad teachers.

I asked EPI to query the authors about this, and received a response from Helen F. Ladd, professor of public policy and economics at Duke University, president-elect of the Association for Public Policy Analysis and Management.

You can see the full list of authors, which includes Diane Ravitch and Linda Darling-Hammond, here, along with the executive summary.

I asked: If student standardized test scores are unreliable as stated in the study, why should they be used at all in teacher evaluation? Why doesn’t the study say they should not be used, period, for this purpose? Was the study bending to political reality?

Ladd: “There is no perfect way to evaluate teachers. Test scores are unreliable; so are principal observations, or peer evaluations, or analysis of videotapes, and so on. The only way to evaluate teachers fairly is to gather information from a variety of imperfect sources, each of which may contribute some information. If a teacher seemed to be ineffective in all of these measures, I’d be pretty confident that the teacher was ineffective. But if a teacher were ineffective only on one of them, I would be reluctant to make that conclusion.

“Test scores are unreliable, but they are still more often right than wrong, but not sufficiently more often to justify making high-stakes decisions on the basis of test scores alone. But giving test scores too much weight in a balanced evaluation system runs the additional danger of creating incentives to narrow the curriculum, as we described in the paper. If they are not given too much weight, this danger is lessened. How much weight they should be given should be a matter of local experimentation and judgment. All we say in the paper is that giving them 50 percent of the weight is too much.”