Accuracy and disclosure: the issues in values-added dispute

Since the Los Angeles Times began publishing its value-added series and making available the evaluations of 6,000 elementary school teachers, the journalistic effort has become the nation’s hottest educational story and the system the source of national debate.

Meanwhile the paper and several commentators have portrayed the issue in an overly simplistic way. It goes like this: On one side are the teachers and their obstructionist union, on the other are reformers, parents, students and a gutty newspaper.

Actually, it’s much more ambiguous. An important part of the dispute is whether the evaluations are accurate and whether the Times has done enough to make the readers aware of the limitations of the concept of value-added. These points have been raised by independent academic researchers.

In the value-added system, a student's past performance on tests is used to project his or her future results. The difference between the prediction and the student's actual performance after a year is the "value" that the teacher added or subtracted.

A warning on value-added was given by Mathematica Policy Research an organization working for the U.S. Education Department, which is headed by Education Secretary Arne Duncan, a supporter of the method. Mathematica said the value-added system was a useful evaluation tool but:

“Our results strongly support the notion that policymakers must carefully consider system error rates in designing and implementing teacher performance measurement systems that are based on value-added models… Consideration of error rates is especially important when evaluating whether and how to use value-added estimates for making high-stakes decisions regarding teachers (such as tenure and firing decisions)...

“Using rigorous statistical methods and realistic performance measurement schemes, (this) paper presents evidence that value-added estimates for teacher-level analyses are subject to a considerable degree of random error…”

In the Times’s evaluation system, the possibility of error is expressed as confidence levels, in terms of plus and minus. The margin of error for English scores is plus or minus five for the highest and lowest scorers. For math, it is plus or minus seven. Accuracy is even lower for teachers in the middle. I checked the evaluation of a teacher I know to be outstanding. Like the others, his name was listed with a colored bar, divided up into sections ranging from “least effective” to “most effective.” A diamond marked his position on the chart. I looked at the chart and read the brief text but saw no reference to confidence levels, or possibility of error.

The teacher was rated “more effective.” He was on the line between “more effective” and “most effective” in math. He was near the line on English and in overall standing. Because of the error factor, he could well have crossed the line on math and actually be in the “most effective” category. That is because of the plus or minus factor in the confidence level. The same could be true in English and in the overall standing. A parent looking at the evaluation could wrongly conclude that he was not as good as a “most effective” teacher when actually he might have been robbed of being in that top category by an error.

I asked Times Asst. Managing Editor David Lauter about this. “We took several steps to deal with the inherent error rate that is involved in any statistical measure,” he said in e-mail. “First and most importantly we did not publish any individual score…Instead, we reported the scores only in broad categories.”

True, no individual numbers were published. But the sharp looking color bar and the text gave as powerful a message as publishing the actual ratings. And while the Times explained the possibility of sampling error, it did so in an article found elsewhere on the web site in its teacher evaluation package. It would take a dedicated reader to dig out this important bit of information. Lauter disagreed, saying the information was accessible. He has more faith than I do in readers rooting out information from the Times web site.

Another research organization, the Economic Policy Institute, also raised a point about the sampling. Such test results, EPI said, usually do not come from classes where students were enrolled at random or by chance. Random sampling is important in statistical reporting. EPI said classes usually are formed by “non-random selection.” Examples of such classroom assignment are: principals spreading high and low achievers among classrooms; separating troublemaking friends; rewarding or punishing teachers with “good” or “bad” classes and yielding to parental pressure.

Lauter said some researchers “have made a considerable point” of the non- random selection issue but he said the Times’s analysis “reliably takes into account differences among students and produces unbiased results.”
He said the paper explained the error factor to its readers and that its articles repeatedly said, “value-added is just one measure” of a teacher’s work.

The Times should have done much better in filling out the story of this controversial evaluation measure and emphasizing the chance for error. It should have made a greater point of value-added being just one factor in judging a teacher. Most experts figure it should be about 30 per cent. The on-line evaluation of the teacher I know has no mention of the many other qualities he brings to the classroom. All anyone sees is his value-added score.

From the powerful first story featuring two “least effective” teachers to the posted data base of the elementary teachers’ scores, the package suggests that that value-added was the decisive factor in the Times evaluation of teachers. In fact, the construction and play of the stories and the charts told readers that this was THE evaluation.