After leading the most ambitious study of teacher effectiveness ever undertaken, Thomas Kane says we can measure quality in ways that are fair to teachers—and to students who deserve the best instructors schools can provide
April 09, 2013
A CENTRAL FOCUS
of education reform efforts over the last decade has been the premise that classroom teachers are the single most important in-school variable affecting student learning.
The idea that there are big differences in teacher performance is axiomatic to anyone who recalls sitting at a classroom desk, or to any parent who has marveled at the impact of a great teacher on a child—or fretted over the lack of learning when that child was saddled with a clearly ineffective teacher. But researchers, school leaders, and teachers unions have argued over how to measure differences in teacher effectiveness and how to translate any such findings into policy and practice.
Thomas Kane, a professor at the Harvard Graduate School of Education, has been at the helm of the largest research study to date that has tried to address many of the questions roiling the teacher effectiveness debate. The Bill & Melinda Gates Foundation funded a massive, $45 million, three-year study under Kane’s leadership, the final report of which was issued in January. The Measures of Effective Teaching project recruited 3,000 teachers to volunteer for the study in six US school districts.
The study assessed teachers of students in grades 4 through 9 using three different measures: student test scores, classroom observations by trained evaluators, and student feedback surveys. The researchers concluded that combining the information from all three approaches, with test scores accounting for one-third to one-half of a composite index, provides the best measure of teacher effectiveness.
Teachers were assessed based on their students’ progress on standardized test scores in math and English. Such “value-added” assessments are designed to measure the true impact of a teacher by taking into account the varying achievement levels at which students begin each school year.
Teachers were also assessed based on classroom observations by trained evaluators. Unlike conventional in-person teacher evaluations, these involved videotaping multiple classroom sessions, which were then reviewed by at least two different observers.
Finally, the study gathered feedback from those with the most extensive knowledge of a teacher’s day-to-day classroom practice—his or her students. Student appraisals of such qualities as a teacher’s ability to explain concepts in different ways or to manage classroom time well were remarkably consistent with how that teacher fared in the assessment based on student achievement scores. Ronald Ferguson, a Harvard colleague of Kane’s, developed the student feedback questionnaire used in the study. “Kids know effective teaching when they experience it,” Ferguson told The New York Times in 2010, when preliminary results from the student survey component were released.
Kane, an economist who has spent years studying teacher effectiveness and education policy, seems temperamentally well-suited to navigate the minefield of today’s education reform debates. He speaks with thoughtful deliberation, betraying more than a hint of his North Carolina roots, and makes his points strongly but without the finger-pointing that too often undermines any hope of finding common ground in such discussions. Kane is quick to acknowledge the many nuances involved in examining what makes for great teaching. But he is steadfast in arguing that any areas of uncertainty shouldn’t hold back policy reforms that make use of the firm knowledge that has been gained, which he believes can contribute to big improvements in American schools.
How much change the MET study will spur in teacher policies remains to be seen, but its significance in adding to our understanding of teacher effectiveness is hard to overstate. Andrew Smarick, a well-known national education policy expert, wrote that it “may prove to be the most important K–12 research study of this generation.”
I sat down with Kane at his office in Cambridge to better understand why that may be the case. What follows is an edited transcript of our conversation.
The whole premise of the research project was that there are meaningful differences in the effectiveness of teachers and that we might be able to measure these reliably. Can you talk a little about how that idea developed? What have been the prevailing norms or policies that it is challenging?
THOMAS KANE: A couple of coauthors and I wrote a paper in 2006 where we noted what many people had been observing for decades: First, that there are these large differences in student achievement gains in different teachers’ classrooms; second, that those differences had very little to do with the teachers’ initial credentials [like whether they have a master’s degree or not]; and third, that the teachers seemed to improve from their first to their second to their third year of teaching but then plateaued afterward. One way to read those three initial facts is that it’s important to try to measure performance on the job because there are big differences in test score gains, but school systems weren’t organizing themselves to do this. There’s been a lot of confusion about our examination of test score gains. The way I view them is they are, in a sense, the tip of the iceberg. They are the thing that is suggesting there’s a lot more below the surface, that there presumably are big differences in teacher practice that are generating those achievement differences. We designed the study to try to also learn more about what those differences in practice are and to see whether you can measure them. All the evidence has been suggesting for years that measuring performance on the job is critical, and yet we’ve never taken those results seriously.
CW: In many ways, the policies in place for how we hire, evaluate, and compensate teachers in the country have been proceeding along without recognition of any of these factors. Was it like two ships passing in the dark?
KANE: Actually, it’s worse than that. It’s almost as if our policies were designed as if just the opposite were true. It’s not just that the policies were designed unaware of the facts. The policies were designed as if we were living in some parallel universe where the opposite of what the evidence was suggesting was true. All of our policies focus on teachers’ credentials at the time they start teaching. So when states think of licensure, when they think of their role in ensuring a high-quality teaching force, they think of raising the standards for entering teaching. But all the evidence suggests that’s not where the action is, that the licensure requirements are not that related to student achievement gains. It’s hard to know who the great teachers are going to be before they get into the classroom. Then, once they’re there, the things we have done have been perfunctory. Teacher evaluations and classroom observations have been required in school districts for years, and yet in the typical system, 98 percent or more of teachers are given the same satisfactory rating. That’s just making a mockery of the whole idea.
CW: So you went at it with the premise that we need to work backwards—we need to see where real strong learning is taking place, in whose classrooms, and from that begin to map backwards and hopefully figure out what it is that’s making those teachers effective?
KANE: There had been attempts over the years to propose ways to identify effective teaching practice in the classroom. Back in 1996, Charlotte Danielson published a book, Framework for Teaching, which proposed a set of behaviors that education research implied would be related to student achievement gains. And there were others. We said, let’s set up a study where we could compare and contrast those and test some of these hypotheses about what effective teaching looks like. And let’s do so in a way that lets us combine several different measures. So we said, suppose for the same group of classrooms we also collected student surveys. And suppose we also collected a test of teachers’ pedagogical content knowledge. And let’s see whether the combination of measures could do better than any one alone. That had not been done before, and it certainly had not been done on such a broad scale—3,000 classrooms.
CW: How would you summarize the big-picture conclusion from the project?
KANE: The most important finding was we learned that if you combine those measures, you can identify sets of teachers who cause greater student achievement to happen. And I use that word “cause” because in the second year of the study we randomly assigned classrooms [of students] to different teachers. We had measured teachers’ practice in the first year, when they were assigned to classrooms the usual way. From that we had an impression—based on the observations, student surveys, and achievement gains—of who the more effective and less effective teachers are. We said, let’s test those impressions by randomly assigning classrooms to both groups, and let’s see what happens. And what we learned was not only were there differences in achievement following random assignment, but the differences were similar to what we would have predicted based on data in the first year. Teachers do tend to be assigned different kinds of students, but it’s based largely on the students’ prior test scores, and that is something you can control for.
CW: What did you find in terms of quantitative measurements that try to capture what it means to be in the classroom of a highly effective teacher versus the least effective?
KANE: In math, a top-quartile teacher on the combined measure generated 7.6 more months of learning in a typical school year than a teacher in the bottom quartile. To put it another way, that’s a quarter of the black-white achievement gap closed in a single year. Or, still another way, if we could find a way to ensure that our average teacher generated gains similar to a top-quartile teacher, we would close the US achievement gap with Japan in two years. The effects are smaller in English language arts. In ELA, having a top quartile teacher generates 2.6 additional months of learning in a single year relative to a bottom quartile teacher.
CW: The broader conclusion that the study drew was that the ideal way to assess effectiveness is by using all three of the measures. Some critics have said the achievement score gains alone tell you all you need to know about who is and who isn’t effective. They’ve suggested that because assessing teachers based on student test scores has become so controversial, the Gates Foundation tried to paper that over by dressing up the conclusions to say student feedback and classroom observations are also key parts of the equation.
KANE: It’s only dressing up the findings if you don’t accept the idea that this is about more than just accountability. My frustration with that criticism is that we were very explicit that being able to predict student achievement gains in the future is only one of the goals in trying to measure effective teaching. Another goal is to be able to provide feedback to a teacher on specific practices that they might change in order to see future achievement gains. And a third goal would be to come up with a measure that’s not going to fluctuate too wildly from year to year or from classroom to classroom. Each of the three measures—the student achievement gains on state tests, student surveys, and classrooms observations—had different strengths and weaknesses.
Measuring a teacher’s track record at being able to cause large gains in achievement in the past does seem to have considerable predictive power with respect to their likely gains with future groups of students. But those measures are terrible in terms of providing teachers with suggestions about what they might do differently in terms of actual practice, and they also tend to fluctuate from classroom to classroom. The classroom observations and the student surveys, because they refer to specific aspects of somebody’s practice, could have that sort of diagnostic value of pointing a teacher to think about aspects of their practice they should work on. In particular, the student surveys boost reliability. They are the least likely to fluctuate from year to year and from classroom to classroom. What we learned from the classroom observations by adults was that even with trained observers there was a significant amount of judgment involved. If you read any of these instruments, like the one on questioning skills, you’ll see there’s plenty of room for judgment in there.
CW: You mean, are teachers good at Socratic teaching or getting that kind of dialogue going?
KANE: Yes. So the instrument is fairly clear at the extremes. A teacher will get a low score if they ask a bunch of yes/no questions, if the teacher does all the questioning, if there are very few students involved in the questioning. A teacher will get a high score if the questioning is not just yes/no, if it requires students to explain their understanding of something, because it turns out that any new knowledge is based on old knowledge—it’s the way we learn. Those are the two extremes. The middle categories become a little squishier to identify, and so when I say we learned that judgment matters, we learned that even after we trained these raters, in watching any given video, the raters often disagreed. One way to increase reliability is to average. We learned that when you’re having adults observe, you want to have at least two adults and four observations, because we also saw that a teacher’s practice varied from class to class, and you wouldn’t want to just base it all on one class. Now you can average two or three adults and you can average three or four lessons. But with the students, you’re averaging over 25 students in an elementary classroom, and 75 or 100 students for middle and high school teachers. So you get the power of averaging over just lots more observers, and the students are not there for just three or four lessons, they’re there for 180 days over the course of the school year.
CW: So it’s a matter of statistical power or sample size?
KANE: Right. So even if the average student is not nearly as discerning as the average adult, you get to have many more students involved rather than a couple of adults.
CW: Some people have asked, won’t students just view favorably teachers that are nice, teachers that are lenient, teachers that have qualities that aren’t necessarily the ones of interest that you might be able to correlate with what we think of as effective teaching or the ability to drive learning?
KANE: In these questions, we tried not to just conduct a popularity contest. The questions focus on specific student experiences in the classroom. In that sense, I actually think the questions are even better than the questions that are used in higher education as part of our student evaluations.
Can I circle back to one thing that I think is important? You stated the criticism from the right—why include observations and student surveys when student achievement gains are the most predictive? But there’s criticism from the left that says, why use student achievement gains? I thought we’d made a lot of progress on that question. Five years ago, it was not rare to hear somebody say student achievement should play no role whatsoever—
CW: And that schools were becoming test factories.
KANE: Right. I think that has become rarer now, but it has not disappeared. There are two things that that argument misses. One is, suppose that a teacher is getting gains using very unconventional methods. That teacher will be happy that student achievement gains are part of the picture and not just classroom observations and student surveys. So think of the Doug Fluties of the world, who don’t look like the classic NFL quarterback, and yet through his career he won football games. So while I do think it’s important to try to identify the key aspects of practice that are associated with better student learning and provide teachers with feedback on that, I also think we need to be humble enough to recognize that even those general pointers might not work for a subset of teachers and that there needs to be an outlet for recognizing that. If student achievement gains aren’t part of the outlet, we’re driving people to conformity. So if we want to leave room for people who are getting great results with unconventional means, student achievement gains have to be a part of it.
A second point I wanted to make about the why-use-test-scores critique is that people have correctly pointed out, hey, look, these tests that we’re using are only measuring a subset of the skills that we want teachers to be teaching. That’s obviously correct. But the conclusion that some people draw from that doesn’t necessarily follow. High-stakes decisions are being made all the time. It’s not that we can avoid making them. In most collective bargaining agreements, if a teacher’s contract is renewed at the end of the second year, they get tenure in their third year. A principal is trying to use whatever information they have for making a difficult decision. Even if the test score gains on the state test are a limited measure and lots of other things will matter, what makes us think that those would be less related to student achievement gains on the state tests than the other things like experience or master’s degrees? One of the things we could do in this study was to take seriously the criticism that the state tests are incomplete. So in addition to the state tests we gave, students took supplemental assessments that in math probed for conceptual understanding. So they’re not just applying an algorithm that they’ve used in school; they’re given a word problem that might require an understanding of addition to solve.
CW: So it sits outside of the teach-to-the-test structure?
KANE: Right. And in literacy, we had a test that required kids to provide some short-answer, written responses to the prompts rather than just multiple choice questions. We wanted to ask, are the teachers that are generating gains on these assessments different from the teachers that are generating gains on the state tests? And while they weren’t perfectly correlated, they were related, and they were much more related to each other than experience or master’s degrees, the two other things that we are making lots of high-stakes decisions on now. Now that doesn’t mean that if it were possible to include more items requiring conceptual understanding on the state test that wouldn’t be a good idea. It actually probably would be a good idea. But to think that the current tests are measuring things that are completely unrelated to conceptual understanding, I think, is just wrong, and is sort of taken as implicit in many of the arguments against it.
CW: The whole discussion around these issues often gets reduced to the idea that we’re looking for a way to axe bad teachers or reward great ones. I suppose at the extremes that might be the case, but it’s probably not going to be useful for making those kinds of decisions for the big bulk of teachers who fall somewhere in the middle.
KANE: I agree they probably won’t be useful for making those kinds of decisions, but I actually would bet that’s where the most information is. You’ve hit on what I think is one of the fundamental mistakes we make when we frame this discussion. Accountability is definitely part of the discussion. But from reading the blog posts or public debate or cocktail party discussions, you’d think it was 85, 95 percent of the issue. But you’re absolutely right, there is no way that any system could hope to fire its way out of this problem. Rather, for the vast majority of teachers these measures will be used as feedback, hopefully, for driving changes in improvement in practice. We will not see dramatic gains in student achievement unless we see dramatic differences in teaching. But that’s adult behavior change, and that’s hard. Think about this as a public health challenge. Imagine if we were all to try to lose 20 pounds. Imagine how we would possibly try to do that without bathroom scales. One way to think about the MET project is we’re trying to build a rudimentary bathroom scale that could be used as a basis for a massive adult behavior change effort. We can’t underestimate the difficulty of that challenge. We also can’t underestimate the necessity of that challenge. So a starting point is to ensure that we’re providing feedback to folks on the job. I think teachers will embrace that aspect of it, and I’m afraid that too much emphasis on the accountability side of it has led teachers to be afraid of it and alienated by it.
CW: But as you say, accountability is still part of the story here. What is your view on the use of value-added assessments based on test scores to make high-stakes accountability decisions—whether this involves terminating those at the bottom or somehow rewarding exceptional talent seen at the upper end?
KANE: As we think about making high-stakes decisions, we need to focus where the benefits are highest. That’s the initial tenure decision. After a teacher has been on the job for a couple of years, a district knows an awful lot more than they did at the moment of hiring about the teacher’s performance. Principals should ensure that only those who are clearly more effective than the average novice are granted tenure. It’s like in professional sports: No team manager would forego a future draft pick for a player whose performance is not as high as the average rookie. Today, virtually anyone willing to remain with a district for three or more years will be granted tenure, the equivalent of a long-term contract. And yet many of those teachers have measured performance below the average novice. Every time that happens, students are harmed, the status of the profession is diminished, and more effective teachers get a colleague that they will have to cover for.
At the upper end, there’s been a lot of confusion about merit pay. Merit pay in schools is not about spurring on greater levels of effort. Teachers are already working pretty hard. Rather, I think the best use of merit pay is to retain young teachers who have demonstrated great promise with their performance in the classroom. Early in their career is when teachers are at greatest risk of leaving. It may be useful to grant longer-term bonuses, say, five years, to the teachers with the greatest track record of early career performance. Changing behavior is the key. Paying bonuses to someone who would have stayed anyway may be fair, but it’s also costly. Paying bonuses to retain high-performing teachers who would have left can yield high returns for children. While it’s impossible to know if any given teacher would leave, we know it’s a greater risk early in their career, when turnover is highest. By the way, I would make these decisions on the basis of the combination of achievement gains, student surveys, and observations. I would not base it solely on value-added scores. Moreover, I think it’s important to preserve principals’ discretion. After all, the measures can be misleading and the principal will know more about any given candidate. If a principal wants to override a decision and grant tenure to someone with performance below the average novice, they should be allowed to do so. However, they should be expected to notify the parents in the school that they are doing it and describe their reasons.
CW: So what have you learned from the project about what excellent instruction looks like? If we know there are differences, can you point to what some of them are?
KANE: So one of them is the questioning skill that we talked about. Another is time management. What’s the pace in class? Are there five or 10 minutes in class that are just wasted handing out papers and so forth? A third is just classroom management and how successful a teacher is in establishing an orderly environment.
CW: And obviously content knowledge at some point figures in here, too.
KANE: That’s the part that I think we have to make the most progress on. Right now, the classroom observation instruments that school districts are implementing are content agnostic. You could score well in terms of your questioning skills and your classroom management and your time management and be teaching incorrect stuff.
CW: This is the largest, most comprehensive study to ever look at these questions. Does it feel like it’s been pretty path-breaking?
KANE: I think this is a start. I feel like for four decades we’ve known that there are large differences in student achievement gains in different classrooms, and for four decades we’ve ignored that. So we’ve started to pay attention to it, and we’ve made some suggestions about what better classroom observations and what better student surveys should look like. But hopefully, as these systems get implemented, they will then be improved, and there will then be a next generation of classroom observations that will parse the important aspects of teaching even more effectively than the ones we tested, and there’ll be better student surveys and maybe there’ll be even some other things. Again, at least we’ve gotten started.
CW: How do you see all this fitting in with what is sometimes referred to as the education wars that pit teachers unions against reform advocates, where the use of test scores to evaluate teachers is often the polarizing issue? I saw a comment from Randi Weingarten, the president of the American Federation of Teachers, around the time of the release of your final report, that was surprisingly positive about the study, especially about the idea that it embraced using multiple measures of effective teaching. I don’t know if it’s Pollyannaish to think that we’re coming to a point where the work you and others are doing is no longer seen as an attempt to attack or undermine teachers, but is seen as trying to boost the profession and recognize them as professionals and, ultimately, improve learning for kids.
KANE: I think that this debate has evolved dramatically in the last three or four years, and it will continue to evolve. Remember, all of the teachers that were a part of this study were volunteers, and one of the six districts we worked with was New York City. We had the support and cooperation of the United Federation of Teachers in New York, which helped us recruit teachers to be part of the study. And Randi had been the president of the UFT in New York before moving to the national office. I give them a lot of credit for that. They had been saying, look, it can’t be just test scores. But they also realized that the rest of the system was completely broken, that classroom observations were not discerning. I don’t mean to say they’re enthusiastic about the student surveys, because there are a lot of reservations that teachers have raised about those. But they were at least open to trying to develop some other things besides test scores, and they realized that the infrastructure just wasn’t there. I think you’ve hit on, though, the way that this sort of large coordinated research project can help make progress in important policy debates in education. It’s hardly ever by completely overturning somebody’s prior belief. That happens, but it’s a very rare event that, on an issue where somebody has taken a very hard position, evidence is going to be enough to push them off that. But what this kind of research project often does is identify other areas where there had not been battle lines drawn, where there’s room for common ground and common understanding and discussion. And so, on classroom observations and the importance of training observers, I think there’s a lot of common ground. I think the debate has become a lot more nuanced than it was four years ago, and I think it is partially because the evidence base has evolved. When people become cynical about the value of research, I think they fail to recognize that you don’t have to change somebody’s mind about an issue to still have an impact on their thinking, and I think we’re seeing that on both sides of the education wars. It’s not like there’s 100 percent agreement on everything, but I think there’s been progress.