Should K-12 students’ test scores determine their teachers’ pay?

Success in teaching is notoriously difficult to evaluate. The Bill and Melinda Gates Foundation, who funded the three-year Measures of Effective Teaching Project, cites that “[t]wo-thirds of American teachers feel that traditional evaluations don’t accurately capture the full picture of what they do in the classroom” (Bill & Melinda Gates Foundation, n.d.). Berk (2005) suggests using a combination of “student ratings, peer ratings, and self-evaluation… [to] build on the strengths of all sources, while compensating for the weaknesses in any single source” (p. 48). Despite complications in measuring teaching success, the debate around merit pay (also known as performance pay) for educators has existed intermittently since the 1920s (Pham, Nguyen, & Springer, 2017). In the age of standardization, proponents advocate for teachers’ salaries to be based on their students’ test scores or for financial incentives for favorable results. Writing for Forbes, Nick Morrison (2013) acknowledges the lack of evidence linking merit pay with increases in test scores yet takes the neoliberal stance and calls the system “fairer.” Since then, Pham, Nguyen, and Springer (2017) conducted a meta-analysis of 40 merit pay studies to find that “the presence of a merit pay program is associated with a modest, statistically significant, positive effect on student test scores” (p. 2). While the authors are appropriately conservative in their conclusion, the overly broad range of studies included in their meta-analysis prevents their findings from depicting a realistic account of merit pay’s effectiveness in the US should it be implemented nationally.

The researchers narrowed the original database gathering of nearly 20,000 results to a manageable 40, but the range of locations represented in the data complicates their generalizability. Though affiliated with the American Vanderbilt University, the investigators included twelve studies from outside the nation in their meta-analysis. Within these twelve is an unspecified number from developing countries (Pham, Nguyen, & Springer, 2017, p. 20), data that may have questionable applicability to the United States. Furthermore, out of the 28 US-based studies, sixteen were categorized as “merit pay + other,” meaning that “merit pay was implemented in conjunction with other reforms such as additional training” (p. 43). Because merit pay was not an isolated variable in these studies, their findings are not necessarily representative of its effects. Returning to the issue of inconsistency in data origins, only one other study in the meta-analysis fell into this category, making the US results even more skewed.

Also obscuring the meta-analysis findings is the breadth of methods used in the 40 studies. The authors acknowledge that “most studies reported effect sizes at multiple time points, with multiple estimation techniques, for different subject areas, and at different levels of analysis” (Pham, Nguyen, & Springer, 2017, p. 15). While they report accounting for this using a random-effects model, which “allows the true effect size to vary across studies” (p. 15), this substantial variation is still not ideal. As they note, “studies included in the analysis are decidedly different[,] and poorly produced studies may inject considerable bias” (p. 18). Considering that “almost [fewer than!] half of the studies are peer-reviewed publications” (p. 20), this broad of a sampling seems overly generous. For example, the authors mention excluding “case studies of fewer than five teachers” (p. 13) from their meta-analysis. This implies that they used a sample size of five teachers as the threshold for inclusion, which appears to be true according to the lower bound of 323 students in the total sample size range (p. 43) (if the 323 study did in fact only involve five teachers, there would be a student-to-teacher ratio of 64.6—a reasonable possibility assuming that, for instance, each teacher has three classes of 21-23 students). Although the upper bound of sample sizes of the studies included in the meta-analysis is 8,561,194 students, also using as small a sample size as five teachers makes the inclusion criteria seem too lenient. The excessively broad range of data included in this meta-analysis impairs the overall reliability of the findings instead of adding merit to the studies’ generalizability.

Although Pham, Nguyen, and Springer (2017) present the findings of this meta-analysis in an appropriately conservative way, investigation of their data collection reveals a lack of integrity that prevents them from even concluding a “modest, statistically significant, positive effect on student test scores” (p. 2) associated with merit pay programs. The inconsistency in the data’s locations, quality, and methods is a structural flaw in the meta-analysis that invalidates the researchers’ claims.

References

Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness. International Journal of Teaching and Learning in Higher Education, 17(1), 48-62. Retrieved from http://www.isetl.org/ijtlhe/pdf/IJTLHE8.pdf

Bill & Melinda Gates Foundation. (n.d.). Measures of Effective Teaching (MET) Project. Retrieved from http://k12education.gatesfoundation.org/blog/measures-of-effective-teaching-met-project/

Morrison, N. (2013, November 26). Merit pay for teachers is only fair. Forbes. Retrieved from https://www.forbes.com/sites/nickmorrison/2013/11/26/merit-pay-for-teachers-is-only-fair/#7f3b6164215d

Pham, L. D., Nguyen, T. D., & Springer, M. G. (2017). Teacher merit pay and student test scores: A meta-analysis. Retrieved from https://pdfs.semanticscholar.org/6d88/33216d5a96a46a3fced5af4ffac7d5d29077.pdf?_ga=2.257686413.705186148.1550997599-2007298045.1550997599

Leave a comment