Student Evaluations of Teaching are Not Valid

It is time to stop using SET scores in personnel decisions.
By John W. Lawrence

In a review of the literature on student evaluations of teaching (SET), Philip B. Stark and Richard Freishtat—of the University of California, Berkeley, statistics department and the Center for Teaching and Learning, respectively—concluded, “The common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness for promotion and tenure decisions should be abandoned for substantive and statistical reasons: There is strong evidence that student responses to questions of ‘effectiveness’ do not measure teaching effectiveness.” This is a startling conclusion, given that SET scores are the primary measure that many colleges and universities use to evaluate professors’ teaching. But a preponderance of evidence suggests that average SET scores are not valid measures of teaching effectiveness.

There are many statistical problems with SET scores. The response rate of student evaluations is often low. There is no reason to assume that the response pattern of those who do not complete the surveys would be similar to the pattern of those who do complete them. Some colleges assume that a low response rate is the professor’s fault; however, no basis exists for this assumption. Also, average SET scores in small classes will be more greatly influenced by outliers, luck, and error. SET scores are ordinal categorical variables in which participants make ratings that can be from poor (one) to great (seven). Stark and Freishtat point out that SET score numbers are labels, not values. We cannot assume the difference between one and two is the same as the difference between five and six. It does not make statistical sense to average categorical variables.

Even if SET score averages were statistically meaningful, it is impossible to compare them with other scores, such as the departmental average, without knowing the distribution of scores. For example, in baseball, if you don’t know the distribution of batting averages, you can’t know whether the difference between a .270 and .300 batting average is meaningful. Also, it makes no sense to compare SET scores of very different classes, such as a small physics course and a large lecture class on Shakespeare and hip-hop.

More problematic are the substantive concerns. SET scores are a poor measure of teaching effectiveness. They are correlated with many variables unrelated to teaching effectiveness, including the student’s grade expectation and enjoyment of the class; the instructor’s gender, race, age, and physical attractiveness; and the weather the day the survey is completed. In a 2016 study by economist Anne Boring and statisticians Kelli Ottoboni and Philip Stark, students in both France and the United States rated online professors more positively when they thought the instructor was a male. The researchers concluded that “SET are more sensitive to students’ gender bias and grade expectations than they are to teaching effectiveness.” In 2007, psychologists Robert Youmans and Benjamin Jee found that giving students chocolate before they completed teaching evaluations improved SET scores.

A 2016 study by two other psychologists, Nate Kornell and Hannah Hausman, reviewed the literature on the relationship between course SET scores and course performance. Course performance was measured by the same outcome measure (a final test) being given to multiple sections of the same course. In six reviewed meta-analyses, SET scores account for between 0 and 18 percent of the variance of student performance. The researchers also reviewed two rigorous studies with random assignment of students that tested whether SET scores predicted performance in a subsequent class—for example, do SET scores in Calculus 1 predict grades in Calculus 2? In both studies, student performance in the second class was negatively correlated with SET scores in the first class. Thus, students in classes with professors who received relatively low SET scores in the first semester tended to perform better in the second class. Kornell and Hausman posited that one possible explanation for these findings is that skilled teachers are able to achieve an optimum level of difficulty in their course to facilitate long-term learning. Students like it less but learn more.

Psychologist Wolfgang Stroebe has argued that reliance on SET scores for evaluating teaching may contribute, paradoxically, to a culture of less rigorous education. He reviewed evidence that students tend to rate more lenient professors more favorably. Moreover, students are more likely to take courses that they perceive as being less demanding and from which they anticipate earning a high grade. Thus, professors are rewarded for being less demanding and more lenient graders both by receiving favorable SET ratings and by enjoying higher student enrollment in their courses. Stroebe reviewed evidence that significant grade inflation over the last several decades has coincided with universities increasingly relying on average SET scores to make personnel decisions. This grade inflation has been greater at private colleges and universities, which often emphasize “customer satisfaction” more than public institutions. In addition, the amount of time students dedicate to studying has fallen, as have gains in critical-thinking skills resulting from college attendance.

If SET scores are such poor measures of teaching effectiveness and provide incentives for leniency, why do colleges and universities continue to use them? First, they are relatively easy to administer and inexpensive. Second, because they result in numerical scores, they have the appearance of being “objective.” Third, the neoliberal zeitgeist emphasizes the importance of measuring individual performance instead of working as a community to address challenges such as improving teaching. And fourth, SET scores are part of a larger problem in higher education in which corporate administrators use assessment and credentialing procedures to exert control over faculty and students. The validity of the “assessments” is assumed and of secondary importance. In reality, the valid assessment of multifaceted phenomena such as teaching effectiveness is complex and requires scientifically rigorous investigations.

So, if SET scores are not measures of teacher effectiveness, how should colleges and universities evaluate teaching? Stark and Freishtat assert that measuring both teaching effectiveness and learning is complex and that we cannot do so reliably and routinely without applying rigorous experimental methods to all courses. They suggest that we focus on evaluating teaching. This requires evaluating the materials that professors create for classes and doing regular teaching observations. Yet, as a former department chair, I have read hundreds of peer teaching observations, and my impression is that the interrater reliability of peer observations is itself not particularly high.

Given the complexity of measuring good teaching, colleges and universities need to engage in this task with humility. It may be more fruitful for the institution to approach teaching observations as a strategy for facilitating collegial dialogue on striving for highquality teaching and experimentation. The emphasis needs to be on mentoring, not potentially punitive evaluations. Moreover, in regard to improving student performance, institutional changes are likely more effective than focusing on individual professor performance. Particularly for minority and first-generation students, small class sizes and increased opportunities to interact with the professors and peers can improve a variety of outcomes.

John W. Lawrence teaches psychology at the City University of New York College of Staten Island. He is also a grievance counselor for the union representing CUNY faculty and staff, the Professional Staff Congress. His email address is john.lawrence@csi.cuny.edu.

Illustration by Toxitz/iStock.

Comments

The author misses a very significant fourth reason for using SET scores: consumer happiness. A happy consumer continues to consume the company's product. By the standards of business, it is good to maximize immediate satisfaction. If the academy continues mistaking education for business, then SET scores will become ever more important. If we return to the discarded educational model for higher education (not sure why we don't now call it "higher business"), learning outcomes become primary. Under the business model, they remain secondary. Nice to learn but not necessary. Until we, through our legislators, once again support higher education through tax revenue, schools will have to favor tuition revenue over learning.

I left academia a long time ago, but I can't help recalling a year when I taught the same subject in two different sections at a large state university, using the same materials, moving at the same pace, and making every effort to engage the students. One section rated me highly, with glowing comments, and the other section rated me poorly, with sarcastic comments.

So was I a good teacher or a bad teacher? The second class had a high percentage of what my colleagues called "the baseball cap crowd," young fraternity and athletic types, while the first class had more women, international students, and older students.

This kind of thing happened again and again in my 11 years in higher education.

As I look back over my own student days, I know that I clearly had some bad teachers, by which I mean the ones who never prepared and just b.s.'d their way through class or the ones who spent the hour reading aloud from the handouts that they had just given the students. I also had some superb teachers who made everyone interested in subjects that are commonly assumed to be boring.

But there's a huge cohort in between the awful and the superb, and it is hard to know how to evaluate them fairly. Perhaps it is best to mentor the underachievers or the uneven performers within a department by partnering each new instructor with an experienced instructor and holding department-wide meetings in which faculty members present their most successful teaching techniques.

As both a former student and former teacher (now retired), I believe student evaluations can be important both for evaluating teachers and helping them improve, but ONLY if they are open-ended (essay), not multiple choice (rate from 1 to 7). Open-ended evaluations allow one to see WHY students give the ratings they do. Unfortunately, most utilizers of student evaluations (i.e. administrators and sometimes faculty peers) don't really care enough to spend the time assessing USEFUL evaluations, but prefer the pointless & stupid multiple choice type, which allows obtaining a numerical score with little effort (or meaning). Of course this is why every company in the world is administering such stupid evaluations after every transaction: they don't really care about quality or WHY you are happy or not, they just want an easy number to get (which, among other things, facilitates getting ridding of "trouble makers" if they're looking for some excuse--which they are!).

In my experience, open-ended (essay) evaluations can be an invitation for aspersive---and even abusive---comments by a student.

The AAUP Committee on Teaching and Research conducted a nation-wide survey of faculty members' experiences with and views on student evaluations of teaching in 2015. The information that we gathered and the suggestions that we made, which were for the faculty cohort, usually the department, to define profession-appropriate standards for coursework and to support one another in achieving them, to provide mentoring and development opportunities for all faculty members, to share methods and solve problems together, and to conduct faculty-based evaluations of teaching were published in: "How do We Evaluate Teaching? Findings from a Survey of Faculty Members." Academe 102.3 (May-June 2016): 34-39 https://www.aaup.org/article/how-do-we-evaluate-teaching#.Ww9AMCBlDIU (Craig Vasey and Linda Carroll; based on the work of the Committee on Teaching, Research, and Publication of the American Association of University Professors); cited by Colleen Flaherty, “Flawed Evaluation,” in Inside Higher Education, June 10, 2015; Michelle Falkoff, “Why We Must Stop Relying on Student Ratings of Teaching,” in The Chronicle of Higher Education, April 25, 2018 (distributed by the Modern Language Association to its members in an email message of May 16, 2018); Colleen Flaherty, “Teaching Eval Shakeup,” in Inside Higher Education, May 22, 2018.

Is your university rethinking evaluation of teaching? Please complete this short survey to help us understand your perspective.

https://baseline.campuslabs.com/wsu/evaluationofteaching

Bravo to the authors for pointing out the bankruptcy of taking course evaluations too seriously. I have argued at my school that after we come to some consensus on what specific content we are trying to teach, we should prepare written or oral comprehensive exams or other similar student assessments to see if we succeeded. The consensus should be at the department or school level to abstract away from individual teacher's characteristics or preferred topics. The question then should be whether they have internalized the content, rather than whether they were happy trying.

I enjoyed reading the discussions about SET. After 52 years of haematology-oncology practice in an university and teaching I have retired, but, continue teaching. There is no easy way to become an out standing teacher in the eyes of students. Evaluation is very subjective. Less demanding spoon feeding teachers get good scores, particularly near the time of examination. Problem based tutorials rather than lectures are more effective.

Readers may be interested in this article: https://ocufa.on.ca/blog-posts/significant-arbitration-decision-on-use-o... . It provides links to the Freishtat and Stark reports, as well as the arbitration at Ryerson University in which they were evidence.

Performance in subsequent courses seems the closest thing to an objective evaluation of teaching effectiveness. This seems most obvious in multi-semester courses, but in those courses seen as foundational, one should presume that effects should affect multiple parts of the student's curriculum. Teachers whose students consistently underperform or overperform on a GPA basis compared to teachers of similar student cohorts (no comparing ESL or developmental English teachers to accelerated English teachers) would seem to be worth examining for faulty or exemplary methods.

Add new comment

We welcome your comments. See our commenting policy.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.