
The belief that student evaluations are the gold standard for judging teaching quality in medical education is wrong. Not just “imperfect.” Wrong.
We’ve built entire promotion systems, faculty bonuses, and teaching awards on an instrument that—when you actually read the data—tracks more with charisma, bias, and grade inflation than with whether learners become safer, more competent clinicians.
Let me walk through what the evidence actually shows, and why blindly worshipping student ratings is quietly damaging medical education.
What Student Evaluations Actually Measure (Hint: Not What You Think)
If student evaluations truly captured “teaching quality,” they’d be strongly linked to:
- Objective knowledge gains
- Long-term retention
- Performance on standardized exams
- Clinical performance and patient outcomes
But the best studies say: not really.
Multiple meta-analyses in higher education (and a smaller number in medical education specifically) show only weak and inconsistent relationships between student ratings and actual learning. One widely cited meta-analysis in general higher ed found correlations between student ratings and learning outcomes hovering around 0.2–0.3 at best. That’s noise territory.
In medical education, you see the same pattern. I’ve seen clerkship directors pull up scatterplots of OSCE performance vs. teaching evaluations for clinical preceptors. Looks like buckshot. The “amazing” attending with sky-high evals has students who perform mediocrely. The “tough but fair” attending with brutally honest feedback? Lower evals, stronger OSCE scores.
Why? Because student evaluations are heavily influenced by everything except rigorous teaching:
- Likeability and entertainment value
- How easy the rotation or exam felt
- How stressed or overworked the learners were
- Whether the teacher “felt supportive” even if they were educationally useless
In other words, we are often measuring whether the teacher was pleasant company during a difficult time, not whether they made learners better physicians.
The Bias Problem: Systematically Rewarding the Wrong People
Let’s address the part many faculty whisper about but administrators often ignore: student evaluations are biased. Not occasionally, structurally.
Studies across higher ed (and replicated in medical education) have shown systematic differences in ratings based on:
- Gender
- Race and ethnicity
- Accent or non-native English
- Age and physical appearance
Same syllabus, same content, different identity → different ratings.
One experiment outside medicine used an online course where the same instructor pretended to be “male” in one section and “female” in another. Exact same teaching, videos, and assignments. The “male” version got higher ratings.
I’ve heard the same story from women and minority faculty in med schools:
- “When I hold firm standards, I’m ‘mean’ or ‘unsupportive.’ When my male colleague does it, he’s ‘high expectations, pushes us to excellence.’”
- “Residents love the older white male attending who lets them leave early. My evals take a hit when I hold them to duty hour and documentation expectations.”
These aren’t hypotheticals. They show up in the numbers.
| Category | Value |
|---|---|
| Male-identified | 4.4 |
| Female-identified | 4 |
Even when the effect sizes look “small” on paper, remember how these scores are used: cutoffs for teaching awards, triggers for remediation, factors in promotion decisions. A 0.3–0.4 difference on a 5-point scale is enough to consistently disadvantage certain groups.
So if you treat student evaluations as your gold standard, you’re not just being lazy. You’re encoding systemic bias into who gets labeled “excellent teacher” and who quietly gets sidelined.
The Likeability Trap: Why “Fun” ≠ “Effective”
In medicine, serious learning is uncomfortable by definition. Good clinical teachers:
- Confront knowledge gaps
- Push learners slightly beyond their comfort zone
- Give specific, sometimes harsh feedback
- Insist on preparation, reading, and repetition
That doesn’t feel good in the moment. Especially to exhausted students and residents juggling notes, exams, and sleep deprivation.
So what gets rewarded instead?
The attending who lets everyone go home early. The preceptor who says “Don’t worry about that guideline; just write this phrase.” The lecturer who replaces difficult pathophysiology with simplified, feel-good stories and curated memes.
You know the comments that correlate with high ratings:
- “Best attending ever, so chill, never pimped us.”
- “Made the rotation low-stress; didn’t care about small details.”
- “Super nice, would work with again.”
Now compare with the teachers who actually sharpen clinical reasoning:
- “Asked hard questions; felt like I was always on the spot.”
- “Very critical; made me feel incompetent sometimes.”
- “Too much feedback about small things.”
Guess which group gets tagged as “supportive and excellent” on evals, and which gets dragged for “not fostering a positive learning environment.”
I’ve watched committees read eval phrases like “intimidating” or “pimps too much” and barely ask: Did the learners actually get better? Did their exam performance improve? Did their notes or clinical decisions improve over time?
Nope. The vibe wins.
The Perverse Incentives: Grade Inflation and Soft Expectations
Once faculty realize that their career advancement hinges on student satisfaction scores, predictable behaviors follow.
I’ve literally heard this in faculty lounges:
- “Why am I going to fail a student and take a hit on evals? I’ll just pass them and document ‘remediated.’”
- “I stopped giving critical feedback on end-of-rotation forms. Every time I did, my scores tanked.”
- “If you want good evals, make the exam easy and give everyone honors.”
This isn’t rare. It’s rational behavior in a broken system.
When student evaluations become the dominant metric:
- Grades creep upward
- Honest feedback disappears
- Struggling learners get “kindness” instead of remediation
- Rigor is quietly downgraded in the name of being “supportive”
Which all feels nice in the short term. Until those same learners show up in your residency with dangerous gaps and zero experience receiving honest feedback.

The Data Problem: Garbage In, Governance Out
Even if student evaluations were conceptually perfect (they’re not), the way many institutions implement them is amateurish.
Common problems I’ve seen over and over:
- Low response rates: Often 20–40%. That’s not representative; it’s self-selection of the very happy or very pissed off.
- Tiny sample sizes: One bad day with three students can tank your mean. That’s statistically meaningless, but HR doesn’t care about power analysis.
- Poorly written items: Vague, double-barreled questions like “Created a positive and effective learning environment.” What does that even mean?
- No validation: Many schools “adapt” instruments without any psychometric analysis, then treat the output as precise measurement.
Then those shaky numbers get turned into rank lists, percentiles, and—my personal favorite—three-decimal-place averages reported in promotion packets like they’re serum sodium levels.
| Problem | Practical Consequence |
|---|---|
| Low response rates | Skewed toward extreme opinions |
| Small n per instructor | Huge volatility year to year |
| Vague questions | Scores reflect mood, not teaching |
| No validity evidence | False confidence in precise-looking data |
| Over-interpretation | High-stakes decisions on weak signals |
We’re pretending to do measurement. What we’re actually doing is institutionalized vibes analysis with numbers attached.
What Correlates Better With Real Learning?
If student ratings are a weak and biased signal, what should you use instead?
Not a single magic tool. But a portfolio of evidence with at least some connection to actual learning and professional outcomes.
Here’s what tends to track more meaningfully with real teaching quality:
Direct observation of teaching by trained peers
Not your buddy from fellowship. Trained observers using structured tools (e.g., frameworks like SETQ, Stanford Faculty Development Program criteria, or modified ICOs for the clinical setting). Yes, it takes time. That’s the point.Learner performance data over time
Not just one exam. Patterns: rotation exam scores, OSCE performance, progression in workplace-based assessments, stabilization or improvement of performance after structured teaching interventions.Quality of feedback and assessment
Are this teacher’s evaluations of learners specific, behavior-based, and aligned with actual performance? Or is everything “meets expectations” and copy-paste comments? Programs that audit narrative comments quickly see which faculty are doing real educational work.Structured learner input focused on behavior, not “liking”
Rebuild your evaluation forms. Ask about specific, observable teaching behaviors:- “Provided concrete, actionable feedback weekly”
- “Asked reasoning questions at the bedside and walked through answers”
- “Used patient cases to explicitly teach diagnostic reasoning steps”
These make it harder for pure likeability to dominate the signal.
Self-reflection and improvement over time
Does the teacher engage in faculty development? Do they change specific behaviors in response to evidence and feedback? You can track that. You should.
| Category | Value |
|---|---|
| Student ratings alone | 2 |
| Peer observation | 6 |
| Learner performance trends | 7 |
| Quality of feedback | 7 |
| Multi-source portfolio | 9 |
Are these perfect? No. But they’re at least pointed at the right target: whether learners actually improve in ways that matter for patient care.
How to Use Student Evaluations Without Letting Them Wreck Your Culture
I’m not saying burn all student evaluation forms and never ask learners anything. Learner perspective is part of the picture. Just not the whole painting, and definitely not the frame.
Here’s how to de-weaponize them:
Stop using raw means as high-stakes cutoffs. Look at patterns over time, qualitative comments, and context. A “3.8” in a notoriously demanding ICU rotation may be more meaningful than a “4.6” in a cushy elective.
Adjust for known bias factors where possible. At least be honest in committee discussions: “We know women and underrepresented faculty get lower ratings on average; we’ll interpret these scores with that in mind.”
Weight them appropriately. Student ratings should be one component among several—maybe 20–30% of the teaching evaluation picture, not 90–100%.
Train learners on how to give useful feedback. Short, focused orientations can move comments away from “nice / not nice” toward “specific behaviors that helped or hindered my learning.”
Separate satisfaction questions from teaching questions. “I liked this rotation” is not the same as “This teacher improved my clinical reasoning.”
| Step | Description |
|---|---|
| Step 1 | Teaching Evaluation |
| Step 2 | Student Evaluations |
| Step 3 | Peer Observation |
| Step 4 | Learner Performance Data |
| Step 5 | Feedback Quality Review |
| Step 6 | Faculty Development Engagement |
| Step 7 | Promotion and Reward Decisions |
If your department claims to value teaching but only talks about student scores in annual reviews, you don’t value teaching. You value popularity.
What You Should Do as an Individual Teacher
You can’t fix your institution alone, but you’re not powerless.
- Read your evals, but don’t internalize them as a referendum on your worth. Look for repeated, specific comments. Ignore one-off venting.
- Ask a trusted, skilled colleague to observe you teach and give real feedback. That’s worth ten anonymous comment boxes.
- Document your teaching impact: learners who improved, curricular changes you led, assessment tools you built, faculty development you completed. Bring that to your annual review.
- When you’re in the room where decisions are made, challenge the lazy assumption that “4.7 means great teacher, 4.1 means problem.” Ask: “What else do we know about their teaching? What do their learners’ outcomes look like?”
And if you’re a program director or clerkship director, you have more leverage than you think. You can pilot multi-source evaluation, rework evaluation forms, and stop pretending that a biased 5-point survey is sacred data.
The Bottom Line
Three things I want you to walk away with:
- Student evaluations are a weak, biased proxy for actual teaching quality, and the evidence is very clear on that.
- Over-reliance on these ratings actively harms medical education by rewarding likeability, punishing rigor, and amplifying systemic bias.
- Real evaluation of teaching requires a portfolio: peer observation, learner performance, quality of feedback, and yes, carefully interpreted student input—but never student ratings alone as the “gold standard.”
If you build your educational culture on popularity scores, don’t be surprised when you graduate popular teachers and poorly trained clinicians.