
The happy-face ratings you collect at the end of teaching sessions are statistically weak predictors of objective performance. That is the uncomfortable truth. The data show a modest, often misleading relationship between learner evaluation scores and what actually matters: knowledge gain, skill competence, patient outcomes, and long‑term practice behavior.
If you build your teaching career around keeping evaluations “above 4.5,” you are optimizing for the wrong metric.
Let’s walk through what the numbers actually say.
What exactly are we comparing?
“Learner evaluation scores” and “objective outcomes” sound tidy. In reality, they come from messy, different data-generating processes.
On the ratings side, most medical educators see some mix of:
- 5‑point Likert teaching effectiveness scores (often skewed 4–5)
- End‑of‑rotation global ratings of faculty and residents
- Course or conference satisfaction surveys
- Narrative comments coded into “top, middle, bottom” tiers
On the outcomes side, we have a very different set of variables:
- Written exam scores (course exams, NBME, board-style)
- OSCE/skills performance, milestone ratings
- Procedure logs and technical competency assessments
- Clinical metrics: guideline adherence, complication rates, readmissions
- Long‑term markers: board pass rates, prescribing patterns, referral quality
These do not move in lockstep. They barely share variance.
When researchers put these on the same page, the typical Pearson correlation (r) between teaching ratings and objective learning is in the 0.1–0.3 range. Translate that to variance explained (r²), and you are in the 1–9% territory. That is tiny.
To make comparisons easier, here is how that stacks up against other predictors in medical education.
| Predictor | Typical Correlation (r) with Objective Outcomes |
|---|---|
| Prior exam scores (e.g., USMLE) | 0.50 – 0.70 |
| Structured OSCE/skills ratings | 0.40 – 0.60 |
| Faculty milestone-based assessments | 0.30 – 0.50 |
| Learner satisfaction/teaching evals | 0.10 – 0.30 |
| Single global “good teacher” items | 0.05 – 0.20 |
So yes, there is some signal. But it is faint compared to other predictors.
What the meta-analyses actually show
Two main literatures matter here: general higher education and health professions education. The patterns are surprisingly consistent.
Global picture from higher education
Large meta‑analyses in higher ed (Marsh, Cohen, Feldman, and others) have repeatedly found:
- Common correlation between student ratings and achievement: r ≈ 0.20–0.30
- After adjusting for measurement error, course type, and other noise: maybe creeping toward 0.30–0.40 in the best datasets
- Heavy ceiling effects: most ratings clustered near “good” or “very good”
That translates to about 4–9% of variance in test performance explained by how students rate the teaching.
The more rigorous the design (e.g., multi-section courses with common final exams and random assignment of students to instructors), the more plausible the signal. But even then, we are nowhere near a strong predictive tool.
Health professions education: narrower lens, same story
Health professions education studies are smaller and noisier, but the central tendency is the same:
- Clinical teaching evaluations vs exam/OSCE performance: r typically 0.10–0.25
- Course satisfaction vs standardized assessment scores: often non‑significant or very small effects
- High performer programs (by objective measures) do not consistently have “perfect” instructor ratings
Where the numbers are a bit better is at the program or course level, not individual instructor:
- Curricula with systematically higher teaching quality ratings sometimes have higher mean scores on NBME or specialty in‑training exams.
- The correlation here might sit in the 0.30 range when averaged across years and cohorts.
That is a key point: aggregation helps. At an individual teacher level, your personal rating this block is a very noisy proxy for what your learners will score on any exam.
To visualize the issue, think in terms of explained variance:
| Category | Value |
|---|---|
| Teaching Evals | 6 |
| Faculty Milestones | 20 |
| OSCE Scores | 25 |
| Prior Exams | 40 |
Even being generous, teaching evaluations are a small sliver of the performance picture.
Bias: where the ratings go off the rails
Weak predictive power would be tolerable if evaluations were at least fair. They are not. Bias is not an abstract risk; it shows up in the numbers over and over.
Common documented biases:
- Gender: Female-identifying faculty frequently receive lower scores than male colleagues for similar objective outcomes, especially in surgical and procedural fields.
- Race/ethnicity: Underrepresented faculty often receive disproportionately negative ratings, even after adjusting for learner performance and course factors.
- Specialty stereotypes: “Soft” specialties (pediatrics, psych, family medicine) sometimes score differently than procedural or “hard” specialties for equivalent teaching behaviors.
- Grading leniency: Instructors perceived as “easier graders” or “less demanding” often receive higher ratings, even when learning outcomes are not better.
- Enthusiasm/entertainment: Charisma and humor inflate evaluations independent of knowledge gain.
In practical terms, I have seen teaching dashboards where a highly demanding ICU attending had median ratings of 3.7/5 but their residents consistently scored in the top decile of in‑training exams. Meanwhile, a very “nice” conference speaker sat at 4.8/5 with no detectable effect on objective performance.
The data were not subtle.
From a measurement standpoint, these biases do two things:
- Add systematic error (unrelated to teaching quality).
- Distort subgroup comparisons (e.g., punishing women and URM faculty).
So you get the worst of both worlds: low predictive validity and biased decisions.
When ratings and performance diverge
Let’s be specific about the mismatch between ratings and objective outcomes. There are at least four recurring patterns I see in institutional data.
1. Short-term satisfaction vs long-term retention
Learners often rate highly when content feels easy, clear, and not threatening. But “easy” is not always good.
Cognitively challenging teaching—testing effect, spaced retrieval, interleaving, problem-solving before explanation—can feel frustrating. In the short term, learners underestimate how much they are actually learning from such approaches.
Studies comparing “smooth” lectures vs active learning consistently find:
- Higher satisfaction and “clarity” ratings for the smooth, traditional lecture.
- Equal or better exam performance for the active learning group, especially on higher-order items.
In other words, the session that “felt better” often does not produce better test scores.
2. Tough but effective teachers
Some of the most effective clinical educators in terms of exam analytics look “average” on evaluations:
- They cold-call on rounds.
- They give blunt feedback.
- They assign extra reading and practice questions.
- They enforce punctuality and preparedness.
These things do not boost your Likert averages. They do, however, tend to correlate with:
- Higher resident in‑training exam percentiles.
- Faster improvement curves for junior learners.
- Better milestone progression.
When you run regression models with both teaching evaluations and such “demandingness” variables, the softer satisfaction scores often drop out. The real driver of outcomes is structured challenge and feedback, not likeability.
3. Entertainment confounding
If you lecture with flashy slides, humor, and compelling stories, you will almost always score well. But when you link those scores to exam performance, the effect is often negligible once you control for baseline learner ability and time on task.
I once saw a year‑over‑year curriculum review where the two most “beloved” lecturers had no discernable impact on the specialty in‑training exam subscores tied to their content. Their sessions were objectively fun. But fun is not a learning outcome.
4. Grade-expectation feedback loop
Learners who expect a good grade, or believe an instructor is lenient, tend to give higher evaluations. There’s data from multiple fields showing teaching ratings correlate more strongly with expected grade than actual learning.
In medicine, with more pass/fail grading, the analog is perceived exam difficulty or rotation workload. “Light” rotations or courses with easy exams reliably get a ratings boost.
If you then tie promotion or bonuses to those same ratings, you create powerful incentives for grade inflation and reduced rigor. That is not theoretical; it happens. You can see score distributions shift over 3–5 years when departments start using evaluations as a primary promotion metric.
Where ratings have some value
Let me be fair. Learner ratings are not useless. They become more informative when you:
- Aggregate across multiple years and cohorts.
- Focus on specific behaviorally anchored items (e.g., “gave timely feedback”).
- Use them as one component in a broader portfolio.
At the program level, sustained low ratings usually flag something real: disorganized teaching, chronic disrespect, lack of feedback. At the educator level, repeated low scores on specific dimensions can identify teachable skills.
The shift is from “ratings predict performance” to “ratings highlight experiences and behaviors that might indirectly affect performance.”
They are process indicators, not outcome metrics.
What predict performance better than ratings?
If you want data that actually predict performance, you need to move away from single global Likert scores and toward structured, criterion‑related measures.
Strong predictors, based on multiple datasets:
- Prior performance: Board scores, NBME shelf results, OSCE scores. Correlations with future exams commonly in the 0.5–0.7 range.
- Structured observations: Direct observation checklists, mini‑CEX, EPA assessments with multiple raters; often in the 0.3–0.6 range when done properly.
- Deliberate practice exposure: Number and quality of supervised cases, procedure repetition with feedback, simulation session counts.
- Programmatic assessment aggregates: Combining multiple low-stakes data points into longitudinal progress metrics.
To illustrate relative predictive power conceptually:
| Category | Value |
|---|---|
| Global Teaching Evals | 15 |
| Specific Eval Items | 25 |
| Direct Observation | 45 |
| Prior Exam Scores | 60 |
| Programmatic Composite | 55 |
The message: you get much more predictive accuracy from performance-based and longitudinal data than from satisfaction snapshots.
How to redesign evaluation systems with data in mind
If you care about aligning evaluations with real performance, you have to re-engineer the system. Tweaking the Likert anchors will not fix this.
1. Separate satisfaction from learning evidence
Stop pretending they are the same construct.
Use learner evaluations to answer:
- Did learners feel respected?
- Was the learning environment psychologically safe?
- Were logistics clear and communication professional?
Use objective performance metrics to answer:
- Did knowledge, skills, and behaviors improve?
- Did exam performance meet targets?
- Are patient outcomes improving?
Those are different analytics streams and should be labeled as such.
2. Build multi-source educator evaluation
A credible teaching portfolio should combine:
- Learner ratings (heavily contextualized, not over-weighted).
- Peer observation of teaching with structured rubrics.
- Evidence of learning impact: pre/post scores, OSCE data, in‑training exam subscore changes.
- Scholarship: curriculum development with evaluation data, educational research.
- Contribution to program outcomes: board pass rates, milestone progression, remediation rates.
You would never base a clinical competency decision on a single patient satisfaction form. Do not do it for teaching either.
3. Use analytics at the right level
Individual instructor rating vs exam score correlation will always be noisy because the sample size is small (one teacher, one cohort, n≈10–40).
You gain power by:
- Aggregating across several cohorts (3–5 years).
- Looking at course or block level trends.
- Using subscore analyses: Does cardio teaching map to cardio exam subscores? Does procedural teaching map to skill stations?
A basic approach I have seen work:
| Step | Description |
|---|---|
| Step 1 | Define Course Objectives |
| Step 2 | Map to Specific Exams or OSCE Stations |
| Step 3 | Tag Teaching Sessions to Objectives |
| Step 4 | Collect Ratings and Performance Data |
| Step 5 | Analyze Objective Subscores by Session Cluster |
| Step 6 | Identify High and Low Impact Teaching |
The point is to tie teaching episodes to specific, measurable outcomes, not to generic satisfaction.
4. Adjust for known biases where possible
You cannot fully “correct” bias, but you can at least reduce its influence:
- Avoid simple cross-faculty ranking by mean score.
- Examine rating distributions by gender, race/ethnicity, and specialty; look for systematic shifts.
- Use within-instructor trends over time rather than raw between-instructor comparisons.
- When using ratings in promotion decisions, down-weight the global “overall effectiveness” item and up-weight specific, behaviorally anchored items.
If your faculty dashboard is essentially a leaderboard sorted by mean Likert score, you are enshrining bias, not measuring excellence.
Practical advice for individual educators
You cannot single‑handedly redesign your institution’s evaluation system, but you can change how you interpret your own scores.
Here is the data-driven way to think about it:
- Stop treating minor fluctuations (4.5 vs 4.3) as meaningful. With small n and strong ceiling effects, those differences are often noise.
- Pay attention to patterns over years, not single blocks.
- Take specific negative comments seriously when they cluster around the same behavior (e.g., “rarely gives feedback”), because behavior is modifiable and may affect learning conditions.
- Do not equate “low but improving” satisfaction with failure. If you increase cognitive challenge and active engagement, your ratings may dip while outcomes improve. Watch both.
- Where possible, track your own teaching impact using data you control: pre/post quizzes, OSCE station performance on content you teach, simulation metrics.
If you are consistently producing better exam or OSCE outcomes while hovering around “4.0 – 4.3” on ratings, you are probably doing more good than the “4.8 – everyone loves them” colleague who leaves no measurable performance trace.
A reality check for leaders and promotion committees
If you are involved in faculty review, stop pretending that the 0.2 spread in mean Likert score between two educators is a robust signal. Statistically, it is not.
Better uses of evaluation data in promotion:
- Minimum threshold for professionalism and learning environment (e.g., no pattern of disrespect or abuse; consistent basic adequacy).
- Contextualized narrative that triangulates ratings with peer review and objective outcome data.
- Reward for sustained, documented impact on high‑stakes outcomes (board pass rates, milestone progression, remediation reductions), even if satisfaction is moderate rather than stellar.
The cold fact: institutions that over‑weight learner satisfaction in promotion implicitly incentivize lower standards, shallower challenge, and entertainment‑heavy teaching. The correlation data support that concern.
Two or three points to keep in your head:
- Learner evaluation scores predict only a small fraction of objective performance. Typical correlations are 0.1–0.3, which means 1–9% of variance explained. That is weak.
- Ratings are heavily biased and confounded by factors unrelated to learning—gender, race, grading leniency, and entertainment value. They are not a fair stand‑alone metric for judging teaching quality.
- If you care about real outcomes, build systems around objective performance data, structured observations, and multi-source portfolios. Use learner ratings as context and process feedback, not as the primary proxy for educational effectiveness.