
The data show a harsh truth: most traditional clinical evaluations in medical school are statistically weak predictors of who will be a strong resident.
Not zero predictive value. But weaker, noisier, and more biased than people like to admit.
If you are a medical student obsessing over each “Above Expectations” box on your surgery rotation, you should understand what the numbers actually say about how those evaluations relate to residency performance, board scores, and future competence.
Let’s walk through the evidence like a stats consult, not a pep talk.
How Clinical Evaluations Are Supposed To Function
On paper, clinical evaluations exist to measure:
- Medical knowledge in the clinical context
- Clinical reasoning and decision-making
- Professionalism and teamwork
- Communication with patients and staff
- Work ethic and reliability
In practice, most U.S. schools use some combination of:
- End-of-rotation global rating forms (Likert scales + narrative comments)
- Mini-CEX / direct observation checklists
- OSCEs (structured patient encounters)
- Shelf exams (NBME subject exams) as an objective component
The core question: which of these have measurable predictive validity for residency performance?
To answer that, you must define “residency performance” numerically. Studies tend to operationalize it as some mix of:
- In-training exam (ITE) scores
- Board exam (USMLE Step 3, specialty boards)
- Program director global ratings
- Milestones scores (ACGME competencies)
- Occasionally: remediation, probation, or dismissal rates
So the pipeline is:
Clinical evaluations → MSPE / grades → Program selection → Residency metrics.
The reality: every link in that chain leaks signal.
What The Data Say: Overall Predictive Power
The literature is messy, but the pattern is consistent: clinical evaluations have at best modest correlations with residency outcomes.
Think “r = 0.2–0.3” territory for many measures. That is small to moderate effect size. Not useless, not decisive.
| Category | Value |
|---|---|
| Clerkship Grades | 0.25 |
| Narrative Evaluations | 0.18 |
| OSCE Scores | 0.22 |
| Shelf Exams | 0.32 |
| Step 2 CK | 0.45 |
Interpretation:
- Step 2 CK: strongest of this group for predicting future exam performance (ITE, boards).
- Shelf exams: moderate predictor.
- OSCEs and clinical ratings: weaker, often noisy predictors.
- Narrative comments: highly qualitative; difficult to quantify but show low-to-moderate correlations when coded.
Correlation coefficients in the 0.2–0.3 zone mean:
- They explain roughly 4–9% of the variance in residency performance (since R² = r²).
- The remaining 91–96% is explained by other factors: later training, personality, environment, luck, life events, program fit.
So if you are looking for a clean, linear “honors in medicine = star resident,” the data do not support that.
What Specific Studies Actually Show
Let’s break this down by assessment type. I will summarize typical findings across multiple studies rather than hang everything on one outlier paper.
Shelf Exams and Step 2 CK: The Stronger Signals
Multiple cohorts have shown:
- Clerkship shelf exams correlate with residency in-training exams around r = 0.25–0.35.
- Step 2 CK correlates with ITE and board exam performance around r = 0.4–0.6, depending on specialty.
Translation: standardized, knowledge-heavy measures carry more predictive weight for future test-based outcomes. No surprise.
| Category | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|
| Clerkship Shelfs | 0.2 | 0.25 | 0.3 | 0.35 | 0.4 |
| Step 2 CK | 0.35 | 0.45 | 0.5 | 0.55 | 0.6 |
| Clinical Ratings | 0.05 | 0.15 | 0.2 | 0.25 | 0.3 |
Notice where clinical ratings sit: lower and more variable.
This is exactly why program directors cling to Step 2 CK after Step 1 became pass/fail. Because the numbers, flawed as they are, carry more predictive signal than subjective clerkship comments in isolation.
Global Clerkship Grades: Some Signal, Lots of Noise
Most schools boil clinical evaluations and exams into final clerkship grades: Honors, High Pass, Pass, etc.
Several studies have examined whether:
- The number of honors / high passes predicts residency outcomes.
- Being in the top tertile of clerkship performance maps to stronger resident ratings.
Findings are mixed but generally:
- More honors / higher clerkship GPA shows a small positive relationship with residency performance.
- Effect sizes again hover in the r = 0.2–0.3 range for faculty global resident ratings and milestones scores.
- Once you control for Step 2 CK, the incremental value of clerkship grades often shrinks.
So, honors versus pass is not irrelevant. It just is not the crystal ball many students think it is.
Narrative Evaluations and MSPE Comments
Narratives feel rich and specific. “Outstanding,” “superstar,” “top 5%,” “quiet but reliable,” “needs to improve efficiency.” The problem is turning them into data.
Studies that coded narrative comments and MSPE language into quantitative categories found:
- Some phrases (“one of the best students I have worked with,” “top 10%”) correlate modestly with residency director ratings and earlier promotion.
- Mildly negative language (“requires supervision,” “needs to work on follow-through”) predicts higher risk of professionalism concerns and remediation.
- Overall predictive strength is still modest: r around 0.2–0.25 for positive phrases; stronger for clearly negative flags.
The clearest signal is at the tails:
- Glowingly superlative comments: often do map to high-performing residents.
- Subtle or explicit negative comments: disproportionally associated with performance issues down the line.
The vast middle (“good team player,” “strong work ethic,” “pleasant to work with”) is nearly indistinguishable noise from a predictive standpoint.
Why Clinical Evaluations Are So Noisy
If you design a measurement system that produces mostly “above average” results, you should not expect strong predictive power. Clinical evaluations are a case study.
Here is what the data and experience show:
Ceiling effects.
Most students receive high ratings. In many systems, 80–90% of scores cluster near the top of the scale. With that little spread, correlations with outcomes are mathematically limited.Rater variability.
Attendings differ widely in how they “use the scale.” I have seen one attending give everyone “meets expectations” because they think honors should be “Nobel laureate level,” while another gives “exceeds expectations” to any student who reads one paper.Halo and horns effects.
One salient behavior (great presentation, one big mistake, memorable patient interaction) biases entire evaluations.Gender and racial bias.
Multiple analyses have now demonstrated systematic differences in narrative language and ratings by gender and race.
Typical pattern:- Women: more likely to be praised for being “hardworking,” “diligent,” “caring.”
- Men: more likely to be praised for being “brilliant,” “leader,” “independent.”
- Underrepresented minorities: more likely to receive “competent” or “solid” rather than “outstanding,” with more mentions of needing support or development.
Bias does not just make the system unfair. It dilutes predictive validity, because ratings now reflect rater bias + performance, not performance alone.
Limited direct observation.
Many evaluations are based on snapshots: a few days of real observation, then a lot of hearsay and impressions. Half the time, the attending is relying on residents or nurses, or just general “vibe.”Non-specific constructs.
Forms try to rate 10–15 competencies at once (knowledge, judgment, empathy, efficiency, communication, etc.), but in practice raters often give essentially the same score across all domains.
This is why standardized tools like OSCEs and structured Mini-CEX have slightly better reliability. Narrower focus. More direct observation. Still, their predictive strength for residency is not massive.
Differences by Specialty: Does Predictive Value Change?
Yes, somewhat. Specialty culture and outcome metrics matter.
Broadly:
Knowledge-heavy, exam-dense specialties (internal medicine, anesthesiology, radiology):
Shelf exams, Step 2, and basic science performance show stronger correlations with ITE and board outcomes. Clinical evaluations still matter but are often overshadowed by test-based markers.Procedural specialties (surgery, OB/GYN, ortho):
Some evidence that clerkship surgical evaluations and technical OSCEs modestly predict procedural competence assessments and surgical milestones. Again, effect sizes are modest.Primary care fields (family medicine, pediatrics, psychiatry):
Communication and professionalism comments may carry a bit more weight in predicting longitudinal performance and professionalism issues. But from a numbers standpoint, Step 2 CK and ITEs still dominate exam-related outcomes.
Here is a simplified comparison of “typical” predictive strengths across a few specialties:
| Specialty | Tests (Step 2, ITE) | Clerkship Grades | Clinical Narratives |
|---|---|---|---|
| Internal Med | Strong | Moderate | Weak–Moderate |
| General Surgery | Moderate–Strong | Moderate | Moderate |
| Pediatrics | Strong | Moderate | Moderate |
| Psychiatry | Moderate | Moderate | Moderate |
| Family Med | Moderate | Moderate | Moderate |
“Strong” here means correlations often above 0.4. “Moderate” in the 0.2–0.4 zone. “Weak” below 0.2.
The pattern: tests are consistently the best predictors of future tests. Clinical evaluations contribute more modest, sometimes specialty-specific, incremental signal.
What About ACGME Milestones and Resident Evaluations?
You might think: residency evaluations are more structured, so maybe they are closer to the “truth” and can validate medical school clinical scores.
The data are not that pretty.
Several programs have tried to correlate:
- Medical school performance indicators (clerkship grades, narratives, OSCEs, Step 2)
with - Early residency milestones and faculty global ratings.
Patterns:
- Step 2 CK and ITE scores still show the strongest correlations with knowledge and patient care milestones.
- Clerkship grades show small positive associations, mostly in the first year, which often fade over time.
- Professionalism-related issues in medical school do predict higher likelihood of professionalism concerns in residency. That is one area where signal is more robust.
The effect of time is important. Initial differences wash out:
- By PGY-2 or PGY-3, performance is driven heavily by residency environment, case mix, supervision quality, and the resident’s growth curve, not what they did as an M3 on medicine ward A at Hospital B.
In other words, clinical performance predictions decay with time. Which fits intuition.
How Program Directors Actually Use Clinical Evaluations
Program directors are not statisticians, but they are not naive either. Surveys and real behavior suggest they use clinical evaluations and MSPE content as:
Red flag detectors:
They look hard for negative language, professionalism concerns, remediation, failed rotations. Those have disproportionate impact and are more predictive of future problems than “slightly below average on knowledge.”Tie-breakers:
When two applicants look identical on Step scores and research, clerkship honors count. Being “Outstanding” on medicine and surgery is a signal of reliability and work ethic, even if the pure predictive correlation is modest.Context markers:
Some PDs adjust their interpretation by school reputation. They know School X gives everyone honors, School Y is stingy. You are being judged relative to your school’s grading culture, not in a national vacuum.
Here is how different components typically factor into selection decisions (broadly averaged across specialties and studies):
| Category | Value |
|---|---|
| USMLE/COMLEX Scores | 30 |
| Clerkship Grades | 15 |
| MSPE & Narratives | 10 |
| Letters of Recommendation | 20 |
| Interviews & Fit | 25 |
This is not universal, but it is a decent approximation:
- Clinical evaluations (grades + narratives) might represent ~25% of the decision.
- Tests, letters, and interview performance drive the rest.
So, yes, they matter. But they are one part of a broader portfolio.
Practical Implications for Medical Students
Now the part you actually care about: what to do with all this data.
1. Stop Treating Each Rotation Grade As Destiny
Given the modest predictive power:
- One “Pass” or “High Pass” in a core rotation does not statistically doom your residency performance or match outcomes.
- A pattern of consistent low performance or professionalism issues is another story. That is where the predictive signal is strongest.
You should care about your evaluations. You should not catastrophize every small deviation from perfection.
2. Focus on Skills That Do Carry Forward
The pieces of clinical performance that most reliably show up again in residency:
- Reliability and follow-through.
- Ability to learn from feedback and correct mistakes.
- Communication with staff and patients.
- Patterns of unprofessional behavior (chronic lateness, dishonesty, poor teamwork).
Faculty consistently recall these traits when they write strong letters and MSPE narratives. And PDs pay attention when comments cluster in these domains.
3. Understand the Role of Standardized Exams
The hard reality from the data:
- If your Step 2 CK is strong, your chance of doing well on residency in-training exams and boards is high, regardless of a few mediocre clinical grades.
- If your Step 2 is weak, “outstanding” clinical evaluations will help, but they will not fully offset exam-performance concerns in the eyes of many programs.
You cannot ignore exams and hope that glowing clinical write-ups will fix everything. They won’t, statistically.
4. Negative Comments Matter More Than Slight Grade Differences
From a predictive standpoint, what really hurts:
- Documented professionalism issues.
- Comments hinting at dishonesty, poor judgment, unsafe behavior.
- Needing remediation or repeating rotations.
Those are associated with future problems at a much higher rate than “solid but not exceptional” comments. Guard your professionalism record fiercely.
Where The System Is Moving (Slowly)
Educators know the current system is flawed. There is ongoing work to:
- Standardize evaluation language and anchors.
- Use entrustable professional activities (EPAs) with clearer thresholds (“can independently manage overnight cross-cover calls on medicine”).
- Increase the use of structured direct observation tools.
- Develop better methods for quantifying narrative data without amplifying bias.
But none of this is moving fast. For your medical school life right now, you are living in a world where:
- Clinical evaluations are partly signal, partly social performance, partly bias.
- Their predictive power for residency success is real but limited.
- Standardized exams and clear red flags carry more deterministic weight than granular clinical score differences.
Key Takeaways
- Clinical evaluations and clerkship grades have modest predictive power for residency performance (correlations ~0.2–0.3); they matter, but they are not destiny.
- Standardized exams (Step 2 CK, shelf exams) consistently show stronger predictive validity for residency in‑training exams and boards than subjective clinical ratings.
- Negative or concerning professionalism comments carry disproportionate predictive weight compared with small differences among “good” or “strong” evaluations—protect your professionalism record above all.