Resources Clinical Rotations Do Clinical Evaluations Predict Residency Performance? What Studies Show

Do Clinical Evaluations Predict Residency Performance? What Studies Show

January 5, 2026

13 minute read

clinical evaluations residency performance step 2 ck shelf exams clerkship grades narrative evaluations predictive validity medical education

Medical student being evaluated on inpatient ward - for Do Clinical Evaluations Predict Residency Performance? What Studies

The data show a harsh truth: most traditional clinical evaluations in medical school are statistically weak predictors of who will be a strong resident.

Not zero predictive value. But weaker, noisier, and more biased than people like to admit.

If you are a medical student obsessing over each “Above Expectations” box on your surgery rotation, you should understand what the numbers actually say about how those evaluations relate to residency performance, board scores, and future competence.

Let’s walk through the evidence like a stats consult, not a pep talk.

How Clinical Evaluations Are Supposed To Function

On paper, clinical evaluations exist to measure:

Medical knowledge in the clinical context
Clinical reasoning and decision-making
Professionalism and teamwork
Communication with patients and staff
Work ethic and reliability

In practice, most U.S. schools use some combination of:

End-of-rotation global rating forms (Likert scales + narrative comments)
Mini-CEX / direct observation checklists
OSCEs (structured patient encounters)
Shelf exams (NBME subject exams) as an objective component

The core question: which of these have measurable predictive validity for residency performance?

To answer that, you must define “residency performance” numerically. Studies tend to operationalize it as some mix of:

In-training exam (ITE) scores
Board exam (USMLE Step 3, specialty boards)
Program director global ratings
Milestones scores (ACGME competencies)
Occasionally: remediation, probation, or dismissal rates

So the pipeline is:

Clinical evaluations → MSPE / grades → Program selection → Residency metrics.

The reality: every link in that chain leaks signal.

What The Data Say: Overall Predictive Power

The literature is messy, but the pattern is consistent: clinical evaluations have at best modest correlations with residency outcomes.

Think “r = 0.2–0.3” territory for many measures. That is small to moderate effect size. Not useless, not decisive.

bar chart: Clerkship Grades, Narrative Evaluations, OSCE Scores, Shelf Exams, Step 2 CK

Interpretation:

Step 2 CK: strongest of this group for predicting future exam performance (ITE, boards).
Shelf exams: moderate predictor.
OSCEs and clinical ratings: weaker, often noisy predictors.
Narrative comments: highly qualitative; difficult to quantify but show low-to-moderate correlations when coded.

Correlation coefficients in the 0.2–0.3 zone mean:

They explain roughly 4–9% of the variance in residency performance (since R² = r²).
The remaining 91–96% is explained by other factors: later training, personality, environment, luck, life events, program fit.

So if you are looking for a clean, linear “honors in medicine = star resident,” the data do not support that.

What Specific Studies Actually Show

Let’s break this down by assessment type. I will summarize typical findings across multiple studies rather than hang everything on one outlier paper.

Shelf Exams and Step 2 CK: The Stronger Signals

Multiple cohorts have shown:

Clerkship shelf exams correlate with residency in-training exams around r = 0.25–0.35.
Step 2 CK correlates with ITE and board exam performance around r = 0.4–0.6, depending on specialty.

Translation: standardized, knowledge-heavy measures carry more predictive weight for future test-based outcomes. No surprise.

boxplot chart: Clerkship Shelfs, Step 2 CK, Clinical Ratings

Notice where clinical ratings sit: lower and more variable.

This is exactly why program directors cling to Step 2 CK after Step 1 became pass/fail. Because the numbers, flawed as they are, carry more predictive signal than subjective clerkship comments in isolation.

Global Clerkship Grades: Some Signal, Lots of Noise

Most schools boil clinical evaluations and exams into final clerkship grades: Honors, High Pass, Pass, etc.

Several studies have examined whether:

The number of honors / high passes predicts residency outcomes.
Being in the top tertile of clerkship performance maps to stronger resident ratings.

Findings are mixed but generally:

More honors / higher clerkship GPA shows a small positive relationship with residency performance.
Effect sizes again hover in the r = 0.2–0.3 range for faculty global resident ratings and milestones scores.
Once you control for Step 2 CK, the incremental value of clerkship grades often shrinks.

So, honors versus pass is not irrelevant. It just is not the crystal ball many students think it is.

Narrative Evaluations and MSPE Comments

Narratives feel rich and specific. “Outstanding,” “superstar,” “top 5%,” “quiet but reliable,” “needs to improve efficiency.” The problem is turning them into data.

Studies that coded narrative comments and MSPE language into quantitative categories found:

Some phrases (“one of the best students I have worked with,” “top 10%”) correlate modestly with residency director ratings and earlier promotion.
Mildly negative language (“requires supervision,” “needs to work on follow-through”) predicts higher risk of professionalism concerns and remediation.
Overall predictive strength is still modest: r around 0.2–0.25 for positive phrases; stronger for clearly negative flags.

The clearest signal is at the tails:

Glowingly superlative comments: often do map to high-performing residents.
Subtle or explicit negative comments: disproportionally associated with performance issues down the line.

The vast middle (“good team player,” “strong work ethic,” “pleasant to work with”) is nearly indistinguishable noise from a predictive standpoint.

Why Clinical Evaluations Are So Noisy

If you design a measurement system that produces mostly “above average” results, you should not expect strong predictive power. Clinical evaluations are a case study.

Here is what the data and experience show:

Ceiling effects.
Most students receive high ratings. In many systems, 80–90% of scores cluster near the top of the scale. With that little spread, correlations with outcomes are mathematically limited.
Rater variability.
Attendings differ widely in how they “use the scale.” I have seen one attending give everyone “meets expectations” because they think honors should be “Nobel laureate level,” while another gives “exceeds expectations” to any student who reads one paper.
Halo and horns effects.
One salient behavior (great presentation, one big mistake, memorable patient interaction) biases entire evaluations.
Gender and racial bias.
Multiple analyses have now demonstrated systematic differences in narrative language and ratings by gender and race.
Typical pattern:
- Women: more likely to be praised for being “hardworking,” “diligent,” “caring.”
- Men: more likely to be praised for being “brilliant,” “leader,” “independent.”
- Underrepresented minorities: more likely to receive “competent” or “solid” rather than “outstanding,” with more mentions of needing support or development.
Bias does not just make the system unfair. It dilutes predictive validity, because ratings now reflect rater bias + performance, not performance alone.
Limited direct observation.
Many evaluations are based on snapshots: a few days of real observation, then a lot of hearsay and impressions. Half the time, the attending is relying on residents or nurses, or just general “vibe.”
Non-specific constructs.
Forms try to rate 10–15 competencies at once (knowledge, judgment, empathy, efficiency, communication, etc.), but in practice raters often give essentially the same score across all domains.

This is why standardized tools like OSCEs and structured Mini-CEX have slightly better reliability. Narrower focus. More direct observation. Still, their predictive strength for residency is not massive.

Differences by Specialty: Does Predictive Value Change?

Yes, somewhat. Specialty culture and outcome metrics matter.

Broadly:

Knowledge-heavy, exam-dense specialties (internal medicine, anesthesiology, radiology):
Shelf exams, Step 2, and basic science performance show stronger correlations with ITE and board outcomes. Clinical evaluations still matter but are often overshadowed by test-based markers.
Procedural specialties (surgery, OB/GYN, ortho):
Some evidence that clerkship surgical evaluations and technical OSCEs modestly predict procedural competence assessments and surgical milestones. Again, effect sizes are modest.
Primary care fields (family medicine, pediatrics, psychiatry):
Communication and professionalism comments may carry a bit more weight in predicting longitudinal performance and professionalism issues. But from a numbers standpoint, Step 2 CK and ITEs still dominate exam-related outcomes.

Here is a simplified comparison of “typical” predictive strengths across a few specialties:

Approximate Relative Predictive Strength by Specialty

Specialty	Tests (Step 2, ITE)	Clerkship Grades	Clinical Narratives
Internal Med	Strong	Moderate	Weak–Moderate
General Surgery	Moderate–Strong	Moderate	Moderate
Pediatrics	Strong	Moderate	Moderate
Psychiatry	Moderate	Moderate	Moderate
Family Med	Moderate	Moderate	Moderate

“Strong” here means correlations often above 0.4. “Moderate” in the 0.2–0.4 zone. “Weak” below 0.2.

The pattern: tests are consistently the best predictors of future tests. Clinical evaluations contribute more modest, sometimes specialty-specific, incremental signal.

What About ACGME Milestones and Resident Evaluations?

You might think: residency evaluations are more structured, so maybe they are closer to the “truth” and can validate medical school clinical scores.

The data are not that pretty.

Several programs have tried to correlate:

Medical school performance indicators (clerkship grades, narratives, OSCEs, Step 2)
with
Early residency milestones and faculty global ratings.

Patterns:

Step 2 CK and ITE scores still show the strongest correlations with knowledge and patient care milestones.
Clerkship grades show small positive associations, mostly in the first year, which often fade over time.
Professionalism-related issues in medical school do predict higher likelihood of professionalism concerns in residency. That is one area where signal is more robust.

The effect of time is important. Initial differences wash out:

By PGY-2 or PGY-3, performance is driven heavily by residency environment, case mix, supervision quality, and the resident’s growth curve, not what they did as an M3 on medicine ward A at Hospital B.

In other words, clinical performance predictions decay with time. Which fits intuition.

How Program Directors Actually Use Clinical Evaluations

Program directors are not statisticians, but they are not naive either. Surveys and real behavior suggest they use clinical evaluations and MSPE content as:

Red flag detectors:
They look hard for negative language, professionalism concerns, remediation, failed rotations. Those have disproportionate impact and are more predictive of future problems than “slightly below average on knowledge.”
Tie-breakers:
When two applicants look identical on Step scores and research, clerkship honors count. Being “Outstanding” on medicine and surgery is a signal of reliability and work ethic, even if the pure predictive correlation is modest.
Context markers:
Some PDs adjust their interpretation by school reputation. They know School X gives everyone honors, School Y is stingy. You are being judged relative to your school’s grading culture, not in a national vacuum.

Here is how different components typically factor into selection decisions (broadly averaged across specialties and studies):

doughnut chart: USMLE/COMLEX Scores, Clerkship Grades, MSPE & Narratives, Letters of Recommendation, Interviews & Fit

This is not universal, but it is a decent approximation:

Clinical evaluations (grades + narratives) might represent ~25% of the decision.
Tests, letters, and interview performance drive the rest.

So, yes, they matter. But they are one part of a broader portfolio.

Practical Implications for Medical Students

Now the part you actually care about: what to do with all this data.

1. Stop Treating Each Rotation Grade As Destiny

Given the modest predictive power:

One “Pass” or “High Pass” in a core rotation does not statistically doom your residency performance or match outcomes.
A pattern of consistent low performance or professionalism issues is another story. That is where the predictive signal is strongest.

You should care about your evaluations. You should not catastrophize every small deviation from perfection.

2. Focus on Skills That Do Carry Forward

The pieces of clinical performance that most reliably show up again in residency:

Reliability and follow-through.
Ability to learn from feedback and correct mistakes.
Communication with staff and patients.
Patterns of unprofessional behavior (chronic lateness, dishonesty, poor teamwork).

Faculty consistently recall these traits when they write strong letters and MSPE narratives. And PDs pay attention when comments cluster in these domains.

3. Understand the Role of Standardized Exams

The hard reality from the data:

If your Step 2 CK is strong, your chance of doing well on residency in-training exams and boards is high, regardless of a few mediocre clinical grades.
If your Step 2 is weak, “outstanding” clinical evaluations will help, but they will not fully offset exam-performance concerns in the eyes of many programs.

You cannot ignore exams and hope that glowing clinical write-ups will fix everything. They won’t, statistically.

4. Negative Comments Matter More Than Slight Grade Differences

From a predictive standpoint, what really hurts:

Documented professionalism issues.
Comments hinting at dishonesty, poor judgment, unsafe behavior.
Needing remediation or repeating rotations.

Those are associated with future problems at a much higher rate than “solid but not exceptional” comments. Guard your professionalism record fiercely.

Where The System Is Moving (Slowly)

Educators know the current system is flawed. There is ongoing work to:

Standardize evaluation language and anchors.
Use entrustable professional activities (EPAs) with clearer thresholds (“can independently manage overnight cross-cover calls on medicine”).
Increase the use of structured direct observation tools.
Develop better methods for quantifying narrative data without amplifying bias.

But none of this is moving fast. For your medical school life right now, you are living in a world where:

Clinical evaluations are partly signal, partly social performance, partly bias.
Their predictive power for residency success is real but limited.
Standardized exams and clear red flags carry more deterministic weight than granular clinical score differences.

Key Takeaways

Clinical evaluations and clerkship grades have modest predictive power for residency performance (correlations ~0.2–0.3); they matter, but they are not destiny.
Standardized exams (Step 2 CK, shelf exams) consistently show stronger predictive validity for residency in‑training exams and boards than subjective clinical ratings.
Negative or concerning professionalism comments carry disproportionate predictive weight compared with small differences among “good” or “strong” evaluations—protect your professionalism record above all.

SmartPick - Residency Selection Made Smarter

Take the guesswork out of residency applications with data-driven precision.

Finding the right residency programs is challenging, but SmartPick makes it effortless. Our AI-driven algorithm analyzes your profile, scores, and preferences to curate the best programs for you. No more wasted applications—get a personalized, optimized list that maximizes your chances of matching. Make every choice count with SmartPick!

See Your Residency Matches

* 100% free to try. No credit card or account creation required.

When Language Barriers Limit Your Patient Interaction on Rounds

Handle language barriers on rounds: practical tips for medical students to use interpreters, prepare scripts, learn key phrases, and contribute to patient care.

How Often Students Get ‘Below Expectations’ on Rotations: Real Statistics

Discover real stats on how often medical students get 'Below Expectations' on rotations, specialty risks, and steps to reduce evaluation dings.

Can One Bad Rotation Grade Destroy My Residency Chances?

Worried one bad rotation grade will ruin your residency chances? Learn when it matters, how PDs judge clerkship grades, and how to rebound to match.

Maximize Your Clinical Rotations: Essential Tips for Medical Students

Unlock your clinical rotation potential with practical strategies to enhance your medical education and develop vital healthcare skills. Start thriving today!

Elevate Your Clinical Rotation Performance with a Positive Attitude

Discover how your attitude influences clinical rotation success, patient care, and team dynamics in medical education. Transform your experience today!

Subtle Behaviors on Rounds That Make Attendings Not Trust You

Learn subtle behaviors on rounds that erode attendings' trust—practical tips for medical students to present precisely, verify data, and appear engaged.

What’s the Right Way to Decline a Procedure You’re Not Comfortable With?

Practical scripts and tips for medical students to safely decline procedures, prioritize patient safety, and learn without risking harm.

Master Your Clinical Rotations: Essential Tips for Medical Students

Unlock success in clinical rotations with effective strategies! Discover essential tips for time management, preparation, and thriving in medical education.

Mastering Clinical Skills: Key Focus Areas for Medical Rotations

Unlock your potential during rotations! Discover essential clinical skills and strategies for excelling in medical education and patient care.

Mastering Time Management for Success in Medical Rotations

Unlock effective time management strategies for clinical rotations to enhance your medical education, improve patient care, and prioritize self-care.

Avoid Common Clinical Rotation Mistakes: A Guide for Medical Students

Maximize your clinical rotations by avoiding top mistakes. Boost professionalism, communication, and self-care for a successful medical education journey.

How to Navigate Rotations When You’re Quiet or Introverted

Help quiet medical students shine on rotations with scripted remarks, visible pre-rounds, clear notes, and one focused daily question.

Are Harsh Attendings Always Bad for Your Career? The Evidence

Explore evidence on harsh attendings, pimping, and abuse: when strict teaching helps clinical skills and when mistreatment harms burnout, exams, and career.

Mastering Rapport with Attendings: Your Guide to Successful Clinical Rotations

Discover essential tips for building rapport with attendings during rotations, crucial for mentorship, professional growth, and networking in medical education.

How to Prepare the Weekend Before Starting Any Core Rotation

Weekend checklist for medical students starting a core rotation: hour-by-hour prep for logistics, pocket notes, apps, and first-week clinical priorities.

How Much Do Clerkship Grades Really Matter for Match Outcomes?

Understand how clerkship grades influence residency match outcomes, who they matter most for, specialty differences, and actionable thresholds.

Clinic Days vs Inpatient Days: Adjusting Your Workflow and Learning Goals

Master clinic vs inpatient days: practical workflow and learning goals for medical students to improve presentations, time management, and patient care.

How to Fix Weak Case Presentations in 3 Days of Focused Practice

Fix weak case presentations in 3 days: a practical, step-by-step plan for medical students to deliver concise, structured, high-impact oral reports.

What If My Attending Clearly Doesn’t Like Me on This Rotation?

Learn how medical students can assess, respond, and recover when an attending dislikes them on rotation - practical steps to protect grades and evaluations.

Mastering One-Liners on Rounds: Specialty-Specific Examples for Students

Learn to craft concise one-liners on rounds for medical students—specialty-specific examples, templates, and tips to present clearly and impress attendings.

SmartPick - Residency Selection Made Smarter

Take the guesswork out of residency applications with data-driven precision.

See Your Residency Matches

* 100% free to try. No credit card or account creation required.

Category	Value
Clerkship Grades	0.25
Narrative Evaluations	0.18
OSCE Scores	0.22
Shelf Exams	0.32
Step 2 CK	0.45

Category	Min	Q1	Median	Q3	Max
Clerkship Shelfs	0.2	0.25	0.3	0.35	0.4
Step 2 CK	0.35	0.45	0.5	0.55	0.6
Clinical Ratings	0.05	0.15	0.2	0.25	0.3

Category	Value
USMLE/COMLEX Scores	30
Clerkship Grades	15
MSPE & Narratives	10
Letters of Recommendation	20
Interviews & Fit	25

Do Clinical Evaluations Predict Residency Performance? What Studies Show

How Clinical Evaluations Are Supposed To Function

What The Data Say: Overall Predictive Power

What Specific Studies Actually Show

Shelf Exams and Step 2 CK: The Stronger Signals

Global Clerkship Grades: Some Signal, Lots of Noise

Narrative Evaluations and MSPE Comments

Why Clinical Evaluations Are So Noisy

Differences by Specialty: Does Predictive Value Change?

What About ACGME Milestones and Resident Evaluations?

How Program Directors Actually Use Clinical Evaluations

Practical Implications for Medical Students

1. Stop Treating Each Rotation Grade As Destiny

2. Focus on Skills That Do Carry Forward

3. Understand the Role of Standardized Exams

4. Negative Comments Matter More Than Slight Grade Differences

Where The System Is Moving (Slowly)

Key Takeaways

SmartPick - Residency Selection Made Smarter

Take the guesswork out of residency applications with data-driven precision.

Related Articles

When Language Barriers Limit Your Patient Interaction on Rounds

How Often Students Get ‘Below Expectations’ on Rotations: Real Statistics

Can One Bad Rotation Grade Destroy My Residency Chances?

Maximize Your Clinical Rotations: Essential Tips for Medical Students

Elevate Your Clinical Rotation Performance with a Positive Attitude

Subtle Behaviors on Rounds That Make Attendings Not Trust You

What’s the Right Way to Decline a Procedure You’re Not Comfortable With?

Master Your Clinical Rotations: Essential Tips for Medical Students

Mastering Clinical Skills: Key Focus Areas for Medical Rotations

Mastering Time Management for Success in Medical Rotations

Avoid Common Clinical Rotation Mistakes: A Guide for Medical Students

How to Navigate Rotations When You’re Quiet or Introverted

Are Harsh Attendings Always Bad for Your Career? The Evidence

Mastering Rapport with Attendings: Your Guide to Successful Clinical Rotations

How to Prepare the Weekend Before Starting Any Core Rotation

How Much Do Clerkship Grades Really Matter for Match Outcomes?

Clinic Days vs Inpatient Days: Adjusting Your Workflow and Learning Goals

How to Fix Weak Case Presentations in 3 Days of Focused Practice

What If My Attending Clearly Doesn’t Like Me on This Rotation?

Mastering One-Liners on Rounds: Specialty-Specific Examples for Students

SmartPick - Residency Selection Made Smarter

Take the guesswork out of residency applications with data-driven precision.