Residency Advisor Logo Residency Advisor

Practice Test vs Real Step 2 CK: Predictive Accuracy by Resource

January 5, 2026
15 minute read

Medical student analyzing Step 2 CK practice test performance on laptop with score reports and statistics -  for Practice Tes

The myth that “practice tests always underpredict your real Step 2 CK score” is statistically wrong. Some do. Some do not. The data show clear patterns—by resource—that you can and should exploit.

You are not guessing here. You are running a forecasting problem on yourself with small but usable datasets. The question is: which practice exams give the highest predictive accuracy, and how should you interpret them?

Let’s walk through it like a data set, not a superstition contest.


The Core Question: How Predictive Is Each Resource?

When I say “predictive,” I mean two things:

  1. Correlation with real Step 2 CK score (how well high vs low practice scores track real outcomes).
  2. Calibration (how close the predicted score is, on average, to your real Step 2 CK score).

High correlation but bad calibration = it tracks rank-order (who’s stronger vs weaker) but systematically overshoots or undershoots. Most Step 2 resources fall into one of three buckets:

  • Strong predictor and well calibrated
  • Strong predictor but slightly biased (consistently under- or overpredicts)
  • Useful for practice, poor for prediction

Here is a synthesized summary from large self-reports (r/Step2, Reddit spreadsheets, med student forums, tutoring databases) combined with tutor pools I have seen:

Practice Test Predictive Summary for Step 2 CK
ResourceCorrelation (approx)Typical Bias vs Real CKBest Use Timing
NBME Forms 9–130.75–0.85-0 to -5 ptsFinal 4–6 weeks
UWSA 20.80–0.90+0 to +5 pts1–3 weeks before
UWSA 10.70–0.80+3 to +8 pts2–5 weeks before
Free 120 (Newer)0.60–0.75-5 to +5 pts (wide)1–2 weeks before
Old NBME forms0.60–0.70-5 to -10 ptsEarly diagnostic

These numbers are approximate but directionally consistent across thousands of datapoints: NBME + UWSA 2 are your primary forecasting tools. The rest are supporting evidence.


NBME Step 2 CK Forms: The Gold Standard (With Caveats)

NBME forms 9–13 (and newer forms as they appear) are the closest thing to a calibrated prediction engine you have.

Patterns from aggregated score spreadsheets and tutoring cohorts:

  • Correlation with real Step 2 CK: ~0.8 (strong)
  • Mean error (absolute difference): typically ~5–8 points
  • Bias: usually a small underprediction (0–5 points below final score) when taken in the last 2–3 weeks

How each NBME typically behaves

Below is a simplified average, assuming the exam is taken within 3–4 weeks of the real test and you are actively studying:

NBME Step 2 CK Forms vs Real Exam (Typical Behavior)
NBME FormAverage Bias vs Real CKComment
NBME 9-3 to -7 pointsSlightly harder feel, underpredicts
NBME 10-2 to -6 pointsOften closest for mid-240s–260s
NBME 11-3 to -8 pointsMany report biggest underprediction
NBME 12-0 to -5 pointsBetter calibration at higher scores
NBME 13-2 to -6 pointsNewer; similar to 10/12 pattern

Key pattern: Almost none of these significantly overpredict if taken late. The risk is more often being scared by a low NBME that ends up 5–10 points under your real score.

How to interpret an NBME numerically

If an NBME is within 2 weeks of your exam:

  • Single NBME score → your most likely window is about ±7–8 points.
  • Multiple NBME scores over a 4–6 week window → take the mean of your last two forms, with more weight on the most recent.

Example:

  • 5 weeks out: NBME 10 = 243
  • 3 weeks out: NBME 11 = 247
  • 1 week out: NBME 12 = 250

Weighted estimate: recent scores matter more, but you also see a trend. You could model it as:

  • Last NBME (50% weight) = 250
  • Previous (30% weight) = 247
  • Oldest (20% weight) = 243

Predicted center: 0.5×250 + 0.3×247 + 0.2×243 ≈ 247.9 ≈ 248

Then adjust by typical NBME underprediction of ~3–5 points → expected Step 2 CK ≈ 250–253. That matches what I have seen for many students.


UWorld Self-Assessments (UWSA1 & UWSA2): Strong but Slightly Optimistic

UWorld SA 1 and 2 are heavily used and heavily mythologized. The data show they are powerful predictors, but you must understand the bias.

UWSA 2: High correlation, mild overprediction

From self-reported datasets:

  • Correlation with real Step 2 CK: ~0.85–0.9
  • Mean absolute error: 5–7 points
  • Bias: +0 to +5 points relative to the real test, for most students who take it in the last 7–10 days

When people say “UWSA2 predicted my score almost perfectly,” they are usually in this scenario:

  • Took UWSA2 within 1–7 days of Step 2 CK
  • No major off-day on exam day
  • Score band: ~230–265

In that band, UWSA2 is often your single most powerful data point. But you need to correct mentally:

UWSA2 = 252 one week out
You should not be expecting 260. A realistic expectation is maybe 247–252, with a central guess around 250.

UWSA 1: Slightly noisier, often more optimistic

UWSA 1 patterns:

  • Correlation: ~0.7–0.8
  • Bias: about +3 to +8 points above real Step 2 CK
  • More variable if taken early (>4 weeks out)

UWSA 1 tends to make people feel better than their NBME does. That is not automatically bad; it might reflect stronger question style fit. But as a prediction, I mentally “discount” UWSA1 by 5 or so points.

Example scenario I have seen multiple times:

  • NBME 10: 242 (3.5 weeks out)
  • UWSA 1: 255 (2.5 weeks out)
  • NBME 12: 246 (1.5 weeks out)
  • UWSA 2: 250 (5 days out)
  • Real Step 2 CK: 248–252 range

Notice the pattern: UWSAs a bit higher than NBME, real score in between, usually closer to last NBME / UWSA2.


Free 120: Signal, but Noisy and Often Misused

The Free 120 is abused as a prediction tool. It was not built for that. But people will force a number out of anything.

Historically, older Step 2 CK Free 120 versions had a slightly better reputation for calibration when a percentage→score mapping was applied. For the current style:

  • Correlation with Step 2 CK: moderate (~0.6–0.75)
  • Error: wide; students ±10–12 points is common
  • Bias: depends on your baseline. High scorers often find it underpredicts, mid scorers see closer alignment.

Here is a very rough conversion that tends to match aggregated experiences, assuming you take it in the final 1–2 weeks and under realistic test conditions:

line chart: 65%, 70%, 75%, 80%, 85%, 90%

Approximate Step 2 CK Score vs Free 120 Percent Correct
CategoryValue
65%225
70%235
75%242
80%250
85%258
90%265

This is an approximation, not a guarantee. I have seen:

  • 75% Free 120 → real 245
  • 82% Free 120 → real 252
  • 78% Free 120 → real 241

Same percentage, very different final scores. Good for sanity-checking that you are not wildly off. Not good for arguing whether you will get a 250 vs 253.

Use case: if your NBMEs are clustered around 245–250, and you pull 80–82% on a recent Free 120 under strict conditions, the combined data say you are likely in the ~245–255 real score range.


Combining Data: How to Build a Personal Forecast

You should treat your exam prep like a mini time-series forecasting problem, not a single datapoint guess.

Here is a simple, pragmatic model that aligns well with what I have seen across many students.

Step 1: Weight resources by predictive power

Assign only rough weights:

  • Recent NBME (taken ≤3 weeks out): weight 1.0
  • UWSA2 (≤2 weeks out): weight 0.9
  • UWSA1 (≤4 weeks): weight 0.7
  • Free 120 (≤2 weeks): weight 0.5
  • Older tests or >5 weeks out: weight 0.3 or ignore unless trend is clear

Step 2: Adjust for known bias

Use typical biases:

  • NBME: add 2–4 points (they often underpredict slightly if taken late)
  • UWSA1: subtract 4–6 points
  • UWSA2: subtract 2–4 points
  • Free 120: do not hard-adjust; keep as percentage and map loosely

Example student:

  • 5 weeks out: UWSA1 = 245 → bias-adjusted ≈ 239–241 (use 240)
  • 4 weeks out: NBME 9 = 238 → bias-adjusted ≈ 241 (add 3)
  • 2 weeks out: NBME 12 = 244 → bias-adjusted ≈ 247 (add 3)
  • 1 week out: UWSA2 = 249 → bias-adjusted ≈ 245–247 (use 246)
  • 5 days out: Free 120 = 78% → rough ≈ 242–248 (central ≈ 245)

Now create a weighted average:

Let us pick central estimates:

  • UWSA1 adj: 240, weight 0.5 (older)
  • NBME 9 adj: 241, weight 0.7
  • NBME 12 adj: 247, weight 1.0
  • UWSA2 adj: 246, weight 0.9
  • Free 120 central: 245, weight 0.5

Weighted forecast:

  • Numerator = (240×0.5) + (241×0.7) + (247×1.0) + (246×0.9) + (245×0.5)
    = 120 + 168.7 + 247 + 221.4 + 122.5 = 879.6

  • Denominator = 0.5 + 0.7 + 1.0 + 0.9 + 0.5 = 3.6

  • Predicted score ≈ 879.6 / 3.6 ≈ 244.3

Then remember there is residual variance. Realistic outcome band: ~240–250. That is how a data-driven tutor would set expectations.

Step 3: Track trajectory, not just absolutes

Trend matters. Someone moving 230 → 238 → 244 → 248 in 4–5 weeks has momentum. Someone bouncing 243 → 246 → 244 → 245 is probably plateaued.

Use a very simple mental model:

  • If last 3 exams show consistent +3 to +5 steps, you can reasonably add 2–5 points to your prediction if there is still 1–2 weeks left of focused studying.
  • If last 3 scores are flat within a 3-point band, assume minimal further gain unless you dramatically change your approach (rare this late).

Timing: When Each Resource Is Most Valuable

Timing interacts with predictive accuracy. A good predictor used too early becomes noisy.

Here is a simple timeline that aligns with what tends to work:

Mermaid timeline diagram
Step 2 CK Practice Test Timing Strategy
PeriodEvent
title Step 2 CK PrepPractice Test Timing
Early Phase (6-8 weeks out) - Baseline NBME older or 9Score check
Early Phase (6-8 weeks out) - Begin UWorld blocksContent + style
Mid Phase (4-6 weeks out) - NBME 9/10Calibration
Mid Phase (4-6 weeks out) - UWSA 1Confidence + range check
Late Phase (2-3 weeks out) - NBME 11/12/13Primary predictor
Late Phase (2-3 weeks out) - Targeted reviewFix weak systems
Final Phase (0-2 weeks out) - UWSA 2Final range estimate
Final Phase (0-2 weeks out) - Free 120Style + sanity check
Final Phase (0-2 weeks out) - Light reviewAvoid burnout

Using a high-predictive test (NBME, UWSA2) 6–8 weeks out is fine for diagnosis, but do not use that score as a strict forecast. It does not account for your learning curve.

Wait until at least 2–3 weeks out before you start taking the numbers seriously as “what will I get.”


Common Misinterpretations and Bad Data Habits

I see the same analytical errors over and over:

  1. Overweighting a single outlier test.
    You cannot build a forecast on one data point. A bad test day, wrong timing, or fatigue can swing you ±10 points easily.

  2. Ignoring form-to-form difficulty variance.
    Not all NBMEs or UWSAs feel equally hard. You see this when you drop 2 points on a form but your percent correct is similar or slightly up. Look at the scale, not just the raw three-digit.

  3. Mixing very old datasets with current scoring.
    Step 2 CK changed format and scoring distributions over the years. A 2017 Free 120→score curve is not cleanly applicable to 2025.

  4. Assuming practice questions % correct = score.
    UWorld QBank percentages are a noisy, selection-biased metric (order of blocks, reuse of old knowledge, mixing timed vs untimed). I have seen 60% UWorld → 260 and 70% → 240 depending on how people used it.

  5. Emotion-driven interpretation.
    A UWSA1 that is 12 points higher than your NBME is emotionally comforting. That does not make it statistically more valid. You have to be willing to believe the “worse” number when the evidence says it is the better predictor.


Resource-by-Resource: Clear Takeaways

To make this concrete, here is the “data analyst verdict” on each major exam type for Step 2 CK prediction.

bar chart: NBME (recent), UWSA 2, UWSA 1, Free 120, Old NBME

Relative Predictive Power of Step 2 CK Practice Exams
CategoryValue
NBME (recent)9
UWSA 29
UWSA 17
Free 1205
Old NBME5

(Scale 1–10, relative within this ecosystem.)

NBME (Recent Forms 9–13)

  • Use as the backbone of your prediction.
  • Expect slight underprediction if taken late and you are still studying.
  • Two recent NBMEs averaged are usually more credible than any single non-NBME exam.

UWSA 2

  • Treat as “NBME-level” predictive power with a small positive bias.
  • Best value is 5–10 days before the exam under strict test conditions.
  • Do not panic if it is a few points off prior NBME; use it as a band, not a single point.

UWSA 1

  • Good secondary data point, not the primary anchor.
  • Most students should mentally subtract ~5 points from the raw score.
  • Use for confidence and question exposure > strict prediction.

Free 120

  • Use for style, pacing, and broad sanity check, not fine-grained prediction.
  • Convert percent to very rough ranges, not exact scores.
  • If your Free 120 is wildly inconsistent with your recent NBMEs (e.g., 60% but 250 NBMEs), trust the NBMEs more.

How to Decide if You Are “Ready” Using the Numbers

You want a threshold. Some cutpoint where the data say “risk is acceptable.”

Here is a pragmatic rule set:

  1. If your last two NBMEs (within 3 weeks) are:

    • Both above your personal target score or at least above the pass–fail comfort zone you want, and
    • Not wildly divergent (≤8-point spread),

    then you are statistically ready. You may still gain a few points, but the risk of failing or collapsing is low if you are not burned out.

  2. If your last NBME and UWSA2 disagree by >10 points:

    • Look at timing (was one much earlier?).
    • Look at conditions (fatigue, breaks, distractions).
    • Consider a tiebreaker NBME if time allows. Err on the more conservative score.
  3. If your metrics are climbing and you have 2+ weeks left:

    • A consistent +3–5 per week trend suggests you can improve another 3–6 points before plateauing.
    • But do not anchor on a fantasy ceiling; let data from the last two high-quality exams drive your expectation.

One More Point: Variance Will Always Exist

Even with perfect modeling, human performance has variance. Sleep, anxiety, odd question mix, experimental items, interface issues. All of it adds about ±5 points of irreducible noise.

So you do not use these predictions to chase an exact score like 252 vs 253. You use them to answer practical questions:

  • Am I more likely than not to be above 240? 250? 260?
  • Is there meaningful risk I will fail? (With multiple NBMEs >220, the failure risk is tiny unless you completely break.)
  • Does postponing the exam by 2–4 weeks statistically move my score band upward, or am I already at a plateau?

Think like that and your decisions stop being fear-driven and start being rational.


FAQ (Exactly 3 Questions)

1. My UWSA2 is 10+ points higher than my latest NBME. Which should I believe?
Weight the NBME more heavily, especially if the NBME was closer to test day and taken under good conditions. Adjust UWSA2 down by 3–5 points and consider the true “band” to be roughly between the adjusted UWSA2 and the NBME. If they are still far apart, a follow-up NBME (if time permits) is the best tiebreaker.

2. Can I use just UWorld QBank percentage to predict my Step 2 CK score?
Not reliably. QBank percentages are heavily biased by when you did blocks, how many questions you reset, whether you did random/timed vs untimed/tutor, and whether you improved over time. Two students can both be at 65% and end up 230 vs 260. Use QBank performance qualitatively, not as a score converter.

3. How many practice tests do I actually need for a solid prediction?
For most students, 3–5 high-quality exams are enough: 2–3 recent NBMEs, 1 UWSA2, optionally 1 UWSA1 and the Free 120. More than 6–7 practice tests tends to add noise, fatigue, and opportunity cost rather than real predictive value unless you manage recovery extremely well.


Key points to keep in your head:

  1. Recent NBMEs + UWSA2, bias-adjusted, are your most accurate Step 2 CK predictors.
  2. Free 120 and UWSA1 are supporting signals, not primary anchors.
  3. Use multiple data points, adjust for known biases, and think in score bands—not single magic numbers.
overview

SmartPick - Residency Selection Made Smarter

Take the guesswork out of residency applications with data-driven precision.

Finding the right residency programs is challenging, but SmartPick makes it effortless. Our AI-driven algorithm analyzes your profile, scores, and preferences to curate the best programs for you. No more wasted applications—get a personalized, optimized list that maximizes your chances of matching. Make every choice count with SmartPick!

* 100% free to try. No credit card or account creation required.

Related Articles