
The myth that “practice tests always underpredict your real Step 2 CK score” is statistically wrong. Some do. Some do not. The data show clear patterns—by resource—that you can and should exploit.
You are not guessing here. You are running a forecasting problem on yourself with small but usable datasets. The question is: which practice exams give the highest predictive accuracy, and how should you interpret them?
Let’s walk through it like a data set, not a superstition contest.
The Core Question: How Predictive Is Each Resource?
When I say “predictive,” I mean two things:
- Correlation with real Step 2 CK score (how well high vs low practice scores track real outcomes).
- Calibration (how close the predicted score is, on average, to your real Step 2 CK score).
High correlation but bad calibration = it tracks rank-order (who’s stronger vs weaker) but systematically overshoots or undershoots. Most Step 2 resources fall into one of three buckets:
- Strong predictor and well calibrated
- Strong predictor but slightly biased (consistently under- or overpredicts)
- Useful for practice, poor for prediction
Here is a synthesized summary from large self-reports (r/Step2, Reddit spreadsheets, med student forums, tutoring databases) combined with tutor pools I have seen:
| Resource | Correlation (approx) | Typical Bias vs Real CK | Best Use Timing |
|---|---|---|---|
| NBME Forms 9–13 | 0.75–0.85 | -0 to -5 pts | Final 4–6 weeks |
| UWSA 2 | 0.80–0.90 | +0 to +5 pts | 1–3 weeks before |
| UWSA 1 | 0.70–0.80 | +3 to +8 pts | 2–5 weeks before |
| Free 120 (Newer) | 0.60–0.75 | -5 to +5 pts (wide) | 1–2 weeks before |
| Old NBME forms | 0.60–0.70 | -5 to -10 pts | Early diagnostic |
These numbers are approximate but directionally consistent across thousands of datapoints: NBME + UWSA 2 are your primary forecasting tools. The rest are supporting evidence.
NBME Step 2 CK Forms: The Gold Standard (With Caveats)
NBME forms 9–13 (and newer forms as they appear) are the closest thing to a calibrated prediction engine you have.
Patterns from aggregated score spreadsheets and tutoring cohorts:
- Correlation with real Step 2 CK: ~0.8 (strong)
- Mean error (absolute difference): typically ~5–8 points
- Bias: usually a small underprediction (0–5 points below final score) when taken in the last 2–3 weeks
How each NBME typically behaves
Below is a simplified average, assuming the exam is taken within 3–4 weeks of the real test and you are actively studying:
| NBME Form | Average Bias vs Real CK | Comment |
|---|---|---|
| NBME 9 | -3 to -7 points | Slightly harder feel, underpredicts |
| NBME 10 | -2 to -6 points | Often closest for mid-240s–260s |
| NBME 11 | -3 to -8 points | Many report biggest underprediction |
| NBME 12 | -0 to -5 points | Better calibration at higher scores |
| NBME 13 | -2 to -6 points | Newer; similar to 10/12 pattern |
Key pattern: Almost none of these significantly overpredict if taken late. The risk is more often being scared by a low NBME that ends up 5–10 points under your real score.
How to interpret an NBME numerically
If an NBME is within 2 weeks of your exam:
- Single NBME score → your most likely window is about ±7–8 points.
- Multiple NBME scores over a 4–6 week window → take the mean of your last two forms, with more weight on the most recent.
Example:
- 5 weeks out: NBME 10 = 243
- 3 weeks out: NBME 11 = 247
- 1 week out: NBME 12 = 250
Weighted estimate: recent scores matter more, but you also see a trend. You could model it as:
- Last NBME (50% weight) = 250
- Previous (30% weight) = 247
- Oldest (20% weight) = 243
Predicted center: 0.5×250 + 0.3×247 + 0.2×243 ≈ 247.9 ≈ 248
Then adjust by typical NBME underprediction of ~3–5 points → expected Step 2 CK ≈ 250–253. That matches what I have seen for many students.
UWorld Self-Assessments (UWSA1 & UWSA2): Strong but Slightly Optimistic
UWorld SA 1 and 2 are heavily used and heavily mythologized. The data show they are powerful predictors, but you must understand the bias.
UWSA 2: High correlation, mild overprediction
From self-reported datasets:
- Correlation with real Step 2 CK: ~0.85–0.9
- Mean absolute error: 5–7 points
- Bias: +0 to +5 points relative to the real test, for most students who take it in the last 7–10 days
When people say “UWSA2 predicted my score almost perfectly,” they are usually in this scenario:
- Took UWSA2 within 1–7 days of Step 2 CK
- No major off-day on exam day
- Score band: ~230–265
In that band, UWSA2 is often your single most powerful data point. But you need to correct mentally:
UWSA2 = 252 one week out
You should not be expecting 260. A realistic expectation is maybe 247–252, with a central guess around 250.
UWSA 1: Slightly noisier, often more optimistic
UWSA 1 patterns:
- Correlation: ~0.7–0.8
- Bias: about +3 to +8 points above real Step 2 CK
- More variable if taken early (>4 weeks out)
UWSA 1 tends to make people feel better than their NBME does. That is not automatically bad; it might reflect stronger question style fit. But as a prediction, I mentally “discount” UWSA1 by 5 or so points.
Example scenario I have seen multiple times:
- NBME 10: 242 (3.5 weeks out)
- UWSA 1: 255 (2.5 weeks out)
- NBME 12: 246 (1.5 weeks out)
- UWSA 2: 250 (5 days out)
- Real Step 2 CK: 248–252 range
Notice the pattern: UWSAs a bit higher than NBME, real score in between, usually closer to last NBME / UWSA2.
Free 120: Signal, but Noisy and Often Misused
The Free 120 is abused as a prediction tool. It was not built for that. But people will force a number out of anything.
Historically, older Step 2 CK Free 120 versions had a slightly better reputation for calibration when a percentage→score mapping was applied. For the current style:
- Correlation with Step 2 CK: moderate (~0.6–0.75)
- Error: wide; students ±10–12 points is common
- Bias: depends on your baseline. High scorers often find it underpredicts, mid scorers see closer alignment.
Here is a very rough conversion that tends to match aggregated experiences, assuming you take it in the final 1–2 weeks and under realistic test conditions:
| Category | Value |
|---|---|
| 65% | 225 |
| 70% | 235 |
| 75% | 242 |
| 80% | 250 |
| 85% | 258 |
| 90% | 265 |
This is an approximation, not a guarantee. I have seen:
- 75% Free 120 → real 245
- 82% Free 120 → real 252
- 78% Free 120 → real 241
Same percentage, very different final scores. Good for sanity-checking that you are not wildly off. Not good for arguing whether you will get a 250 vs 253.
Use case: if your NBMEs are clustered around 245–250, and you pull 80–82% on a recent Free 120 under strict conditions, the combined data say you are likely in the ~245–255 real score range.
Combining Data: How to Build a Personal Forecast
You should treat your exam prep like a mini time-series forecasting problem, not a single datapoint guess.
Here is a simple, pragmatic model that aligns well with what I have seen across many students.
Step 1: Weight resources by predictive power
Assign only rough weights:
- Recent NBME (taken ≤3 weeks out): weight 1.0
- UWSA2 (≤2 weeks out): weight 0.9
- UWSA1 (≤4 weeks): weight 0.7
- Free 120 (≤2 weeks): weight 0.5
- Older tests or >5 weeks out: weight 0.3 or ignore unless trend is clear
Step 2: Adjust for known bias
Use typical biases:
- NBME: add 2–4 points (they often underpredict slightly if taken late)
- UWSA1: subtract 4–6 points
- UWSA2: subtract 2–4 points
- Free 120: do not hard-adjust; keep as percentage and map loosely
Example student:
- 5 weeks out: UWSA1 = 245 → bias-adjusted ≈ 239–241 (use 240)
- 4 weeks out: NBME 9 = 238 → bias-adjusted ≈ 241 (add 3)
- 2 weeks out: NBME 12 = 244 → bias-adjusted ≈ 247 (add 3)
- 1 week out: UWSA2 = 249 → bias-adjusted ≈ 245–247 (use 246)
- 5 days out: Free 120 = 78% → rough ≈ 242–248 (central ≈ 245)
Now create a weighted average:
Let us pick central estimates:
- UWSA1 adj: 240, weight 0.5 (older)
- NBME 9 adj: 241, weight 0.7
- NBME 12 adj: 247, weight 1.0
- UWSA2 adj: 246, weight 0.9
- Free 120 central: 245, weight 0.5
Weighted forecast:
Numerator = (240×0.5) + (241×0.7) + (247×1.0) + (246×0.9) + (245×0.5)
= 120 + 168.7 + 247 + 221.4 + 122.5 = 879.6Denominator = 0.5 + 0.7 + 1.0 + 0.9 + 0.5 = 3.6
Predicted score ≈ 879.6 / 3.6 ≈ 244.3
Then remember there is residual variance. Realistic outcome band: ~240–250. That is how a data-driven tutor would set expectations.
Step 3: Track trajectory, not just absolutes
Trend matters. Someone moving 230 → 238 → 244 → 248 in 4–5 weeks has momentum. Someone bouncing 243 → 246 → 244 → 245 is probably plateaued.
Use a very simple mental model:
- If last 3 exams show consistent +3 to +5 steps, you can reasonably add 2–5 points to your prediction if there is still 1–2 weeks left of focused studying.
- If last 3 scores are flat within a 3-point band, assume minimal further gain unless you dramatically change your approach (rare this late).
Timing: When Each Resource Is Most Valuable
Timing interacts with predictive accuracy. A good predictor used too early becomes noisy.
Here is a simple timeline that aligns with what tends to work:
| Period | Event |
|---|---|
| title Step 2 CK Prep | Practice Test Timing |
| Early Phase (6-8 weeks out) - Baseline NBME older or 9 | Score check |
| Early Phase (6-8 weeks out) - Begin UWorld blocks | Content + style |
| Mid Phase (4-6 weeks out) - NBME 9/10 | Calibration |
| Mid Phase (4-6 weeks out) - UWSA 1 | Confidence + range check |
| Late Phase (2-3 weeks out) - NBME 11/12/13 | Primary predictor |
| Late Phase (2-3 weeks out) - Targeted review | Fix weak systems |
| Final Phase (0-2 weeks out) - UWSA 2 | Final range estimate |
| Final Phase (0-2 weeks out) - Free 120 | Style + sanity check |
| Final Phase (0-2 weeks out) - Light review | Avoid burnout |
Using a high-predictive test (NBME, UWSA2) 6–8 weeks out is fine for diagnosis, but do not use that score as a strict forecast. It does not account for your learning curve.
Wait until at least 2–3 weeks out before you start taking the numbers seriously as “what will I get.”
Common Misinterpretations and Bad Data Habits
I see the same analytical errors over and over:
Overweighting a single outlier test.
You cannot build a forecast on one data point. A bad test day, wrong timing, or fatigue can swing you ±10 points easily.Ignoring form-to-form difficulty variance.
Not all NBMEs or UWSAs feel equally hard. You see this when you drop 2 points on a form but your percent correct is similar or slightly up. Look at the scale, not just the raw three-digit.Mixing very old datasets with current scoring.
Step 2 CK changed format and scoring distributions over the years. A 2017 Free 120→score curve is not cleanly applicable to 2025.Assuming practice questions % correct = score.
UWorld QBank percentages are a noisy, selection-biased metric (order of blocks, reuse of old knowledge, mixing timed vs untimed). I have seen 60% UWorld → 260 and 70% → 240 depending on how people used it.Emotion-driven interpretation.
A UWSA1 that is 12 points higher than your NBME is emotionally comforting. That does not make it statistically more valid. You have to be willing to believe the “worse” number when the evidence says it is the better predictor.
Resource-by-Resource: Clear Takeaways
To make this concrete, here is the “data analyst verdict” on each major exam type for Step 2 CK prediction.
| Category | Value |
|---|---|
| NBME (recent) | 9 |
| UWSA 2 | 9 |
| UWSA 1 | 7 |
| Free 120 | 5 |
| Old NBME | 5 |
(Scale 1–10, relative within this ecosystem.)
NBME (Recent Forms 9–13)
- Use as the backbone of your prediction.
- Expect slight underprediction if taken late and you are still studying.
- Two recent NBMEs averaged are usually more credible than any single non-NBME exam.
UWSA 2
- Treat as “NBME-level” predictive power with a small positive bias.
- Best value is 5–10 days before the exam under strict test conditions.
- Do not panic if it is a few points off prior NBME; use it as a band, not a single point.
UWSA 1
- Good secondary data point, not the primary anchor.
- Most students should mentally subtract ~5 points from the raw score.
- Use for confidence and question exposure > strict prediction.
Free 120
- Use for style, pacing, and broad sanity check, not fine-grained prediction.
- Convert percent to very rough ranges, not exact scores.
- If your Free 120 is wildly inconsistent with your recent NBMEs (e.g., 60% but 250 NBMEs), trust the NBMEs more.
How to Decide if You Are “Ready” Using the Numbers
You want a threshold. Some cutpoint where the data say “risk is acceptable.”
Here is a pragmatic rule set:
If your last two NBMEs (within 3 weeks) are:
- Both above your personal target score or at least above the pass–fail comfort zone you want, and
- Not wildly divergent (≤8-point spread),
then you are statistically ready. You may still gain a few points, but the risk of failing or collapsing is low if you are not burned out.
If your last NBME and UWSA2 disagree by >10 points:
- Look at timing (was one much earlier?).
- Look at conditions (fatigue, breaks, distractions).
- Consider a tiebreaker NBME if time allows. Err on the more conservative score.
If your metrics are climbing and you have 2+ weeks left:
- A consistent +3–5 per week trend suggests you can improve another 3–6 points before plateauing.
- But do not anchor on a fantasy ceiling; let data from the last two high-quality exams drive your expectation.
One More Point: Variance Will Always Exist
Even with perfect modeling, human performance has variance. Sleep, anxiety, odd question mix, experimental items, interface issues. All of it adds about ±5 points of irreducible noise.
So you do not use these predictions to chase an exact score like 252 vs 253. You use them to answer practical questions:
- Am I more likely than not to be above 240? 250? 260?
- Is there meaningful risk I will fail? (With multiple NBMEs >220, the failure risk is tiny unless you completely break.)
- Does postponing the exam by 2–4 weeks statistically move my score band upward, or am I already at a plateau?
Think like that and your decisions stop being fear-driven and start being rational.
FAQ (Exactly 3 Questions)
1. My UWSA2 is 10+ points higher than my latest NBME. Which should I believe?
Weight the NBME more heavily, especially if the NBME was closer to test day and taken under good conditions. Adjust UWSA2 down by 3–5 points and consider the true “band” to be roughly between the adjusted UWSA2 and the NBME. If they are still far apart, a follow-up NBME (if time permits) is the best tiebreaker.
2. Can I use just UWorld QBank percentage to predict my Step 2 CK score?
Not reliably. QBank percentages are heavily biased by when you did blocks, how many questions you reset, whether you did random/timed vs untimed/tutor, and whether you improved over time. Two students can both be at 65% and end up 230 vs 260. Use QBank performance qualitatively, not as a score converter.
3. How many practice tests do I actually need for a solid prediction?
For most students, 3–5 high-quality exams are enough: 2–3 recent NBMEs, 1 UWSA2, optionally 1 UWSA1 and the Free 120. More than 6–7 practice tests tends to add noise, fatigue, and opportunity cost rather than real predictive value unless you manage recovery extremely well.
Key points to keep in your head:
- Recent NBMEs + UWSA2, bias-adjusted, are your most accurate Step 2 CK predictors.
- Free 120 and UWSA1 are supporting signals, not primary anchors.
- Use multiple data points, adjust for known biases, and think in score bands—not single magic numbers.