Residency Advisor Logo Residency Advisor

Practice Test Correlation: Which Step 2 CK Self-Assessment Best Predicts Score?

January 5, 2026
14 minute read

Medical student reviewing Step 2 CK practice test performance analytics -  for Practice Test Correlation: Which Step 2 CK Sel

Step 2 CK self-assessments are not equal, and the data makes that brutally clear.

If you treat every practice test as equally predictive, you will overestimate your score, underestimate your risk, and walk into exam day with false confidence. I have seen it over and over: a student hanging their hopes on one inflated form while ignoring the one assessment that was screaming a very different story.

Let me be direct: correlation with the real Step 2 CK is a measurable statistic. You can rank the major self-assessments by how closely they track real scores. You can quantify which ones overpredict, which ones underpredict, and by how much. That is what we will do here.

We will focus on the big four categories of predictors:

And the core question: Which Step 2 CK self-assessment best predicts your real score, and how should you interpret each number?


The Core Metric: Correlation, Bias, and Error

Before arguing about which exam is “best”, we need three numbers:

  1. Correlation coefficient (r) – how tightly do practice scores and real scores move together?
  2. Bias (mean prediction error) – on average, how many points do they over- or underpredict?
  3. Spread (standard deviation of error) – how wide is the typical miss?

You will not get perfect peer-reviewed meta-analyses for every form. But you do not need perfection. You need approximate, stable patterns pulled from:

  • School-level shared spreadsheets (classes of 150–200 students)
  • Big Reddit / SDN Google sheets (hundreds to low thousands of entries)
  • Tutoring group internal trackers (dozens to hundreds of students, but with clean, verified data)

When you aggregate multiple such sources, the signal is surprisingly consistent.

Here is an evidence-synthesized summary. Numbers are approximate, but directionally reliable.

Step 2 CK Self-Assessment Predictive Performance
Exam TypeCorrelation with Real CK (r)Average Bias vs Real (Predicted − Actual)Typical Error Band (±SD)
NBME newer forms0.80–0.87−2 to +3 points~8–10 points
UWSA20.80–0.85+2 to +5 points~8–11 points
UWSA10.70–0.78+5 to +10 points~10–13 points
Free 120 (scaled)0.65–0.75−5 to +5 points (centered, very noisy)~12–15 points

The exact decimal place does not matter. The hierarchy does.

  • Top tier correlation: NBME newest forms, UWSA2
  • Second tier: UWSA1
  • Utility tool, not a score predictor: Free 120

And yes, this means what you probably suspect: NBME CBSSAs and UWSA2 are your best predictors. But they behave differently, and their errors are not symmetric. That matters for planning.


NBME Step 2 CK Forms: The Gold(ish) Standard

NBME forms are the least glamorous but most useful. No fancy interface. No “wow, I jumped 30 points” stories. Just steady, boring predictive power.

Across multiple class spreadsheets I have seen, NBME forms usually show:

  • Correlation with real Step 2 CK: r ≈ 0.80–0.87
  • Average bias: extremely low, often within −2 to +3 points
  • Error band (1 SD): roughly ±8–10 points

So if your NBME 10 score converts to a 245, a realistic outcome range (based on actual data, not wishful thinking) is approximately 235–255 for most students, assuming no major change in prep between that exam and test day.

Why NBME forms track so well

The data shows three reasons:

  1. Item style and calibration. NBME writes the real exam. Their question construction, distractor style, and difficulty distribution mirror the actual test more closely than any third-party resource.
  2. Score scaling tuned to population data. These forms are built off real-use performance data from large cohorts, not a guess from a private company.
  3. Less content gimmickry. They measure how you handle the actual Step 2 CK style, not how cute you are with UWorld patterns.

Are NBME forms “harder” or “easier”?

Students say both, often in the same week.

Quantitatively, what I see:

  • Lower raw percent correct than UWorld blocks at equivalent ability levels.
  • But scaled scores that are quite close to the real exam, especially if taken 1–4 weeks before test day.

The story that matches the data:
NBME forms feel harder because they remove some of the pattern-recognition crutches UWorld gives you. They also punish conceptual gaps more harshly. But the scaled score you get from a reasonably recent NBME form is usually your best single-number predictor.

So if you want one anchor number to plan around, use your latest NBME.


UWSA2: The High-Confidence “Ceiling” Predictor

Now we get to the one that generates the most emotional reactions: UWSA2.

Students love UWSA2 because:

  • Scores are often higher than their earliest NBME scores.
  • Correlation with the real exam is strong.
  • It “feels” like Step 2 CK in length and style.

The numbers from large online datasets and internal logs:

  • Correlation with real Step 2 CK: r ≈ 0.80–0.85
  • Average bias: +2 to +5 points (mildly optimistic)
  • Error band: about ±8–11 points

So yes, UWSA2 is very predictive. But systematically a bit high.

bar chart: NBME, UWSA2, UWSA1, Free 120

Average Prediction Bias by Step 2 CK Self-Assessment
CategoryValue
NBME-1
UWSA24
UWSA18
Free 1200

Interpreted:

  • NBME: clusters around real score, tiny under/over on average
  • UWSA2: roughly +4 over real is typical
  • UWSA1: can sit +8 (or more) over real, especially for mid-range scorers
  • Free 120: centered but noisy; some +15, some −15

How to read a UWSA2 score

If you take UWSA2 within 7–14 days of your test date, under exam-like conditions (single day, timed, no pausing, minimal distractions), a practical interpretation:

  • Score − 5 points ≈ conservative floor you can expect if test day is “average”
  • Score ≈ realistic midpoint, if stress and fatigue are normal
  • Score + 5 points ≈ best-case scenario

So a 250 on UWSA2 often means:

  • You are not “actually a 260” yet.
  • You are most likely in the 245–255 band, assuming consistency.

I have seen multiple cases where:

  • UWSA2: 261
  • NBME form (within a week): 248
  • Real Step 2 CK: 251

Pattern: UWSA2 inflates slightly, NBME pulls you back to reality, actual exam lands in between.

Use UWSA2 as your optimistic but still data-grounded view of your score ceiling.


UWSA1: Useful, but Noisy and Often Inflated

UWSA1 generates the worst false confidence.

Statistically:

  • Correlation: r ≈ 0.70–0.78 with real Step 2 CK
  • Average bias: +5 to +10 points (frequently overpredicts)
  • Error band: ±10–13 points or worse for some cohorts

You can feel this in anecdotes:

  • Student has UWSA1 = 252, gets excited, slows intensity.
  • NBME 2 weeks later = 238. Panic.
  • Real exam = ~240–245.

UWSA1 correlated, yes, but systematically too high and too unreliable to be used as your primary target.

Where UWSA1 is still valuable

It is not useless. Far from it. The data and experience say:

  • Good for early-to-mid prep benchmarking.
  • Useful for relative progress tracking (e.g., from 220 → 240).
  • Helpful for question-style practice and timing early in dedicated.

But if you must choose which score to trust between UWSA1 and NBME/UWSA2, the hierarchy is simple:

NBME ≈ UWSA2 > UWSA1.

If only UWSA1 is high while NBME and UWSA2 are subdued, believe NBME/UWSA2. Every time.


The Free 120: Great for Style, Mediocre for Prediction

The Free 120 is a trap if you treat it like a full predictor.

Quantitatively:

  • Correlation: r ≈ 0.65–0.75 with real Step 2 CK (varies a lot by dataset)
  • Bias: near zero on average, but with large variance
  • Error band: ±12–15 points or more

In other words, Free 120 is centered but noisy. Across hundreds of students, it “evens out” to a solid regressed mean. For an individual student, it can be off by a mile.

The main reasons:

  • Only 120 questions; fewer data points means wider confidence intervals.
  • Scoring is rough, and conversion tables students use online are not consistently calibrated.
  • The content mix sometimes lags behind exam trends.

I treat Free 120 as:

  • A style and confidence check, not a score predictor.
  • A good tool to ensure you are comfortable with the Prometric interface, exhibit format, and item length.
  • A quick way to see if your percent correct is at least in the neighborhood of what your NBME/UWSA2 scores imply.

Rule of thumb I have seen hold decently:

  • If your Free 120 percentage, converted with a reasonable rule of thumb (often percent correct × 3 + 120-ish), sits more than 10 points below your recent NBME/UWSA2, that is a yellow flag.
  • If it is within ±5 points of those, you are probably fine.
  • If it is higher, ignore the optimism and default back to NBME/UWSA2.

Q‑Bank Percentages: Background Signal, Not a Forecast

Students badly overvalue their UWorld percentage. The correlation is real but weaker than they think, and the bias is heavy.

Across multiple groups:

  • Cumulative UWorld percentage tends to show r ≈ 0.60–0.70 with real CK.
  • Early blocks drag the average down, so late-improving students often “look worse” on the cumulative.
  • Stronger correlation if you filter to last 600–800 questions or self-assessment–adjacent weeks.

The approximate mapping (from classes I have seen, assuming random/timed, mostly first pass):

Approximate UWorld Percent to Step 2 CK Range
UWorld % (1st pass)Approx Typical CK Range
50–54%220–230
55–59%225–240
60–64%235–250
65–69%245–260
70%+255+ (wide spread)

But this mapping has huge variance. Two students at 62% can end up at 240 vs 255, depending on how their curves evolved.

So use Q‑bank data as:

  • A background indicator of readiness,
  • A sanity check when your self-assessment scores jump or drop dramatically,
  • Not as the primary predictor.

If your Q‑bank % and your self-assessments fundamentally disagree, trust the self-assessments—especially NBME and UWSA2.


How to Combine Multiple Practice Tests into a Single Prediction

Individual self-assessments have noise. The smartest way to use them is to treat each score as one sample from your underlying “true ability” distribution.

Multiple samples → better estimate.

In practice, you can do a quick-and-dirty weighted average that reflects predictive power:

  • Assign highest weight to your most recent NBME and UWSA2.
  • Moderate weight to other recent NBMEs.
  • Light weight to UWSA1.
  • Free 120: use mainly as a tiebreaker.

Example. Let us say your recent scores (within 3 weeks of exam) are:

  • NBME 11: 242
  • NBME 12: 246
  • UWSA2: 251
  • UWSA1: 256
  • Free 120: ~78% (roughly maps to low–mid 240s typically)

You could approximate like this:

  1. Ignore UWSA1 for primary prediction; use it as “max upside”.
  2. Average NBME 11 and 12: (242 + 246) / 2 = 244
  3. Compare to UWSA2 (251). Difference ≈ 7 points.
  4. Given known UWSA2 optimism (+4 on average), shift UWSA2 down slightly → adjusted UWSA2 ≈ 247.
  5. Take a weighted mean:
    • 244 (NBMEs) with 60% weight,
    • 247 (adj. UWSA2) with 40% weight.
      Predicted ≈ 245–246.

Now look at Free 120 percent; if conversion roughly matches mid‑240s, you have concordant evidence.

Your realistic target: mid‑240s, with a likely range of about 238–253. That is how the data stacks when you treat each test as noisy evidence, not gospel.


Trend vs Snapshot: The Time Dimension Matters

Correlation metrics ignore a crucial factor: time between self-assessment and real exam.

Here is what internal tracking usually shows:

  • Self-assessments taken >6 weeks before the real exam have weaker predictive value. Many students are still on the rising part of their curve.
  • Predictive strength increases sharply for tests taken in the final 2–4 weeks before the exam.
  • Scores taken in the final 7–10 days, especially NBME/UWSA2, are often within ±5–7 points for 60–70% of people.

So do not anchor your entire identity to a UWSA1 from 8 weeks ago. The score might be obsolete—up or down.

line chart: 8+ weeks, 6 weeks, 4 weeks, 2 weeks, 1 week

Predictive Strength vs Time Before Step 2 CK
CategoryValue
8+ weeks0.55
6 weeks0.65
4 weeks0.75
2 weeks0.82
1 week0.85

The line is conceptual, but it matches what I have observed: correlation tightens as you approach game day.

If your last 7–10 day scores show a clear upward trend, weight those more heavily than older, lower scores. If they show a downward trend, that is a separate red flag—often reflecting burnout, rushing, or poor sleep.


Practical Rules: Which Self-Assessment “Wins” in a Conflict?

Let me give you the actual decision rules I recommend to students when data points disagree.

  1. NBME vs UWSA1 conflict (big gap):

    • NBME 238, UWSA1 255
    • Trust: NBME. UWSA1 is optimistic. Your realistic range is probably low‑240s, not mid‑250s.
  2. NBME vs UWSA2 conflict (moderate gap):

    • NBME 245, UWSA2 255
    • Interpretation:
      • UWSA2 likely inflated by ~5–7 points.
      • NBME might be slightly conservative if older by >7–10 days.
      • Expect real score in the middle, maybe 247–250, assuming stable prep.
  3. Multiple NBME forms trending up, UWSA1 flat:

    • NBME 235 → 242 → 248
    • UWSA1 sitting at 244 somewhere in the middle
    • Trust the trend and the most recent NBME. You improved after that UWSA1.
  4. Free 120 much lower than others in final week:

    • NBME 252, UWSA2 255, but Free 120 ~72% (translating to low‑240s)
    • Check conditions: were you distracted, tired, guessing on many?
    • If exam-like conditions were good, consider that your test-day performance is sensitive to fatigue or integration. You probably still land closer to 245–250 than to 260.
  5. Only UWSA1 is great; everything else mediocre:

    • UWSA1 260, NBME 238–242, UWSA2 245
    • Treat UWSA1 as an outlier. Reality is likely mid‑240s to low‑250s barring a last-minute jump in performance and content mastery.

Notice the pattern: NBME and UWSA2 almost always control the final estimate.


Final Takeaways: What the Data Actually Says

Strip away the anecdotes and wishful thinking, and the data reduces to three blunt points:

  1. NBME and UWSA2 are your best predictors.
    NBME forms slightly edge out in calibration; UWSA2 runs very close but tends to overpredict by a few points. If you need one “anchor” number, use your latest NBME. Use UWSA2 as the optimistic boundary of your likely range.

  2. UWSA1 and Free 120 are supporting actors, not the lead.
    UWSA1 is useful for run-up practice and early benchmarking but frequently inflated. Free 120 is excellent for interface familiarity and style, poor as a standalone score predictor due to wide variance.

  3. Trends and timing matter as much as individual scores.
    Self-assessments taken in the final 2–4 weeks under exam-like conditions carry far more predictive weight. Combine them using simple, rational weighting rather than obsessing over any single score, and assume an error band of at least ±7–10 points no matter how good the predictor looks.

If you use the exams this way—NBME and UWSA2 as calibrated instruments, everything else as context—you are making decisions based on actual signal, not noise. That is how you walk into Step 2 CK with confidence grounded in data, not in luck.

overview

SmartPick - Residency Selection Made Smarter

Take the guesswork out of residency applications with data-driven precision.

Finding the right residency programs is challenging, but SmartPick makes it effortless. Our AI-driven algorithm analyzes your profile, scores, and preferences to curate the best programs for you. No more wasted applications—get a personalized, optimized list that maximizes your chances of matching. Make every choice count with SmartPick!

* 100% free to try. No credit card or account creation required.

Related Articles