
Most students are using the wrong metric to judge USMLE question banks. The logo does not matter. The correlation does.
You care about one thing: “If I get X% on this Q‑bank, what USMLE score should I expect?” Everything else is noise.
Let’s answer that with data, not vibes.
1. The Only Question That Matters: Correlation
Strip it down to basics. You have:
- Performance on a Q‑bank (percent correct, cumulative or timed blocks).
- Actual USMLE score (Step 1 numerical in the old era, Step 2 CK score now).
The core analytic question is: How strongly does Q‑bank performance predict the real exam score?
Statistically, that means:
- Use Pearson correlation coefficient, r, between:
- Q‑bank % correct
- Real exam score (or a strong proxy like NBME/Free 120 score)
R values:
- 0.1–0.3 = weak
- 0.3–0.5 = moderate
- 0.5–0.7 = strong
- 0.7–0.9 = very strong
But correlation alone is not enough. You need:
- Slope: how many USMLE points per 1% increase in Q‑bank?
- Intercept: what is the baseline prediction at a given raw percent?
- Calibration: does 65% in the Q‑bank actually map to what the line predicts, or is it systematically high/low?
Let me be explicit: self‑reported Reddit scatter plots are noisy but directionally useful. When you see hundreds of datapoints, patterns stabilize.
Based on compiled user data, known NBME correlations, and personal data from students I have worked with, here is the synthesized picture.
2. Q‑Bank Landscape: What the Data Consistently Shows
There are four major players students endlessly compare for correlation:
- UWorld
- NBME (forms, not exactly a Q‑bank but functionally a predictive bank)
- AMBOSS
- Kaplan
We will treat NBME forms as the reference standard, then compare the banks to that.
| Resource | Typical r with Real Score | Comment |
|---|---|---|
| NBME Forms | 0.85–0.90 | Gold standard predictive tool |
| UWorld | 0.70–0.80 | Strong, especially near exam |
| AMBOSS | 0.55–0.70 | Moderate–strong, more variable |
| Kaplan | 0.45–0.60 | Moderate at best |
Are these perfect? No. But the ordering is remarkably stable across cohorts:
- NBME
- UWorld
- AMBOSS
- Kaplan
If you want a one‑line summary: NBME predicts best, UWorld comes second, and everything else is a support act.
Now we break this down.
3. UWorld: The Workhorse with a Strong but Imperfect Correlation
Most people treat UWorld percentage as a pseudo‑score. That’s dangerous if you do not understand the pattern behind it.
What the data shows
Across multiple unofficial datasets (student spreadsheets, survey collections, Reddit mega‑threads):
- Cumulative UWorld percent correct correlates with Step 1 / Step 2 CK around r = 0.70–0.80.
- The correlation tightens as:
- You complete more of the bank (ideally >60–70%).
- Your blocks are timed and random rather than untimed / subject‑only.
Rough rule of thumb that has held surprisingly well for Step 2 CK:
- Step 2 CK score ≈ (UWorld % correct × 1.1–1.3) + 150–160
For example:
- 60% UWorld → prediction band roughly 220–235
- 70% UWorld → prediction band roughly 235–250
- 80% UWorld → prediction band roughly 250–265+
Does everyone fit? Of course not. But you see dense clustering around these bands.
Why UWorld correlates reasonably well
Three data‑driven reasons:
- Content alignment: The blueprint and difficulty distribution are deliberately tuned to USMLE style. That reduces construct mismatch.
- Statistical scaling: UWorld tracks huge volumes of user performance, then continuously adjusts question difficulty and explanations. The “percent correct” is not raw chance; it is anchored against a large user base.
- Test‑taking behavior similarity: Students tend to treat UWorld like a serious tool (timed, random) more than, say, random free banks. This makes performance more “exam‑like.”
Where it goes wrong:
- People using UWorld as a learning tool early (untimed, subject‑only, peeking at explanations), then treating that % like a prediction. That tanks correlation in personal datasets.
If your goal is prediction, not just learning, your UWorld data must be:
- Timed
- Random
- Near the exam (last 25–40% of questions especially)
Otherwise, you are just generating noise.
4. NBME: The Predictive Gold Standard (Even Though It Is Not a Q‑Bank)
NBME is not a classic Q‑bank, but in practice:
- Repeated NBME forms act like a high‑signal question bank with embedded scoring.
- Performance on NBME forms is the single best statistical predictor of your actual USMLE score.
| Category | Value |
|---|---|
| NBME | 0.88 |
| UWorld | 0.75 |
| AMBOSS | 0.63 |
| Kaplan | 0.52 |
Typical numbers from compiled student data:
- Single NBME form score vs Step score: r ≈ 0.85–0.90
- Average of the last 2–3 NBMEs vs Step score: r ≈ 0.90+
The score difference between your last NBME and real exam is commonly:
- Within ±5 points in many cases
- Within ±10 points for the vast majority
- Outliers exist, but they are rare and usually due to:
- Severe test anxiety
- Illness / sleep issues
- Major content gaps unmasked on test day
So if you are being strictly data‑driven and the question is “Which Q‑bank correlates best,” the pedantic but correct answer is:
The “Q‑bank” that correlates best is NBME forms, not any commercial bank.
NBME is built from the same test writers and blueprint as the real exam. That is why the correlation is so tight. No commercial Q‑bank can fully replicate that.
5. AMBOSS: Dense, Good for Learning, Slightly Weaker as a Predictor
AMBOSS fans love the explanations, tables, and integration with the library. Fair. Pedagogically, it is strong.
Predictively? Middle of the pack.
From multiple user‑reported scatter plots and cohort analyses I have seen:
- AMBOSS % correct vs Step score: r ≈ 0.55–0.70
- Tends to over‑predict for weaker students and slightly under‑predict or match for strong students.
Why does the correlation come out weaker than UWorld?
Several reasons show up repeatedly:
- Usage pattern: Many students use AMBOSS earlier in prep, during content building. Untimed, tutor mode, topic‑targeted. That destroys predictive value.
- Question style: AMBOSS sometimes leans denser and more reading‑heavy than the real exam. Performance on dense questions does not always scale linearly to NBME style.
- User base composition: UWorld has near‑universal penetration. AMBOSS skew is more mixed. Correlation coefficient is sensitive to the population you sample.
I tell students this explicitly:
- Use AMBOSS for learning and remediation.
- Do not obsess over your AMBOSS % as an exam score proxy, especially early.
Once again: the more your usage mimics the actual exam (timed, random, near test date), the more predictive the bank becomes. AMBOSS is no exception.
6. Kaplan and Others: Decent Practice, Modest Predictive Power
Kaplan has been around forever. Some schools bundle it. That alone does not make it a strong predictor.
Aggregate patterns:
- Kaplan % correct vs Step score: r ≈ 0.45–0.60
- Tends to have a slightly different question flavor than NBME.
- Students often use it very early, sometimes even M1/M2, before serious dedicated prep.
All of that erodes correlation.
Kaplan is useful for:
- Early exposure to question‑based learning
- Filling in some basic science gaps
- Rotating topics during preclinical years
But Kaplan performance is not a high‑confidence signal for your real exam score. If you try to reverse‑engineer “Kaplan % to Step score” with one simple formula, you will create more anxiety than insight.
The same applies, often more so, to a long tail of smaller banks and “free question sites.” The data on them is thin, inconsistent, and user behavior highly variable. Translation: correlation estimates are garbage.
7. How Score Prediction Actually Works in Practice
Let us talk mechanics. Suppose you want to use your data like an adult:
- You have:
- UWorld % correct (say, 68% cumulatively)
- AMBOSS % (say, 74%)
- Last two NBME forms (say, 232 and 238 for Step 2 CK)
How should you weigh these?
Think of each resource as a noisy estimator with known approximate precision. NBME is your high‑precision instrument. UWorld is medium precision. AMBOSS is lower precision. Kaplan barely counts for prediction.
A simple weighted approach that I have used with students:
- Take the average of your last 2–3 NBME scores.
- Adjust slightly using UWorld, only if:
- You have done >60–70% of UWorld timed + random, and
- Your cumulative % is clearly above or below what NBME suggests.
Concrete example:
- Last NBMEs: 232, 238 → average 235
- UWorld cumulative: 78% (solid)
- The typical 78% UWorld band for CK is around 245–255.
You are underperforming slightly on NBMEs relative to UWorld expectation. I would predict:
- Most likely test‑day range: 238–248
- Centered maybe at 243–245
- With a tail risk if test anxiety or fatigue hits.
Flip it around:
- Last NBMEs: 248, 252 → average 250
- UWorld cumulative: 64%
- That 64% UWorld maps more like a 225–240 band.
- Here NBMEs say you perform better under “true NBME style” conditions than your longer‑term UWorld record. That happens if you learned a lot late or did UWorld sloppily early.
In that scenario, I trust NBME much more and keep 250 as the anchor, with maybe a slightly wider band like 242–255.
| Category | Value |
|---|---|
| NBME Contribution | 70 |
| UWorld Contribution | 25 |
| Other Banks | 5 |
Weights are not universal, but this 70/25/5 breakdown is roughly how good your signals are.
8. Timing and Mode: The Hidden Variables Students Ignore
Correlation is not just “which brand of questions.” It is when and how you use them. I have seen students with the same raw UWorld % have wildly different outcomes purely because of test‑taking discipline.
Key variables that matter more than you think:
-
- Less than 40–50% completed = very weak predictor
- 60–80% completed = moderate predictor
- 100% completed = best signal, assuming consistent mode
Mode
- Timed + random blocks correlate best with real exam.
- Untimed + tutor mode destroy predictive value.
- System‑based blocks are fine for learning, weak for prediction.
Recency
- Early‑phase Q‑bank performance predicts almost nothing about scores months later.
- The last 4–6 weeks data (especially NBME and Free 120) dominates prediction.
If you want your Q‑bank to function as a predictive tool, you must treat at least a portion of it like the exam:
- 40‑question blocks
- Timed, no pausing
- Random systems
- Honest review of mistakes without changing answers retroactively
Without this, you are not building a dataset. You are just doing homework.
9. Putting It All Together: Which Q‑Bank “Correlates Best”?
Let us answer your original question bluntly.
If we restrict the definition to “commercial Q‑banks” (ignoring NBME):
- UWorld has the strongest and most reliable correlation with real USMLE scores.
- AMBOSS is second, useful but more variable.
- Kaplan and others lag as predictors, though they can be fine learning tools.
If we expand the scope honestly to include everything you use for questions:
- NBME forms correlate best. Period.
- UWorld is your best continuous practice + secondary predictor.
- AMBOSS and Kaplan are supplemental.
Here is a simplified mapping of bank usage vs predictive value:
| Scenario | Predictive Quality | Comment |
|---|---|---|
| NBME + Free 120 near exam | Excellent | Primary anchor |
| UWorld timed/random, >70% complete | Strong | Best commercial predictor |
| AMBOSS timed/random, near exam | Moderate–Strong | Helpful but noisier |
| Kaplan early, subject-based, untimed | Weak | Learning tool, not a predictor |
| Any bank in tutor mode, early in prep | Very weak | Do not use % as score proxy |
And to visualize relative predictive strength:
| Category | Value |
|---|---|
| NBME + Free 120 | 95 |
| UWorld (timed/random) | 80 |
| AMBOSS (timed/random) | 65 |
| Kaplan (mixed use) | 50 |
| Other small banks | 40 |
(The numbers are a rough “predictive strength index,” not literal r values, but the ranking reflects real patterns.)
10. How to Use This Data Strategically
Let me translate all this into a plan, because raw stats without decisions are useless.
Anchor your expectations on NBME + Free 120.
- Treat these as your real exam dress rehearsals.
- Last 2–3 NBME scores drive your final prediction.
Use UWorld as both learning tool and secondary predictor.
- Early: learn from explanations, tag weaknesses.
- Late: switch to timed, random blocks to generate high‑quality predictive data.
- Interpret your final cumulative % in the context of your NBME trend.
Use AMBOSS for depth and remediation, not primary prediction.
- Great when UWorld explanations feel thin.
- Good for targeted drilling during clerkships.
- Do not obsess if your AMBOSS % looks “low” while NBME and UWorld are strong.
Ignore Kaplan % as a score proxy unless you have no other data.
- If you must, treat Kaplan as a very rough lower‑precision signal.
- But once you have UWorld + NBME, Kaplan is basically demoted.
Stop comparing raw percentages across banks.
- 70% on AMBOSS does not equal 70% on UWorld.
- 70% on NBME is itself a different beast entirely because it maps directly to a scored scale.
| Step | Description |
|---|---|
| Step 1 | Question Performance Data |
| Step 2 | Base prediction on NBME avg |
| Step 3 | Use UWorld % cautiously |
| Step 4 | Refine range with UWorld data |
| Step 5 | Trust NBME more |
| Step 6 | Collect NBME as soon as possible |
| Step 7 | Adjust study plan based on gaps |
| Step 8 | NBME Available? |
| Step 9 | UWorld Timed/Random Done? |
That is the hierarchy you should internalize.
11. Final Takeaways
Three points, stripped of nonsense:
- NBME forms correlate best with real USMLE scores. They are your predictive anchor. Everything else is secondary.
- UWorld is the strongest commercial Q‑bank predictor, with solid correlation when used in timed, random mode and completed to a substantial degree.
- AMBOSS, Kaplan, and others are primarily learning tools, not scoring oracles. Use their explanations aggressively, but let NBME + UWorld numbers drive your expectations and decisions.