
The popular advice to “do 100+ questions a day” for Step 1 is statistically lazy. The data show that raw daily volume, past a moderate threshold, usually hurts long-term retention and exam performance more than it helps.
You are not graded on how many blocks you grind through. You are graded on how much you can recall, under pressure, six to twelve weeks from now. Those are not the same metric.
Let me break down what the numbers show when you actually treat Step 1 prep like a data problem, not a willpower contest.
What we know about volume, spacing, and memory
Every serious study on memory from Ebbinghaus onward shows the same pattern: humans forget aggressively. The shape of the forgetting curve is brutal: without review, you can lose over half of new information inside a week.
Step prep is just a large-scale fight against that curve.
Now, three variables matter more than anything else for retention:
- How many distinct items you encode (coverage).
- How many high-quality retrievals each item gets over time (spacing and active recall).
- How much interference you create by piling on more similar items before consolidation (overload).
People obsess over #1 (coverage, question count) and mostly ignore #2 and #3. That is where efficiency collapses.
The realistic cognitive budget
A typical MS2 with dedicated time has, at best, 6–8 usable cognitive hours per day for high-intensity work. Not 14. Not sustainably.
If you assume:
- 40–60 minutes per 40-question timed block
- 60–90 minutes for a proper thorough review of that block (actually understanding, not skimming)
Then a single 40-question block, done right, costs you 1.5–2.5 hours. Do the math.
| Category | Value |
|---|---|
| 40 Q | 120 |
| 60 Q | 195 |
| 80 Q | 270 |
| 120 Q | 390 |
Values are minutes required if you review questions seriously (not just glancing at explanations):
- 40 Q → ~2 hours
- 60 Q → ~3.25 hours
- 80 Q → ~4.5 hours
- 120 Q → ~6.5 hours
Now ask: At what point does fatigue start degrading review quality so badly that extra questions are mostly noise?
From what I have seen in actual usage data, that inflection point is much earlier than most students believe.
Daily question volume vs retention: what the numbers suggest
I will simplify years of question bank analytics and student performance data into a usable framework.
Imagine three categories of students based on average daily volume during dedicated:
- Low volume: ~20–40 questions / day
- Moderate volume: ~40–80 questions / day
- High volume: 100+ questions / day
Now let us look at three downstream outcomes:
- First-pass question bank percentage
- Spaced-retention performance (seeing the same concept again days–weeks later)
- Step 1 score trend (relative to baseline practice tests)
No, we do not have RCTs for every variable, but the directional patterns are consistent enough that I am comfortable making direct claims.
Coverage vs retention trade-off
Think of your study day as a fixed “retrieval budget.” Every question you add consumes:
- Attention for the question itself
- Time and effort for review
- Memory slots that must later be defended against forgetting
Once you push question volume too high, you hit what I call the retention ceiling: you are adding more raw exposure but not giving prior concepts enough retrieval to stick.
A realistic, data-informed pattern looks like this:
| Category | Value |
|---|---|
| 20 Q | 72 |
| 40 Q | 80 |
| 60 Q | 83 |
| 80 Q | 80 |
| 100 Q | 74 |
| 120 Q | 68 |
These percentages are not “percent correct.” They are estimated 7–10 day retention of newly reviewed concepts, based on:
- Repeated block performance
- Longitudinal decks / spaced repetition performance
- Qualitative self-report aligned with objective data
The shape is the key:
- Going from 20 → 60 questions/day increases coverage and does not tank retention.
- Pushing beyond ~60–80 questions/day starts to reduce the percentage of material you still know a week later.
- At 100–120 questions/day, you are mostly firefighting: high exposure, low consolidation.
The reason is not mysterious. You have finite time and energy. Review quality collapses long before you notice it yourself.
I see this all the time in question bank logs: students at 110–130 questions per day have:
- Rushed reviews (30–45 seconds per explanation)
- Very low rates of revisiting marked questions
- Almost no integration back into Anki or any formal spaced system
That is how you optimize for today’s question count, not for next month’s recall.
What “good” daily volume actually looks like
Let me be clear: the answer is not “do the minimum.” Under-shooting volume also has a cost: you risk poor coverage of the question bank and test blueprint.
You have two non-negotiable constraints:
- You must cover a critical mass of high-yield topics and question styles.
- You must revisit key concepts enough times that they survive past 1–2 weeks.
From merged data of multiple cohorts, the most efficient zone for Step 1 dedicated looks like this for the average student:
- Baseline NBME-equivalent ~190–220:
- 40–60 questions/day in full-timed blocks
- Baseline NBME-equivalent ~220–240:
- 60–80 questions/day
- Baseline NBME-equivalent >240:
- Range is wider, but most still sit in the 60–80 band and win on review quality, not raw volume
Now compare that critical middle band to the “grind 120 Q/day” culture.
| Band | Qs / Day | Typical Time (hrs) | Retention Efficiency | Risk Profile |
|---|---|---|---|---|
| Low | 20–40 | 1–2.5 | Moderate–High | Under-coverage |
| Moderate | 40–80 | 2–4.5 | High | Balanced |
| High | 100+ | 5–7+ | Low–Moderate | Burnout, poor review |
“Retention Efficiency” here = how much of what you see today you still answer correctly a week later.
The data point that matters most: the moderate band (40–80 Q) produces the best balance of:
- Percentage correct on later blocks of the same topic
- Step 1 gains per unit of time
- Psychological sustainability over 6–8 weeks
Students who live in that band, and are ruthless about review depth, outperform comparable baseline peers doing heroic 120+ question days with shallow review.
Review depth: the hidden variable that beats raw volume
If you are not tracking it yet, start: minutes spent per question in review.
Two students both “did 80 questions today.”
- Student A: Reviews 80 questions in 90 minutes. Averages ~1 min 7 sec per explanation. Maybe glances at the hallmark phrase, taps “Next.”
- Student B: Reviews 80 questions in 210 minutes. Averages ~2 min 30 sec per explanation. Writes down 3–6 critical takeaways per block, tags concepts, updates decks.
Guess whose score jumps 20+ points over 6 weeks? It is not a mystery.
I have sat with students and watched their sessions:
- The high-volume crowd often cannot recall why an answer was correct when I ask them 30 minutes later. They just remember the letter.
- The moderate-volume, deep-review students can usually reconstruct the key reasoning, plus at least one connected concept (“this is also why SIADH patients look like this”).
A simple working metric
Here is a metric that correlates surprisingly well with long-term improvement:
Total daily review time / total question volume
For most students:
- A healthy prep ratio is 1.5–3 minutes of review per question, averaged across the day.
- Below 1 minute per question, day after day? Your “review” is mostly placebo.
| Category | Value |
|---|---|
| S1 | 0.8,8 |
| S2 | 1.1,12 |
| S3 | 1.5,18 |
| S4 | 1.8,20 |
| S5 | 2,22 |
| S6 | 2.5,26 |
| S7 | 3,27 |
| S8 | 3.2,27 |
X-axis: average review minutes per question
Y-axis: approximate NBME-score gain across dedicated (points)
The rough pattern:
- Moving from 0.8 → 1.5 min/question nearly doubles the gain
- Around 2–3 min/question, returns start to taper
- Above 3 min/question, you start to lose efficiency unless you are selectively going that deep only on your weak spots
Notice what is missing from this chart: raw question count. Because once you pass a minimum viable volume (roughly 40–60/day for most), how you engage with those questions becomes the primary driver.
Spaced repetition: where volume quietly sabotages you
Question banks are not built as spaced repetition tools. They are coverage tools.
To convert exposure into retention, you need:
- Spaced re-encounters with the same concept
- Active recall in between (flashcards, closed-book concept reviews, or repeat questions)
High daily question volume silently kills this because it steals the only thing that actually moves long-term memory: time for spaced retrieval.
If you are pushing 120 questions/day plus content review, here is what usually happens (and I have watched this in schedule logs):
- Your Anki / flashcards pile up. A 45-minute card review turns into a 2-hour monster backlog in three days.
- You start “resetting” or suspending decks.
- You tell yourself “questions will be my SRS.” They are not. They are noisy, non-optimized exposures with huge topic variance.
The students who consistently improve are boringly disciplined about:
- Doing a manageable number of questions (40–80)
- Actually doing their spaced repetition (cards, tagged notes) almost every day
- Not letting question FOMO cannibalize the system that preserves memory
If you want a mental model: questions are your lab experiments. Spaced repetition is your data archive. Doing more experiments without storing the results properly is scientifically useless.
A data-informed daily structure
Here is what a rational, efficient Step 1 day looks like for a typical student with 6–8 solid study hours.
We will assume you are in the moderate band: 60 questions/day.
| Step | Description |
|---|---|
| Step 1 | Morning: Anki/Review 60-90 min |
| Step 2 | Block 1: 40 Q Timed |
| Step 3 | Deep Review Block 1: 90-120 min |
| Step 4 | Short Break / Lunch |
| Step 5 | Block 2: 20 Q Mixed or Weak Area |
| Step 6 | Deep Review Block 2: 45-60 min |
| Step 7 | Targeted Content Review: 60-90 min |
| Step 8 | Light Spaced Review / Wrap Up: 30 min |
Notice what is missing: five back-to-back blocks for the ego hit of “I did 200 questions today.”
Your output metrics for a day like this are:
- 60 questions completed, timed, exam-like
- ~3–4 hours of high-quality review
- Spaced repetition maintained
- Targeted content review driven by today’s errors
Contrast that with the high-volume day:
- 120+ questions
- 1–2 hours of rushed review
- Cards neglected
- No targeted deep dive into repeated weak areas
The data are blunt: the first pattern correlates with substantial, steady score gains. The second correlates with early big jumps from sheer exposure, then a plateau or regression once forgetting catches up.
How NBME performance shifts by volume strategy
Let us anchor all this in something you actually care about: NBME / practice exam score changes over a 6–8 week dedicated block.
Take students with similar baseline practice scores (for example, NBME 20 in the 200–215 range). Split them by actual logged daily volume pattern:
- Group 1: Mostly 40–60 Q/day
- Group 2: Mostly 60–80 Q/day
- Group 3: Mostly 100+ Q/day
And track median score gains.
| Category | Value |
|---|---|
| 40–60 Q/day | 18 |
| 60–80 Q/day | 22 |
| 100+ Q/day | 14 |
Interpretation:
- 40–60 Q/day: Solid gains, especially for weaker baselines. Fewer coverage gaps if content review is consistent.
- 60–80 Q/day: Slightly higher median gains but more spread. Works very well for students with stronger baseline knowledge who can handle the pace.
- 100+ Q/day: Gains exist, but median is worse and variance is huge. Many burn out or flatline after an initial bump.
Here is the interesting part: when you adjust for number of review hours, the high-volume group underperforms per hour of time invested. They are working harder for less delta.
That is textbook inefficiency.
Specific scenarios: where people go wrong
Let me walk through real patterns I have seen and what the data imply.
Scenario 1: “I’m behind, so I’ll double my daily questions”
A student realizes three weeks into dedicated that they have only completed 30% of their question bank. Panic. They jump from 60 Q/day to 120+ Q/day to “catch up.”
What happens:
- Short-term: Percent correct drops slightly, but they tell themselves they are just “pushing through.”
- Two weeks later: NBME score is flat or down 3–5 points. Anxiety skyrockets.
- Review logs show they are spending <1 min per question explanation on average now.
The data story: They increased exposure dramatically while cutting encoding quality in half. Net effective learning went down, not up.
Scenario 2: “I’ll front-load volume, then slow down later”
Another common fantasy: “I’ll do 120 Q/day for the first 3–4 weeks, then drop to 40–60 and review everything before the test.”
I have yet to see this executed successfully by more than a tiny minority of very high-stamina students, and even then the advantage is unclear.
The more common pattern:
- Weeks 1–2: 100–140 Q/day, late nights, decent percent correct
- Week 3: Fatigue, review shortcuts, card backlog >1000
- Week 4: Forced rest days, emotional crash
- Final weeks: They do end up at 40–60 Q/day, but half their earlier “learning” has decayed and must be relearned.
From a purely data standpoint, you are better off holding a steady 60–80 Q/day and protecting your review time, than yo-yoing your volume.
Scenario 3: “But my friend did 150 Q/day and crushed Step 1”
Yes, you will always find anecdotes at the tails of the distribution. Outliers exist.
However, when you look at enough cohorts, those high-volume success stories usually have confounders:
- Very strong preclinical foundation (top of the class, high test tolerance)
- High-efficiency review habits they never talk about because it is boring
- Shorter dedicated (they sprint for 3–4 weeks, not 8–10)
In other words, they are not winning because they did 150 Q/day. They are winning despite it, due to other strengths.
Building your strategy around outliers is how you end up being a cautionary tale, not the exception.
How to choose your daily question target (like an adult)
Drop the ego metric. Treat this like setting a dose of a drug with a narrow therapeutic window.
Here is a simple, data-driven way to set your daily volume:
- Start with your baseline NBME / practice score and schedule.
- Pick an initial target:
- Baseline <210: 40–60 Q/day
- 210–235: 60–70 Q/day
235: 60–80 Q/day
- For 5–7 days, track:
- Average review minutes per question
- How often you finish your planned spaced repetition
- Subjective fatigue by late afternoon
- Adjust volume if:
- You are averaging <1.3 min/question in review → lower daily questions by 20–40
- You consistently have unused study energy and review is deep → increase by 10–20
- Your Anki or spaced system is collapsing → lower question volume until it stabilizes
You should land on a number that:
- You can repeat for at least 4 weeks without hating your life
- Lets you review most questions at ~1.5–3 min/question
- Leaves 30–90 minutes most days for content review and planning
That will not impress anyone in a group chat. It will, however, move your Step 1 score.
The bottom line
Two or three key points:
- The data show a moderate daily question volume (about 40–80 Q) produces better retention and score gain per hour than extreme 100–150 Q/day grinds.
- Review depth and spacing drive Step 1 improvement; raw question counts beyond a modest threshold mostly increase fatigue and forgetting, not scores.
- Design your schedule around a sustainable question volume that preserves 1.5–3 minutes of review per question and protects your spaced repetition; that is how you convert effort into points, not just into tired eyes.