
Most “MCQs” used in medical teaching are not questions. They’re guessing games wearing a lab coat.
Let me fix that.
You want to construct high‑quality MCQs for medical exams—licensing-level, OSCE written components, in‑training tests, CME assessments. Not “quiz night” questions. Not trivia. You want items that actually discriminate between competent and unsafe, between superficial and deep understanding.
That requires discipline. There are item‑writing rules you simply do not break if you want valid, defensible exams. The good news: they are teachable, learnable, and repeatable.
I will walk through them systematically—stem, options, content, cognitive level, fairness, and review—using real medical examples and the same standards used by organizations like NBME, RCPSC, MRCP(UK), and specialty boards.
1. What High‑Quality Medical MCQs Actually Do
Before rules, you need the target.
High‑quality MCQs in medicine:
- Test application or higher-order thinking using authentic clinical problems.
- Focus on one clearly defined learning objective.
- Are answerable by a well‑trained candidate without guessing tricks.
- Differentiate between minimally competent and not‑yet‑competent candidates.
- Are psychometrically sound: good difficulty, good discrimination, clean functioning distractors.
Poor MCQs usually fail at least one of these. Classic patterns:
- Trivia about eponyms, dates, or numbers no clinician would memorize.
- Vague stems where two answers seem “sort of right.”
- Negatively worded questions (“Which of the following is NOT…”) that punish reading mistakes more than knowledge gaps.
- Implausible distractors that no serious candidate would choose, inflating scores artificially.
If you want your exam to stand up to scrutiny—from trainees, from your department, from accreditation—everything below becomes non‑optional.
2. Core Item‑Writing Rules: The Stem
The stem is where 80% of the quality is won or lost.
2.1 Use a focused clinical vignette
Most medical MCQs should present a short clinical scenario. Not because vignettes are fashionable, but because we do not treat lab values in the wild. We treat patients.
Bad stem (too abstract, decontextualized):
Which of the following is a side effect of amiodarone?
Better stem:
A 64‑year‑old man with ischemic cardiomyopathy and atrial fibrillation has been taking amiodarone for 18 months. He reports new shortness of breath and non‑productive cough. His oxygen saturation is 89% on room air. Chest X‑ray shows diffuse interstitial infiltrates.
Which of the following is the most likely cause of his pulmonary findings?
Now the item tests applied pharmacology, not a memorized list.
Rules for a focused stem:
- Present only information needed to answer the question—no padding.
- Include age, sex (if relevant), important comorbidities, key exam and investigation findings.
- Remove “window dressing” labs and extraneous details that do not affect the decision.
2.2 Ask a direct, single question
The stem must end with a single, clearly formulated question. No “double questions,” no trailing ambiguity.
Good questions:
- “What is the most appropriate next step in management?”
- “Which of the following is the most likely diagnosis?”
- “Which test is most appropriate to confirm the suspected diagnosis?”
- “Which complication is this patient most at risk of developing?”
Bad patterns:
- “Which of the following is TRUE about this condition?” (Too broad)
- “What is the diagnosis and the best treatment?” (Two questions in one)
- “Which is the best INITIAL and long‑term management?” (Again, two tasks)
Force yourself: one task per item. Diagnosis OR next step OR mechanism. Not two.
2.3 Keep all necessary information in the stem, not the options
The stem should be self‑contained. The candidate should not have to scan the options to understand the question.
Poor:
Which of the following best explains this patient’s condition?
A. Autoimmune destruction of…
B. Infection with…
C. Deficiency of…
D. …
(But the stem does not describe the condition well enough.)
Better: Make the clinical picture explicit in the stem, then ask for the mechanism based on that picture.
2.4 Avoid “window dressing”
Overly long stems are a common sin in medical exams. The problem is not length; it is irrelevance.
If a detail does not change the correct answer, remove it. Clerking‑style novels belong in the chart note, not the exam.
Test yourself while editing:
- If I delete this sentence, does the correct answer change?
- If no → cut it.
- If yes → keep it.
You will be surprised how much fluff you shed when you are strict.
3. Core Item‑Writing Rules: The Options
If the stem is the brain, the options are the nerves. When they misfire, your whole item collapses.
3.1 Use one best answer format (not true/false, not multiple response)
For high‑stakes medical exams, stick to single best answer MCQs:
- Exactly one option is best supported by current evidence and practice.
- Other options are clearly less correct, incorrect, or incomplete for the specific stem.
Avoid:
- True/False or A–E “each is true/false” sets (terrible psychometrics, encourage cueing and test‑wiseness).
- Multiple‑response (“Select all that apply”) for high stakes—they are harder to standardize and validate.
3.2 Make options homogeneous in content and structure
Options should belong to the same “family.” If the question is about next step in management, all options should be management actions, not a mix of diagnosis, pathophysiology, and management.
Bad (mixed types):
Which of the following is the best next step?
A. Pulmonary embolism
B. Perform CT pulmonary angiography
C. Give aspirin
D. Start warfarin
Here, A is a diagnosis, B is a test, C and D are treatments. Sloppy.
Better:
Which of the following is the most appropriate next step in management?
A. Begin low‑molecular‑weight heparin
B. Order CT pulmonary angiography
C. Begin aspirin therapy
D. Reassure and schedule outpatient follow‑up
E. Start warfarin without imaging
Now all options are “actions” you might reasonably consider next.
Also keep parallel grammar and similar length across options. Obviously longer or more detailed options often cue the correct answer subconsciously.
3.3 Avoid overlapping or mutually inclusive options
Options should not overlap in logic. If two options could both be considered correct under some interpretation, your item is broken.
Example of overlap:
A. Give IV fluids
B. Give 1 L of normal saline
C. Give vasopressors
Here, B is a subset of A. If IV fluids are correct, both A and B are technically correct.
Fix: Either specify exactly what you want (e.g., “Which fluid regimen is most appropriate?”) or restructure the options so they are mutually exclusive.
3.4 Make distractors plausible, but clearly inferior
High‑quality distractors:
- Are plausible to a borderline candidate with partial knowledge.
- Address common misconceptions or typical errors.
- Represent alternative diagnoses or management options that fit part of the vignette but not all.
Poor distractors:
- Ridiculous diagnoses unrelated to the case.
- Obsolete treatments no one uses.
- Options that no trainee would consider after 3 months in clinic.
Example:
Patient with acute coronary syndrome and ST‑elevation.
Bad set:
A. Aspirin
B. Penicillin
C. Vitamin C
D. Herbal supplement X
Only one option is even remotely relevant. This item will have high facility (too many get it right) and terrible discrimination.
Better set:
A. Immediate percutaneous coronary intervention
B. Stress echocardiography
C. Start high‑dose atorvastatin only
D. Schedule outpatient cardiology review
All are things students may have seen in some context. Only A is correct in the acute STEMI scenario.
3.5 Do not hide the answer in “all/none of the above”
Serious exams do not use:
- “All of the above”
- “None of the above”
Why?
- They measure test‑wiseness rather than competence.
- They create problems when content updates (one option becomes arguable and “all of the above” dies).
- Candidates can infer correct answers once they know one option is clearly false or clearly true.
4. Banned Tricks and Sloppy Habits
Some rules are not just “best practice.” They are do not do this rules from every modern item‑writing guideline.

4.1 Avoid negative stems (“EXCEPT”, “NOT”, “LEAST”)
These are notorious for penalizing sloppy reading more than knowledge.
Bad:
Which of the following is NOT a feature of nephrotic syndrome?
Better:
Which of the following findings most strongly suggests an alternative diagnosis rather than nephrotic syndrome?
If you absolutely must test “least appropriate” or “not recommended” (for guidelines, sometimes you have to):
- Bold or capitalize the negative word: NOT, LEAST, EXCEPT.
- Keep the stem as short and simple as possible.
- Use consistently across the exam (do not pepper one or two negative stems in randomly).
But honestly, in high‑stakes exams, these tend to be weeded out in review.
4.2 Do not include irrelevant difficulty (language, math, trickiness)
Candidates should struggle with content, not with language parsing.
Avoid:
- Overly complex sentence structures, double negatives, or idioms (especially in international exams).
- Unnecessary mental arithmetic with awkward numbers if the point is not calculation skill.
- “Which of the following is MOST correct?” nonsense—if you need to argue that two are “kind of right,” your stem is under‑specified.
4.3 Do not cue the answer inadvertently
Common unintentional cues:
- Correct answer is consistently longer or more specific.
- Correct answer has formal language while distractors are colloquial.
- Grammatical mismatch: stem uses singular, only one option is singular.
- Repeated pattern: always option C, etc. Trainees do notice.
Check:
- Length balance.
- Grammatical agreement.
- Position of correct answers across your exam (use item randomization or planned distribution).
5. Matching Cognitive Level to Exam Purpose
Not every exam needs USMLE‑level puzzles, but every serious medical exam must exceed simple recall.
Think in terms of cognitive levels (Bloom-like, adapted for medicine):
- Recall: “Which nerve innervates the deltoid?”
- Comprehension: “What is the effect of vagal stimulation on heart rate?”
- Application: “Patient on drug X develops symptom Y—why?”
- Analysis/Management: “Given this complex case, what is the next step?”
Licensing and in‑training exams should live mostly at application and analysis. CME questions can mix 2–4, but if all your items are level 1 recall, you are running trivia, not education.
Example evolution:
Recall level:
Which bacterium is most commonly responsible for community‑acquired pneumonia?
Application level:
A 45‑year‑old previously healthy man presents with 2 days of fever, productive cough, and pleuritic chest pain. Chest X‑ray shows right lower lobe consolidation. Sputum Gram stain shows gram‑positive diplococci.
Which organism is the most likely cause of his pneumonia?
Same content, different cognitive demand.
For high‑quality MCQs:
- Start from a clearly written learning objective at the required level.
- Design the stem to force the candidate to use knowledge, not just recite it.
- Reserve pure recall for low‑stakes quizzes or foundation years.
6. Constructing Items Step‑by‑Step (with a Live Example)
Let me walk through this in a concrete way. This is how I teach faculty in item‑writing workshops.
Step 1: Define the tested objective precisely
Bad objective: “Understand DKA.”
Useful objective:
“Given a patient with suspected diabetic ketoacidosis, identify the most appropriate initial management step in the emergency department.”
This implies: applied knowledge, early management decisions, not minutiae of pathophysiology.
Step 2: Build a lean, discriminating vignette
Draft stem:
A 22‑year‑old woman with type 1 diabetes presents with 1 day of vomiting and abdominal pain. She has missed multiple insulin doses. On exam, she is drowsy, tachycardic, hypotensive, and has deep respirations. Capillary glucose is 28 mmol/L (504 mg/dL). Arterial blood gas shows metabolic acidosis. Urine is positive for ketones.
Which of the following is the most appropriate initial management step?
We deliberately include:
- History suggesting missed insulin.
- Physical exam (Kussmaul respirations, hypotension).
- Labs consistent with DKA.
We exclude:
- Unnecessary family history, drug list, social history, extraneous labs. None of that changes the first step.
Step 3: Generate the correct answer
Correct answer: Start IV fluid resuscitation with isotonic saline.
Now we create the other options.
Step 4: Build plausible distractors based on real errors
Typical errors:
- Starting insulin infusion before fluids.
- Giving bicarbonate early.
- Ignoring potassium until too late.
- Delaying treatment for imaging.
Options:
A. Begin continuous IV insulin infusion immediately
B. Start IV fluid resuscitation with normal saline
C. Administer IV sodium bicarbonate
D. Order emergent CT scan of the abdomen
E. Give a bolus of long‑acting subcutaneous insulin
Check:
- All are “management actions.”
- All could be considered by a novice.
- Only one is best initial step.
Step 5: Clean language and balance options
- Parallel structure: Begin/Start/Administer/Order/Give.
- Similar length.
- No quantification required.
This is a high‑quality, application‑level item aligned with a real learning objective.
7. Content Rules: Currency, Relevance, and Blueprinting
High‑quality MCQs do not live in isolation; they live inside an exam built on a blueprint.
| Category | Value |
|---|---|
| Recall | 10 |
| Comprehension | 20 |
| Application | 40 |
| Analysis/Management | 30 |
7.1 Align every item with a blueprint
Before you write a single item, you need:
- A content outline of the exam (systems, disciplines, domains).
- Cognitive level targets (e.g., 70% application/analysis, 30% foundational).
- Weighting by importance and frequency in real practice, not by how easy something is to write questions about.
Then:
- For each slot in the blueprint, write explicit learning objectives.
- Construct items to hit those specific targets.
If you do not blueprint, you end up with 15 questions on glomerulonephritis because your nephrologist likes writing them, and 2 questions on delirium because nobody volunteered.
7.2 Test important, not obscure, content
Obvious but often violated.
Do not test:
- Rare syndromes no one sees in typical practice, unless it is a specialist exam where they matter.
- Useless minutiae (exact gene locus of a common condition, drug half‑lives to the decimal).
Do test:
- Presentations and first‑line management of common conditions.
- Early recognition of life‑threatening emergencies.
- Pitfalls and high‑risk decisions (e.g., when you must not send someone home).
A quick rule I use: “Would a competent clinician be legitimately criticized for not knowing this?” If yes, test it. If no, reconsider.
7.3 Keep content current and referenced
Clinical guidelines change. So must your items.
For high‑stakes exams:
- Attach a reference (guideline, textbook, paper) to each item in the item bank.
- Set review cycles (e.g., 2–3 years) or earlier if guidelines in that domain change.
- Kill or revise items that conflict with new standards of care.
8. Fairness, Bias, and Accessibility
You are not just testing “who has seen this obscure corner case in their rotation.” You are testing competence, across diverse backgrounds.
8.1 Minimize construct‑irrelevant variance
Things that wrongly affect performance:
- Cultural references, idioms, jokes.
- Patient names or scenarios that rely on cultural assumptions.
- Vague references to local healthcare structures that international graduates may not recognize.
Use clear, neutral language. If local context matters (e.g., TB prevalence, resource‑limited settings), be explicit.
8.2 Avoid demographic stereotypes
When using demographics:
- Use diverse names and backgrounds across the exam.
- Do not associate pathology stereotypically with certain ethnic groups unless epidemiologically relevant and explicitly stated.
- When race/ethnicity is included, it should be clinically relevant (e.g., sickle cell disease in populations where it is actually more prevalent) and not used as a lazy clue.
8.3 Write for non‑native speakers without dumbing down content
Medical English is hard enough. Do not compound it with slang, idioms, or region‑specific phrases.
Examples to avoid:
- “He is under the weather.” → Say “He feels unwell.”
- “She is out of sorts.” → Say “She feels fatigued and irritable.”
The goal is not literary flair. It is clarity.
9. Item Review, Piloting, and Psychometrics
Serious question writing ends with data, not with your gut.
| Metric | Typical Good Range |
|---|---|
| Difficulty (P value) | 0.3–0.8 |
| Discrimination (rpb) | ≥ 0.2 (preferably ≥0.3) |
| Non-functioning distractors | < 20% per item |
| Guessing pattern | No option overused |
9.1 Peer review before exam use
Every new item should be:
- Reviewed by at least one content expert and one assessment‑savvy educator.
- Checked against current guidelines.
- Screened for ambiguity, overlaps, flaws listed above.
I have seen “obvious” items die in committee because two experts gave different correct answers. Better to find that in review than after the exam with 200 angry residents.
9.2 Pilot items where possible
If your exam volume allows, include pretest items:
- They appear indistinguishable from scored items.
- Their performance is analyzed but they do not count toward the candidate’s score.
- Poorly performing items are revised or dropped.
This is how large boards maintain item quality over time.
9.3 Interpret item statistics and act
The basics:
Difficulty (P value): proportion of candidates answering correctly.
- Very high (>0.9): may be too easy or trivial.
- Very low (<0.2): may be too hard, miskeyed, or poorly written.
Discrimination (point‑biserial, rpb): correlation between getting the item right and overall test score.
- Low or negative: high performers are not choosing the keyed answer. That is a big red flag.
Distractor analysis: count how often each distractor is chosen.
- Distractors that no one picks across administrations are non‑functioning and should be revised or removed.
Items with poor metrics should be:
- Investigated for content errors or ambiguity.
- Revised based on evidence.
- Possibly removed from scoring (post‑hoc) if truly flawed.
Boards do this routinely; local programs almost never do. That is a mistake.
10. Practical Workflow for Busy Medical Educators
You do not have time to reinvent the wheel for every exam. You need a practical system.
10.1 Use an item template
For each new item, fill in:
- Learning objective (with cognitive level).
- Content area / blueprint category.
- Stem.
- Options (A–E).
- Correct key.
- Rationale: short explanation of why the key is correct and why each distractor is wrong.
- Reference(s).
- Date created, author.
The rationale is not just for learners. It helps future you (or colleagues) understand what the item was meant to test when revising years later.
10.2 Write items in batches, then revise
I have watched faculty burn out trying to write “the perfect question” one at a time.
Better:
- Draft 10–15 items quickly without over‑polishing.
- Take a break.
- Return and systematically apply the rules above: check stem clarity, option homogeneity, distractor plausibility.
- Then send the batch for peer review.
Volume first, quality second. Revision is where quality emerges.
10.3 Build and curate an item bank
Even a small department can:
- Store items in a simple database or spreadsheet initially, then migrate to a professional system later.
- Tag items by topic, cognitive level, difficulty (once known), year.
- Retire, revise, and rotate items based on performance and guideline changes.
Over a few years, you will have a mature item bank, and “writing exams” becomes largely a task of selecting and updating, not generating from scratch every time.
FAQ (exactly 5 questions)
1. How many options (A–E) should a high‑quality medical MCQ have?
Four or five options are standard. There is nothing magical about five; what matters is that you can generate several plausible distractors. If you can only create two good distractors, use four options. Forcing weak fifth options degrades question quality more than it improves psychometrics.
2. Is it acceptable to reuse questions across different cohorts or years?
Yes, provided you have an item bank and track item performance. Many licensing bodies reuse items. However, overexposure (circulating recalls, shared banks) will gradually erode discrimination. High‑stakes programs typically mix a core of known good items with new items each year and periodically retire or substantially revise overused ones.
3. Should I give partial credit if more than one answer seems correct after the exam?
For properly written one‑best‑answer MCQs, partial credit makes little sense and complicates scoring. If post‑exam review shows that two options are defensibly correct, you should either: (a) accept both as correct for scoring, or (b) remove the item from scoring altogether. Then revise or retire the item for future use.
4. Can I test procedural skills or professionalism with MCQs at all?
You cannot fully assess hands‑on procedural skill or observed professionalism with MCQs, but you can test knowledge about procedures (indications, complications, steps) and ethical reasoning in professionalism scenarios. These should be part of a multimodal assessment strategy that includes OSCEs, workplace‑based assessments, and supervisor reports.
5. How many items do I need for a reliable exam?
Reliability depends on item quality and test length. For most in‑training or end‑of‑rotation exams, 40–60 well‑constructed items can give acceptable reliability (Cronbach’s alpha around 0.7–0.8). For high‑stakes summative exams, 100+ items are common to ensure stability. Poorly written items will undermine reliability no matter how many you include, so focus on quality first.
Key points:
- High‑quality MCQs are built on disciplined item‑writing rules: focused vignettes, single clear questions, homogeneous plausible options, no negative stems or trickery.
- Every item must map to a blueprint and a specific cognitive level, be current with clinical guidelines, and survive rigorous peer and statistical review.
- Sustainable exam quality comes from systems: templates, item banks, routine revision cycles, and the willingness to kill or fix flawed questions instead of defending them.