
Most milestone systems are glorified checklists pretending to be assessment frameworks.
If you are going to design one for a residency, you either do it rigorously or you create an illusion of precision that fools faculty, misleads residents, and hurts patient care.
Let me walk you through how to do this correctly.
1. Start With the Uncomfortable Question: What Are You Actually Judging?
Most programs skip this step. They download the ACGME milestones PDF, throw it into an evaluation form, and call it a day. That is lazy assessment design.
You must answer one question first:
What decisions will this assessment system support?
Not “what data will we generate,” but:
- Who gets promoted from PGY-1 to PGY-2?
- Who is allowed to take independent call?
- Who needs an individualized remediation plan?
- Who is safe to graduate and practice unsupervised?
If your milestones system does not clearly connect to these decisions, it is decor. Pretty decor, maybe, but still decor.
Now translate those decisions into assessment purposes:
- High-stakes (promotion, non-renewal, graduation)
- Medium-stakes (privileges like ICU call, procedures, consults)
- Low-stakes (coaching feedback, learning plans, self-assessment)
Your design must be fit-for-purpose. A system built only for “formative feedback” will be too soft for promotion decisions. A system built only for high-stakes will crush honest feedback and generate inflated ratings.
The right answer is usually:
One system. Two layers:
- Continuous formative data (frequent, low-friction, behavior-based)
- Periodic summative decisions (CCC review, milestone levels, formal documentation)
Design everything with those two layers in mind.
2. Milestones Are Not Checkboxes: Deconstruct Them Correctly
ACGME-style milestones are global, synthetic descriptors. “Manages complex medical conditions with indirect supervision” sounds nice, but what does that look like at 2 a.m. on night float?
You have to unpack each milestone into:
- Observable behaviors
- Real clinical contexts
- Expected degree of supervision
If you do not do this, your faculty will “assess” based on vibes, not performance.
A practical deconstruction workflow
Pick a core domain. Example from Internal Medicine: “Patient Care – Clinical Judgment and Decision Making.”
Step 1: Identify key tasks where this shows up:
- Admission decision and initial orders
- Daily plan on ward rounds
- Cross-cover calls
- ICU triage decisions
Step 2: Convert those into entrustable professional activities (EPAs):
- “Evaluate and admit an acutely ill adult patient from the ED”
- “Provide overnight cross-cover for a general medicine ward team”
- “Lead daily rounds for a small inpatient team”
Step 3: Map EPAs to milestone levels.
| EPA Level | Description | Supervision Level |
|---|---|---|
| Level 1 | Needs explicit step-by-step guidance | Direct |
| Level 2 | Manages straightforward cases with frequent input | Direct–Indirect |
| Level 3 | Manages most routine cases, asks for help appropriately | Indirect |
| Level 4 | Manages complex cases reliably, anticipates problems | Distal |
| Level 5 | Functions like a new attending in usual settings | Autonomous (program complete) |
Step 4: Define sample behavioral anchors for each level, in context.
For cross-cover EPA, “responding to acute change in clinical status”:
- Level 2: Calls senior immediately for borderline hypotension without initial assessment.
- Level 3: Assesses at bedside, reviews vitals and meds, tries 1–2 appropriate interventions, then calls senior with clear summary and plan.
- Level 4: Triages multiple sick patients, prioritizes correctly, initiates appropriate resuscitation steps, calls for help early when thresholds are met.
Now you have something faculty can actually rate.
3. Build the Spine: EPAs, Milestones, and Supervision Levels
The biggest design mistake I see: programs confuse competencies, milestones, and EPAs and try to assess everything everywhere. It becomes noise.
The solution is to build a spine that everything hangs off:
- Define 8–15 core EPAs that actually matter for safe independent practice in your specialty.
- Map your existing milestones to those EPAs, not the other way around.
- Tie supervision levels to clear entrustment decisions.
Example: Internal Medicine core EPAs snapshot
| EPA # | EPA Name | Typical Entrustment Timing |
|---|---|---|
| 1 | Admit and manage general ward pt | Mid-PGY1 to early PGY2 |
| 2 | Provide cross-cover for ward team | Early–Mid PGY2 |
| 3 | Lead family meeting on goals of care | Late PGY2–PGY3 |
| 4 | Run a rapid response on the ward | Late PGY2–PGY3 |
| 5 | Manage ICU patient on call | Late PGY3 / Fellowship |
Tie these to supervision scales:
- 1 – Not allowed to perform
- 2 – Direct supervision (attending or senior physically present)
- 3 – Indirect, immediately available
- 4 – Indirect, available but not constantly present
- 5 – Supervision at a distance / post-hoc review
Now your milestone levels have teeth. “Reaching Level 3 or higher on EPA 2 = eligible for independent cross-cover.”
This is where resident buy-in changes. They see the connection:
“Once I consistently show Level 3–4 behaviors on these EPAs, I get X privilege.”
Not “once I get some random 4’s on an opaque form, maybe someone upgrades my level.”
4. Design the Instruments: Stop Using 40-Item Frankenstein Forms
If your evaluation form takes more than 90 seconds to complete, it will be either:
- Ignored.
- Filled in with straight-line 4’s.
- Completed days later from memory, which is useless.
You need multiple, small, purpose-built instruments, not one giant omnibus form pretending to do everything.
Core instrument types you actually need
Micro-assessments (work-based, high frequency)
Short, context-specific forms completed after real work:- 5–7 items maximum
- Single domain focus (e.g., inpatient cross-cover, clinic precepting, OR scrubbed case)
- 1–2 milestones/EPAs per form
- One narrative comment field
Multi-source feedback instruments
Structured input from:- Nurses
- Peers (other residents)
- Interprofessional staff (pharmacy, PT, social work) Tailored to domains they actually see: communication, teamwork, professionalism, reliability.
Direct observation tools
For procedures, handoffs, family meetings, etc.
Highly behavior-anchored, designed to be completed in real time or immediately after.Self-assessment + learning plan templates
Residents pick 1–2 milestones per block that they are working on, predict their level, and propose specific behaviors to target. This is not busywork if you align it with the same EPA language faculty are using.CCC synthesis templates
This is not an instrument completed with the resident. It is a structured way the Clinical Competency Committee aggregates data into milestone ratings and promotion decisions.
You want lots of low-friction data points, not a few bloated evaluations that everyone hates.
5. Anchor Everything or Accept Garbage Data
Unanchored global ratings are statistically noisy and psychologically biased. “3 vs 4” means nothing if not tied to observable behaviors.
Your job: aggressively anchor.
How to write good behavioral anchors
Take a common domain: “Clinical reasoning – assessment and plan on ward rounds.”
Bad anchors:
- 1 – Poor
- 3 – Average
- 5 – Excellent
Slightly less terrible but still weak:
- 1 – Often misses important diagnoses
- 3 – Usually identifies key diagnoses
- 5 – Always identifies correct diagnoses
Better anchors – concrete behaviors:
- Level 1: Lists unprioritized problems, often omits key data, plans are copied from prior notes or senior; cannot explain reasoning when asked.
- Level 2: Identifies main active problem, but differential is narrow; management plans are reactive; struggles to adapt when patient worsens.
- Level 3: Prioritizes problems appropriately, offers a reasonable differential with justifying data; plans mostly evidence-based; recognizes when a plan is failing.
- Level 4: Anticipates complications; uses probabilistic thinking; weighs risks/benefits with patient-specific factors; independently adjusts complex regimens.
- Level 5: Teaches juniors reasoning frameworks; recognizes rare presentations early; consistently synthesizes across multi-morbid conditions.
Now each rating implies specific observed patterns. Faculty can think:
“Today on rounds, this resident anticipated AKI risk with contrast, adjusted meds, and changed the imaging plan after discussing with radiology. That’s a Level 4 behavior.”
Do this anchoring for:
- Common scenarios (ward cross-cover, resuscitation, clinic visit)
- Key domains (communication, professionalism, systems-based practice)
- EPAs tied to high-stakes decisions (independent call, ICU, procedures)
Yes, this is work. But you only need to do it once, then refine.
6. Data Flow and Frequency: Build a System, Not a Pile of Forms
You need a data architecture almost as much as you need the instruments themselves.
Here is what an actually functional assessment flow looks like.
| Step | Description |
|---|---|
| Step 1 | Clinical Encounter |
| Step 2 | Micro-assessment by faculty |
| Step 3 | Peer or nurse feedback |
| Step 4 | Direct observation tool if targeted EPA |
| Step 5 | Assessment database |
| Step 6 | Resident dashboard |
| Step 7 | CCC dashboard |
| Step 8 | Resident reflection and learning plan |
| Step 9 | CCC milestone rating meeting |
| Step 10 | Promotion and entrustment decisions |
| Step 11 | Feedback to resident |
Key design decisions:
How often do you expect micro-assessments?
Example: 1–2 per week per resident during inpatient, 1 per clinic half-day, 1 per OR day.Who is required to complete them?
Not “anyone.” Be explicit. Senior resident? Attending? Fellow?What triggers direct observation tools?
For example: first 5 central lines, first 10 family meetings, each new rotation type.How does the CCC see the data?
Raw forms alone are overwhelming. They need:- Aggregated trends by EPA and milestone
- Trajectories over time
- Distribution of supervision levels
- Outlier flags (rapid drop-off, plateauing below target)
Here is what that might look like visually.
| Category | EPA 1 - Ward Management | EPA 2 - Cross-cover |
|---|---|---|
| Start PGY1 | 1.5 | 1 |
| Mid PGY1 | 2.3 | 1.8 |
| End PGY1 | 3 | 2.5 |
| Mid PGY2 | 3.5 | 3 |
| End PGY2 | 4 | 3.7 |
If you do not design the flow, your system collapses into random, sporadic evaluations that the CCC cannot interpret.
7. CCC: Where the System Lives or Dies
The Clinical Competency Committee is not a rubber-stamp promotion body. It is the engine that converts noisy, real-world data into coherent milestone judgments.
If your CCC meeting is:
- A spreadsheet of average scores,
- 3 comments per resident,
- And “I worked with them, they’re fine,”
…your system is failing.
A rigorous CCC process has:
Pre-meeting case preparation
Each member reviews a defined set of residents, focusing on:- EPA trajectories
- Narrative comments (sorted by domain)
- Supervision level changes
- Any professionalism flags
Structured discussion template
For each resident, the chair drives:- Quick data review (1–2 minutes)
- Focus on discrepancies (e.g., clinic vs wards, faculty vs nurses)
- Synthesis toward each relevant milestone group (Patient Care, Medical Knowledge, etc.)
- Explicit decision: stay on track, watch, or intervene
Written rationale for major decisions
“Promoted to independent call on wards because of consistent Level 3–4 behaviors on EPA 2; no concerns from nursing; two different attendings commented on excellent triage decisions.”Feedback loop to the resident
Not just “you’re PGY2 now.” Instead:- Current milestone/EPA level (in plain language, not just a number)
- Specific strengths with examples
- 1–3 targeted goals for next period, again tied to EPAs/milestones
- Explicit statement about entrustment (e.g., “We now expect you to handle cross-cover at this level; we will be watching X and Y closely.”)
You want the CCC to behave like a clinical case conference. Data, interpretation, consensus, plan.
8. Technology: Use It, But Do Not Worship It
A fancy platform does not fix a bad assessment design. But a good design implemented in email and sticky notes will die.
Pick tools that support, not dictate, your system. The core tech needs:
- Fast, mobile-friendly micro-assessment entry
- Role-based forms (faculty vs nurse vs peer)
- Real-time dashboards for residents
- CCC views that show trends by EPA/milestone
Avoid:
- Forcing every milestone onto every form
- Pages of free-text fields that nobody reads
- Making residents dig through 100 PDFs to understand their progress
A simple structure:
| Category | Value |
|---|---|
| Faculty micro-assessments | 50 |
| Nurse feedback | 20 |
| Peer feedback | 10 |
| Direct observations | 15 |
| Patient feedback | 5 |
Aim for something like that distribution: faculty micro-assessments as the backbone, diversified with other perspectives.
9. Faculty Development: Without This, Everything Above Is Theory
Here is the hard truth: most faculty were never trained to use milestones. They fill out forms in under a minute, often in bulk at block’s end, and they hate every second of it.
You can complain, or you can fix it. That means real, targeted faculty development:
One-page guides per assessment form
Plain language:- What this form is for
- When to use it
- Concrete examples of behaviors at each level
- What not to worry about (e.g., “this is not a global judgment of the resident”)
Short, case-based workshops (30–45 minutes)
Show:- 3 video clips or written scenarios of resident performance
- Have faculty rate them with your forms
- Display variation
- Argue through to consensus This quickly calibrates expectations and exposes leniency/severity biases.
Hard line on expectation of honest ratings
You must explicitly say:
“A 3 is not a bad score. It means appropriate for training level, still progressing.”
And you must have leadership back that up when residents complain about not getting 4’s and 5’s every time.Feedback training
Teach attendings to translate milestone language into plain, specific, time-bound feedback:- Not: “You need to work on your clinical judgment.”
- Instead: “On cross-cover, start by going to the bedside yourself before calling for help. Then present in this structured way…”
Without this, your lovely system generates trash data.
10. Residents Are Not Passive Targets: Make the System Transparent
Residents will either:
- See milestones as a secret promotion game they cannot understand, or
- Use them as a roadmap for growth.
You control which.
Concrete actions:
During orientation, explicitly map:
- Milestones → EPAs → Real privileges
(“To take independent ICU call, you must be consistently at Level 3+ on EPAs 2, 4, and 5.”)
- Milestones → EPAs → Real privileges
Give them a simple dashboard of:
- Current EPA levels
- Trends over 6–12 months
- Narrative feedback, tagged by domain
Embed milestones in:
- Semi-annual meetings with program leadership
- Individual learning plans
- Resident self-reflection exercises
When a PGY2 says, “I want to get ready for independent cross-cover,” you should be able to respond with:
“Great. Right now, you are at Level 2–3 on cross-cover EPA. To move up, we are looking for these three behaviors consistently. Let us target those in the next two months and get more direct observations in that context.”
That is how an assessment system becomes an educational tool instead of a bureaucratic requirement.
11. Start Small, Iterate, and Be Honest About Trade-offs
You will not build a perfect system in one year. If you try, you will frustrate everyone.
Better approach:
Year 1:
- Identify 3–4 critical EPAs.
- Build short, anchored forms for those contexts.
- Calibrate a small group of faculty.
- Run pilot on 1–2 rotations.
Year 2:
- Expand EPAs and settings.
- Refine anchors based on faculty feedback.
- Build CCC dashboards.
- Start using data in real entrustment decisions.
Year 3:
- Integrate multi-source feedback.
- Tighten links between EPA levels and privileges.
- Adjust thresholds if you find mismatch (e.g., you “graduated” Level 4 residents who struggled in early attending jobs).
Transparency with your faculty and residents matters:
“This year, we are piloting new EPA-based tools on wards and ICU. They will matter for feedback and learning plans, but not yet for high-stakes promotion. In 18 months, we will begin linking them to call privileges.”
If you hide the stakes or pretend the system is more valid than it is, you destroy trust fast.
12. Common Failure Modes You Should Avoid
I have seen these patterns over and over:
Milestone shopping
Faculty scroll through 30 milestones and pick random ones to rate. Result: Swiss cheese data. Solution: pre-map forms tightly to EPAs and specific milestone subsets.End-of-rotation “global” masterpieces
One long form dictated by the GME office, all filled out the night before forms are due. Solution: move to high-frequency micro-assessments during the rotation, plus a short summary form at the end.Everyone is “at level” all the time
Either overly lenient culture or fear of harming residents. Solution: clear messaging that levels are developmental, training-level context, and no one expects PGY1s to be Level 4.CCC drowning in uninterpretable comments
Piles of “hardworking, pleasant, great to work with.” Zero actionable detail. Solution: require faculty to anchor comments to specific EPAs or behaviors: “On cross-cover, he triaged three new patients correctly and escalated care appropriately when labs worsened.”Tech-first thinking
Buying whatever the institution uses and trying to retrofit your educational needs. Solution: design your assessment logic and forms first on paper, then choose or adapt tools to match.
If you see any of these, you do not need a new platform. You need to re-architect the system.
FAQ (Exactly 4 Questions)
1. How many EPAs should a residency program define without overwhelming faculty and residents?
Most core residencies function best with 8–15 EPAs. Below 8 and you are oversimplifying; above 15 and nobody can keep them straight. Start with the true “cannot graduate without this” activities: managing a ward team, running a code, handling overnight cross-cover, running a clinic panel, performing core procedures. You can always add nuance later, but early on you need high-yield, clearly distinguishable EPAs tied to concrete privileges.
2. Should milestone ratings be shared with residents verbatim or translated into simpler language?
Share them. But do not dump the raw ACGME grid on residents with no interpretation. Use both: the official levels, plus a plain-language translation and examples. For instance: “You are at Level 2–3 for cross-cover EPA. That means you can handle most straightforward issues but still need help anticipating deterioration in complex patients.” Residents are adults. They can handle nuanced feedback if it is explained clearly and connected to specific behaviors and expectations.
3. How do you handle faculty who consistently rate everyone at the top of the scale?
You deal with it directly. First, show them anonymized comparative data: “Your average ratings are 0.8 points higher than the faculty median across all domains.” Then revisit the behavioral anchors with specific resident examples. If that fails, adjust the weight of their evaluations in CCC deliberations, or in extreme cases, limit their formal assessment role. Protecting the integrity of the system is more important than avoiding one difficult conversation.
4. Can a milestone-based system work in smaller programs with limited faculty numbers?
Yes, but you have to be smarter about design. Smaller programs actually have an advantage: faculty know residents better. Focus on a tight set of EPAs, use brief but frequent micro-assessments, and lean on multi-source feedback to diversify perspectives. The CCC can be more efficient and in-depth because it is discussing fewer residents. The trap small programs fall into is copying big-program systems wholesale; instead, build leaner instruments but keep the same rigor in anchoring, data review, and decision making.
Key takeaways:
Design assessment backward from real decisions and privileges, not forward from a milestones PDF. Build a spine of EPAs, tightly anchored behaviors, and frequent micro-assessments that feed a serious CCC process. And remember: the system only works when faculty and residents both understand how milestones map to the everyday work of being a safe, independent physician.