
The way most applicants “rank residencies” is statistically indefensible.
They go with vibes. A few interview impressions. What friends say. Whatever the PD said about “strong operative experience.” Then they throw that into a mental blender and call it a rank list.
If you want to behave like a data-literate adult instead of a lottery participant, you build a scoring system. In a spreadsheet. With weights, criteria, and actual numbers. Is it perfect? No. Is it dramatically better than hand-waving? The data says yes.
Let’s walk through how to do this properly.
Why You Need a Spreadsheet Scoring System
I will be direct: once you interview at more than about 8–10 programs, unaided human memory collapses. You cannot reliably compare Program #3 from early October with Program #17 from late January. Recency bias, fatigue, and social pressure dominate.
A structured scoring system fixes several predictable errors:
- Recency bias: The last 3 programs you saw feel “better” just because you remember them.
- Halo effect: You like one feature (e.g., free housing) and let that overshadow weak education or malignant culture.
- Anchoring: A big-name institution tilts everything in its favor, regardless of actual training fit.
- Emotional noise: Bad travel, delayed flight, awkward co-interviewers color your perception of the program.
A spreadsheet does not remove your judgment. It forces it to become explicit and consistent.
You decide what matters. But then you apply that decision to every program the same way. That is the point.
Step 1: Define Your Core Criteria (What Actually Matters)
The biggest failure I see? People copy someone else’s criteria. Your data model needs to reflect your utility function, not your class group chat.
From hundreds of rank list reviews, the same broad buckets show up again and again:
- Training quality and outcomes
- Lifestyle and workload
- Career positioning (fellowship, research, brand)
- Location and personal life
- Program culture and support
- Financial considerations
You do not need 40 criteria. Start with 8–15 meaningful ones. Split big concepts. Avoid duplicates. “Operative experience” and “case volume” might be one combined metric. “Wellness” and “burnout” probably correlate, so choose the sharper one.
Here is a concrete, data-friendly set many residents end up using:
- Clinical volume / hands-on experience
- Teaching quality / structure
- Fellowship match or job outcomes
- Call schedule / hours / workload
- Program culture (supportiveness, toxicity)
- Location fit (family, partner job, city size)
- Compensation and cost of living
- Reputation / brand strength
- Research opportunities and support
- Autonomy and graduated responsibility
You can add specialty-specific metrics (ICU exposure for anesthesia, continuity clinic quality for IM, trauma load for EM, etc.).
The critical move: define each criterion in advance in writing. Two sentences each. That prevents you from subtly changing definitions to justify your feelings about a specific program.
Step 2: Assign Weights (All Criteria Are Not Equal)
Raw scores without weights assume that “nice city” = “good fellowship placement” in importance. That is rarely true.
You need to translate your preferences into numbers. That means assigning a weight (importance) to each criterion.
Practical approach:
- Start with 100 total points of importance.
- Distribute those 100 points across your criteria.
- Force tradeoffs; you cannot give everything a 10.
Example weight distribution for a surgery applicant serious about fellowship:
| Criterion | Weight (out of 100) |
|---|---|
| Clinical volume / operative exp. | 18 |
| Teaching and education structure | 14 |
| Fellowship match outcomes | 16 |
| Call schedule / workload | 10 |
| Program culture / support | 14 |
| Location fit | 8 |
| Compensation / cost of living | 6 |
| Reputation / brand | 8 |
| Research environment | 6 |
Right away, you see a clear statement of values:
- Training and career outcomes (volume + education + fellowship + brand + research) = 62% of decision weight.
- Lifestyle / culture / location / money = 38%.
Change the numbers to match your reality. A married applicant with kids might push location and call schedule way up and research way down. That is not wrong. It is just a different utility function.
One more sanity check: if you give “prestige” a giant weight while saying “I care most about being happy,” you have a misalignment. Fix it now, not during PGY-2 meltdown.
Step 3: Build a Simple, Scalable Scoring Template
Open Excel, Google Sheets, or Numbers. The software does not matter. The structure does.
Minimal effective columns:
- Program name
- Program ID (optional)
- Each criterion’s score
- Weighted score per criterion (score × weight)
- Total score (sum of weighted scores)
- Notes / qualitative comments
Basic layout (rows = programs, columns = criteria):
- Column A: Program
- Column B: City / State (for quick filtering)
- Columns C–K: Criteria scores (0–10 or 1–5 scale)
- Columns L–T: Weighted scores (criterion score × weight)
- Column U: Total score
- Column V: Free-text impressions
Example formula structure using a 1–10 scoring scale and 0–1 weights:
- Put weights in row 1 (e.g., D1 = 0.18 for volume, E1 = 0.14 for teaching, etc.).
- Program 1 scores in row 2.
- Cell L2 (weighted volume score) =
=D2*$D$1 - Copy across and down.
- Cell U2 (total score) =
=SUM(L2:T2)
This is not complex modeling. It is linear weighting. But even this crude model is superior to no model.
For data cleanliness, use data validation (dropdown lists) for some fields:
- Program type (academic / hybrid / community)
- State or region
- Applicant-level flags (e.g., “couples acceptable,” “avoid this city”)
This lets you filter and subset later without text-matching chaos.
Step 4: Standardize Your Scoring Scale
If your scoring scale is loose, your whole system degrades into wishful thinking. You need clear anchors.
Pick a range. Two common choices:
- 1–5 (coarse but easy)
- 1–10 (more granularity, slightly more work)
For most applicants, 1–10 works better. But then you must define what “10” and “5” actually mean.
Example for “Clinical volume / hands-on experience”:
- 9–10: Top decile volume nationally; senior residents consistently report “never struggling to get cases”; early autonomy; graduates highly confident.
- 7–8: Strong volume; residents rarely complain about case numbers; few gaps in exposure.
- 5–6: Adequate minimums; some residents need case-trading or electives to hit certain benchmarks.
- 3–4: Documented worries about volume in key areas; residents express concern about readiness.
- 1–2: Serious deficits or chronic structural issues limiting experience.
Do a similar rubric for each major criterion. Does this feel overly fussy? Maybe. But the alternative is “I kinda liked the place; call it an 8?” which is not data, that is mood.
Two more calibration tips:
- Use anchor programs. After a few interviews, pick one “baseline mid” program and one “clear top” program to anchor your high and mid scores. That keeps you from inflating everything to 8–10 by January.
- Allow 0 or N/A where appropriate. If a program has essentially no research environment, a 0 is honest. Or use N/A and adjust weighting if needed.
Step 5: Populate Scores Systematically (Not From Memory)
The scoring system only works if you feed it decent input. That requires discipline.
Here is a simple, data-respecting workflow that I have seen work across multiple cycles:
- During interview day: Take quick notes per criterion in a small template (paper or digital). Do not assign final scores yet. Just impressions.
- Within 24 hours after interview: Transfer notes into your spreadsheet and assign preliminary scores across all criteria. This timing matters; recall drops fast.
- End of each interview week: Revisit that week’s programs in one sitting and normalize scores. Ensure you are not scoring Thursday’s program 1–2 points higher just because you remember it better than Monday’s.
You are essentially doing intra-week calibration to combat drift.
Do not wait until after all interviews are done to score everything. That is a guaranteed data quality disaster.
Step 6: Turn Scores Into Ranks (And Check for Sanity)
Once you have all your programs scored:
- Compute total weighted score for each program.
- Sort by total in descending order.
- That gives you your data-driven rank order.
At this point, you will usually notice one of three patterns:
- The list matches your gut closely. Good. You are internally well-calibrated.
- The top 3 make sense, but mid-tier programs shift a lot. That is normal; your brain is bad at fine-grained comparison.
- A program you “liked” is numerically weak, or a program you “felt meh” about is numerically strong. That is where the real work starts.
Instead of dismissing the numbers or dismissing your feelings, you investigate the discrepancy.
Example:
- Program A felt exciting. Charismatic PD. Big-name hospital. But in your sheet: brutal call, high burnout, weak fellowship matches, expensive city, mediocre teaching. Total score: 71.
- Program B felt quieter. No hype. Residents were tired but honest. Very strong autonomy, good outcomes, decent lifestyle, affordable city. Total score: 83.
If your stated goal was “good training and fellowship in a livable setup,” Program B is objectively better aligned.
You can override the numbers. But if you do, write one clear sentence: “I am moving Program A above B because __________.” That forces you to confront whether this is a rational reweighting or pure emotional noise.
Visualizing Your Data: Seeing Patterns You Will Miss Otherwise
Raw tables are useful. Plots are better. They reveal structure your brain does not see.
Here is a very simple visualization that many applicants find clarifying: total scores by program.
| Category | Value |
|---|---|
| Program A | 83 |
| Program B | 79 |
| Program C | 75 |
| Program D | 71 |
| Program E | 68 |
When you chart all your programs, you usually see:
- A clear top tier (scores cluster high, separated from the rest by 3–5+ points).
- A messy middle (scores within 2–3 points of each other).
- A bottom tier (obvious drop-off).
A 1–2 point difference can be statistical noise. A 6–10 point gap probably is not.
You can also plot specific criteria across your top 5–10 programs. For example, compare location versus training quality:
| Category | Value |
|---|---|
| Program A | 9,5 |
| Program B | 8,7 |
| Program C | 7,9 |
| Program D | 8,6 |
| Program E | 6,9 |
Imagine x-axis = training score, y-axis = location score. You will instantly see tradeoffs:
- Top-right: strong training, great location (rare).
- Top-left: okay training, great location (lifestyle programs).
- Bottom-right: strong training, poor location (classic “go suffer for 3–5 years and come out strong” sites).
- Bottom-left: avoid.
Those pictures make the decision landscape explicit.
Example: Three Programs, One Applicant, Different Outcomes
Let’s run a concrete scenario. Internal medicine applicant. Wants cardiology fellowship, moderate lifestyle, coastal city if possible.
Weights (out of 100):
- Clinical training / complexity: 18
- Teaching quality: 14
- Fellowship placement: 18
- Call / workload: 12
- Culture: 14
- Location: 12
- Cost of living: 4
- Research: 8
Three hypothetical programs:
- Program X (Big city academic)
- Program Y (Mid-size city, hybrid)
- Program Z (Smaller city, high-volume community)
Applicant’s 1–10 scores (based on interviews, resident data):
| Criterion | Weight | Program X | Program Y | Program Z |
|---|---|---|---|---|
| Clinical training | 18 | 8 | 7 | 9 |
| Teaching | 14 | 7 | 9 | 6 |
| Fellowship placement | 18 | 9 | 7 | 6 |
| Call / workload | 12 | 5 | 8 | 6 |
| Culture | 14 | 6 | 9 | 7 |
| Location | 12 | 9 | 7 | 5 |
| Cost of living | 4 | 4 | 7 | 9 |
| Research | 8 | 9 | 7 | 4 |
Convert to weighted scores (score × weight, then sum, omitting the intermediate math here):
- Program X total ≈ 7.34 / 10 equivalent
- Program Y total ≈ 7.64 / 10
- Program Z total ≈ 6.82 / 10
If the applicant only looked at brand and fellowship, Program X “feels” like the winner. But once you factor in culture, teaching, and survivable workload, Program Y edges out as the best overall fit.
That 0.3–0.5 difference is meaningful. Not gigantic, but enough to make you pause before blindly chasing prestige.
Avoiding Common Statistical and Cognitive Traps
People manage to break even simple scoring systems. The same mistakes repeat.
Here are the main failure modes I see:
- Using too many criteria. Once you have 20+ variables, your scoring becomes noisy. Redundancy creeps in. Keep it lean.
- Changing weights mid-season without re-scoring. If your priorities change (they sometimes do), either:
- Recompute totals using new weights across all programs (easy in a spreadsheet), or
- Create a second sheet (“v2 weighting”) and compare results.
- Score creep. By December, applicants start throwing 8s and 9s like candy. Re-anchor against your early-season scores.
- Letting one criterion dominate unconsciously. If location is actually 40% of your decision, then give it 40% weight explicitly. Do not pretend it is 10% and then override everything to live near a beach.
- Ignoring qualitative red flags. A spreadsheet is not an excuse to dismiss “PGY-3 quietly told me: ‘Run’” just because the numbers look good. That goes in a separate “deal-breaker” column.
A simple rule: quantitative model first, common sense and red-flag check second.
Advanced Tweaks for Data Nerds (Optional, but Powerful)
If you enjoy playing with data, you can extend this system a bit.
- Sensitivity analysis. Vary a key weight (say location from 5 to 20) and see how your top 5 programs reshuffle. This shows you how robust your rank order is to preference shifts.
- Scenario sheets. Create separate tabs for “Career-first,” “Lifestyle-balanced,” and “Location-maximizing” scenarios with different weights. Compare where each program lands in each scenario.
- Z-scoring each criterion. If you have many programs, you can convert each criterion into a z-score (how many standard deviations above/below the mean a program is). That helps when your raw scoring scale drifts.
- Flagging tier breaks. Add conditional formatting to highlight when total scores differ by more than, say, 5 points. That naturally creates tiers rather than a fake precise 1–N listing.
Do you need any of this to make a solid rank list? No. But if you are the kind of person who enjoys a regression table, you may appreciate it.
Integrating the Spreadsheet with NRMP Strategy
One more layer: you are dealing with a matching algorithm. The NRMP algorithm optimizes for your true preference order, not what you think programs think of you.
Your scoring system should feed your actual preference list, not some game-theory distortion.
Workflow:
- Finalize your weighted score–based rank order.
- Apply red-flag filters (toxic vibe, deal-breaker location, partner cannot move, etc.).
- Adjust for non-negotiables (couples match constraints, visa issues, absolute must-avoid cities).
- The final ordering after this should be the list you submit.
If you are tempted to move a program up purely because “I think they ranked me high,” you are now ignoring both the algorithm and your own data. That is not strategy; that is superstition.
A Simple, Practical Build Timeline
To make this concrete, here is how I would time this across a typical application season:
| Period | Event |
|---|---|
| Pre-Interviews - Define criteria and weights | Sep |
| Pre-Interviews - Build spreadsheet template | Sep |
| Interview Season - Score each program within 24h | Oct - Jan |
| Interview Season - Weekly normalization and review | Oct - Jan |
| Rank List Phase - Final scoring and tiering | Feb |
| Rank List Phase - Sensitivity checks and adjustments | Feb |
| Rank List Phase - Submit NRMP rank list | Late Feb - Early Mar |
This is not busywork. You are building the data backbone of a 3–7 year decision.
The Real Point: Forcing Yourself to Be Honest
A spreadsheet scoring system will not magically choose your perfect program for you. That is not the point.
The point is discipline.
- You declare what matters to you.
- You weight it.
- You apply those weights consistently.
- You confront, in numbers, when your emotional pull conflicts with your stated values.
That process alone puts you ahead of the majority of applicants who scribble a rank list three nights before the deadline based on who gave the best catered lunch.
So build the sheet. Argue with yourself about the weights. Score ruthlessly. Then, when you finally drag those programs into order on the NRMP screen, you will know that list is anchored in something more than a blur of hotel rooms and hospital tours.
With that kind of data backbone behind your rankings, you are ready for the next real challenge: thriving once you actually land in the program you chose. But that is another analysis entirely.