
The biggest misconception about AI in radiology is this: people think it is “a magic black box that reads images.” It is not. It is a pipeline. And most of the clinically dangerous failures happen in the seams between those pipeline stages, not inside the neural network itself.
Let me break this down specifically.
1. What a Modern Radiology Detection Pipeline Actually Looks Like
Forget the marketing slides with glowing 3D brains. A real computer vision pipeline in radiology is closer to a messy industrial assembly line.
At a high level, nearly every system doing “detection” in radiology follows some version of this chain:
- Data ingestion and harmonization
- Preprocessing and normalization
- Spatial standardization (registration / resampling)
- Detection / segmentation network(s)
- Post‑processing and candidate filtering
- Clinical context integration (EHR, priors)
- Presentation to the radiologist (UI, overlays, worklist)
- Logging, audit, and feedback loops
If you only understand the neural net in step 4, you do not understand how these systems will behave in real hospitals.
1.1 Data ingestion: where the mess starts
In theory: “We load DICOMs.”
In practice:
- Mixed CT protocols: chest CT, CT pulmonary angiogram, CT abdomen/pelvis with overlapping coverage
- Variable slice thickness: 0.6 mm to 5 mm
- Reconstruction kernels: soft tissue, lung, bone — all changing edge contrast and noise
- Vendor differences: Siemens vs GE vs Philips vs Canon, each with their own header quirks
- Partial scans: truncated lungs, motion, contrast timing issues
You get a DICOM series. You think it is “chest CT.” The AI thinks so too. Except half the lungs are missing because the scan is limited to upper abdomen. Now your “missed PE” rate just exploded.
Typical ingestion steps:
- Series selection (which of the 10+ series from the exam do we feed into AI?)
- Basic QA: reject corrupted / incomplete series
- Identification of modality, body part, orientation from DICOM headers (which are sometimes wrong)
- Mapping of pixel data to Hounsfield units (for CT), normalization of rescale slope/intercept
Misclassification at this stage leads to classic error patterns:
- Running lung nodule detection on abdominal CT slices that partially include basal lungs → nonsense detections
- Running ICH (intracranial hemorrhage) detection on post‑op CTs with craniotomy defects and drains → “false positives” that are actually entirely predictable
You see the theme: the “AI error” often starts long before the model sees the pixels.
2. Preprocessing: Subtle Choices, Big Consequences
Radiology computer vision lives and dies on preprocessing. Most academic papers underplay this. Most production teams obsess over it.
Common preprocessing elements:
- Resampling to uniform voxel spacing (e.g., 1 mm isotropic)
- Intensity windowing / clipping (e.g., lung window: −1000 to 400 HU; brain window: 0 to 80 HU)
- Intensity normalization (zero‑mean, unit variance per volume or per slice)
- Cropping or patch extraction (to meet GPU memory limits)
- Noise reduction or denoising filters (especially in low‑dose CT)
These are not cosmetic. They define what the network “sees” as signal vs noise.
Example: pulmonary embolism (PE) detection in CT angiography.
- If you resample too coarsely (say to 2–3 mm) to save memory, distal subsegmental PEs become a single fuzzy voxel. Your sensitivity plummets.
- If you keep voxel spacing but aggressively crop to the central area to avoid empty space, you may clip peripheral vessels entirely.
Another example: brain CT for ICH detection.
Many pipelines clip CT intensities to a “brain window”, say −100 to 200 HU, to reduce dynamic range. Good idea for most parenchymal hemorrhages. But hyperdense calcified lesions, bone fragments, or surgical material get pulled into a similar intensity range. You start confusing materials. Post‑op scans become a minefield.
These borderline calls drive specific error patterns:
- Loss of small but clinically important findings (tiny PEs, microbleeds, subtle early infarcts)
- Over-smoothing that erases margins of lesions
- Windowing choices that exaggerate normal variants into “lesions” (e.g., prominent perivascular spaces on brain MRI)
3. Spatial Standardization: Registration, Orientation, and the “Upside‑Down” Problem
You cannot overstate how often simple spatial assumptions break.
Most 3D detection networks in radiology:
- Assume fixed orientation (e.g., head-first supine, left-right correct, cranial-caudal correct)
- Assume fairly complete coverage of the anatomic region
- Often assume specific body region (e.g., chest CT vs abdomen CT)
Radiology reality:
- Feet-first exams
- Prone positioning
- Cropped fields of view (e.g., limited temporal bone CT)
- Scans with gantry tilt or deforming fields
The pipeline usually does:
- DICOM orientation handling: convert raw slices into a standardized orientation (e.g., radiological convention)
- Volume reconstruction: sort slices, account for variable spacing
- Optional registration to an atlas (brain MRI, sometimes chest CT)
- Optional cropping using body segmentation (e.g., separate lungs from mediastinum)
If orientation handling is wrong or incomplete, you get:
- Liver lesions detected in the spleen region because left-right is flipped
- Lung nodules flagged in the mediastinum fat because the lung mask crashed
- Brain abnormalities mapped to nonsense coordinates when trying to compare to priors
I have literally seen demos where the AI heatmap was mirrored left-right and the vendor did not catch it until a radiologist pointed out that “this right MCA infarct is being highlighted on the left.” That is downstream of sloppy orientation handling.
Standardization also interacts with detection thresholds.
Many pipelines resample everything to a fixed cube, say 256×256×256. If the incoming scan covers from neck to pubis, the lungs occupy a smaller fraction of the normalized volume and fine patterns become harder to resolve. If they design the model on tight chest-only datasets, they then deploy to wildly larger fields of view and wonder why performance tanks.
4. Detection Models: One‑Stage vs Two‑Stage and Segmentation‑As‑Detection
This is where everyone focuses, so I will be blunt: architecture matters, but not as much as people think. Choices in the rest of the pipeline often dominate.
That said, you should understand the three main patterns used in radiology.
4.1 Direct detection networks (“one‑stage”)
Analogs of YOLO / RetinaNet adapted to medical images.
Characteristics:
- Output bounding boxes + class probabilities (e.g., “nodule, 8 mm, upper lobe”)
- Operate in 2D (per-slice) or 3D (volumetric)
- Often faster, easier to deploy in near‑real‑time scenarios
Used for:
- Chest X‑ray abnormalities (pneumothorax, effusion, consolidation, lines/tubes)
- Mammography calcification clusters or masses
- Single-view images such as extremity radiographs
Error patterns:
- Multiple overlapping boxes on the same lesion → requires clustering / merging
- Sensitivity drop for small lesions or low-contrast findings
- Misses in regions with less training representation (e.g., apices of lungs, clavicle-overlapped zones in CXR)
4.2 Proposal‑based (“two‑stage”) detectors
Region proposal network + classifier/ regressor, often extended to 3D.
Used in:
- Lung nodule detection CT
- Liver lesion detection CT
- Bone metastasis detection
Pattern:
- Generate candidate regions (regions-of-interest) with high recall but many false positives
- Classify and refine these proposals to reduce false positives
Error patterns:
- Systematically “blind” areas where proposal network underperforms (e.g., near diaphragm, near hilum)
- Very sensitive to training data sampling — if rare lesion types were underrepresented, proposals will never fire
4.3 Segmentation‑first pipelines
For many tasks, the best detection is actually segmentation:
- Intracranial hemorrhage: segment blood and then summarize volume/location
- PE: segment vessels, then segment clot within
- Lung nodules: segment candidate regions, then classify nodule vs vessel vs artifact
Typical flow:
- Segmentation network (often a U‑Net variant, 2D/2.5D/3D)
- Connected components analysis to extract individual lesions
- Size, shape, location feature extraction → candidate scoring
Error patterns here are more geometric:
- Over‑segmentation: vascular structures mis-labeled as nodules
- Under‑segmentation: extension of hemorrhage or tumor not captured, underestimating volume
- Leakage: contrast into adjacent structures being labeled as lesion (classic in vascular tasks)
Segmentation-based approaches tend to have better localization but can be more brittle to imaging artifacts (metal, motion, streaks).
5. Post‑Processing: Where Engineering “Fixes” Become Clinical Bugs
After the detector/segmenter outputs raw predictions, there is a whole layer of logic that turns that into something radiologists see.
This layer often contains:
- Confidence thresholds per lesion type
- Size thresholds (ignore nodules <3 mm, or hemorrhage <0.1 mL)
- Non-maximum suppression or clustering to merge overlapping detections
- Rule-based heuristics (e.g., discard candidate if outside lung mask; discard ICH candidate if inside skull bone mask)
- Temporal logic (compare with prior studies; ignore lesions that are perfectly stable over multiple exams)
These rules are rarely transparent to clinicians. But they shape error patterns dramatically.
Typical examples:
- To reduce false positives, a vendor raises the size threshold for lung nodules from 3 mm to 5 mm. Radiologists now complain about missed early nodules; vendor points to “guidelines” that often start surveillance at 6 mm. Clinically nuanced cases (high‑risk patients) fall through the cracks.
- For ICH, minor hyperdensities in the sulci of an elderly patient are suppressed by a “volume < X mL = ignore” rule. But that includes subtle subarachnoid hemorrhage from trauma. AI flags nothing; radiologist was depending on AI to triage.
Post‑processing is also where context flags are added: “suspicious for malignancy,” “requires urgent attention,” etc. The mapping from raw model score to “critical result” is mostly arbitrary and tuned in small retrospective datasets.
In other words: the last mile of the pipeline is a quiet source of clinically meaningful bias.
6. Integration with Clinical Context: The Missing Piece (Most of the Time)
A truly intelligent detection pipeline in radiology would combine:
- Imaging findings
- Demographics (age, sex)
- Clinical presentation (chest pain vs fever vs trauma)
- Prior imaging
- Labs (D‑dimer, troponin, creatinine, etc.)
Most production tools barely scrape the surface. At best:
- They read patient age and sex from DICOM / RIS
- They compare to prior studies of the same modality to detect size changes
- Maybe they read the indication text to route to a specific AI model (e.g., “CTPA” vs “routine chest CT”)
Without context, systematic error patterns emerge:
- Overcalling benign congenital anomalies in young patients (for example, normal thymus mass in pediatrics flagged as mediastinal mass)
- Undercalling subtle new lesions when priors are not accessible or are from external systems
- Misaligned risk assessment — e.g., incidental 2 mm PE in a young trauma patient being given the same “critical” flag as a large PE in a hemodynamically unstable patient
We are starting to see imaging+EHR models in research (multimodal transformers, etc.), but they are nowhere near routine deployment.
7. Common Error Patterns in Radiology Computer Vision
Now the part you actually care about: how these systems fail. Not just “false positive” and “false negative,” but the patterns you can practically expect.
I will group them into six buckets.
7.1 Anatomy‑specific “blind spots”
Certain anatomic regions are chronically underdetected:
- Apices and bases of lungs in CXR and CT
- Hilar and perihilar regions (high vessel density, overlapping structures)
- Posterior fossa and brainstem on CT (beam hardening, bone)
- Skull base lesions on head CT (complex bone, variable recon)
- Subtle pelvic fractures on trauma CT (motion, large field of view)
Why? Training sets skew to “clean” mid‑lung, supratentorial brain, central pelvis. Vendors rarely show you performance by subregion.
7.2 Device and artifact confusion
Models are famously bad at:
- Distinguishing lines, tubes, wires from pathology (e.g., confusing chest tube tracks with pneumothorax or scar)
- Handling metal artifact (hip prostheses, dental work, spinal hardware)
- Motion artifact (especially in cardiac and trauma CT)
Concrete patterns:
- “Pneumothorax” flags along the path of a chest tube or near skin folds in portable CXRs
- “Hemorrhage” flags surrounding metal clips in post‑op brain CTs
- “Mass” in the pelvis where there is bowel gas plus hip replacement artifact
These are not random. They are extremely predictable once you see them a few dozen times.
7.3 Domain shift and protocol drift
Most models are trained on:
- A limited set of scanners from a few vendors
- Limited reconstruction kernels
- Certain dose ranges
- Adult patients, often outpatient or emergency settings
Deploy them on:
- Pediatric cases
- Very low‑dose screening CT (e.g., lung cancer screening programs)
- New reconstruction algorithms (e.g., deep learning reconstruction, iterative reconstruction upgrades)
- Non‑standard protocols (dual‑energy CT, spectral CT, research sequences)
You get:
- Sensitivity degradation, especially on low-contrast and small lesions
- Calibration drift: same lesion now gets lower confidence scores, falling below threshold
- Weird, previously unseen artifacts that the network tries to “explain” as one of its known classes
Domain shift is why some institutions find that a “FDA‑cleared” algorithm works beautifully on their Siemens 64‑slice CTs but falls apart on newer photon‑counting CT, or on pediatric chest CT done with ultra‑low dose protocols.
7.4 Prevalence and calibration bias
Most clinical AI systems are trained and tuned on datasets with much higher prevalence of the target condition than reality. Example:
- Training ICH detection with 30–40 % positive cases
- Real ED head CT prevalence of ICH: closer to 5–10 % depending on population
What happens? In practice:
- Positive predictive value (PPV) plummets
- Radiologists start ignoring “ICH suspected” flags because most are false alarms in their environment
- Some teams try to “fix” this by adjusting thresholds post‑hoc, which in turn reduces sensitivity
The more rare the condition (aortic dissection, free air under diaphragm, spinal epidural abscess), the more calibration becomes a headache. This is not a minor detail; it shapes user trust.
| Category | Value |
|---|---|
| 5% | 55 |
| 10% | 70 |
| 20% | 82 |
| 40% | 90 |
The chart above is illustrative: the same sensitivity/specificity model will show drastically different PPV depending on prevalence.
7.5 Cascading pipeline failures
This is the under‑discussed part.
Example chain for PE detection:
- Series selector accidentally picks non‑contrast chest CT instead of CTPA
- Preprocessor still applies “angiographic lung window” assumptions
- Detector, trained almost exclusively on contrast CT, is now seeing noise
- Post‑processor sees low overall model confidence, but a few noisy candidates survive
- UI flags “possible PE” on a non‑contrast scan
To the radiologist: “AI thinks this non‑contrast chest CT has a PE. The AI is stupid.”
In reality: series selection bug.
Another:
- Lung segmentation fails due to massive effusion and consolidation
- Rule‑based post‑processing discards all candidates outside lung mask
- Central PE remains undetected because it sits in collapsed lung area that segmentation missed
- Radiologist sees normal‑appearing right pulmonary artery, misses small peripheral PE, blames themselves or AI depending on training.
Failures propagate. And they often interact with human behavior in perverse ways.
7.6 Human–AI interaction errors
Not a vision problem per se, but absolutely part of the “error pattern” picture.
Some actual phrases I have heard from residents using AI‑enabled PACS:
- “If AI does not mark anything, I relax a bit on that case.”
- “If AI flags hemorrhage, I spend more time trying to see it, even if I am not actually convinced.”
- “I trust it a lot on chest X‑rays, less on CT, and almost not at all on weird post‑op scans.”
Error modes:
- Omission: Radiologist misses a lesion because AI failed to flag it and they over‑trusted the tool.
- Commission: Radiologist over‑calls something because AI flagged it and they anchor to that suggestion.
- Automation bias: Under‑reading non‑flagged studies, over‑reading flagged ones.
These error modes are strongest early after deployment, then moderate as radiologists gain a mental model of the AI’s strengths and weaknesses. But they never fully disappear.
8. How These Pipelines Are Changing Radiology Practice
Despite all the pitfalls, computer vision in radiology is not going away. It is expanding.
Where it is genuinely strong today:
- Triage and worklist prioritization
- ICH on non‑contrast head CT
- Large vessel occlusion on CTA
- Pneumothorax on CXR
- Quality checks
- Positioning (e.g., incomplete lung coverage)
- Contrast timing (e.g., inadequate CTPA bolus)
- Line/tube/drain positioning in ICU CXRs
- Quantification
- Coronary calcium scoring
- Emphysema percentage on chest CT
- Liver fat quantification
- Volumetric measurements of nodules or aneurysms
These tasks rely on relatively robust signals and clear objective labels.
Where it is still fragile:
- Early, subtle disease (very small lesions, early interstitial lung disease, microinfarcts)
- Heavily post‑operative anatomy
- Rare diseases and unusual presentations
- Any task requiring nuanced clinical judgment rather than pattern recognition alone
| Task Type | Current Reliability | Typical Use |
|---|---|---|
| ICH detection (CT) | High | ED triage, worklist |
| Pneumothorax (CXR) | High | ED/ICU triage |
| Lung nodule (CT) | Moderate | Secondary read, follow-up |
| PE detection (CTPA) | Moderate | Assist, not autonomous |
| Post-op CT (any region) | Low | Generally unsupported |
Notice how “moderate” often means “good on clean, typical cases; sketchy on complex ones.”
9. What Needs to Change in Future Detection Pipelines
If we want computer vision in radiology to move from “helpful gadget” to “reliable clinical infrastructure,” a few things must evolve.
9.1 Make the pipeline explicit and monitored
Hospitals should not install “black box” AI.
They should have:
- Clear documentation of each pipeline step (series selection rules, preprocessing details, thresholds)
- Monitoring dashboards that track failure points separately (e.g., segmentation failure rates, series misclassification rates)
- Institution‑specific calibration (adjust thresholds based on local prevalence and protocols)
Regulators will eventually push for this level of transparency. Smart institutions will get there earlier.
9.2 Multimodal and temporal context as first‑class citizens
Single‑scan models will increasingly look primitive.
The future pipeline will:
- Ingest prior imaging to track lesion trajectories automatically
- Ingest basic EHR context (age, key labs, indication, risk factors)
- Adjust thresholds based on pre‑test probability
Example: an incidental 3 mm lung nodule in a 24‑year‑old non‑smoker with no symptoms deserves a different flagging behavior than the same nodule in a 70‑year‑old heavy smoker enrolled in a screening program. The model should behave differently. Most do not.
9.3 Continual learning with guardrails
Right now, most systems are “frozen” at the version shipped. That is absurd in a domain where scanners, protocols, and populations drift continuously.
The right model:
- Has a formal process to ingest new labeled data from the local site
- Retrains or fine‑tunes on a controlled schedule
- Validates on a held‑out local test set before activating updates
- Keeps versioned performance reports per site
Without this, domain shift will erode performance silently over time.
9.4 Human‑centered interfaces
The UI is not a cosmetic afterthought. It is a safety layer.
Better patterns:
- Show confidence intervals or at least relative confidence bins, not just green/red flags
- Visually encode uncertainty (faint vs bold highlighting)
- Allow “why did you think this?” inspection: show heatmaps, segmentation masks, or exemplars
- Let the radiologist provide structured feedback (e.g., “false positive”, “missed lesion here”) that feeds back into QA processes
I have seen systems where AI alerts pop up as modal dialogs, forcing radiologists to click through them. That is how you generate alert fatigue and eventually blind clicking. Do not do this.
| Step | Description |
|---|---|
| Step 1 | Raw DICOM Studies |
| Step 2 | Series Selection |
| Step 3 | Preprocessing and Resampling |
| Step 4 | Detection and Segmentation Models |
| Step 5 | Post Processing and Thresholding |
| Step 6 | UI Presentation in PACS |
| Step 7 | Radiologist Assessment |
| Step 8 | AI QA and Monitoring |
| Step 9 | Model and Rule Updates |
That loop on the right side? Almost nobody has it fully built. But that is where long‑term reliability will come from.
10. Practical Takeaways for Clinicians and Builders
If you are a radiologist:
- Do not think of “the AI” as a single thing. Ask: how are they selecting series? What study types are excluded? How do they handle post‑op cases?
- Learn your tool’s blind spots. Ask for stratified performance: by body region, scanner model, protocol, patient age. Vendors hate giving this, which is exactly why you should.
- Use AI as a triage and cross‑check tool, not a primary reader. If it changes your read, you should be able to articulate why — not just “because the AI said so.”
If you are building these systems:
- Spend at least as much engineering time on ingestion, preprocessing, post‑processing, and UI as you do on the core model. That is where your real clinical reliability will be decided.
- Log everything. Series selection failures, segmentation failures, out‑of‑distribution scans, performance drift over time. Silent failures are the real risk.
- Treat integration into the radiologist workflow as a design problem, not an afterthought. A perfectly calibrated detector that nobody trusts or uses is functionally dead.
FAQ (exactly 5)
1. Why do FDA‑cleared radiology AI tools still make so many obvious errors?
Because FDA clearance is usually based on performance in curated validation datasets with known distributions and specific use conditions. Real hospitals have different scanners, protocols, patient populations, and a ton of edge cases. The pipeline around the core model (series selection, preprocessing, thresholds, UI) is rarely validated as rigorously across those variations. What looked great in a controlled study can degrade significantly in your specific environment.
2. Are 3D models always better than 2D models for radiology detection?
No. 3D models capture volumetric context and can improve lesion detection in CT/MRI, but they are more memory‑intensive, harder to train, and more sensitive to resampling artifacts. Many high‑performing systems use 2.5D (multiple adjacent slices) or hybrid architectures. For tasks like chest X‑ray interpretation, 2D remains natural and effective. The key is matching architecture to the imaging modality, resolution, and clinical constraints—not blindly using 3D.
3. Why do AI tools perform poorly on post‑operative or hardware‑heavy scans?
Because most training datasets underrepresent those cases. Post‑op anatomy, surgical changes, and metal hardware introduce patterns and artifacts that look nothing like the “normal vs pathology” examples the model learned. Without explicit training on these cases, the network will either hallucinate pathology (false positives) or suppress everything as out‑of‑distribution noise. Handling post‑op imaging well requires dedicated datasets and often separate logic paths.
4. Can radiologists “game” AI tools to get better performance?
To a degree, yes. If you understand how the pipeline behaves, you can adapt: choosing the right protocol so AI is triggered, avoiding non‑standard reconstructions the model was not trained on, previewing which series are AI‑processed, and mentally discounting AI flags in known weak regions (e.g., apices, areas with hardware). This is not cheating; it is using your knowledge of the tool’s limitations to avoid predictable failure modes.
5. Will future radiology AI fully replace detection by humans?
Highly unlikely in the foreseeable future. Detection will become increasingly automated for common, well‑defined tasks, especially in clean, protocol‑standard studies. But complex cases, rare diseases, postoperative anatomy, and context‑heavy decisions will still require human judgment. The more realistic future is: AI handles large volumes of routine detection and quantification, while radiologists focus on integration, ambiguity, and accountability. In other words, the job changes, but it does not vanish.
Key points to remember:
First, radiology AI is a pipeline, not a magical reader; most dangerous errors come from the stages around the model. Second, error patterns are systematic and predictable—by anatomy, protocol, artifacts, and calibration—not random flukes. Third, the future value will come from explicit, monitored pipelines that integrate clinical context and human feedback, not from yet another slightly better network architecture.