AI & Machine Learning in Oncology
Note: This page is educational and reflects the state of the literature in 2025. It does not replace medical advice.
TL;DR
Machine learning is now embedded across oncology — screening (lung LDCT, mammography), digital pathology, radiomics, genomic interpretation, drug discovery, trial matching, and outcome prediction. The technical advances are real, but most production deployments still face the same hard problems: distribution shift, label quality, calibration in subgroups, regulatory framing, and the gap between AUC on a held-out test set and improving outcomes for a real patient. This page is the orientation map; see ML pitfalls in oncology for the failure modes you must internalize.
1. Where ML actually has measurable impact today
| Domain | Examples (2024–2025) | Status |
|---|---|---|
| Lung cancer screening on LDCT | AI-driven nodule detection, malignancy risk scoring, end-to-end risk prediction[1] | Several FDA-cleared tools; integrated in clinical workflow at scale |
| Mammography triage | Worklist prioritization, missed-cancer reduction | Multiple cleared tools; deployed in screening programs |
| Digital pathology | Prostate Gleason assist, breast HER2/Ki-67 quantification, MSI prediction from H&E | Cleared tools; integration with LIS expanding |
| PD-L1 / TMB / TME prediction | ML scoring of immunotherapy benefit signals from images and omics[2] | Mostly research-stage; some IVD validation in progress |
| Genomic variant prediction | Pathogenicity scoring, splice impact, structural variant classification | Used widely in clinical interpretation pipelines |
| Drug discovery | Protein structure (AlphaFold), generative chemistry, property prediction | Reshaped early discovery; clinical translation lags |
| Trial matching | EHR/NLP → eligibility, ClinicalTrials.gov linkage | Several platforms in production |
| Outcome / risk prediction | Survival, treatment response, toxicity | Many models; few well-validated for deployment |
| Operations | Schedule optimization, no-show prediction, sepsis alerts | Common in academic centers |
For depth on the screening and immunotherapy threads, see refs and. Sources: [1], [2]
2. Data modalities, briefly
- Genomics / multi-omics — VCF, expression matrices, methylation, variant interpretation. See From FASTQ to variants and Multi-omics.
- Imaging — radiology (CT, MRI, PET, US, mammography), pathology (WSI), endoscopy, dermatology.
- Free text — pathology reports, radiology reports, oncology notes, discharge summaries.
- Structured EHR — labs, vitals, medications, ICD codes, procedures.
- Trial / outcome registries — pre-registered endpoints, AE reports.
- Patient-reported outcomes — symptom diaries, ePROs, wearable signals.
Each modality has its own quirks: imaging needs preprocessing pipelines and acquisition-protocol awareness; genomics needs versioned reference and variant annotation; text needs strong de-identification and ontology grounding.
3. Model classes commonly used
| Model class | Where it shines |
|---|---|
| Gradient-boosted trees (XGBoost, LightGBM) | Tabular EHR data, structured features; strong baseline |
| CNNs | Medical imaging (still the dominant class in production) |
| Transformers / vision transformers | Whole-slide pathology, multi-modal fusion |
| Foundation models for imaging | RETFound, BiomedCLIP, pathology FMs — emerging |
| Foundation models for text | BioGPT, GatorTron, Med-PaLM — early operational uses |
| Graph neural networks | Drug discovery, network biology, patient similarity |
| Survival models | Cox-PH, DeepSurv, time-aware transformers |
| Diffusion models | Synthetic data, augmentation, generative chemistry |
| Reinforcement learning | Adaptive trial design, dose optimization (research) |
The model class often matters less than how the data is split, what the labels really mean, and how the model is evaluated in clinical workflow.
4. Honest evaluation
Three layers of evaluation, in increasing rigor:
- Discrimination — AUC, sensitivity/specificity at operating points.
- Calibration — does the model's predicted probability match observed frequency? Far more important than AUC for clinical use; far less reported.
- Clinical utility — does the model change a decision that improves an outcome? Decision-curve analysis, prospective deployment, randomized trials of AI vs. no AI.
Common reporting failures:
- AUC reported on the same site/scanner used for training (no external validation).
- Test set leakage (multiple slices from the same patient split across train/test).
- Class imbalance ignored (rare cancers always look "high accuracy" if you predict "no cancer").
- No subgroup analysis (sex, age, race/ethnicity, scanner manufacturer, geography).
- Threshold optimized post-hoc to maximize a single metric.
Reference standards: TRIPOD-AI, CONSORT-AI, SPIRIT-AI for clinical AI reporting and trial design.
5. Regulatory and deployment context
- FDA (US) — SaMD framework; pre-cert pilot; Predetermined Change Control Plan (PCCP) for AI/ML lifecycle management.
- EU — MDR + AI Act (2024) — high-risk medical AI requires conformity assessment, transparency, human oversight, post-market surveillance.
- Brazil — ANVISA RDC 657/2022 and RDC 751/2022 cover SaMD; AI-specific guidance evolving; LGPD for data protection.
Deployment-time obligations beyond accuracy:
- Versioning and reproducibility — exact model + preprocessing reproducible from a tag.
- Monitoring for drift — input distribution and outcome calibration tracked over time.
- Safety logging — actionable alerts when performance degrades.
- User interface and decision support — show the model's confidence and uncertainty meaningfully.
- Human oversight — mandatory for high-risk recommendations.
- Recall and rollback — ability to disable or revert a model fast when problems are found.
For the regulatory framing in detail, see Regulatory & ethics.
6. Fairness, equity, and the data-shift problem
Models trained on US/European populations frequently underperform on Brazilian, African, Asian, or Indigenous populations.Causes: Sources: [2]
- Distribution shift — different acquisition equipment, patient demographics, comorbidity patterns.
- Label bias — historical care disparities encoded as ground truth.
- Sampling bias — under-represented populations under-represented in training data.
Mitigations:
- Test on local data before deployment, period.
- Monitor subgroup performance, not just overall metrics.
- Recalibrate models for local populations when feasible.
- Build local datasets — Brazilian initiatives (e.g., A.C. Camargo, Albert Einstein, USP, INCA, ABRACE) are filling part of the gap.
- Federated learning where centralization isn't possible.
7. Best practices for technologists building oncology ML
- Read the protocol before writing code. Domain framing is the highest-leverage decision.
- Get a clinician on the team. Not as a stakeholder — as a co-developer.
- Prefer simple, well-calibrated baselines before reaching for foundation models.
- Hold out by patient, by site, by time — not by row.
- External validation is non-negotiable for clinical use.
- Calibration > AUC for clinical decision support.
- Plan for monitoring before plan for deployment.
- Document everything the way a regulator would expect to read it (pipelines, data lineage, evaluation, change history).
- Measure clinical utility, not just statistical performance. A trial of "AI vs. no AI" is the gold standard.
- Read ML pitfalls in oncology before starting.
8. Common myths to push back on
- "Bigger model = better." Not for clinical use. Smaller, well-calibrated, locally validated models often outperform.
- "We can replace radiologists / pathologists." Augmentation is the realistic 5-year story; replacement is mostly not the goal nor possible.
- "Privacy isn't an issue if we de-identify." Genomic data are inherently re-identifiable; treat them like PHI.
- "Higher AUC always helps patients." Often not: calibration, threshold, and workflow integration matter more.
- "If it works at one site, it works everywhere." Almost never true.
See also
- Multi-omics
- Network biology
- Precision medicine
- ML pitfalls in oncology
- Biomarkers & companion diagnostics
- Regulatory & ethics
- Data governance & LGPD
References
- Adams SJ, Mikhael P, Wohlwend J, et al. Artificial Intelligence and Machine Learning in Lung Cancer Screening. Thorac Surg Clin 2023;33:401-409. PMID 37806742. https://doi.org/10.1016/j.thorsurg.2023.03.001
- Gao Q, Yang L, Lu M, Jin R, Ye H, Ma T. The artificial intelligence and machine learning in lung cancer immunotherapy. J Hematol Oncol 2023;16:55. PMID 37226190. https://doi.org/10.1186/s13045-023-01456-y
- Meyer ML, Fitzgerald BG, Paz-Ares L, et al. New promises and challenges in the treatment of advanced non-small-cell lung cancer. Lancet 2024;404:803-822. PMID 39121882. https://doi.org/10.1016/S0140-6736(24)01029-8
- U.S. National Cancer Institute. https://www.cancer.gov/about-cancer/understanding/what-is-cancer
- American Cancer Society. https://www.cancer.org/cancer.html
- Cleveland Clinic. Cancer (overview). https://my.clevelandclinic.org/health/diseases/12194-cancer
- A.C. Camargo Cancer Center. https://accamargo.org.br
- Fundação do Câncer (Brasil). https://www.cancer.org.br/
- Ministério da Saúde / BVS. ABC do câncer. https://bvsms.saude.gov.br/bvs/publicacoes/abc_do_cancer.pdf
- ANVISA — Agência Nacional de Vigilância Sanitária. https://www.gov.br/anvisa/pt-br