Skip to content

ML in Oncology Pitfalls

Machine learning holds great promise for cancer research, but it's fraught with pitfalls that can lead to overly optimistic results and failed clinical translation. Understanding these challenges is crucial for developing robust, clinically useful models.

Skeptic's corner: Most ML models in oncology fail in validation. The key is understanding why they fail and how to avoid common pitfalls. Not every correlation is meaningful, and not every model is clinically useful.


Common Pitfalls in ML for Oncology

Data Leakage

  • Definition: Future information used to predict past events
  • Examples: Using post-treatment data to predict treatment response
  • Impact: Artificially inflated performance
  • Prevention: Careful temporal separation of data

Overfitting

  • Definition: Model learns training data too well
  • Examples: Complex models on small datasets
  • Impact: Poor generalization to new data
  • Prevention: Cross-validation, regularization, simpler models

Selection Bias

  • Definition: Non-representative sample selection
  • Examples: Convenience sampling, exclusion criteria
  • Impact: Biased results, poor generalizability
  • Prevention: Random sampling, diverse populations

Statistical and Methodological Issues

Multiple Testing

  • Problem: Testing many hypotheses increases false positives
  • Solution: Bonferroni correction, FDR control
  • Example: Testing thousands of genes for association

Cross-Validation Pitfalls

  • Data leakage: Information bleeding between folds
  • Temporal bias: Using future data to predict past
  • Stratification: Ensuring representative splits
  • Nested CV: Proper hyperparameter tuning

Feature Selection Bias

  • Problem: Selecting features based on entire dataset
  • Solution: Feature selection within CV folds
  • Impact: Overly optimistic performance estimates

Data Quality Issues

Missing Data

  • Types: MCAR, MAR, MNAR
  • Impact: Biased results, reduced power
  • Solutions: Imputation, complete case analysis
  • Validation: Sensitivity analysis

Measurement Error

  • Sources: Instrument error, human error
  • Impact: Noisy features, reduced performance
  • Solutions: Quality control, error modeling
  • Validation: Replication studies

Batch Effects

  • Definition: Systematic differences between batches
  • Sources: Different labs, time points, equipment
  • Impact: Spurious associations
  • Solutions: Batch correction, study design

Clinical Translation Challenges

External Validation

  • Problem: Models fail on new datasets
  • Causes: Population differences, protocol changes
  • Solutions: Multi-site validation, diverse populations
  • Standards: TRIPOD guidelines

Clinical Utility

  • Problem: Models don't improve patient outcomes
  • Causes: Poor clinical integration, workflow issues
  • Solutions: Clinical workflow integration
  • Validation: Randomized controlled trials

Regulatory Approval

  • Problem: Models don't meet regulatory standards
  • Causes: Insufficient validation, safety concerns
  • Solutions: Early regulatory engagement
  • Standards: FDA/EMA guidelines

Code Examples: Common Pitfalls

Data Leakage Example

python
# WRONG: Data leakage
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def wrong_approach(data):
    """
    WRONG: Feature selection on entire dataset
    """
    # This is wrong - selecting features on entire dataset
    # before splitting into train/test
    important_features = data.corr()['outcome'].abs().nlargest(10).index
    
    X = data[important_features]
    y = data['outcome']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    return accuracy_score(y_test, predictions)

def correct_approach(data):
    """
    CORRECT: Feature selection within CV folds
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.feature_selection import SelectKBest, f_classif
    from sklearn.pipeline import Pipeline
    
    X = data.drop('outcome', axis=1)
    y = data['outcome']
    
    # Create pipeline with feature selection
    pipeline = Pipeline([
        ('feature_selection', SelectKBest(f_classif, k=10)),
        ('classifier', RandomForestClassifier())
    ])
    
    # Cross-validation
    scores = cross_val_score(pipeline, X, y, cv=5)
    
    return scores.mean(), scores.std()

Overfitting Example

python
# Overfitting demonstration
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier

def demonstrate_overfitting(X, y):
    """
    Demonstrate overfitting with different model complexities
    """
    # Simple model (low complexity)
    simple_model = RandomForestClassifier(n_estimators=10, max_depth=3)
    
    # Complex model (high complexity)
    complex_model = RandomForestClassifier(n_estimators=1000, max_depth=None)
    
    # Learning curves
    train_sizes, train_scores_simple, val_scores_simple = learning_curve(
        simple_model, X, y, cv=5, n_jobs=-1
    )
    
    train_sizes, train_scores_complex, val_scores_complex = learning_curve(
        complex_model, X, y, cv=5, n_jobs=-1
    )
    
    # Plot results
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(train_sizes, train_scores_simple.mean(axis=1), 'o-', label='Training')
    plt.plot(train_sizes, val_scores_simple.mean(axis=1), 'o-', label='Validation')
    plt.title('Simple Model (Low Complexity)')
    plt.xlabel('Training Size')
    plt.ylabel('Score')
    plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.plot(train_sizes, train_scores_complex.mean(axis=1), 'o-', label='Training')
    plt.plot(train_sizes, val_scores_complex.mean(axis=1), 'o-', label='Validation')
    plt.title('Complex Model (High Complexity)')
    plt.xlabel('Training Size')
    plt.ylabel('Score')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    return train_scores_simple, val_scores_simple, train_scores_complex, val_scores_complex

Best Practices

Study Design

  • Clear objectives: Specific, measurable goals
  • Appropriate sample size: Power calculations
  • Representative data: Diverse populations
  • Temporal separation: No data leakage

Model Development

  • Cross-validation: Proper CV strategy
  • Feature selection: Within CV folds
  • Hyperparameter tuning: Nested CV
  • Ensemble methods: Reduce overfitting

Validation

  • External validation: Independent datasets
  • Clinical validation: Real-world performance
  • Sensitivity analysis: Robustness testing
  • Error analysis: Understanding failures

Regulatory Considerations

FDA Guidelines

  • Software as Medical Device: SaMD classification
  • Clinical validation: Real-world performance
  • Risk assessment: Patient safety
  • Quality management: Development process

EMA Guidelines

  • Medical device regulation: MDR compliance
  • Clinical evaluation: Performance assessment
  • Risk management: ISO 14971
  • Quality system: ISO 13485

Practical Recommendations

Before Starting

  1. Define clear objectives: What are you trying to predict?
  2. Assess data quality: Is the data fit for purpose?
  3. Plan validation strategy: How will you validate?
  4. Consider clinical utility: Will this help patients?

During Development

  1. Use proper CV: Avoid data leakage
  2. Start simple: Build complexity gradually
  3. Validate early: Test on external data
  4. Document everything: Reproducibility is key

After Development

  1. External validation: Test on new data
  2. Clinical integration: Work with clinicians
  3. Regulatory engagement: Early consultation
  4. Continuous monitoring: Performance tracking

FAQ

Q: Why do most ML models fail in validation? A: Common reasons include data leakage, overfitting, selection bias, and lack of external validation.

Q: How can we avoid overfitting? A: Use proper cross-validation, start with simple models, use regularization, and validate on external data.

Q: What's the difference between statistical significance and clinical significance? A: Statistical significance means the result is unlikely due to chance; clinical significance means the result is meaningful for patient care.


References (APA Style)

Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—big data, machine learning, and clinical medicine. New England Journal of Medicine, 375(13), 1216-1219.

Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44-56.

Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347-1358.


Contributing

  1. Review existing content for accuracy
  2. Add missing pitfalls or solutions
  3. Create practical examples and code snippets
  4. Cite recent research and best practices

This article provides the foundation for understanding common pitfalls in ML for oncology. Master these concepts to develop robust, clinically useful models.

Early public release. Content evolves through continuous review. Questions: [email protected] · CC BY 4.0 where applicable.