Skip to content

Code Examples

Welcome to the examples section! Here you'll find practical code examples, tutorials, and real-world applications that demonstrate how to use your technical skills in cancer research.

What you'll find

  • Python Scripts: Data analysis, API integration, visualization
  • R Code: Statistical analysis, bioinformatics workflows
  • Jupyter Notebooks: Interactive tutorials and demonstrations
  • Command Line Tools: Shell scripts and automation
  • Web Applications: Dashboards and data portals

Who is this section for?

  • Software developers learning bioinformatics
  • Data scientists applying ML to cancer data
  • Bioinformaticians looking for practical examples
  • Researchers wanting to automate analyses
  • Students learning computational biology

Getting Started

Prerequisites

  1. Python 3.8+ with scientific libraries
  2. R 4.0+ with Bioconductor
  3. Jupyter Notebooks for interactive learning
  4. Git for version control

First Steps

  1. Clone the repository and explore examples
  2. Install dependencies for your chosen examples
  3. Run simple examples to verify setup
  4. Modify parameters to experiment

Python Examples

Data Analysis

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load cancer data
df = pd.read_csv("cancer_data.csv")

# Basic exploration
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Missing values:\n{df.isnull().sum()}")

# Visualize distributions
plt.figure(figsize=(12, 8))
for i, col in enumerate(['age', 'tumor_size', 'survival_days']):
    plt.subplot(2, 2, i+1)
    sns.histplot(data=df, x=col, hue='cancer_type', alpha=0.7)
plt.tight_layout()
plt.show()

API Integration

python
import requests
import json
from typing import Dict, List

class CancerDataAPI:
    def __init__(self, base_url: str, api_key: str = None):
        self.base_url = base_url
        self.api_key = api_key
        self.session = requests.Session()
        
    def search_cases(self, project_id: str, filters: Dict = None) -> List[Dict]:
        """Search for cancer cases in a specific project."""
        url = f"{self.base_url}/cases"
        params = {
            "filters": json.dumps(filters) if filters else "{}",
            "format": "json",
            "size": "100"
        }
        
        if self.api_key:
            self.session.headers.update({"Authorization": f"Bearer {self.api_key}"})
            
        response = self.session.get(url, params=params)
        response.raise_for_status()
        
        return response.json()["data"]["hits"]

# Usage example
api = CancerDataAPI("https://api.gdc.cancer.gov")
cases = api.search_cases("TCGA-BRCA")
print(f"Found {len(cases)} breast cancer cases")

Machine Learning

python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Prepare data
X = df.drop(['cancer_type', 'patient_id'], axis=1)
y = df['cancer_type']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))

# Save model
joblib.dump(rf_model, 'cancer_classifier.pkl')

R Examples

Statistical Analysis

r
library(tidyverse)
library(survival)
library(survminer)

# Load data
cancer_data <- read_csv("cancer_survival.csv")

# Survival analysis
fit <- survfit(Surv(time, status) ~ cancer_type, data = cancer_data)

# Plot survival curves
ggsurvplot(fit, 
           data = cancer_data,
           pval = TRUE,
           conf.int = TRUE,
           risk.table = TRUE,
           palette = "Set1",
           title = "Survival Analysis by Cancer Type")

# Cox proportional hazards
cox_model <- coxph(Surv(time, status) ~ age + sex + cancer_type, data = cancer_data)
summary(cox_model)

Bioinformatics

r
library(DESeq2)
library(ggplot2)
library(pheatmap)

# Create DESeq dataset
dds <- DESeqDataSetFromMatrix(
    countData = counts_matrix,
    colData = sample_info,
    design = ~ condition
)

# Run differential expression
dds <- DESeq(dds)
results <- results(dds)

# Volcano plot
ggplot(results, aes(x = log2FoldChange, y = -log10(padj))) +
    geom_point(aes(color = padj < 0.05)) +
    scale_color_manual(values = c("black", "red")) +
    theme_minimal() +
    labs(title = "Volcano Plot: Control vs Treatment")

Command Line Examples

Data Processing

bash
#!/bin/bash

# Process multiple VCF files
for file in *.vcf; do
    echo "Processing $file..."
    
    # Filter by quality
    bcftools filter "$file" -i 'QUAL>30' > "filtered_${file}"
    
    # Count variants
    variant_count=$(bcftools view "filtered_${file}" | wc -l)
    echo "$file: $variant_count variants"
done

# Combine all filtered files
bcftools merge filtered_*.vcf > combined_filtered.vcf

Automation

bash
#!/bin/bash

# Download cancer data automatically
PROJECTS=("TCGA-BRCA" "TCGA-LUAD" "TCGA-COAD")

for project in "${PROJECTS[@]}"; do
    echo "Downloading $project data..."
    
    # Create directory
    mkdir -p "data/$project"
    
    # Download metadata
    curl -o "data/$project/metadata.json" \
         "https://api.gdc.cancer.gov/cases?filters={\"op\":\"in\",\"content\":{\"field\":\"cases.project.project_id\",\"value\":[\"$project\"]}}&format=json&size=100"
    
    echo "Downloaded $project metadata"
done

Web Application Examples

Flask Dashboard

python
from flask import Flask, render_template, jsonify
import pandas as pd
import plotly.express as px
import plotly.utils
import json

app = Flask(__name__)

@app.route('/')
def dashboard():
    return render_template('dashboard.html')

@app.route('/api/cancer-stats')
def cancer_stats():
    # Load data
    df = pd.read_csv("cancer_data.csv")
    
    # Calculate statistics
    stats = {
        "total_cases": len(df),
        "cancer_types": df['cancer_type'].value_counts().to_dict(),
        "avg_age": df['age'].mean(),
        "survival_rate": (df['survival_days'] > 365).mean()
    }
    
    return jsonify(stats)

@app.route('/api/survival-plot')
def survival_plot():
    df = pd.read_csv("cancer_data.csv")
    
    # Create plot
    fig = px.box(df, x="cancer_type", y="survival_days", 
                 title="Survival Days by Cancer Type")
    
    return json.dumps(fig, cls=plotly.utils.PlotlyJSONEncoder)

if __name__ == '__main__':
    app.run(debug=True)

Streamlit App

python
import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

st.set_page_config(page_title="Cancer Research Dashboard", layout="wide")

# Load data
@st.cache_data
def load_data():
    return pd.read_csv("cancer_data.csv")

df = load_data()

# Sidebar
st.sidebar.header("Filters")
selected_cancer = st.sidebar.multiselect(
    "Cancer Type", 
    options=df['cancer_type'].unique(),
    default=df['cancer_type'].unique()
)

# Filter data
filtered_df = df[df['cancer_type'].isin(selected_cancer)]

# Main content
st.title("Cancer Research Dashboard")
st.write(f"Showing {len(filtered_df)} cases")

# Metrics
col1, col2, col3, col4 = st.columns(4)
with col1:
    st.metric("Total Cases", len(filtered_df))
with col2:
    st.metric("Average Age", f"{filtered_df['age'].mean():.1f}")
with col3:
    st.metric("Survival Rate", f"{(filtered_df['survival_days'] > 365).mean():.1%}")
with col4:
    st.metric("Tumor Size", f"{filtered_df['tumor_size'].mean():.1f} cm")

# Charts
col1, col2 = st.columns(2)

with col1:
    fig = px.histogram(filtered_df, x="age", color="cancer_type", 
                       title="Age Distribution by Cancer Type")
    st.plotly_chart(fig, use_container_width=True)

with col2:
    fig = px.scatter(filtered_df, x="tumor_size", y="survival_days", 
                     color="cancer_type", title="Tumor Size vs Survival")
    st.plotly_chart(fig, use_container_width=True)

Jupyter Notebooks

Interactive Tutorials

  • Data Exploration: Load, clean, and visualize cancer data
  • API Integration: Connect to GDC, TCGA, and other databases
  • Machine Learning: Build predictive models for cancer outcomes
  • Bioinformatics: Process genomic data and identify patterns

Example Notebooks

  1. Cancer_Data_Analysis.ipynb: Complete workflow from raw data to insights
  2. GDC_API_Tutorial.ipynb: Step-by-step API usage examples
  3. ML_Cancer_Prediction.ipynb: Machine learning for cancer classification
  4. Genomic_Visualization.ipynb: Creating publication-ready plots

Advanced Examples

Parallel Processing

python
import multiprocessing as mp
from functools import partial
import pandas as pd

def process_chunk(chunk_data, output_file):
    """Process a chunk of data and save results."""
    # Your processing logic here
    results = chunk_data.groupby('cancer_type').agg({
        'survival_days': ['mean', 'std', 'count']
    })
    
    results.to_csv(output_file)
    return output_file

def parallel_processing(data_file, chunk_size=10000):
    """Process large datasets in parallel."""
    # Read data in chunks
    chunks = pd.read_csv(data_file, chunksize=chunk_size)
    
    # Process chunks in parallel
    with mp.Pool() as pool:
        results = []
        for i, chunk in enumerate(chunks):
            output_file = f"results_chunk_{i}.csv"
            result = pool.apply_async(
                process_chunk, 
                args=(chunk, output_file)
            )
            results.append(result)
        
        # Wait for all processes to complete
        for result in results:
            result.wait()

Cloud Deployment

python
# requirements.txt
flask==2.3.3
pandas==2.0.3
plotly==5.17.0
gunicorn==21.2.0

# Dockerfile
FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["gunicorn", "--bind", "0.0.0.0:8000", "app:app"]

Learning Path

Beginner Level

  1. Data Loading: Read CSV, JSON, and API data
  2. Basic Analysis: Calculate statistics and create simple plots
  3. Data Cleaning: Handle missing values and outliers

Intermediate Level

  1. API Integration: Connect to cancer databases
  2. Statistical Analysis: Perform hypothesis tests and modeling
  3. Visualization: Create publication-ready charts

Advanced Level

  1. Machine Learning: Build predictive models
  2. Big Data: Process large genomic datasets
  3. Web Applications: Deploy interactive dashboards

Contributing Examples

Have a great example? Share it!

  1. Test your code thoroughly
  2. Document dependencies and setup
  3. Include sample data if possible
  4. Add comments explaining key concepts
  5. Provide expected outputs and results

Additional Resources

Documentation

Tutorials

Communities


This section provides practical examples to help you apply your technical skills to cancer research.

Early public release. Content evolves through continuous review. Questions: [email protected] · CC BY 4.0 where applicable.