Code Examples
Welcome to the examples section! Here you'll find practical code examples, tutorials, and real-world applications that demonstrate how to use your technical skills in cancer research.
What you'll find
- Python Scripts: Data analysis, API integration, visualization
- R Code: Statistical analysis, bioinformatics workflows
- Jupyter Notebooks: Interactive tutorials and demonstrations
- Command Line Tools: Shell scripts and automation
- Web Applications: Dashboards and data portals
Who is this section for?
- Software developers learning bioinformatics
- Data scientists applying ML to cancer data
- Bioinformaticians looking for practical examples
- Researchers wanting to automate analyses
- Students learning computational biology
Getting Started
Prerequisites
- Python 3.8+ with scientific libraries
- R 4.0+ with Bioconductor
- Jupyter Notebooks for interactive learning
- Git for version control
First Steps
- Clone the repository and explore examples
- Install dependencies for your chosen examples
- Run simple examples to verify setup
- Modify parameters to experiment
Python Examples
Data Analysis
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load cancer data
df = pd.read_csv("cancer_data.csv")
# Basic exploration
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Missing values:\n{df.isnull().sum()}")
# Visualize distributions
plt.figure(figsize=(12, 8))
for i, col in enumerate(['age', 'tumor_size', 'survival_days']):
plt.subplot(2, 2, i+1)
sns.histplot(data=df, x=col, hue='cancer_type', alpha=0.7)
plt.tight_layout()
plt.show()API Integration
python
import requests
import json
from typing import Dict, List
class CancerDataAPI:
def __init__(self, base_url: str, api_key: str = None):
self.base_url = base_url
self.api_key = api_key
self.session = requests.Session()
def search_cases(self, project_id: str, filters: Dict = None) -> List[Dict]:
"""Search for cancer cases in a specific project."""
url = f"{self.base_url}/cases"
params = {
"filters": json.dumps(filters) if filters else "{}",
"format": "json",
"size": "100"
}
if self.api_key:
self.session.headers.update({"Authorization": f"Bearer {self.api_key}"})
response = self.session.get(url, params=params)
response.raise_for_status()
return response.json()["data"]["hits"]
# Usage example
api = CancerDataAPI("https://api.gdc.cancer.gov")
cases = api.search_cases("TCGA-BRCA")
print(f"Found {len(cases)} breast cancer cases")Machine Learning
python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib
# Prepare data
X = df.drop(['cancer_type', 'patient_id'], axis=1)
y = df['cancer_type']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Evaluate
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))
# Save model
joblib.dump(rf_model, 'cancer_classifier.pkl')R Examples
Statistical Analysis
r
library(tidyverse)
library(survival)
library(survminer)
# Load data
cancer_data <- read_csv("cancer_survival.csv")
# Survival analysis
fit <- survfit(Surv(time, status) ~ cancer_type, data = cancer_data)
# Plot survival curves
ggsurvplot(fit,
data = cancer_data,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE,
palette = "Set1",
title = "Survival Analysis by Cancer Type")
# Cox proportional hazards
cox_model <- coxph(Surv(time, status) ~ age + sex + cancer_type, data = cancer_data)
summary(cox_model)Bioinformatics
r
library(DESeq2)
library(ggplot2)
library(pheatmap)
# Create DESeq dataset
dds <- DESeqDataSetFromMatrix(
countData = counts_matrix,
colData = sample_info,
design = ~ condition
)
# Run differential expression
dds <- DESeq(dds)
results <- results(dds)
# Volcano plot
ggplot(results, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(aes(color = padj < 0.05)) +
scale_color_manual(values = c("black", "red")) +
theme_minimal() +
labs(title = "Volcano Plot: Control vs Treatment")Command Line Examples
Data Processing
bash
#!/bin/bash
# Process multiple VCF files
for file in *.vcf; do
echo "Processing $file..."
# Filter by quality
bcftools filter "$file" -i 'QUAL>30' > "filtered_${file}"
# Count variants
variant_count=$(bcftools view "filtered_${file}" | wc -l)
echo "$file: $variant_count variants"
done
# Combine all filtered files
bcftools merge filtered_*.vcf > combined_filtered.vcfAutomation
bash
#!/bin/bash
# Download cancer data automatically
PROJECTS=("TCGA-BRCA" "TCGA-LUAD" "TCGA-COAD")
for project in "${PROJECTS[@]}"; do
echo "Downloading $project data..."
# Create directory
mkdir -p "data/$project"
# Download metadata
curl -o "data/$project/metadata.json" \
"https://api.gdc.cancer.gov/cases?filters={\"op\":\"in\",\"content\":{\"field\":\"cases.project.project_id\",\"value\":[\"$project\"]}}&format=json&size=100"
echo "Downloaded $project metadata"
doneWeb Application Examples
Flask Dashboard
python
from flask import Flask, render_template, jsonify
import pandas as pd
import plotly.express as px
import plotly.utils
import json
app = Flask(__name__)
@app.route('/')
def dashboard():
return render_template('dashboard.html')
@app.route('/api/cancer-stats')
def cancer_stats():
# Load data
df = pd.read_csv("cancer_data.csv")
# Calculate statistics
stats = {
"total_cases": len(df),
"cancer_types": df['cancer_type'].value_counts().to_dict(),
"avg_age": df['age'].mean(),
"survival_rate": (df['survival_days'] > 365).mean()
}
return jsonify(stats)
@app.route('/api/survival-plot')
def survival_plot():
df = pd.read_csv("cancer_data.csv")
# Create plot
fig = px.box(df, x="cancer_type", y="survival_days",
title="Survival Days by Cancer Type")
return json.dumps(fig, cls=plotly.utils.PlotlyJSONEncoder)
if __name__ == '__main__':
app.run(debug=True)Streamlit App
python
import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
st.set_page_config(page_title="Cancer Research Dashboard", layout="wide")
# Load data
@st.cache_data
def load_data():
return pd.read_csv("cancer_data.csv")
df = load_data()
# Sidebar
st.sidebar.header("Filters")
selected_cancer = st.sidebar.multiselect(
"Cancer Type",
options=df['cancer_type'].unique(),
default=df['cancer_type'].unique()
)
# Filter data
filtered_df = df[df['cancer_type'].isin(selected_cancer)]
# Main content
st.title("Cancer Research Dashboard")
st.write(f"Showing {len(filtered_df)} cases")
# Metrics
col1, col2, col3, col4 = st.columns(4)
with col1:
st.metric("Total Cases", len(filtered_df))
with col2:
st.metric("Average Age", f"{filtered_df['age'].mean():.1f}")
with col3:
st.metric("Survival Rate", f"{(filtered_df['survival_days'] > 365).mean():.1%}")
with col4:
st.metric("Tumor Size", f"{filtered_df['tumor_size'].mean():.1f} cm")
# Charts
col1, col2 = st.columns(2)
with col1:
fig = px.histogram(filtered_df, x="age", color="cancer_type",
title="Age Distribution by Cancer Type")
st.plotly_chart(fig, use_container_width=True)
with col2:
fig = px.scatter(filtered_df, x="tumor_size", y="survival_days",
color="cancer_type", title="Tumor Size vs Survival")
st.plotly_chart(fig, use_container_width=True)Jupyter Notebooks
Interactive Tutorials
- Data Exploration: Load, clean, and visualize cancer data
- API Integration: Connect to GDC, TCGA, and other databases
- Machine Learning: Build predictive models for cancer outcomes
- Bioinformatics: Process genomic data and identify patterns
Example Notebooks
- Cancer_Data_Analysis.ipynb: Complete workflow from raw data to insights
- GDC_API_Tutorial.ipynb: Step-by-step API usage examples
- ML_Cancer_Prediction.ipynb: Machine learning for cancer classification
- Genomic_Visualization.ipynb: Creating publication-ready plots
Advanced Examples
Parallel Processing
python
import multiprocessing as mp
from functools import partial
import pandas as pd
def process_chunk(chunk_data, output_file):
"""Process a chunk of data and save results."""
# Your processing logic here
results = chunk_data.groupby('cancer_type').agg({
'survival_days': ['mean', 'std', 'count']
})
results.to_csv(output_file)
return output_file
def parallel_processing(data_file, chunk_size=10000):
"""Process large datasets in parallel."""
# Read data in chunks
chunks = pd.read_csv(data_file, chunksize=chunk_size)
# Process chunks in parallel
with mp.Pool() as pool:
results = []
for i, chunk in enumerate(chunks):
output_file = f"results_chunk_{i}.csv"
result = pool.apply_async(
process_chunk,
args=(chunk, output_file)
)
results.append(result)
# Wait for all processes to complete
for result in results:
result.wait()Cloud Deployment
python
# requirements.txt
flask==2.3.3
pandas==2.0.3
plotly==5.17.0
gunicorn==21.2.0
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "app:app"]Learning Path
Beginner Level
- Data Loading: Read CSV, JSON, and API data
- Basic Analysis: Calculate statistics and create simple plots
- Data Cleaning: Handle missing values and outliers
Intermediate Level
- API Integration: Connect to cancer databases
- Statistical Analysis: Perform hypothesis tests and modeling
- Visualization: Create publication-ready charts
Advanced Level
- Machine Learning: Build predictive models
- Big Data: Process large genomic datasets
- Web Applications: Deploy interactive dashboards
Contributing Examples
Have a great example? Share it!
- Test your code thoroughly
- Document dependencies and setup
- Include sample data if possible
- Add comments explaining key concepts
- Provide expected outputs and results
Additional Resources
Documentation
Tutorials
- Real Python: Python tutorials
- R for Data Science: R tutorials
- DataCamp: Interactive learning
Communities
- Stack Overflow: Programming Q&A
- Biostars: Bioinformatics community
- Reddit r/learnpython: Python learning
This section provides practical examples to help you apply your technical skills to cancer research.