Data & APIs
Welcome to the data and APIs section! Here you'll find access to cancer research databases, genomic data repositories, and programmatic interfaces for building research applications.
What you'll find
- Genomic Databases: DNA sequences, gene expression, mutations
- Clinical Data: Patient outcomes, treatment responses, survival data
- Research APIs: Programmatic access to scientific resources
- Data Formats: Standards for biological data exchange
- Integration Tools: Software for combining multiple data sources
Who is this section for?
- Data scientists working with cancer datasets
- Software developers building research applications
- Bioinformaticians analyzing genomic data
- Researchers looking for data sources
- Students learning data science in biology
Getting Started
Essential Data Sources
- GDC: Genomic Data Commons (TCGA and more)
- ICGC‑ARGO: Successor to ICGC (harmonized international data)
- UCSC Xena: Harmonized cohorts across hubs
- cBioPortal: Multi‑omics cancer studies
- TCIA: Cancer Imaging Archive (radiology/pathology)
- CDA: Cancer Data Aggregator (cross‑commons search)
- SRA/ENA: Raw sequence archives
First Steps
- Explore data formats and standards
- Set up API access and authentication
- Download sample datasets for testing
- Build simple queries and filters
Data Types & Formats
Genomic Data
- DNA Sequences: FASTA, FASTQ formats
- Gene Expression: RNA-seq count matrices
- Variants: VCF (Variant Call Format)
- Annotations: GFF, GTF, BED files
- Alignments: SAM, BAM, CRAM formats
Clinical Data
- Patient Demographics: Age, gender, ethnicity
- Diagnosis: Cancer type, stage, grade
- Treatment: Surgery, chemotherapy, radiation
- Outcomes: Survival time, recurrence, response
- Biomarkers: Protein levels, genetic mutations
Metadata
- Sample Information: Collection date, processing
- Quality Metrics: Read depth, coverage
- Experimental Design: Batch effects, controls
- Ethics & Consent: IRB approval, data sharing
Available APIs
Cancer Data Aggregator (CDA) — 2024
python
# pip install cdapython
from cdapython import Q
q = Q('Subject').filter(Q('ResearchSubject.primary_diagnosis_site') == 'Breast').select('id','sex','race','vital_status')
results = q.run()
print(len(results))UCSC Xena — 2024
python
# pip install xenaPython
import xenaPython as xena
hub = "https://tcga.xenahubs.net"
# list datasets
datasets = xena.dataset_list(hub)
# get BRCA HTSeq counts for selected genes
samples = xena.dataset_samples(hub, "TCGA-BRCA.htseq_counts.tsv", None)
expr = xena.dataset_gene_values(hub, "TCGA-BRCA.htseq_counts.tsv", samples, ["TP53","BRCA1","BRCA2"])cBioPortal
bash
# REST example
curl "https://www.cbioportal.org/api/studies?projection=SUMMARY"Genomic Data Commons (GDC) — 2024 API
python
import requests, json
BASE = "https://api.gdc.cancer.gov"
# Example: list breast primary_site projects with pagination
filters = {
"op": "in",
"content": {"field": "primary_site", "value": ["breast"]}
}
params = {
"filters": json.dumps(filters),
"format": "json",
"size": 50
}
resp = requests.get(f"{BASE}/projects", params=params)
resp.raise_for_status()
projects = resp.json()["data"]["hits"]
print(f"Projects: {len(projects)}")
# Example: BAM slicing endpoint (requires auth for controlled data)
# r = requests.post(f"{BASE}/slicing/view/{file_uuid}", headers={"Content-Type":"application/json"}, data=json.dumps({"regions":["chr17:43044295-43125483"]}), stream=True)NCBI APIs
python
from Bio import Entrez
# Search PubMed
Entrez.email = "[email protected]"
handle = Entrez.esearch(db="pubmed", term="cancer[Title]")
record = Entrez.read(handle)UCSC Genome Browser
python
import pybedtools
# Query genomic regions
bed = pybedtools.BedTool("regions.bed")
genes = bed.intersect("genes.gtf")Cancer Cell Line Encyclopedia (CCLE)
python
import pandas as pd
# Load expression data
expression = pd.read_csv("ccle_expression.csv")
metadata = pd.read_csv("ccle_metadata.csv")Data Download & Storage
Direct Downloads
- FTP servers: Large file transfers
- HTTP downloads: Web-based access
- Cloud storage: AWS S3, Google Cloud Storage
- Torrents: Peer-to-peer sharing
Programmatic Access
- REST APIs: HTTP-based interfaces
- GraphQL: Flexible data queries
- Python clients: Specialized libraries
- R packages: Bioconductor tools
Data Management
- Version control: Track data changes
- Compression: Reduce storage requirements
- Indexing: Fast data retrieval
- Backup: Multiple storage locations
Data Repositories & Formats (2024–2025)
Verified Repositories
| Repository | URL | API Status | Latest Update |
|---|---|---|---|
| GDC Portal | https://portal.gdc.cancer.gov | ✅ Active | 2024 Q3 |
| UCSC Xena | http://xena.ucsc.edu | ✅ Active | 2024 refresh |
| cBioPortal | https://www.cbioportal.org | ✅ Active | Ongoing |
| Cancer Genome Interpreter | https://www.cancergenomeinterpreter.org/api/v1 | ✅ Active | Current |
| TCIA (Imaging) | https://www.cancerimagingarchive.net | ✅ Active | 2025 updates |
| CDA | https://cda.readthedocs.io | ✅ Active | 2024‑07 |
Data Formats (current)
- Genomic Sequences: FASTA, FASTQ
- Variants: VCF
- Gene Expression: RNA‑seq matrices
- Alignments: SAM, BAM, CRAM
- Annotations: GFF, GTF, BED
Data Quality & Validation
Quality Metrics
- Completeness: Missing data assessment
- Accuracy: Validation against known standards
- Consistency: Cross-reference checks
- Timeliness: Data freshness
Common Issues
- Missing values: Handle appropriately
- Format inconsistencies: Standardize data
- Outliers: Identify and investigate
- Batch effects: Control for technical variation
Data Processing Tools
Python Ecosystem
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load and explore data
df = pd.read_csv("cancer_data.csv")
print(df.info())
print(df.describe())
# Basic statistics
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x="age", hue="cancer_type")
plt.show()R Ecosystem
r
library(tidyverse)
library(DESeq2)
library(ggplot2)
# Load data
counts <- read.csv("gene_counts.csv")
metadata <- read.csv("sample_info.csv")
# Differential expression
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~ condition)Command Line Tools
bash
# Process VCF files
bcftools filter input.vcf -i 'QUAL>30' > filtered.vcf
# Convert file formats
samtools view -b input.sam > output.bam
# Quality control
fastqc sample.fastq.gzData Visualization
Static Plots
- Histograms: Distribution analysis
- Scatter plots: Correlation studies
- Box plots: Group comparisons
- Heatmaps: Matrix visualization
Interactive Visualizations
- Plotly: Web-based charts
- Bokeh: Python interactive plots
- D3.js: Custom web visualizations
- Tableau: Business intelligence
Specialized Tools
- IGV: Genomic data browser
- UCSC Browser: Web-based genome viewer
- Circos: Circular plots for genomics
- R2: Cancer genomics platform
Data Access & Ethics
Access & Authentication (2024–2025)
- GDC: Open access vs. controlled (dbGaP + NIH login); API tokens for downloads
- CDA: Public endpoints; follows source commons auth for controlled data
- cBioPortal / Xena: Open access for public studies; some hubs may be restricted
- TCIA: Open datasets + registered access for certain collections
Public Datasets
- Open access: No restrictions
- Creative Commons: Attribution required
- Government data: Public domain
- Academic sharing: Research use
Controlled Access
- dbGaP: Genotypes and Phenotypes
- EGA: European Genome Archive
- ICGC: International consortium
- Institutional: Local data sharing
Ethical Considerations
- Patient privacy: HIPAA compliance
- Data consent: Informed permission
- Re-identification: De-anonymization risks
- Commercial use: Licensing restrictions
Advanced Topics
Big Data Technologies
- Apache Spark: Distributed computing
- Hadoop: MapReduce framework
- Dask: Parallel Python computing
- Ray: Distributed AI/ML
Cloud Computing
- AWS Genomics: Specialized services
- Google Cloud: Healthcare APIs
- Azure: Medical data solutions
- DNAnexus: Genomic platform
Real-time Data
- Streaming: Live data feeds
- WebSockets: Real-time updates
- Kafka: Event streaming
- Pub/Sub: Message queuing
Learning Resources
Documentation
Tutorials
Communities
- Biostars: Bioinformatics Q&A
- SeqAnswers: Sequencing community
- Reddit r/bioinformatics
Contributing Data
Have data to share? Contribute!
- Document your dataset thoroughly
- Provide metadata and descriptions
- Include usage examples and code
- Share data quality assessments
- Update regularly with new versions
This section provides access to the data you need for cancer research and analysis.