Skip to content

Data & APIs

Welcome to the data and APIs section! Here you'll find access to cancer research databases, genomic data repositories, and programmatic interfaces for building research applications.

What you'll find

  • Genomic Databases: DNA sequences, gene expression, mutations
  • Clinical Data: Patient outcomes, treatment responses, survival data
  • Research APIs: Programmatic access to scientific resources
  • Data Formats: Standards for biological data exchange
  • Integration Tools: Software for combining multiple data sources

Who is this section for?

  • Data scientists working with cancer datasets
  • Software developers building research applications
  • Bioinformaticians analyzing genomic data
  • Researchers looking for data sources
  • Students learning data science in biology

Getting Started

Essential Data Sources

  1. GDC: Genomic Data Commons (TCGA and more)
  2. ICGC‑ARGO: Successor to ICGC (harmonized international data)
  3. UCSC Xena: Harmonized cohorts across hubs
  4. cBioPortal: Multi‑omics cancer studies
  5. TCIA: Cancer Imaging Archive (radiology/pathology)
  6. CDA: Cancer Data Aggregator (cross‑commons search)
  7. SRA/ENA: Raw sequence archives

First Steps

  1. Explore data formats and standards
  2. Set up API access and authentication
  3. Download sample datasets for testing
  4. Build simple queries and filters

Data Types & Formats

Genomic Data

  • DNA Sequences: FASTA, FASTQ formats
  • Gene Expression: RNA-seq count matrices
  • Variants: VCF (Variant Call Format)
  • Annotations: GFF, GTF, BED files
  • Alignments: SAM, BAM, CRAM formats

Clinical Data

  • Patient Demographics: Age, gender, ethnicity
  • Diagnosis: Cancer type, stage, grade
  • Treatment: Surgery, chemotherapy, radiation
  • Outcomes: Survival time, recurrence, response
  • Biomarkers: Protein levels, genetic mutations

Metadata

  • Sample Information: Collection date, processing
  • Quality Metrics: Read depth, coverage
  • Experimental Design: Batch effects, controls
  • Ethics & Consent: IRB approval, data sharing

Available APIs

Cancer Data Aggregator (CDA) — 2024

python
# pip install cdapython
from cdapython import Q
q = Q('Subject').filter(Q('ResearchSubject.primary_diagnosis_site') == 'Breast').select('id','sex','race','vital_status')
results = q.run()
print(len(results))

UCSC Xena — 2024

python
# pip install xenaPython
import xenaPython as xena
hub = "https://tcga.xenahubs.net"
# list datasets
datasets = xena.dataset_list(hub)
# get BRCA HTSeq counts for selected genes
samples = xena.dataset_samples(hub, "TCGA-BRCA.htseq_counts.tsv", None)
expr = xena.dataset_gene_values(hub, "TCGA-BRCA.htseq_counts.tsv", samples, ["TP53","BRCA1","BRCA2"])

cBioPortal

bash
# REST example
curl "https://www.cbioportal.org/api/studies?projection=SUMMARY"

Genomic Data Commons (GDC) — 2024 API

python
import requests, json

BASE = "https://api.gdc.cancer.gov"

# Example: list breast primary_site projects with pagination
filters = {
    "op": "in",
    "content": {"field": "primary_site", "value": ["breast"]}
}
params = {
    "filters": json.dumps(filters),
    "format": "json",
    "size": 50
}
resp = requests.get(f"{BASE}/projects", params=params)
resp.raise_for_status()
projects = resp.json()["data"]["hits"]
print(f"Projects: {len(projects)}")

# Example: BAM slicing endpoint (requires auth for controlled data)
# r = requests.post(f"{BASE}/slicing/view/{file_uuid}", headers={"Content-Type":"application/json"}, data=json.dumps({"regions":["chr17:43044295-43125483"]}), stream=True)

NCBI APIs

python
from Bio import Entrez

# Search PubMed
Entrez.email = "[email protected]"
handle = Entrez.esearch(db="pubmed", term="cancer[Title]")
record = Entrez.read(handle)

UCSC Genome Browser

python
import pybedtools

# Query genomic regions
bed = pybedtools.BedTool("regions.bed")
genes = bed.intersect("genes.gtf")

Cancer Cell Line Encyclopedia (CCLE)

python
import pandas as pd

# Load expression data
expression = pd.read_csv("ccle_expression.csv")
metadata = pd.read_csv("ccle_metadata.csv")

Data Download & Storage

Direct Downloads

  • FTP servers: Large file transfers
  • HTTP downloads: Web-based access
  • Cloud storage: AWS S3, Google Cloud Storage
  • Torrents: Peer-to-peer sharing

Programmatic Access

  • REST APIs: HTTP-based interfaces
  • GraphQL: Flexible data queries
  • Python clients: Specialized libraries
  • R packages: Bioconductor tools

Data Management

  • Version control: Track data changes
  • Compression: Reduce storage requirements
  • Indexing: Fast data retrieval
  • Backup: Multiple storage locations

Data Repositories & Formats (2024–2025)

Verified Repositories

RepositoryURLAPI StatusLatest Update
GDC Portalhttps://portal.gdc.cancer.gov✅ Active2024 Q3
UCSC Xenahttp://xena.ucsc.edu✅ Active2024 refresh
cBioPortalhttps://www.cbioportal.org✅ ActiveOngoing
Cancer Genome Interpreterhttps://www.cancergenomeinterpreter.org/api/v1✅ ActiveCurrent
TCIA (Imaging)https://www.cancerimagingarchive.net✅ Active2025 updates
CDAhttps://cda.readthedocs.io✅ Active2024‑07

Data Formats (current)

  • Genomic Sequences: FASTA, FASTQ
  • Variants: VCF
  • Gene Expression: RNA‑seq matrices
  • Alignments: SAM, BAM, CRAM
  • Annotations: GFF, GTF, BED

Data Quality & Validation

Quality Metrics

  • Completeness: Missing data assessment
  • Accuracy: Validation against known standards
  • Consistency: Cross-reference checks
  • Timeliness: Data freshness

Common Issues

  • Missing values: Handle appropriately
  • Format inconsistencies: Standardize data
  • Outliers: Identify and investigate
  • Batch effects: Control for technical variation

Data Processing Tools

Python Ecosystem

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load and explore data
df = pd.read_csv("cancer_data.csv")
print(df.info())
print(df.describe())

# Basic statistics
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x="age", hue="cancer_type")
plt.show()

R Ecosystem

r
library(tidyverse)
library(DESeq2)
library(ggplot2)

# Load data
counts <- read.csv("gene_counts.csv")
metadata <- read.csv("sample_info.csv")

# Differential expression
dds <- DESeqDataSetFromMatrix(countData = counts,
                             colData = metadata,
                             design = ~ condition)

Command Line Tools

bash
# Process VCF files
bcftools filter input.vcf -i 'QUAL>30' > filtered.vcf

# Convert file formats
samtools view -b input.sam > output.bam

# Quality control
fastqc sample.fastq.gz

Data Visualization

Static Plots

  • Histograms: Distribution analysis
  • Scatter plots: Correlation studies
  • Box plots: Group comparisons
  • Heatmaps: Matrix visualization

Interactive Visualizations

  • Plotly: Web-based charts
  • Bokeh: Python interactive plots
  • D3.js: Custom web visualizations
  • Tableau: Business intelligence

Specialized Tools

  • IGV: Genomic data browser
  • UCSC Browser: Web-based genome viewer
  • Circos: Circular plots for genomics
  • R2: Cancer genomics platform

Data Access & Ethics

Access & Authentication (2024–2025)

  • GDC: Open access vs. controlled (dbGaP + NIH login); API tokens for downloads
  • CDA: Public endpoints; follows source commons auth for controlled data
  • cBioPortal / Xena: Open access for public studies; some hubs may be restricted
  • TCIA: Open datasets + registered access for certain collections

Public Datasets

  • Open access: No restrictions
  • Creative Commons: Attribution required
  • Government data: Public domain
  • Academic sharing: Research use

Controlled Access

  • dbGaP: Genotypes and Phenotypes
  • EGA: European Genome Archive
  • ICGC: International consortium
  • Institutional: Local data sharing

Ethical Considerations

  • Patient privacy: HIPAA compliance
  • Data consent: Informed permission
  • Re-identification: De-anonymization risks
  • Commercial use: Licensing restrictions

Advanced Topics

Big Data Technologies

  • Apache Spark: Distributed computing
  • Hadoop: MapReduce framework
  • Dask: Parallel Python computing
  • Ray: Distributed AI/ML

Cloud Computing

  • AWS Genomics: Specialized services
  • Google Cloud: Healthcare APIs
  • Azure: Medical data solutions
  • DNAnexus: Genomic platform

Real-time Data

  • Streaming: Live data feeds
  • WebSockets: Real-time updates
  • Kafka: Event streaming
  • Pub/Sub: Message queuing

Learning Resources

Documentation

Tutorials

Communities

Contributing Data

Have data to share? Contribute!

  1. Document your dataset thoroughly
  2. Provide metadata and descriptions
  3. Include usage examples and code
  4. Share data quality assessments
  5. Update regularly with new versions

This section provides access to the data you need for cancer research and analysis.

Early public release. Content evolves through continuous review. Questions: [email protected] · CC BY 4.0 where applicable.