Getting started
A practical onboarding for technologists. Setup, one real example end-to-end, and where to go after.
TL;DR
In about 30 minutes you can go from a clean machine to a real query against the NCI Genomic Data Commons (GDC) — a public cancer-research portal — and a small dataset you can analyze locally. This page walks the minimum path. It does not try to teach you cancer biology in 5 minutes; the Fundamentals and Omics sections do that.
Project status
HackCancer is an early public release. APIs change. PRs and issues welcome at github.com/hack-cancer. For project state and how to help today, see Project status.
1. What you actually need
| Tool | Why | Min version |
|---|---|---|
| Python | runnable examples, pipelines | 3.8+ |
| Git | clone the repo, contribute back | any modern |
| A terminal | bash, zsh, PowerShell, Windows Terminal | — |
| Node.js (optional) | only if you want to run the docs site locally | 20+ |
| Docker (optional) | for containerized workflows later | — |
If you only want to read the docs, you don't need anything — the site is live. If you want to run examples, you need Python and Git.
2. Setup (5 minutes)
# 1. Clone
git clone https://github.com/hack-cancer/Hack-Cancer.git
cd Hack-Cancer
# 2. Python virtual environment
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activate
# 3. Install Python dependencies for the examples
pip install -r src/requirements.txtIf pip install fails on your system, the bare minimum for the example below is:
pip install requests pandas3. Your first real query (10 minutes)
The point of this section is to get you to a real result against real cancer data, not a toy. Below is a self-contained snippet — no need to have cloned the repo first.
It queries the NCI GDC /cases endpoint for breast cancer cases (TCGA-BRCA project), pulls 20, and shows demographics. The GDC API is public, well-documented, and does not require authentication for public data. Sources: [1]
# gdc_quickstart.py
import json
import requests
import pandas as pd
API = "https://api.gdc.cancer.gov/cases"
# Filter: cases in project TCGA-BRCA (breast cancer)
filters = {
"op": "in",
"content": {"field": "cases.project.project_id", "value": ["TCGA-BRCA"]},
}
params = {
"filters": json.dumps(filters),
"fields": "submitter_id,project.project_id,primary_site,"
"demographic.gender,demographic.age_at_index",
"format": "json",
"size": "20",
}
r = requests.get(API, params=params, timeout=30)
r.raise_for_status()
hits = r.json()["data"]["hits"]
rows = []
for c in hits:
demo = c.get("demographic") or {}
if isinstance(demo, list):
demo = demo[0] if demo else {}
rows.append({
"submitter_id": c["submitter_id"],
"project": c["project"]["project_id"],
"site": c.get("primary_site"),
"gender": demo.get("gender"),
"age_at_index": demo.get("age_at_index"),
})
df = pd.DataFrame(rows)
print(df.head())
print(f"\nGender distribution:\n{df['gender'].value_counts(dropna=False)}")
ages = pd.to_numeric(df['age_at_index'], errors='coerce')
print(f"\nMean age at diagnosis: {ages.mean():.1f} years")Run it:
python gdc_quickstart.pyWhat you should see
A table similar to the following (exact case IDs vary as the GDC dataset evolves):
submitter_id project site gender age_at_index
0 TCGA-AR-A1AR TCGA-BRCA Breast female 61
1 TCGA-A2-A0CL TCGA-BRCA Breast female 55
2 TCGA-BH-A0BO TCGA-BRCA Breast female 58
3 TCGA-AC-A2FB TCGA-BRCA Breast male 66
4 TCGA-E2-A14P TCGA-BRCA Breast female 48
Gender distribution:
female 19
male 1
Mean age at diagnosis: 58.4 yearsWhat just happened:
- You asked the GDC for public clinical metadata of breast-cancer cases in the TCGA-BRCA cohort.
- You got back a small
pandas.DataFramewith one row per case. - You did a sanity check (gender skew toward female, mean age in the expected range for breast cancer).
- No data left the public domain. No PHI. No login.
This is the smallest non-trivial query. From here you can pull mutations, gene expression, copy-number, files, etc. — all public.
4. The expanded example in this repo
src/examplesapi/gdc_data_access.py ships a more complete script with three steps:
- Search — list cases in TCGA-BRCA with demographics.
- Mutations — query
/ssmsfor somatic mutations in the first case. - File metadata — list RNA-seq quantification files for the cohort.
To run it:
cd src/examplesapi
python gdc_data_access.pyThe script uses the same public GDC endpoints; output structure follows the GDC API contract.If a network or proxy issue blocks calls, you'll see a RequestException — the snippet in §3 is the simpler fallback. Sources: [1]
5. After you ran the example — where to go
Pick whichever matches what you want to do next:
| You want to… | Read |
|---|---|
| Understand what cancer actually is | What is cancer? |
| Understand the data layers (DNA, RNA, protein…) | Omics overview |
| Pull patient cohorts and run analyses | Data & APIs, Examples |
| Process raw sequencing data | From FASTQ to variants |
| Build models, but responsibly | ML pitfalls in oncology, AI & ML overview |
| Know the legal and ethical frame | Limits & responsibility, Regulatory & ethics, Data governance & LGPD |
| Why the project exists | Mission |
6. Running the docs site locally (optional)
If you want to preview your edits to the docs:
# Requires Node 20+
npm ci
npm run docs:devOpen the URL printed in the terminal (usually http://localhost:5173). The Portuguese version lives at /pt-br/.
To build the static site:
npm run docs:buildOutput goes to docs/.vitepress/dist.
7. Common pitfalls
- Network errors against
api.gdc.cancer.gov— corporate proxies and some country networks may need a proxy config. Try the snippet from a personal network first to isolate. - Mismatched Python versions — virtualenv pins to whichever
pythonyou used to create it; checkpython --versioninside the venv. - Pandas version warnings — usually harmless; pin in
requirements.txtif you need stability. - Treating GDC metadata as patient identity — it isn't, but treat downloaded data with the same hygiene you'd use for any health data anyway.
See also
- Project status — what's actually done, what isn't, how to help today
- Mission — the why
- Limits & responsibility — what HackCancer is not
- Roadmap — direction without promises
- Contact — when you need a human
References
- NCI Genomic Data Commons API documentation. https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/ — public, no authentication required for open data.