Skip to content

Getting started

A practical onboarding for technologists. Setup, one real example end-to-end, and where to go after.

TL;DR

In about 30 minutes you can go from a clean machine to a real query against the NCI Genomic Data Commons (GDC) — a public cancer-research portal — and a small dataset you can analyze locally. This page walks the minimum path. It does not try to teach you cancer biology in 5 minutes; the Fundamentals and Omics sections do that.

Project status

HackCancer is an early public release. APIs change. PRs and issues welcome at github.com/hack-cancer. For project state and how to help today, see Project status.


1. What you actually need

ToolWhyMin version
Pythonrunnable examples, pipelines3.8+
Gitclone the repo, contribute backany modern
A terminalbash, zsh, PowerShell, Windows Terminal
Node.js (optional)only if you want to run the docs site locally20+
Docker (optional)for containerized workflows later

If you only want to read the docs, you don't need anything — the site is live. If you want to run examples, you need Python and Git.


2. Setup (5 minutes)

bash
# 1. Clone
git clone https://github.com/hack-cancer/Hack-Cancer.git
cd Hack-Cancer

# 2. Python virtual environment
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activate

# 3. Install Python dependencies for the examples
pip install -r src/requirements.txt

If pip install fails on your system, the bare minimum for the example below is:

bash
pip install requests pandas

3. Your first real query (10 minutes)

The point of this section is to get you to a real result against real cancer data, not a toy. Below is a self-contained snippet — no need to have cloned the repo first.

It queries the NCI GDC /cases endpoint for breast cancer cases (TCGA-BRCA project), pulls 20, and shows demographics. The GDC API is public, well-documented, and does not require authentication for public data. Sources: [1]

python
# gdc_quickstart.py
import json
import requests
import pandas as pd

API = "https://api.gdc.cancer.gov/cases"

# Filter: cases in project TCGA-BRCA (breast cancer)
filters = {
    "op": "in",
    "content": {"field": "cases.project.project_id", "value": ["TCGA-BRCA"]},
}

params = {
    "filters": json.dumps(filters),
    "fields": "submitter_id,project.project_id,primary_site,"
              "demographic.gender,demographic.age_at_index",
    "format": "json",
    "size": "20",
}

r = requests.get(API, params=params, timeout=30)
r.raise_for_status()
hits = r.json()["data"]["hits"]

rows = []
for c in hits:
    demo = c.get("demographic") or {}
    if isinstance(demo, list):
        demo = demo[0] if demo else {}
    rows.append({
        "submitter_id": c["submitter_id"],
        "project": c["project"]["project_id"],
        "site": c.get("primary_site"),
        "gender": demo.get("gender"),
        "age_at_index": demo.get("age_at_index"),
    })

df = pd.DataFrame(rows)
print(df.head())
print(f"\nGender distribution:\n{df['gender'].value_counts(dropna=False)}")
ages = pd.to_numeric(df['age_at_index'], errors='coerce')
print(f"\nMean age at diagnosis: {ages.mean():.1f} years")

Run it:

bash
python gdc_quickstart.py

What you should see

A table similar to the following (exact case IDs vary as the GDC dataset evolves):

   submitter_id    project        site  gender  age_at_index
0  TCGA-AR-A1AR  TCGA-BRCA      Breast  female            61
1  TCGA-A2-A0CL  TCGA-BRCA      Breast  female            55
2  TCGA-BH-A0BO  TCGA-BRCA      Breast  female            58
3  TCGA-AC-A2FB  TCGA-BRCA      Breast    male            66
4  TCGA-E2-A14P  TCGA-BRCA      Breast  female            48

Gender distribution:
female    19
male       1

Mean age at diagnosis: 58.4 years

What just happened:

  • You asked the GDC for public clinical metadata of breast-cancer cases in the TCGA-BRCA cohort.
  • You got back a small pandas.DataFrame with one row per case.
  • You did a sanity check (gender skew toward female, mean age in the expected range for breast cancer).
  • No data left the public domain. No PHI. No login.

This is the smallest non-trivial query. From here you can pull mutations, gene expression, copy-number, files, etc. — all public.


4. The expanded example in this repo

src/examplesapi/gdc_data_access.py ships a more complete script with three steps:

  1. Search — list cases in TCGA-BRCA with demographics.
  2. Mutations — query /ssms for somatic mutations in the first case.
  3. File metadata — list RNA-seq quantification files for the cohort.

To run it:

bash
cd src/examplesapi
python gdc_data_access.py

The script uses the same public GDC endpoints; output structure follows the GDC API contract.If a network or proxy issue blocks calls, you'll see a RequestException — the snippet in §3 is the simpler fallback. Sources: [1]


5. After you ran the example — where to go

Pick whichever matches what you want to do next:

You want to…Read
Understand what cancer actually isWhat is cancer?
Understand the data layers (DNA, RNA, protein…)Omics overview
Pull patient cohorts and run analysesData & APIs, Examples
Process raw sequencing dataFrom FASTQ to variants
Build models, but responsiblyML pitfalls in oncology, AI & ML overview
Know the legal and ethical frameLimits & responsibility, Regulatory & ethics, Data governance & LGPD
Why the project existsMission

6. Running the docs site locally (optional)

If you want to preview your edits to the docs:

bash
# Requires Node 20+
npm ci
npm run docs:dev

Open the URL printed in the terminal (usually http://localhost:5173). The Portuguese version lives at /pt-br/.

To build the static site:

bash
npm run docs:build

Output goes to docs/.vitepress/dist.


7. Common pitfalls

  • Network errors against api.gdc.cancer.gov — corporate proxies and some country networks may need a proxy config. Try the snippet from a personal network first to isolate.
  • Mismatched Python versions — virtualenv pins to whichever python you used to create it; check python --version inside the venv.
  • Pandas version warnings — usually harmless; pin in requirements.txt if you need stability.
  • Treating GDC metadata as patient identity — it isn't, but treat downloaded data with the same hygiene you'd use for any health data anyway.

See also


References

  1. NCI Genomic Data Commons API documentation. https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/ — public, no authentication required for open data.

Early public release. Content evolves through continuous review. Questions: [email protected] · CC BY 4.0 where applicable.