Getting started

A practical onboarding for technologists. Setup, one real example end-to-end, and where to go after.

TL;DR

In about 30 minutes you can go from a clean machine to a real query against the NCI Genomic Data Commons (GDC) — a public cancer-research portal — and a small dataset you can analyze locally. This page walks the minimum path. It does not try to teach you cancer biology in 5 minutes; the Fundamentals and Omics sections do that.

Project status

HackCancer is an early public release. APIs change. PRs and issues welcome at github.com/hack-cancer. For project state and how to help today, see Project status.

1. What you actually need

Tool	Why	Min version
Python	runnable examples, pipelines	3.8+
Git	clone the repo, contribute back	any modern
A terminal	bash, zsh, PowerShell, Windows Terminal	—
Node.js (optional)	only if you want to run the docs site locally	20+
Docker (optional)	for containerized workflows later	—

If you only want to read the docs, you don't need anything — the site is live. If you want to run examples, you need Python and Git.

2. Setup (5 minutes)

bash

# 1. Clone
git clone https://github.com/hack-cancer/Hack-Cancer.git
cd Hack-Cancer

# 2. Python virtual environment
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activate

# 3. Install Python dependencies for the examples
pip install -r src/requirements.txt

If pip install fails on your system, the bare minimum for the example below is:

bash

pip install requests pandas

3. Your first real query (10 minutes)

The point of this section is to get you to a real result against real cancer data, not a toy. Below is a self-contained snippet — no need to have cloned the repo first.

It queries the NCI GDC /cases endpoint for breast cancer cases (TCGA-BRCA project), pulls 20, and shows demographics. The GDC API is public, well-documented, and does not require authentication for public data. Sources: [1]

python

# gdc_quickstart.py
import json
import requests
import pandas as pd

API = "https://api.gdc.cancer.gov/cases"

# Filter: cases in project TCGA-BRCA (breast cancer)
filters = {
    "op": "in",
    "content": {"field": "cases.project.project_id", "value": ["TCGA-BRCA"]},
}

params = {
    "filters": json.dumps(filters),
    "fields": "submitter_id,project.project_id,primary_site,"
              "demographic.gender,demographic.age_at_index",
    "format": "json",
    "size": "20",
}

r = requests.get(API, params=params, timeout=30)
r.raise_for_status()
hits = r.json()["data"]["hits"]

rows = []
for c in hits:
    demo = c.get("demographic") or {}
    if isinstance(demo, list):
        demo = demo[0] if demo else {}
    rows.append({
        "submitter_id": c["submitter_id"],
        "project": c["project"]["project_id"],
        "site": c.get("primary_site"),
        "gender": demo.get("gender"),
        "age_at_index": demo.get("age_at_index"),
    })

df = pd.DataFrame(rows)
print(df.head())
print(f"\nGender distribution:\n{df['gender'].value_counts(dropna=False)}")
ages = pd.to_numeric(df['age_at_index'], errors='coerce')
print(f"\nMean age at diagnosis: {ages.mean():.1f} years")

Run it:

bash

python gdc_quickstart.py

What you should see

A table similar to the following (exact case IDs vary as the GDC dataset evolves):

   submitter_id    project        site  gender  age_at_index
0  TCGA-AR-A1AR  TCGA-BRCA      Breast  female            61
1  TCGA-A2-A0CL  TCGA-BRCA      Breast  female            55
2  TCGA-BH-A0BO  TCGA-BRCA      Breast  female            58
3  TCGA-AC-A2FB  TCGA-BRCA      Breast    male            66
4  TCGA-E2-A14P  TCGA-BRCA      Breast  female            48

Gender distribution:
female    19
male       1

Mean age at diagnosis: 58.4 years

What just happened:

You asked the GDC for public clinical metadata of breast-cancer cases in the TCGA-BRCA cohort.
You got back a small pandas.DataFrame with one row per case.
You did a sanity check (gender skew toward female, mean age in the expected range for breast cancer).
No data left the public domain. No PHI. No login.

This is the smallest non-trivial query. From here you can pull mutations, gene expression, copy-number, files, etc. — all public.

4. The expanded example in this repo

src/examplesapi/gdc_data_access.py ships a more complete script with three steps:

Search — list cases in TCGA-BRCA with demographics.
Mutations — query /ssms for somatic mutations in the first case.
File metadata — list RNA-seq quantification files for the cohort.

To run it:

bash

cd src/examplesapi
python gdc_data_access.py

The script uses the same public GDC endpoints; output structure follows the GDC API contract.If a network or proxy issue blocks calls, you'll see a RequestException — the snippet in §3 is the simpler fallback. Sources: [1]

5. After you ran the example — where to go

Pick whichever matches what you want to do next:

You want to…	Read
Understand what cancer actually is	What is cancer?
Understand the data layers (DNA, RNA, protein…)	Omics overview
Pull patient cohorts and run analyses	Data & APIs, Examples
Process raw sequencing data	From FASTQ to variants
Build models, but responsibly	ML pitfalls in oncology, AI & ML overview
Know the legal and ethical frame	Limits & responsibility, Regulatory & ethics, Data governance & LGPD
Why the project exists	Mission

6. Running the docs site locally (optional)

If you want to preview your edits to the docs:

bash

# Requires Node 20+
npm ci
npm run docs:dev

Open the URL printed in the terminal (usually http://localhost:5173). The Portuguese version lives at /pt-br/.

To build the static site:

bash

npm run docs:build

Output goes to docs/.vitepress/dist.

7. Common pitfalls

Network errors against api.gdc.cancer.gov — corporate proxies and some country networks may need a proxy config. Try the snippet from a personal network first to isolate.
Mismatched Python versions — virtualenv pins to whichever python you used to create it; check python --version inside the venv.
Pandas version warnings — usually harmless; pin in requirements.txt if you need stability.
Treating GDC metadata as patient identity — it isn't, but treat downloaded data with the same hygiene you'd use for any health data anyway.

References

NCI Genomic Data Commons API documentation. https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/ — public, no authentication required for open data.

Getting started ​

TL;DR ​

1. What you actually need ​

2. Setup (5 minutes) ​

3. Your first real query (10 minutes) ​

What you should see ​

4. The expanded example in this repo ​

5. After you ran the example — where to go ​

6. Running the docs site locally (optional) ​

7. Common pitfalls ​

See also ​

References ​

Getting started

TL;DR

1. What you actually need

2. Setup (5 minutes)

3. Your first real query (10 minutes)

What you should see

4. The expanded example in this repo

5. After you ran the example — where to go

6. Running the docs site locally (optional)

7. Common pitfalls

See also

References