From FASTQ to Variants
Understanding the bioinformatics pipeline from raw sequencing data to variant calling is essential for cancer genomics. This pipeline transforms raw sequencing reads into meaningful biological insights.
Skeptic's corner: Not all variants are created equal. The challenge is distinguishing between driver mutations, passenger mutations, and sequencing artifacts. Quality control is crucial at every step.
The Bioinformatics Pipeline
Overview
- Quality Control: Assess sequencing quality
- Preprocessing: Adapter trimming, quality filtering
- Alignment: Map reads to reference genome
- Post-alignment: Duplicate removal, quality recalibration
- Variant Calling: Identify genetic variants
- Annotation: Predict functional effects
- Filtering: Remove artifacts and low-quality variants
Quality Control and Preprocessing
FASTQ Format
@SEQ_ID
ATCGATCGATCGATCG
+
IIIIIIIIIIIIIIII- Line 1: Sequence identifier
- Line 2: DNA sequence
- Line 3: Separator (+)
- Line 4: Quality scores (Phred+33)
Quality Assessment
# FASTQ quality analysis
import matplotlib.pyplot as plt
import numpy as np
from Bio import SeqIO
def analyze_fastq_quality(fastq_file):
"""
Analyze quality metrics from FASTQ file
"""
qualities = []
lengths = []
gc_content = []
for record in SeqIO.parse(fastq_file, "fastq"):
# Quality scores
qual_scores = record.letter_annotations["phred_quality"]
qualities.extend(qual_scores)
# Read lengths
lengths.append(len(record.seq))
# GC content
gc_count = record.seq.count('G') + record.seq.count('C')
gc_content.append(gc_count / len(record.seq) * 100)
# Calculate statistics
mean_quality = np.mean(qualities)
mean_length = np.mean(lengths)
mean_gc = np.mean(gc_content)
return {
'mean_quality': mean_quality,
'mean_length': mean_length,
'mean_gc': mean_gc,
'qualities': qualities,
'lengths': lengths,
'gc_content': gc_content
}Preprocessing Steps
- Adapter trimming: Remove sequencing adapters
- Quality filtering: Remove low-quality reads
- Length filtering: Remove too short/long reads
- Contamination removal: Remove host DNA
Alignment and Mapping
Reference Genome
- Human: GRCh38 (hg38), GRCh37 (hg19)
- Mouse: GRCm39 (mm39), GRCm38 (mm38)
- Format: FASTA files with index
Alignment Algorithms
- BWA-MEM: Burrows-Wheeler Aligner
- STAR: Spliced Transcripts Alignment
- HISAT2: Hierarchical Indexing for Spliced Alignment
- Minimap2: Long-read alignment
Alignment Quality Metrics
- Mapping rate: Percentage of aligned reads
- Properly paired: Both reads aligned correctly
- Insert size: Distance between paired reads
- Coverage: Reads per genomic position
Post-Alignment Processing
Duplicate Removal
- PCR duplicates: Identical reads from PCR amplification
- Optical duplicates: Same cluster on flow cell
- Tools: Picard MarkDuplicates, samtools rmdup
Quality Recalibration
- Base quality scores: Adjust for systematic errors
- Tools: GATK BaseRecalibrator
- Reference: Known variant sites (dbSNP, Mills)
Indel Realignment
- Local realignment: Around indels
- Tools: GATK IndelRealigner (deprecated)
- Modern: HaplotypeCaller handles this
Variant Calling
Variant Types
- SNVs: Single nucleotide variants
- Indels: Insertions and deletions
- CNVs: Copy number variants
- SVs: Structural variants
Calling Algorithms
- GATK HaplotypeCaller: Gold standard for SNVs/indels
- Mutect2: Somatic variant calling
- VarScan: Alternative caller
- Strelka2: Somatic variant calling
Variant Calling Pipeline
# GATK variant calling pipeline
# 1. Preprocessing
gatk MarkDuplicates -I input.bam -O marked.bam -M metrics.txt
# 2. Quality recalibration
gatk BaseRecalibrator -I marked.bam -R reference.fa --known-sites dbsnp.vcf -O recal.table
gatk ApplyBQSR -I marked.bam -R reference.fa --bqsr-recal-file recal.table -O recal.bam
# 3. Variant calling
gatk HaplotypeCaller -R reference.fa -I recal.bam -O variants.vcf
# 4. Variant filtering
gatk VariantFiltration -R reference.fa -V variants.vcf -O filtered.vcf --filter-expression "QD < 2.0" --filter-name "QD2"Variant Annotation
Functional Annotation
- VEP: Variant Effect Predictor
- ANNOVAR: Comprehensive annotation
- SnpEff: Effect prediction
Annotation Categories
- Consequence: Missense, nonsense, synonymous
- Impact: High, moderate, low, modifier
- Gene: Affected gene
- Transcript: Affected transcript
- Protein: Amino acid change
Database Integration
- ClinVar: Clinical significance
- COSMIC: Cancer mutations
- gnomAD: Population frequencies
- OMIM: Disease associations
Variant Filtering and Prioritization
Quality Filters
- Quality score: QD, QUAL, GQ
- Read depth: DP, AD
- Strand bias: SB, FS
- Mapping quality: MQ, MQRankSum
Population Filters
- Allele frequency: gnomAD, ExAC
- Common variants: Remove if >1% frequency
- Rare variants: Keep if <0.1% frequency
Functional Prioritization
- Consequence: Missense > synonymous
- Impact: High > moderate > low
- Conservation: PhyloP, GERP++
- Pathogenicity: SIFT, PolyPhen, CADD
Laboratory Techniques
Quality Control Tools
- FastQC: Read quality assessment
- MultiQC: Aggregate QC reports
- Qualimap: BAM file analysis
- Picard: Java-based tools
Variant Analysis
- VCFtools: VCF file manipulation
- BCFtools: BCF file manipulation
- GATK: Comprehensive toolkit
- ANNOVAR: Annotation and filtering
Clinical Relevance
Diagnostic Applications
- Germline variants: Hereditary cancer risk
- Somatic variants: Tumor characterization
- Pharmacogenomics: Drug response prediction
- Biomarkers: Therapeutic targets
Therapeutic Implications
- Targeted therapy: Mutation-guided treatment
- Immunotherapy: TMB, MSI status
- Resistance mechanisms: Secondary mutations
- Clinical trials: Biomarker-driven studies
Research Applications
Cancer Genomics
- Driver mutation identification: Recurrent variants
- Pathway analysis: Affected biological processes
- Evolutionary analysis: Tumor progression
- Heterogeneity analysis: Subclonal populations
Precision Medicine
- Molecular profiling: Comprehensive characterization
- Targeted therapy: Mutation-guided treatment
- Biomarker discovery: Response prediction
- Clinical trials: Biomarker-driven studies
Practical Considerations
Computational Requirements
- Memory: 32-64 GB RAM
- Storage: 1-10 TB per sample
- CPU: 16-32 cores
- Time: 1-3 days per sample
Data Management
- Raw data: FASTQ files
- Intermediate files: BAM files
- Final results: VCF files
- Backup: Multiple copies
FAQ
Q: How do we know if a variant is real? A: Through quality metrics, population frequencies, and functional evidence.
Q: What's the difference between germline and somatic variants? A: Germline variants are inherited and present in all cells; somatic variants are acquired and present only in tumor cells.
Q: How do we prioritize variants for further study? A: Based on quality, frequency, functional impact, and clinical relevance.
References (APA Style)
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297-1303.
Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., ... & Getz, G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213-219.
Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., ... & Wilson, R. K. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568-576.
Contributing
- Review existing content for accuracy
- Add missing pipeline steps or tools
- Create practical examples and code snippets
- Cite recent research and software updates
This article provides the foundation for understanding the bioinformatics pipeline from sequencing to variants. Master these concepts to understand cancer genomics and precision medicine.