Skip to content

From FASTQ to Variants

Understanding the bioinformatics pipeline from raw sequencing data to variant calling is essential for cancer genomics. This pipeline transforms raw sequencing reads into meaningful biological insights.

Skeptic's corner: Not all variants are created equal. The challenge is distinguishing between driver mutations, passenger mutations, and sequencing artifacts. Quality control is crucial at every step.

FASTQ to variants pipeline


The Bioinformatics Pipeline

Overview

  1. Quality Control: Assess sequencing quality
  2. Preprocessing: Adapter trimming, quality filtering
  3. Alignment: Map reads to reference genome
  4. Post-alignment: Duplicate removal, quality recalibration
  5. Variant Calling: Identify genetic variants
  6. Annotation: Predict functional effects
  7. Filtering: Remove artifacts and low-quality variants

Quality Control and Preprocessing

FASTQ Format

@SEQ_ID
ATCGATCGATCGATCG
+
IIIIIIIIIIIIIIII
  • Line 1: Sequence identifier
  • Line 2: DNA sequence
  • Line 3: Separator (+)
  • Line 4: Quality scores (Phred+33)

Quality Assessment

python
# FASTQ quality analysis
import matplotlib.pyplot as plt
import numpy as np
from Bio import SeqIO

def analyze_fastq_quality(fastq_file):
    """
    Analyze quality metrics from FASTQ file
    """
    qualities = []
    lengths = []
    gc_content = []
    
    for record in SeqIO.parse(fastq_file, "fastq"):
        # Quality scores
        qual_scores = record.letter_annotations["phred_quality"]
        qualities.extend(qual_scores)
        
        # Read lengths
        lengths.append(len(record.seq))
        
        # GC content
        gc_count = record.seq.count('G') + record.seq.count('C')
        gc_content.append(gc_count / len(record.seq) * 100)
    
    # Calculate statistics
    mean_quality = np.mean(qualities)
    mean_length = np.mean(lengths)
    mean_gc = np.mean(gc_content)
    
    return {
        'mean_quality': mean_quality,
        'mean_length': mean_length,
        'mean_gc': mean_gc,
        'qualities': qualities,
        'lengths': lengths,
        'gc_content': gc_content
    }

Preprocessing Steps

  • Adapter trimming: Remove sequencing adapters
  • Quality filtering: Remove low-quality reads
  • Length filtering: Remove too short/long reads
  • Contamination removal: Remove host DNA

Alignment and Mapping

Reference Genome

  • Human: GRCh38 (hg38), GRCh37 (hg19)
  • Mouse: GRCm39 (mm39), GRCm38 (mm38)
  • Format: FASTA files with index

Alignment Algorithms

  • BWA-MEM: Burrows-Wheeler Aligner
  • STAR: Spliced Transcripts Alignment
  • HISAT2: Hierarchical Indexing for Spliced Alignment
  • Minimap2: Long-read alignment

Alignment Quality Metrics

  • Mapping rate: Percentage of aligned reads
  • Properly paired: Both reads aligned correctly
  • Insert size: Distance between paired reads
  • Coverage: Reads per genomic position

Post-Alignment Processing

Duplicate Removal

  • PCR duplicates: Identical reads from PCR amplification
  • Optical duplicates: Same cluster on flow cell
  • Tools: Picard MarkDuplicates, samtools rmdup

Quality Recalibration

  • Base quality scores: Adjust for systematic errors
  • Tools: GATK BaseRecalibrator
  • Reference: Known variant sites (dbSNP, Mills)

Indel Realignment

  • Local realignment: Around indels
  • Tools: GATK IndelRealigner (deprecated)
  • Modern: HaplotypeCaller handles this

Variant Calling

Variant Types

  • SNVs: Single nucleotide variants
  • Indels: Insertions and deletions
  • CNVs: Copy number variants
  • SVs: Structural variants

Calling Algorithms

  • GATK HaplotypeCaller: Gold standard for SNVs/indels
  • Mutect2: Somatic variant calling
  • VarScan: Alternative caller
  • Strelka2: Somatic variant calling

Variant Calling Pipeline

bash
# GATK variant calling pipeline
# 1. Preprocessing
gatk MarkDuplicates -I input.bam -O marked.bam -M metrics.txt

# 2. Quality recalibration
gatk BaseRecalibrator -I marked.bam -R reference.fa --known-sites dbsnp.vcf -O recal.table
gatk ApplyBQSR -I marked.bam -R reference.fa --bqsr-recal-file recal.table -O recal.bam

# 3. Variant calling
gatk HaplotypeCaller -R reference.fa -I recal.bam -O variants.vcf

# 4. Variant filtering
gatk VariantFiltration -R reference.fa -V variants.vcf -O filtered.vcf --filter-expression "QD < 2.0" --filter-name "QD2"

Variant Annotation

Functional Annotation

  • VEP: Variant Effect Predictor
  • ANNOVAR: Comprehensive annotation
  • SnpEff: Effect prediction

Annotation Categories

  • Consequence: Missense, nonsense, synonymous
  • Impact: High, moderate, low, modifier
  • Gene: Affected gene
  • Transcript: Affected transcript
  • Protein: Amino acid change

Database Integration

  • ClinVar: Clinical significance
  • COSMIC: Cancer mutations
  • gnomAD: Population frequencies
  • OMIM: Disease associations

Variant Filtering and Prioritization

Quality Filters

  • Quality score: QD, QUAL, GQ
  • Read depth: DP, AD
  • Strand bias: SB, FS
  • Mapping quality: MQ, MQRankSum

Population Filters

  • Allele frequency: gnomAD, ExAC
  • Common variants: Remove if >1% frequency
  • Rare variants: Keep if <0.1% frequency

Functional Prioritization

  • Consequence: Missense > synonymous
  • Impact: High > moderate > low
  • Conservation: PhyloP, GERP++
  • Pathogenicity: SIFT, PolyPhen, CADD

Laboratory Techniques

Quality Control Tools

  • FastQC: Read quality assessment
  • MultiQC: Aggregate QC reports
  • Qualimap: BAM file analysis
  • Picard: Java-based tools

Variant Analysis

  • VCFtools: VCF file manipulation
  • BCFtools: BCF file manipulation
  • GATK: Comprehensive toolkit
  • ANNOVAR: Annotation and filtering

Clinical Relevance

Diagnostic Applications

  • Germline variants: Hereditary cancer risk
  • Somatic variants: Tumor characterization
  • Pharmacogenomics: Drug response prediction
  • Biomarkers: Therapeutic targets

Therapeutic Implications

  • Targeted therapy: Mutation-guided treatment
  • Immunotherapy: TMB, MSI status
  • Resistance mechanisms: Secondary mutations
  • Clinical trials: Biomarker-driven studies

Research Applications

Cancer Genomics

  1. Driver mutation identification: Recurrent variants
  2. Pathway analysis: Affected biological processes
  3. Evolutionary analysis: Tumor progression
  4. Heterogeneity analysis: Subclonal populations

Precision Medicine

  1. Molecular profiling: Comprehensive characterization
  2. Targeted therapy: Mutation-guided treatment
  3. Biomarker discovery: Response prediction
  4. Clinical trials: Biomarker-driven studies

Practical Considerations

Computational Requirements

  • Memory: 32-64 GB RAM
  • Storage: 1-10 TB per sample
  • CPU: 16-32 cores
  • Time: 1-3 days per sample

Data Management

  • Raw data: FASTQ files
  • Intermediate files: BAM files
  • Final results: VCF files
  • Backup: Multiple copies

FAQ

Q: How do we know if a variant is real? A: Through quality metrics, population frequencies, and functional evidence.

Q: What's the difference between germline and somatic variants? A: Germline variants are inherited and present in all cells; somatic variants are acquired and present only in tumor cells.

Q: How do we prioritize variants for further study? A: Based on quality, frequency, functional impact, and clinical relevance.


References (APA Style)

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297-1303.

Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., ... & Getz, G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213-219.

Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., ... & Wilson, R. K. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568-576.


Contributing

  1. Review existing content for accuracy
  2. Add missing pipeline steps or tools
  3. Create practical examples and code snippets
  4. Cite recent research and software updates

This article provides the foundation for understanding the bioinformatics pipeline from sequencing to variants. Master these concepts to understand cancer genomics and precision medicine.

Early public release. Content evolves through continuous review. Questions: [email protected] · CC BY 4.0 where applicable.