Step-by-Step Manual: SNP Calling Workflow

January 9, 2025 Off By admin

SNP calling has evolved with the introduction of new tools and improvements in existing ones. Below is an updated workflow incorporating recent tools and best practices for SNP calling, along with tips for optimizing your analysis.

Table of Contents

1. Prepare Your Data

Input Data: Paired-end or single-end FASTQ files (e.g., R1.fastq.gz, R2.fastq.gz).
Reference Genome: A reference genome in FASTA format (e.g., reference.fasta).
Known Variants: A VCF file of known variants (e.g., known_sites.vcf) for base quality recalibration.

2. Align Reads to the Reference Genome

Use BWA-MEM (Burrows-Wheeler Aligner) for alignment, as it is optimized for Illumina reads.

Step 2.1: Index the Reference Genome

bwa index reference.fasta

Step 2.2: Align Reads

For paired-end reads:

bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

For single-end reads:

bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz > aln.sam

3. Convert SAM to BAM and Sort

Use SAMtools or sambamba for faster processing.

Step 3.1: Convert SAM to BAM

samtools view -S -b aln.sam > aln.bam

Step 3.2: Sort BAM File

sambamba sort -o aln_sorted.bam aln.bam

Tip: Use sambamba for faster sorting and indexing.

4. Mark Duplicates

Use Picard or sambamba to mark duplicate reads.

Step 4.1: Mark Duplicates

sambamba markdup -t 8 aln_sorted.bam aln_dedup.bam

-t: Number of threads.
Tip: Use sambamba markdup for faster duplicate marking.

5. Base Quality Score Recalibration

Use GATK for base quality score recalibration.

Step 5.1: Generate Recalibration Table

gatk BaseRecalibrator -R reference.fasta -I aln_dedup.bam --known-sites known_sites.vcf -O recal_data.table

Step 5.2: Apply Recalibration

gatk ApplyBQSR -R reference.fasta -I aln_dedup.bam --bqsr-recal-file recal_data.table -O aln_recalibrated.bam

6. Variant Calling

Use GATK HaplotypeCaller or DeepVariant for variant calling.

Step 6.1: Call Variants with GATK HaplotypeCaller

gatk HaplotypeCaller -R reference.fasta -I aln_recalibrated.bam -O raw_variants.vcf

Step 6.2: Call Variants with DeepVariant

sudo docker run -v "/path/to/data:/data" google/deepvariant:latest /opt/deepvariant/bin/run_deepvariant --model_type=WGS --ref=/data/reference.fasta --reads=/data/aln_recalibrated.bam --output_vcf=/data/raw_variants.vcf

Tip: Use DeepVariant for highly accurate SNP calling, especially for complex genomes.

7. Filter Variants

Filter variants based on quality scores and other metrics.

Step 7.1: Extract SNPs

gatk SelectVariants -R reference.fasta -V raw_variants.vcf --select-type-to-include SNP -O raw_snps.vcf

Step 7.2: Filter SNPs

gatk VariantFiltration -R reference.fasta -V raw_snps.vcf --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0" --filter-name "SNP_FILTER" -O filtered_snps.vcf

8. Annotate Variants

Use SnpEff, VEP, or ANNOVAR for variant annotation.

Step 8.1: Annotate with SnpEff

java -jar snpEff.jar -v genome_version filtered_snps.vcf > annotated_snps.vcf

Step 8.2: Annotate with VEP

vep -i filtered_snps.vcf -o annotated_snps.vcf --cache --dir_cache /path/to/cache --species homo_sapiens --assembly GRCh38

9. Visualize Results

Use IGV (Integrative Genomics Viewer) or Tablet for visualizing aligned reads and variants.
Generate summary statistics and plots using VCFtools or R.

10. Automate the Workflow

Use workflow managers like Snakemake, Nextflow, or WDL to automate and reproduce the pipeline.

Example Snakemake Workflow

rule all:
    input:
        "annotated_snps.vcf"

rule align:
    input:
        "reference.fasta",
        "R1.fastq.gz",
        "R2.fastq.gz"
    output:
        "aln.sam"
    shell:
        "bwa mem -R '@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01' {input} > {output}"

rule sort:
    input:
        "aln.sam"
    output:
        "aln_sorted.bam"
    shell:
        "sambamba sort -o {output} {input}"

rule markdup:
    input:
        "aln_sorted.bam"
    output:
        "aln_dedup.bam"
    shell:
        "sambamba markdup -t 8 {input} {output}"

rule recalibrate:
    input:
        "aln_dedup.bam",
        "known_sites.vcf"
    output:
        "aln_recalibrated.bam"
    shell:
        "gatk BaseRecalibrator -R reference.fasta -I {input[0]} --known-sites {input[1]} -O recal_data.table && "
        "gatk ApplyBQSR -R reference.fasta -I {input[0]} --bqsr-recal-file recal_data.table -O {output}"

rule call_variants:
    input:
        "aln_recalibrated.bam"
    output:
        "raw_variants.vcf"
    shell:
        "gatk HaplotypeCaller -R reference.fasta -I {input} -O {output}"

rule filter_variants:
    input:
        "raw_variants.vcf"
    output:
        "filtered_snps.vcf"
    shell:
        "gatk SelectVariants -R reference.fasta -V {input} --select-type-to-include SNP -O raw_snps.vcf && "
        "gatk VariantFiltration -R reference.fasta -V raw_snps.vcf --filter-expression 'QD < 2.0 || FS > 60.0 || MQ < 40.0' --filter-name 'SNP_FILTER' -O {output}"

rule annotate:
    input:
        "filtered_snps.vcf"
    output:
        "annotated_snps.vcf"
    shell:
        "java -jar snpEff.jar -v genome_version {input} > {output}"

Recent Tools and Tips

DeepVariant: A deep learning-based variant caller from Google, known for high accuracy.
sambamba: Faster alternative to SAMtools for sorting and marking duplicates.
VEP (Variant Effect Predictor): Comprehensive annotation tool for variants.
Snakemake/Nextflow: Workflow managers for automating and reproducing pipelines.
MultiQC: Summarize QC metrics from multiple tools in one report.

Tips for SNP Calling

Use Known Variants: Incorporate known variants for base quality recalibration.
Filter Carefully: Adjust filtering thresholds based on your data quality.
Parallelize Tasks: Use tools like sambamba and workflow managers to speed up processing.
Validate Results: Cross-check results with orthogonal methods (e.g., Sanger sequencing).

By following this updated workflow, you can efficiently and accurately call SNPs using the latest tools and best practices.