computer-bioinformatics

Step-by-Step Manual: SNP Calling Workflow

January 9, 2025 Off By admin
Shares

SNP calling has evolved with the introduction of new tools and improvements in existing ones. Below is an updated workflow incorporating recent tools and best practices for SNP calling, along with tips for optimizing your analysis.


1. Prepare Your Data

  • Input Data: Paired-end or single-end FASTQ files (e.g., R1.fastq.gzR2.fastq.gz).
  • Reference Genome: A reference genome in FASTA format (e.g., reference.fasta).
  • Known Variants: A VCF file of known variants (e.g., known_sites.vcf) for base quality recalibration.

2. Align Reads to the Reference Genome

Use BWA-MEM (Burrows-Wheeler Aligner) for alignment, as it is optimized for Illumina reads.

Step 2.1: Index the Reference Genome

bash
Copy
bwa index reference.fasta

Step 2.2: Align Reads

For paired-end reads:

bash
Copy
bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

For single-end reads:

bash
Copy
bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz > aln.sam

3. Convert SAM to BAM and Sort

Use SAMtools or sambamba for faster processing.

Step 3.1: Convert SAM to BAM

bash
Copy
samtools view -S -b aln.sam > aln.bam

Step 3.2: Sort BAM File

bash
Copy
sambamba sort -o aln_sorted.bam aln.bam
  • Tip: Use sambamba for faster sorting and indexing.

4. Mark Duplicates

Use Picard or sambamba to mark duplicate reads.

Step 4.1: Mark Duplicates

bash
Copy
sambamba markdup -t 8 aln_sorted.bam aln_dedup.bam
  • -t: Number of threads.
  • Tip: Use sambamba markdup for faster duplicate marking.

5. Base Quality Score Recalibration

Use GATK for base quality score recalibration.

Step 5.1: Generate Recalibration Table

bash
Copy
gatk BaseRecalibrator -R reference.fasta -I aln_dedup.bam --known-sites known_sites.vcf -O recal_data.table

Step 5.2: Apply Recalibration

bash
Copy
gatk ApplyBQSR -R reference.fasta -I aln_dedup.bam --bqsr-recal-file recal_data.table -O aln_recalibrated.bam

6. Variant Calling

Use GATK HaplotypeCaller or DeepVariant for variant calling.

Step 6.1: Call Variants with GATK HaplotypeCaller

bash
Copy
gatk HaplotypeCaller -R reference.fasta -I aln_recalibrated.bam -O raw_variants.vcf

Step 6.2: Call Variants with DeepVariant

bash
Copy
sudo docker run -v "/path/to/data:/data" google/deepvariant:latest /opt/deepvariant/bin/run_deepvariant --model_type=WGS --ref=/data/reference.fasta --reads=/data/aln_recalibrated.bam --output_vcf=/data/raw_variants.vcf
  • Tip: Use DeepVariant for highly accurate SNP calling, especially for complex genomes.

7. Filter Variants

Filter variants based on quality scores and other metrics.

Step 7.1: Extract SNPs

bash
Copy
gatk SelectVariants -R reference.fasta -V raw_variants.vcf --select-type-to-include SNP -O raw_snps.vcf

Step 7.2: Filter SNPs

bash
Copy
gatk VariantFiltration -R reference.fasta -V raw_snps.vcf --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0" --filter-name "SNP_FILTER" -O filtered_snps.vcf

8. Annotate Variants

Use SnpEffVEP, or ANNOVAR for variant annotation.

Step 8.1: Annotate with SnpEff

bash
Copy
java -jar snpEff.jar -v genome_version filtered_snps.vcf > annotated_snps.vcf

Step 8.2: Annotate with VEP

bash
Copy
vep -i filtered_snps.vcf -o annotated_snps.vcf --cache --dir_cache /path/to/cache --species homo_sapiens --assembly GRCh38

9. Visualize Results

  • Use IGV (Integrative Genomics Viewer) or Tablet for visualizing aligned reads and variants.
  • Generate summary statistics and plots using VCFtools or R.

10. Automate the Workflow

  • Use workflow managers like SnakemakeNextflow, or WDL to automate and reproduce the pipeline.

Example Snakemake Workflow

Copy
rule all:
    input:
        "annotated_snps.vcf"

rule align:
    input:
        "reference.fasta",
        "R1.fastq.gz",
        "R2.fastq.gz"
    output:
        "aln.sam"
    shell:
        "bwa mem -R '@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01' {input} > {output}"

rule sort:
    input:
        "aln.sam"
    output:
        "aln_sorted.bam"
    shell:
        "sambamba sort -o {output} {input}"

rule markdup:
    input:
        "aln_sorted.bam"
    output:
        "aln_dedup.bam"
    shell:
        "sambamba markdup -t 8 {input} {output}"

rule recalibrate:
    input:
        "aln_dedup.bam",
        "known_sites.vcf"
    output:
        "aln_recalibrated.bam"
    shell:
        "gatk BaseRecalibrator -R reference.fasta -I {input[0]} --known-sites {input[1]} -O recal_data.table && "
        "gatk ApplyBQSR -R reference.fasta -I {input[0]} --bqsr-recal-file recal_data.table -O {output}"

rule call_variants:
    input:
        "aln_recalibrated.bam"
    output:
        "raw_variants.vcf"
    shell:
        "gatk HaplotypeCaller -R reference.fasta -I {input} -O {output}"

rule filter_variants:
    input:
        "raw_variants.vcf"
    output:
        "filtered_snps.vcf"
    shell:
        "gatk SelectVariants -R reference.fasta -V {input} --select-type-to-include SNP -O raw_snps.vcf && "
        "gatk VariantFiltration -R reference.fasta -V raw_snps.vcf --filter-expression 'QD < 2.0 || FS > 60.0 || MQ < 40.0' --filter-name 'SNP_FILTER' -O {output}"

rule annotate:
    input:
        "filtered_snps.vcf"
    output:
        "annotated_snps.vcf"
    shell:
        "java -jar snpEff.jar -v genome_version {input} > {output}"

Recent Tools and Tips

  1. DeepVariant: A deep learning-based variant caller from Google, known for high accuracy.
  2. sambamba: Faster alternative to SAMtools for sorting and marking duplicates.
  3. VEP (Variant Effect Predictor): Comprehensive annotation tool for variants.
  4. Snakemake/Nextflow: Workflow managers for automating and reproducing pipelines.
  5. MultiQC: Summarize QC metrics from multiple tools in one report.

Tips for SNP Calling

  • Use Known Variants: Incorporate known variants for base quality recalibration.
  • Filter Carefully: Adjust filtering thresholds based on your data quality.
  • Parallelize Tasks: Use tools like sambamba and workflow managers to speed up processing.
  • Validate Results: Cross-check results with orthogonal methods (e.g., Sanger sequencing).

By following this updated workflow, you can efficiently and accurately call SNPs using the latest tools and best practices.

Shares