Step-by-Step Manual: SNP Calling Workflow
January 9, 2025SNP calling has evolved with the introduction of new tools and improvements in existing ones. Below is an updated workflow incorporating recent tools and best practices for SNP calling, along with tips for optimizing your analysis.
1. Prepare Your Data
- Input Data: Paired-end or single-end FASTQ files (e.g.,
R1.fastq.gz
,R2.fastq.gz
). - Reference Genome: A reference genome in FASTA format (e.g.,
reference.fasta
). - Known Variants: A VCF file of known variants (e.g.,
known_sites.vcf
) for base quality recalibration.
2. Align Reads to the Reference Genome
Use BWA-MEM (Burrows-Wheeler Aligner) for alignment, as it is optimized for Illumina reads.
Step 2.1: Index the Reference Genome
bwa index reference.fasta
Step 2.2: Align Reads
For paired-end reads:
bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
For single-end reads:
bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz > aln.sam
3. Convert SAM to BAM and Sort
Use SAMtools or sambamba for faster processing.
Step 3.1: Convert SAM to BAM
samtools view -S -b aln.sam > aln.bam
Step 3.2: Sort BAM File
sambamba sort -o aln_sorted.bam aln.bam
- Tip: Use
sambamba
for faster sorting and indexing.
4. Mark Duplicates
Use Picard or sambamba to mark duplicate reads.
Step 4.1: Mark Duplicates
sambamba markdup -t 8 aln_sorted.bam aln_dedup.bam
-t
: Number of threads.- Tip: Use
sambamba markdup
for faster duplicate marking.
5. Base Quality Score Recalibration
Use GATK for base quality score recalibration.
Step 5.1: Generate Recalibration Table
gatk BaseRecalibrator -R reference.fasta -I aln_dedup.bam --known-sites known_sites.vcf -O recal_data.table
Step 5.2: Apply Recalibration
gatk ApplyBQSR -R reference.fasta -I aln_dedup.bam --bqsr-recal-file recal_data.table -O aln_recalibrated.bam
6. Variant Calling
Use GATK HaplotypeCaller or DeepVariant for variant calling.
Step 6.1: Call Variants with GATK HaplotypeCaller
gatk HaplotypeCaller -R reference.fasta -I aln_recalibrated.bam -O raw_variants.vcf
Step 6.2: Call Variants with DeepVariant
sudo docker run -v "/path/to/data:/data" google/deepvariant:latest /opt/deepvariant/bin/run_deepvariant --model_type=WGS --ref=/data/reference.fasta --reads=/data/aln_recalibrated.bam --output_vcf=/data/raw_variants.vcf
- Tip: Use DeepVariant for highly accurate SNP calling, especially for complex genomes.
7. Filter Variants
Filter variants based on quality scores and other metrics.
Step 7.1: Extract SNPs
gatk SelectVariants -R reference.fasta -V raw_variants.vcf --select-type-to-include SNP -O raw_snps.vcf
Step 7.2: Filter SNPs
gatk VariantFiltration -R reference.fasta -V raw_snps.vcf --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0" --filter-name "SNP_FILTER" -O filtered_snps.vcf
8. Annotate Variants
Use SnpEff, VEP, or ANNOVAR for variant annotation.
Step 8.1: Annotate with SnpEff
java -jar snpEff.jar -v genome_version filtered_snps.vcf > annotated_snps.vcf
Step 8.2: Annotate with VEP
vep -i filtered_snps.vcf -o annotated_snps.vcf --cache --dir_cache /path/to/cache --species homo_sapiens --assembly GRCh38
9. Visualize Results
- Use IGV (Integrative Genomics Viewer) or Tablet for visualizing aligned reads and variants.
- Generate summary statistics and plots using VCFtools or R.
10. Automate the Workflow
Example Snakemake Workflow
rule all: input: "annotated_snps.vcf" rule align: input: "reference.fasta", "R1.fastq.gz", "R2.fastq.gz" output: "aln.sam" shell: "bwa mem -R '@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01' {input} > {output}" rule sort: input: "aln.sam" output: "aln_sorted.bam" shell: "sambamba sort -o {output} {input}" rule markdup: input: "aln_sorted.bam" output: "aln_dedup.bam" shell: "sambamba markdup -t 8 {input} {output}" rule recalibrate: input: "aln_dedup.bam", "known_sites.vcf" output: "aln_recalibrated.bam" shell: "gatk BaseRecalibrator -R reference.fasta -I {input[0]} --known-sites {input[1]} -O recal_data.table && " "gatk ApplyBQSR -R reference.fasta -I {input[0]} --bqsr-recal-file recal_data.table -O {output}" rule call_variants: input: "aln_recalibrated.bam" output: "raw_variants.vcf" shell: "gatk HaplotypeCaller -R reference.fasta -I {input} -O {output}" rule filter_variants: input: "raw_variants.vcf" output: "filtered_snps.vcf" shell: "gatk SelectVariants -R reference.fasta -V {input} --select-type-to-include SNP -O raw_snps.vcf && " "gatk VariantFiltration -R reference.fasta -V raw_snps.vcf --filter-expression 'QD < 2.0 || FS > 60.0 || MQ < 40.0' --filter-name 'SNP_FILTER' -O {output}" rule annotate: input: "filtered_snps.vcf" output: "annotated_snps.vcf" shell: "java -jar snpEff.jar -v genome_version {input} > {output}"
Recent Tools and Tips
- DeepVariant: A deep learning-based variant caller from Google, known for high accuracy.
- sambamba: Faster alternative to SAMtools for sorting and marking duplicates.
- VEP (Variant Effect Predictor): Comprehensive annotation tool for variants.
- Snakemake/Nextflow: Workflow managers for automating and reproducing pipelines.
- MultiQC: Summarize QC metrics from multiple tools in one report.
Tips for SNP Calling
- Use Known Variants: Incorporate known variants for base quality recalibration.
- Filter Carefully: Adjust filtering thresholds based on your data quality.
- Parallelize Tasks: Use tools like
sambamba
and workflow managers to speed up processing. - Validate Results: Cross-check results with orthogonal methods (e.g., Sanger sequencing).
By following this updated workflow, you can efficiently and accurately call SNPs using the latest tools and best practices.