bioinformatics-statistics

SNP Calling: A Step-by-Step Guide

January 3, 2025 Off By admin
Shares

SNP Calling refers to the process of identifying single nucleotide polymorphisms (SNPs) in sequencing data, distinguishing genuine variations from sequencing errors. Here’s a step-by-step guide to perform SNP calling using bioinformatics tools and scripts:


Step 1: Prepare Your Data

  1. Obtain Sequence Data: Ensure you have quality-checked FASTQ files containing your sequencing reads.
  2. Reference Genome: Download the reference genome for your organism of interest (e.g., from NCBI or Ensembl).

Step 2: Align Reads to the Reference Genome

Use an aligner like BWA or Bowtie2 to map reads to the reference genome:

bash
bwa mem reference_genome.fasta input_reads.fastq > aligned_reads.sam

Convert the SAM file to BAM, sort, and index:

bash
samtools view -Sb aligned_reads.sam | samtools sort -o aligned_reads.sorted.bam
samtools index aligned_reads.sorted.bam

Step 3: Mark Duplicates (Optional)

Use Picard Tools to mark PCR duplicates:

bash
picard MarkDuplicates I=aligned_reads.sorted.bam O=dedup_reads.bam M=metrics.txt
samtools index dedup_reads.bam

Step 4: Call Variants

Use a variant caller like GATK HaplotypeCaller:

bash
gatk HaplotypeCaller \
-R reference_genome.fasta \
-I dedup_reads.bam \
-O raw_variants.vcf

Alternatively, use BCFtools:

bash
bcftools mpileup -f reference_genome.fasta dedup_reads.bam | bcftools call -mv -Ov -o raw_variants.vcf

Step 5: Filter Variants

Filter SNPs based on quality metrics:

bash
gatk VariantFiltration \
-R reference_genome.fasta \
-V raw_variants.vcf \
--filter-expression "QUAL < 30.0 || DP < 10" \
--filter-name "LowQual" \
-O filtered_variants.vcf

Step 6: Annotate Variants

Annotate SNPs with functional information using SnpEff or ANNOVAR:

bash
java -jar snpEff.jar GRCh38.99 filtered_variants.vcf > annotated_variants.vcf

Recent Online Tools and Software for SNP Calling


Example Python Script for Simple SNP Calling (using Pysam)

import pysam

bamfile = pysam.AlignmentFile("aligned_reads.sorted.bam", "rb")
ref_fasta = pysam.FastaFile("reference_genome.fasta")

for pileupcolumn in bamfile.pileup():
ref_base = ref_fasta.fetch(reference=pileupcolumn.reference_name, start=pileupcolumn.reference_pos, end=pileupcolumn.reference_pos+1)
base_counts = {base: 0 for base in 'ACGTN'}
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
base_counts[pileupread.alignment.query_sequence[pileupread.query_position]] += 1
print(f"Position: {pileupcolumn.reference_pos + 1}, Ref: {ref_base}, Counts: {base_counts}")


Conclusion

SNP calling involves aligning reads, identifying variants, and filtering them for accuracy. Tools like GATK, BCFtools, and DeepVariant offer robust solutions. Combining computational methods with biological interpretation ensures high-confidence SNP discovery.

Shares