python-bioinformatics-basics

Understanding Samtools View Output

January 3, 2025 Off By admin
Shares

This step-by-step guide will help you comprehend the output of samtools view. The guide covers essential details about the fields in the output, provides example scripts for processing, and mentions recent tools and software for interpreting the data.


Step 1: Command Overview

The samtools view command retrieves alignment information from BAM or SAM files for specified regions.

Example:

bash
samtools view BAMFILE chr2:1,000,000-2,000,000

This command extracts alignments from BAMFILE in the region spanning 1,000,000 to 2,000,000 on chromosome 2.


Step 2: Output Format Explanation

The output of samtools view is a tab-delimited format, following the SAM format. Each row represents a single alignment with the following columns:

  1. QNAME: Query name (e.g., read identifier).
  2. FLAG: Bitwise flag indicating alignment information (e.g., strand, paired-end, etc.).
  3. RNAME: Reference sequence name (e.g., chromosome).
  4. POS: Leftmost 1-based position of the alignment on the reference.
  5. MAPQ: Mapping quality (Phred scale).
  6. CIGAR: Encoded representation of alignment (e.g., matches, insertions, deletions).
  7. MRNM: Mate reference sequence (= if the same as RNAME).
  8. MPOS: 1-based position of the mate.
  9. ISIZE: Inferred insert size.
  10. SEQ: Aligned sequence.
  11. QUAL: ASCII-encoded quality score for each base in the sequence.

Optional fields (tags):

  • RG: Read group.
  • NM: Edit distance.
  • OQ: Original quality.
  • E2: Second sequence.
  • Additional tags may vary.

Step 3: Script for Parsing SAM/BAM

Using Python (pysam):

import pysam

# Open BAM file
bamfile = pysam.AlignmentFile("BAMFILE.bam", "rb")

# Fetch alignments in a region
for read in bamfile.fetch("chr2", 1000000, 2000000):
print(f"QNAME: {read.query_name}")
print(f"FLAG: {read.flag}")
print(f"RNAME: {read.reference_name}")
print(f"POS: {read.reference_start + 1}")
print(f"MAPQ: {read.mapping_quality}")
print(f"CIGAR: {read.cigarstring}")
print(f"SEQ: {read.query_sequence}")
print(f"QUAL: {read.qual}")
print("-" * 40)

Using Unix:

bash
samtools view BAMFILE.bam chr2:1000000-2000000 | cut -f1,2,3,4,5,6

Using R (Rsamtools):

R
library(Rsamtools)

# Open BAM file
bam <- scanBam("BAMFILE.bam", param=ScanBamParam(which=GRanges("chr2", IRanges(1000000, 2000000))))
bam[[1]]


Step 4: Online Tools for SAM/BAM Interpretation

  1. IGV (Integrative Genomics Viewer)
    Visualize BAM/SAM files alongside the genome.
    Website: IGV
  2. Galaxy
    Offers tools for processing and interpreting BAM/SAM files.
    Website: Galaxy Project
  3. SAMStat
    Generates summaries and quality control metrics for SAM/BAM files.
    Website: SAMStat
  4. BamTools
    Comprehensive toolkit for BAM file analysis.
    Website: BamTools

Step 5: Reference Documentation

Shares