computer-bioinformatics

Guide to Extracting Reads from BAM Files for Specific Genomic Regions

December 27, 2024 Off By admin
Shares

Extracting reads from a BAM file that fall entirely within a given region. The process ensures you exclude reads overlapping but not contained within the specified region.


Step-by-Step Guide to Extract Reads Entirely Within a Region

1. Requirements

  • Software:
  • Input files:
    • A sorted and indexed BAM file (.bam, .bai)
    • A region file in BED or GFF format (e.g., regions.bed).

2. Install Necessary Tools

  • On a Unix/Linux system, use the following commands to install required tools:
    bash
    sudo apt-get install samtools bedtools

    Alternatively, use conda:

    bash
    conda install -c bioconda samtools bedtools

3. Prepare Input Files

  • Ensure your BAM file is sorted and indexed:
    bash
    samtools sort input.bam -o input_sorted.bam
    samtools index input_sorted.bam
  • Format the BED file to specify the regions:
    text
    chr10 1000 2000

    Save this as regions.bed.

4. Extract Reads Within Regions

  • Use samtools view to extract reads overlapping the region:
    bash
    samtools view -L regions.bed input_sorted.bam > overlapping_reads.sam
  • Convert to BAM format and include the header for downstream compatibility:
    bash
    samtools view -b -L regions.bed input_sorted.bam > overlapping_reads.bam

5. Filter Reads Entirely Within a Region

  • Convert BAM to BED format for detailed filtering:
    bash
    bedtools bamtobed -i overlapping_reads.bam > overlapping_reads.bed
  • Use awk or bedtools to filter reads entirely within regions:
    bash
    bedtools intersect -a overlapping_reads.bed -b regions.bed -f 1.0 > reads_within.bed
    • Explanation:
      • -f 1.0 ensures that reads are fully contained within the region.
      • The output file reads_within.bed contains filtered reads.

6. Convert Filtered BED Back to BAM

  • Convert the filtered BED file back to BAM format:
    bash
    bedtools bedtobam -i reads_within.bed -g genome_file > filtered_reads.bam

    Replace genome_file with your genome’s chromosome size file (e.g., genome.chrom.sizes).

7. Validate the Output

  • View the extracted reads:
    bash
    samtools view filtered_reads.bam | less -S
  • Count the number of reads:
    bash
    samtools view -c filtered_reads.bam
  • Visualize the BAM file using tools like IGV or Tablet.

Script for Automation

You can automate the above steps using a shell script:

bash
#!/bin/bash

# Input BAM file and region file
BAM_FILE="input_sorted.bam"
REGIONS="regions.bed"
GENOME="genome.chrom.sizes"

# Step 1: Extract overlapping reads
samtools view -b -L $REGIONS $BAM_FILE > overlapping_reads.bam

# Step 2: Convert BAM to BED
bedtools bamtobed -i overlapping_reads.bam > overlapping_reads.bed

# Step 3: Filter reads entirely within the region
bedtools intersect -a overlapping_reads.bed -b $REGIONS -f 1.0 > reads_within.bed

# Step 4: Convert BED back to BAM
bedtools bedtobam -i reads_within.bed -g $GENOME > filtered_reads.bam

# Step 5: Index and validate
samtools index filtered_reads.bam
samtools view -c filtered_reads.bam

echo "Filtered reads saved in filtered_reads.bam"


Advanced Tips

  • If working with paired-end reads, ensure both ends meet the filtering criteria using samtools view -f 2.
  • To handle multiple regions in the BED file, ensure all regions are formatted correctly.
  • Use R with GenomicRanges for a flexible programming solution:
    R
    library(GenomicAlignments)
    bam <- readGAlignments("input_sorted.bam")
    region <- GRanges("chr10", IRanges(1000, 2000))
    contained <- subsetByOverlaps(bam, region, type = "within")
    export(contained, "filtered_reads.bam", format = "BAM")

This guide provides an effective workflow for extracting reads strictly within a specific region, suitable for bioinformatics beginners.

Shares