
Step-by-Step Guide: Is My BAM File Sorted?
December 28, 2024Introduction:
In bioinformatics, BAM files are essential for storing aligned sequencing data, typically in the Binary Alignment/Map format. Sorting BAM files is crucial for various downstream applications, including variant calling, indexing, and visualization. A BAM file can be sorted either by query name (unsorted) or by genomic coordinates (sorted), with the latter being necessary for certain analyses.
Why Is It Important to Check if a BAM File is Sorted?
- Efficient Access: Many tools, such as samtools and Picard, assume that BAM files are sorted by genomic coordinates. Without sorting, these tools may not function properly or efficiently.
- Variant Calling: Most variant callers, like GATK, expect sorted BAM files to ensure correct processing of sequence alignments for variant discovery.
- Indexing: To index a BAM file, it must be sorted by coordinates. An unsorted BAM file can cause indexing failures or incorrect outputs.
- Visualizations: Genome browsers like IGV expect BAM files to be sorted to display alignments accurately.
How to Check If a BAM File is Sorted?
Method 1: Using samtools (Modern Approach)
Samtools is one of the most popular tools for handling BAM files. In recent versions, the command samtools stats provides an easy way to check if a BAM file is sorted.
- Run
samtools statson Your BAM File: - Interpret the Output:
- If the output is
is sorted: 1, it means the BAM file is sorted by genomic coordinates. - If the output is
is sorted: 0, the BAM file is not sorted.
Example:
This output indicates the BAM file is sorted.
- If the output is
Method 2: Inspecting the BAM Header
The BAM file header contains metadata, including sorting information. You can check this using samtools view:
- Check BAM File Header:
- Look for the Sort Order (SO) Flag:
- If the header contains
SO:coordinate, the BAM file is sorted by coordinates. - If it contains
SO:unsorted, the BAM file is not sorted.
Example:
This header line indicates that the BAM file is sorted by coordinates.
- If the header contains
Method 3: Using samtools index
Samtools can also help identify whether a BAM file is sorted. If the BAM file is unsorted, samtools index might produce an error or a smaller index file. However, this method is not foolproof, as it can sometimes run without error on an unsorted file.
- Index the BAM File:
- Check the Return Code:
- If the file is unsorted, the command may produce a truncated index file, or you may receive an error like
[bam_index_core] the alignment is not sorted. - If the BAM file is sorted,
samtools indexwill run without errors and generate a correct index file.
Example:
- A return code of
0indicates thatsamtools indexhas run successfully, suggesting the file is sorted. - A return code other than
0may indicate the file is unsorted.
- If the file is unsorted, the command may produce a truncated index file, or you may receive an error like
How to Sort a BAM File?
If you find that your BAM file is unsorted, you can sort it using samtools:
- Sort the BAM File by Coordinates:
- Verify the Sort: After sorting, you can verify that the BAM file is now sorted by checking its header or running the
samtools statscommand again.
Additional Considerations:
- MarkDuplicates: When using tools like
MarkDuplicates, it’s crucial to set theassume_sorted=trueoption only if the BAM file is actually sorted. If you’re unsure, it’s always safest to runsamtools sortfirst. - samtools view: When working with SAM files, you can pipe the output of
samtools viewinto a sorting command if necessary.
Unix or Perl Scripts for Automation:
You can automate the checking process with a simple shell script:
This script takes a BAM file as input, checks if it is sorted, and outputs the result.
Conclusion:
Ensuring that BAM files are sorted is crucial for the proper functioning of various bioinformatics tools. The most reliable way to check whether a BAM file is sorted is by using samtools stats. If necessary, you can sort the file using samtools sort. By automating these checks, you can streamline your bioinformatics workflow and avoid errors in downstream analyses.


















