Step-by-Step Guide: Understanding and Redefining the CIGAR String in SAM/BAM Format
January 10, 2025The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string in SAM/BAM files is a crucial component for representing sequence alignments. However, its current definition has limitations, particularly in distinguishing between matches and mismatches. This guide explores the CIGAR string’s design, its limitations, and whether it should be redefined.
1. Understand the Current CIGAR String Format
The CIGAR string describes how a read aligns to a reference sequence. It consists of a series of operations, each represented by a letter and a number indicating the length of the operation.
Key CIGAR Operations:
- M: Alignment match (can be a sequence match or mismatch).
- I: Insertion in the read.
- D: Deletion in the read.
- N: Skipped region (e.g., intron in RNA-Seq).
- S: Soft clipping (sequence present in the read but not aligned).
- H: Hard clipping (sequence not present in the read).
- =: Sequence match (added later to distinguish matches).
- X: Sequence mismatch (added later to distinguish mismatches).
Example CIGAR String:
10M1I2M1D3M
- 10M: 10 alignment matches (could be matches or mismatches).
- 1I: 1 insertion in the read.
- 2M: 2 alignment matches.
- 1D: 1 deletion in the read.
- 3M: 3 alignment matches.
2. Limitations of the Current CIGAR String
The primary limitation is that the M
operator does not distinguish between matches and mismatches. This makes it impossible to:
- Calculate the exact number of matches and mismatches.
- Compute sequence similarity (e.g., Levenshtein distance) directly from the CIGAR string.
Example:
A CIGAR string 10M
could represent:
- 10 matches.
- 9 matches and 1 mismatch.
- 8 matches and 2 mismatches.
- And so on.
3. Proposed Redefinition of the CIGAR String
To address these limitations, the CIGAR string could be redefined to use =
for matches and X
for mismatches explicitly.
Proposed CIGAR Operations:
- =: Sequence match.
- X: Sequence mismatch.
- M: Retained for backward compatibility but discouraged.
Example Redefined CIGAR String:
8=1X1I2=1D3=
- 8=: 8 sequence matches.
- 1X: 1 sequence mismatch.
- 1I: 1 insertion in the read.
- 2=: 2 sequence matches.
- 1D: 1 deletion in the read.
- 3=: 3 sequence matches.
4. Pros and Cons of Redefining the CIGAR String
Pros:
- Clarity: Distinguishes matches and mismatches explicitly.
- Accuracy: Enables precise calculation of sequence similarity.
- Simplification: Reduces reliance on additional tags like
MD
.
Cons:
- Backward Compatibility: Existing tools and files would need updates.
- Verbosity: Increases the size of the CIGAR string.
- Complexity: Adds complexity to alignment algorithms.
5. Practical Steps to Redefine the CIGAR String
If the community agrees to redefine the CIGAR string, the following steps could be taken:
Step 1: Update the SAM/BAM Specification
- Define
=
andX
as the standard operators for matches and mismatches. - Deprecate the use of
M
for matches/mismatches.
Step 2: Modify Alignment Tools
- Update alignment tools (e.g., BWA, Bowtie) to output
=
andX
instead ofM
. - Ensure tools can still read legacy CIGAR strings with
M
.
Step 3: Update Downstream Tools
- Modify tools that process SAM/BAM files to handle the new CIGAR format.
- Ensure compatibility with both old and new formats.
Step 4: Educate the Community
- Provide documentation and tutorials on the new CIGAR format.
- Encourage adoption through workshops and conferences.
6. Example Implementation in Python
Here’s how you could convert a legacy CIGAR string to the proposed format using Python:
def convert_cigar(cigar, md_tag): """ Convert a legacy CIGAR string to the proposed format using the MD tag. :param cigar: Legacy CIGAR string (e.g., "10M1I2M1D3M"). :param md_tag: MD tag (e.g., "8A1^C2"). :return: Redefined CIGAR string (e.g., "8=1X1I2=1D3="). """ import re new_cigar = [] md_index = 0 for op in re.findall(r'\d+[MIDNSHP=X]', cigar): length = int(op[:-1]) operator = op[-1] if operator == 'M': # Parse MD tag to distinguish matches and mismatches md_part = md_tag[md_index:md_index + length] matches = 0 mismatches = 0 for char in md_part: if char.isdigit(): matches += int(char) elif char in 'ACGTN': mismatches += 1 elif char == '^': # Skip deletion in MD tag continue new_cigar.append(f"{matches}=") if mismatches > 0: new_cigar.append(f"{mismatches}X") md_index += len(md_part) else: new_cigar.append(op) return ''.join(new_cigar) # Example usage cigar = "10M1I2M1D3M" md_tag = "8A1^C2" new_cigar = convert_cigar(cigar, md_tag) print(f"Redefined CIGAR: {new_cigar}")
7. Conclusion
Redefining the CIGAR string to explicitly distinguish matches and mismatches could improve clarity and accuracy in sequence alignment representation. However, this change would require significant updates to tools and workflows. The decision to redefine the CIGAR string should involve broad community discussion and careful consideration of the trade-offs.