Step-by-Step Guide to Analyze Sequence Length Distribution from a FASTQ File

December 27, 2024 Off By admin

This guide provides a beginner-friendly manual to determine the sequence length distribution of reads in a FASTQ file. It incorporates Unix and Python methods and offers alternative approaches using common bioinformatics tools.

Table of Contents

Step 1: Understand FASTQ Format

A FASTQ file contains four lines per read:
1. Sequence identifier (e.g., @SEQ_ID)
2. Sequence (e.g., AGCTGAC...)
3. Separator (+)
4. Quality scores.

Only the second line of each four-line group contains the sequence whose length needs to be measured.

Step 2: Using AWK for Sequence Length Distribution

AWK is a lightweight and efficient text processing tool.

Example Command:

Explanation:

NR % 4 == 2: Processes every second line (sequence lines).
lengths[length($0)]++: Calculates the sequence length and increments its count in an associative array.
END {for (len in lengths) print len, lengths[len]}: Outputs the length and frequency.

Output:

The file length_distribution.txt contains two columns:
1. Sequence length.
2. Frequency of sequences with that length.

Step 3: Using Python Script

If you prefer Python, you can use the following script:

python

#!/usr/bin/env python3
 import sys
if len(sys.argv) != 2:
 print("Usage: python script.py file.fastq")
 sys.exit(1)
file_path = sys.argv[1]
 length_counts = {}
with open(file_path, 'r') as fastq_file:
 for i, line in enumerate(fastq_file):
 if i % 4 == 1: # Sequence lines
 seq_len = len(line.strip())
 length_counts[seq_len] = length_counts.get(seq_len, 0) + 1
# Write results to a file
 output_file = file_path.split('.')[0] + "_length_distribution.txt"
 with open(output_file, 'w') as out:
 for length, count in sorted(length_counts.items()):
 out.write(f"{length}\t{count}\n")

print(f"Length distribution saved to {output_file}")

Execution:

Step 4: Visualizing the Data in R

After generating the length distribution file, visualize the data in R:

Step 5: Using Prebuilt Tools

Option 1: FastQC

Install FastQC:
bash
sudo apt-get install fastqc
Run FastQC:
bash
fastqc file.fastq
The output includes sequence length distribution in a graphical format.

Option 2: BBMap’s readlength.sh

Install BBMap:
bash
sudo apt-get install bbmap
Run the command:
bash
readlength.sh in=file.fastq out=length_histogram.txt

Step 6: Handling Compressed FASTQ Files

For .gz files, use zcat or gunzip to decompress on the fly:

Step 7: Common Issues and Debugging

Error in AWK Command: Ensure there is a space between print and length($0).
Large Files: Use tools like seqtk or samtools for faster processing:
bash
seqtk fqchk file.fastq > stats.txt