Data Scientists-genomics

Step-by-Step Guide to Analyze Sequence Length Distribution from a FASTQ File

December 27, 2024 Off By admin
Shares

This guide provides a beginner-friendly manual to determine the sequence length distribution of reads in a FASTQ file. It incorporates Unix and Python methods and offers alternative approaches using common bioinformatics tools.


Step 1: Understand FASTQ Format

  • A FASTQ file contains four lines per read:
    1. Sequence identifier (e.g., @SEQ_ID)
    2. Sequence (e.g., AGCTGAC...)
    3. Separator (+)
    4. Quality scores.

Only the second line of each four-line group contains the sequence whose length needs to be measured.


Step 2: Using AWK for Sequence Length Distribution

AWK is a lightweight and efficient text processing tool.

Example Command:

bash
awk 'NR % 4 == 2 {lengths[length($0)]++} END {for (len in lengths) print len, lengths[len]}' file.fastq > length_distribution.txt

Explanation:

  1. NR % 4 == 2: Processes every second line (sequence lines).
  2. lengths[length($0)]++: Calculates the sequence length and increments its count in an associative array.
  3. END {for (len in lengths) print len, lengths[len]}: Outputs the length and frequency.

Output:

  • The file length_distribution.txt contains two columns:
    1. Sequence length.
    2. Frequency of sequences with that length.

Step 3: Using Python Script

If you prefer Python, you can use the following script:

python
#!/usr/bin/env python3
import sys

if len(sys.argv) != 2:
print("Usage: python script.py file.fastq")
sys.exit(1)

file_path = sys.argv[1]
length_counts = {}

with open(file_path, 'r') as fastq_file:
for i, line in enumerate(fastq_file):
if i % 4 == 1: # Sequence lines
seq_len = len(line.strip())
length_counts[seq_len] = length_counts.get(seq_len, 0) + 1

# Write results to a file
output_file = file_path.split('.')[0] + "_length_distribution.txt"
with open(output_file, 'w') as out:
for length, count in sorted(length_counts.items()):
out.write(f"{length}\t{count}\n")

print(f"Length distribution saved to {output_file}")

Execution:

bash
python3 script.py file.fastq

Step 4: Visualizing the Data in R

After generating the length distribution file, visualize the data in R:

R
data <- read.table("length_distribution.txt", header=FALSE)
plot(data$V1, data$V2, type="l", xlab="Read Length", ylab="Frequency", col="blue", main="Read Length Distribution")

Step 5: Using Prebuilt Tools

Option 1: FastQC

  • Install FastQC:
    bash
    sudo apt-get install fastqc
  • Run FastQC:
    bash
    fastqc file.fastq
  • The output includes sequence length distribution in a graphical format.

Option 2: BBMap’s readlength.sh

  • Install BBMap:
    bash
    sudo apt-get install bbmap
  • Run the command:
    bash
    readlength.sh in=file.fastq out=length_histogram.txt

Step 6: Handling Compressed FASTQ Files

For .gz files, use zcat or gunzip to decompress on the fly:

bash
zcat file.fastq.gz | awk 'NR % 4 == 2 {lengths[length($0)]++} END {for (len in lengths) print len, lengths[len]}' > length_distribution.txt

Step 7: Common Issues and Debugging

  1. Error in AWK Command: Ensure there is a space between print and length($0).
  2. Large Files: Use tools like seqtk or samtools for faster processing:
    bash
    seqtk fqchk file.fastq > stats.txt

Step 8: Recommended Workflow

  1. For large datasets: Use BBMap or seqtk.
  2. For visualization: Export results to a .txt file and use R or Python plotting libraries.

This step-by-step guide ensures both simplicity and flexibility, suitable for beginners and advanced users alike.

Shares