remotecomputer-bioinformatics

How To Efficiently Parse A Huge Fastq File?

January 2, 2025 Off By admin
Shares

Parsing large FASTQ files requires memory-efficient and computationally optimized methods. Below are step-by-step instructions using various tools and programming languages to extract specific reads efficiently.


Step 1: Understand the Problem

  • You have a large FASTQ file (e.g., 19 GB) containing millions of reads.
  • You want to extract a subset of reads (e.g., 10,000) based on their identifiers.

Step 2: Recommended Approaches

Option 1: Python (Biopython)

Efficient parsing using FastqGeneralIterator:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
import sys

# Input: File with list of IDs and the FASTQ file
id_file = sys.argv[1]
fastq_file = sys.stdin

# Read the IDs
ids = set(line.strip() for line in open(id_file))

# Parse and extract reads
for title, seq, qual in FastqGeneralIterator(fastq_file):
if title.split(None, 1)[0] in ids:
print(f"@{title}\n{seq}\n+\n{qual}")

Usage:

bash
cat large.fastq | python script.py id_list.txt > extracted_reads.fastq

Option 2: Unix Tools

Use grep for a FASTQ-aware approach:

bash
grep -A 3 -Ff id_list.txt large.fastq > extracted_reads.fastq
  • -A 3: Include 3 lines after a match (sequence, +, quality).
  • -Ff: Match fixed patterns from id_list.txt.

Option 3: Seqtk

Seqtk is a fast and lightweight toolkit for FASTQ file manipulation.

Installation:

bash
git clone https://github.com/lh3/seqtk.git
cd seqtk
make

Usage:

bash
seqtk subseq large.fastq id_list.txt > extracted_reads.fastq

Option 4: Perl Script

Simple FASTQ parser in Perl:

perl
use strict;
use warnings;

my $id_file = shift;
open my $fh, '<', $id_file or die "Cannot open $id_file: $!";

# Store IDs in a hash
my %ids = map { chomp; $_ => 1 } <$fh>;
close $fh;

# Process FASTQ input
while (<>) {
my $header = $_;
my $seq = <>;
my $plus = <>;
my $qual = <>;

if ($header =~ /^@(\S+)/ && exists $ids{$1}) {
print $header, $seq, $plus, $qual;
}
}

Usage:

bash
perl script.pl id_list.txt < large.fastq > extracted_reads.fastq

Step 3: Tools and Libraries

  1. Seqtk – A fast and versatile tool for FASTQ/FASTA files.
  2. Biopython – Python library for bioinformatics tasks.
  3. FASTQTools – Includes utilities for FASTQ file manipulation.

Step 4: Best Practices

  • Index the File: For repeated queries, consider indexing using tools like seqtk or samtools.
  • Subset Files: If possible, split the large FASTQ file into smaller chunks using split or seqtk.
  • Optimize Memory Usage: Use lazy loading and generators in Python.

Step 5: Benchmarking

  • Benchmark different tools/scripts with a subset of data to determine the fastest solution for your hardware and dataset.

By following these steps and using the suggested tools, you can efficiently parse large FASTQ files to extract the required reads.

Shares