Circular RNAs (circRNAs)

Step-by-Step Guide: Counting Sequence Numbers in a FASTQ Zipped File

December 28, 2024 Off By admin
Shares

ntroduction

Counting sequences in a FASTQ.GZ file is a common task in bioinformatics for verifying data integrity, ensuring proper preprocessing, and confirming the expected number of reads. FASTQ files store sequence data, where each sequence spans four lines:

  1. Sequence identifier
  2. Sequence string
  3. Separator (+)
  4. Quality scores

Why It’s Important

  • Data Validation: Ensures no sequences are missing or corrupted.
  • Pipeline QA: Confirms that all sequences are processed in alignment and analysis steps.
  • Resource Estimation: Helps estimate compute and storage requirements.

Tools Required


Approach

1. Basic Counting: Using zcat and wc

This method calculates the total lines in the file, dividing by 4 to get the sequence count. It is simple but may be slow for very large files.

Command:

bash
zcat file.fastq.gz | wc -l | awk '{print $1 / 4}'

Explanation:

  • zcat: Decompresses the gzipped file to stdout.
  • wc -l: Counts the lines.
  • awk: Divides the line count by 4 to compute the sequence count.

2. Optimized Counting: Using zgrep

Search for the sequence headers using zgrep with a regex matching valid FASTQ sequence identifiers.

Command:

bash
zgrep -c '^@.*/[0-9]$' file.fastq.gz

Explanation:

  • zgrep: Searches within compressed files.
  • ^@.*/[0-9]$: Matches lines starting with @ followed by read identifiers.

3. Parallel Processing with GNU Parallel

For multiple files, GNU Parallel can speed up counting.

Command:

bash
parallel "echo {} && zcat {} | wc -l | awk '{print $1 / 4}'" ::: *.fastq.gz

Explanation:

  • parallel: Runs the command across multiple files simultaneously.
  • ::: *.fastq.gz: Specifies all gzipped FASTQ files in the directory.

4. Python Script for Automation

To integrate this in a Python pipeline:

Script:

python
import os
import subprocess

def count_sequences_in_fastq(filepath):
"""Counts sequences in a gzipped FASTQ file."""
cmd = f"zcat {filepath} | wc -l"
result = subprocess.check_output(cmd, shell=True)
lines = int(result.strip())
return lines // 4

# Path to directory containing FASTQ files
fastq_dir = "/path/to/fastq/files"

for file in os.listdir(fastq_dir):
if file.endswith(".fastq.gz"):
filepath = os.path.join(fastq_dir, file)
seq_count = count_sequences_in_fastq(filepath)
print(f"{file}: {seq_count} sequences")

5. BAM File Sequence Counting

To count reads in BAM files, use samtools:

Command:

bash
samtools view -c file.bam

Python Function:

python
import subprocess

def count_bam_reads(bamfile):
"""Counts mapped and unmapped reads in a BAM file."""
cmd = f"samtools idxstats {bamfile}"
result = subprocess.check_output(cmd, shell=True).decode().strip().split("\n")
mapped, unmapped = 0, 0
for line in result:
_, _, nm, nu = line.split("\t")
mapped += int(nm)
unmapped += int(nu)
return mapped, unmapped


Notes and Considerations

  1. Line Wrapping in FASTQ Files: Verify there’s no wrapping with:
    bash
    zcat file.fastq.gz | ruby -e '
    last_char = "+"
    skip_line = false

    while gets
    if skip_line
    skip_line = false
    next
    end

    if $_.start_with?("@") && last_char == "+"
    last_char = "@"
    elsif $_.start_with?("+") && last_char == "@"
    last_char = "+"
    else
    puts "WARNING: Wrapping detected at line #{$.}"
    exit(1)
    end

    skip_line = true
    end
    '

  2. End-of-File Newline: Missing newline may cause issues. Fix with:
    bash
    zcat file.fastq.gz | sed -e '$a\' | gzip > fixed_file.fastq.gz
  3. Compressed File Formats: For large-scale workflows, consider using file systems with real-time compression, such as ZFS.

Applications

This guide should make sequence counting in FASTQ.GZ files accessible and efficient for both beginners and experienced bioinformaticians!

Shares