Step-by-Step Guide: Counting Sequences in a FASTQ.GZ File

January 10, 2025 Off By admin

Counting the number of sequences in a FASTQ file is a common task in bioinformatics. When dealing with compressed FASTQ files (.fastq.gz), it’s important to use efficient methods to avoid unnecessary decompression and processing. This guide provides step-by-step instructions for counting sequences in .fastq.gz files using command-line tools and Python.

Table of Contents

1. Understand the FASTQ Format

A FASTQ file contains sequence records, each consisting of four lines:

Sequence Identifier: Starts with @.
Sequence: The actual nucleotide sequence.
Separator: Starts with +.
Quality Scores: Encoded quality values for the sequence.

Since each sequence record spans four lines, the total number of sequences in a FASTQ file is:

$Number of Sequences = 4 Total Lines$

2. Use Command-Line Tools

Command-line tools like zcat, wc, and zgrep are efficient for counting sequences in .fastq.gz files.

Method 1: Count Lines and Divide by 4

Use zcat to decompress the file and wc -l to count lines.

zcat file.fastq.gz | wc -l | awk '{print $1/4}'

zcat: Decompresses the .gz file on the fly.
wc -l: Counts the number of lines.
awk '{print $1/4}': Divides the line count by 4 to get the number of sequences.

Method 2: Use `zgrep` to Count Sequence Identifiers

Count lines starting with @ (sequence identifiers).

zgrep -c '^@' file.fastq.gz

zgrep -c '^@': Counts lines starting with @.

Note: This method may overcount if @ appears in quality scores or other lines.

3. Use Python for Counting Sequences

For more flexibility, use Python to count sequences in .fastq.gz files.

Example: Count Sequences Using Python

import gzip

def count_sequences(fastq_gz_file):
    with gzip.open(fastq_gz_file, 'rt') as f:
        return sum(1 for line in f) // 4

# Example usage
file_path = "file.fastq.gz"
num_sequences = count_sequences(file_path)
print(f"Number of sequences: {num_sequences}")

gzip.open: Opens the .gz file in text mode.
sum(1 for line in f): Counts all lines.
// 4: Divides the line count by 4 to get the number of sequences.

4. Use GNU Parallel for Multiple Files

If you have multiple .fastq.gz files, use GNU Parallel to count sequences in parallel.

Install GNU Parallel

sudo apt-get install parallel

Count Sequences in Multiple Files

parallel "echo {} && zcat {} | wc -l | awk '{print \$1/4}'" ::: *.fastq.gz

parallel: Runs commands in parallel.
::: *.fastq.gz: Processes all .fastq.gz files in the directory.

5. Validate Results

Compare the results from different methods to ensure accuracy. For example:

Use zcat | wc -l and zgrep -c '^@' to cross-check.
Verify with a small subset of the data.

6. Handle Edge Cases

Wrapped Lines: Ensure the FASTQ file does not contain wrapped lines (one line per sequence).
Missing Newlines: Check for missing newlines at the end of the file.

Check for Wrapped Lines

zcat file.fastq.gz | awk 'NR % 4 == 2 {if (length($0) > 100) print "Wrapped line at record", (NR/4)+1}'

Check for Missing Newlines

zcat file.fastq.gz | tail -n 1 | cat -e

If the output does not end with $, the file is missing a newline.

7. Count Sequences in SAM/BAM Files

For .sam or .bam files, use samtools to count aligned reads.

Count Reads in a BAM File

samtools view -c file.bam

samtools view -c: Counts the number of aligned reads.

Count Reads in a SAM File

grep -v '^@' file.sam | wc -l

grep -v '^@': Excludes header lines.
wc -l: Counts the remaining lines (aligned reads).

By following these steps, you can efficiently count sequences in .fastq.gz, .sam, and .bam files, ensuring accurate and reliable results for your bioinformatics analyses.