Direct RNA Sequencing (DRS)

Step-by-Step Guide: Counting Sequences in a FASTQ.GZ File

January 10, 2025 Off By admin
Shares

Counting the number of sequences in a FASTQ file is a common task in bioinformatics. When dealing with compressed FASTQ files (.fastq.gz), it’s important to use efficient methods to avoid unnecessary decompression and processing. This guide provides step-by-step instructions for counting sequences in .fastq.gz files using command-line tools and Python.


1. Understand the FASTQ Format

A FASTQ file contains sequence records, each consisting of four lines:

  1. Sequence Identifier: Starts with @.
  2. Sequence: The actual nucleotide sequence.
  3. Separator: Starts with +.
  4. Quality Scores: Encoded quality values for the sequence.

Since each sequence record spans four lines, the total number of sequences in a FASTQ file is:

Number of Sequences=Total Lines4


2. Use Command-Line Tools

Command-line tools like zcatwc, and zgrep are efficient for counting sequences in .fastq.gz files.

Method 1: Count Lines and Divide by 4

Use zcat to decompress the file and wc -l to count lines.

bash
Copy
zcat file.fastq.gz | wc -l | awk '{print $1/4}'
  • zcat: Decompresses the .gz file on the fly.
  • wc -l: Counts the number of lines.
  • awk '{print $1/4}': Divides the line count by 4 to get the number of sequences.

Method 2: Use zgrep to Count Sequence Identifiers

Count lines starting with @ (sequence identifiers).

bash
Copy
zgrep -c '^@' file.fastq.gz
  • zgrep -c '^@': Counts lines starting with @.

Note: This method may overcount if @ appears in quality scores or other lines.


3. Use Python for Counting Sequences

For more flexibility, use Python to count sequences in .fastq.gz files.

Example: Count Sequences Using Python

python
Copy
import gzip

def count_sequences(fastq_gz_file):
    with gzip.open(fastq_gz_file, 'rt') as f:
        return sum(1 for line in f) // 4

# Example usage
file_path = "file.fastq.gz"
num_sequences = count_sequences(file_path)
print(f"Number of sequences: {num_sequences}")
  • gzip.open: Opens the .gz file in text mode.
  • sum(1 for line in f): Counts all lines.
  • // 4: Divides the line count by 4 to get the number of sequences.

4. Use GNU Parallel for Multiple Files

If you have multiple .fastq.gz files, use GNU Parallel to count sequences in parallel.

Install GNU Parallel

bash
Copy
sudo apt-get install parallel

Count Sequences in Multiple Files

bash
Copy
parallel "echo {} && zcat {} | wc -l | awk '{print \$1/4}'" ::: *.fastq.gz
  • parallel: Runs commands in parallel.
  • ::: *.fastq.gz: Processes all .fastq.gz files in the directory.

5. Validate Results

Compare the results from different methods to ensure accuracy. For example:

  • Use zcat | wc -l and zgrep -c '^@' to cross-check.
  • Verify with a small subset of the data.

6. Handle Edge Cases

  • Wrapped Lines: Ensure the FASTQ file does not contain wrapped lines (one line per sequence).
  • Missing Newlines: Check for missing newlines at the end of the file.

Check for Wrapped Lines

bash
Copy
zcat file.fastq.gz | awk 'NR % 4 == 2 {if (length($0) > 100) print "Wrapped line at record", (NR/4)+1}'

Check for Missing Newlines

bash
Copy
zcat file.fastq.gz | tail -n 1 | cat -e
  • If the output does not end with $, the file is missing a newline.

7. Count Sequences in SAM/BAM Files

For .sam or .bam files, use samtools to count aligned reads.

Count Reads in a BAM File

bash
Copy
samtools view -c file.bam
  • samtools view -c: Counts the number of aligned reads.

Count Reads in a SAM File

bash
Copy
grep -v '^@' file.sam | wc -l
  • grep -v '^@': Excludes header lines.
  • wc -l: Counts the remaining lines (aligned reads).

By following these steps, you can efficiently count sequences in .fastq.gz.sam, and .bam files, ensuring accurate and reliable results for your bioinformatics analyses.

Shares