Selecting Random Pairs from FASTQ Files: A Beginner’s Guide

December 28, 2024 Off By admin

Introduction: In bioinformatics, selecting random pairs of reads from paired-end sequencing data is a common task. The process is often necessary for downsampling large datasets or for quality control before further analysis. Illumina paired-end (PE) FASTQ files contain two sets of sequences (forward and reverse reads) that should be maintained together. It’s crucial to retain both ends of a read pair in the correct order when performing random sampling.

This guide will walk you through the process of selecting random pairs from FASTQ files using tools like seqtk, shuf, and custom scripts. It covers why random sampling is important, how to use existing tools for this task, and provides practical examples with Unix and Python code.

Table of Contents

Why Is Random Sampling Important?

Random sampling from sequencing data helps in:

Data Reduction: For large sequencing datasets, random sampling allows researchers to work with a subset of the data, reducing computational costs.
Reproducibility: By selecting a random subset, you can perform experiments or simulations under controlled, reproducible conditions.
Bias Avoidance: Random sampling ensures that there is no systematic bias in the data, which could skew results in downstream analyses.

Prerequisites:

To follow along, you’ll need:

Basic knowledge of command-line tools like bash or Unix/Linux commands.
A FASTQ file: This is a standard file format that contains the nucleotide sequences and quality scores.
Tools: Install seqtk, shuf, or awk if not already available on your system. You can also use Python for more advanced scripting.

Method 1: Using `seqtk` for Random Sampling

seqtk is a fast and flexible tool for processing sequences in FASTA or FASTQ format. It’s particularly useful for random sampling because it can handle large files efficiently without consuming much memory.

Steps:

Install seqtk: If seqtk is not installed on your system, you can install it via apt on Ubuntu:
bash
sudo apt-get install seqtk
Use seqtk to sample reads: To sample a fixed number of reads, use the seqtk sample command:
bash
seqtk sample -s100 read1.fq 10000 > sub1.fq seqtk sample -s100 read2.fq 10000 > sub2.fq
- -s100: This is the random seed for reproducibility.
- 10000: This is the number of reads you want to sample.
- read1.fq and read2.fq: These are the input paired FASTQ files.
- sub1.fq and sub2.fq: These are the output files containing the randomly sampled reads.
Using a Fraction Instead of a Fixed Number: You can also sample a fraction of the reads:
bash
seqtk sample -s100 read1.fq 0.1 > sub1.fq seqtk sample -s100 read2.fq 0.1 > sub2.fq
- 0.1: This specifies that 10% of the reads will be sampled.

Explanation of Parameters:

Random Seed (-s): This ensures that the random selection is reproducible. Using the same seed number will produce the same results each time.
Fraction: Instead of specifying the number of reads to sample, you can specify a fraction (e.g., 0.1 means 10% of the total reads).

Method 2: Using `shuf` and `awk` for More Control

If you need more control over the sampling process or want to perform additional operations (e.g., shuffling reads), you can use shuf (part of GNU core utilities) to shuffle the lines in your FASTQ files. Here’s how you can do it:

Steps:

Shuffling the Reads: Use shuf to shuffle the reads:
bash
paste read1.fq read2.fq | shuf | split -l 4 - read
- paste read1.fq read2.fq: Combines the two FASTQ files.
- shuf: Shuffles the combined input.
- split -l 4 -: Splits the shuffled lines back into two FASTQ files.
Splitting the Shuffled Reads: After shuffling, you can use awk or sed to format the output as two separate files:
bash
awk '{printf("%s",$0); n++; if(n%4==0) {printf("\n");} else {printf("\t\t");}}' read | \ sed 's/\t\t/\n/g' | \ awk '{print $1 > "file1.fastq"; print $2 > "file2.fastq"}'

Explanation of Commands:

paste: Joins the two input files line by line.
shuf: Randomly shuffles the input.
split -l 4 -: Ensures that each output file contains complete reads (4 lines per read).
awk and sed: Used for formatting and splitting the shuffled output back into two files.

Method 3: Using Python for Random Sampling (Advanced)

If you prefer using Python for better flexibility and error handling, you can use the following Python script to select random pairs from two FASTQ files:

Python Script:

python

import random
# File paths
 read1_path = 'read1.fq'
 read2_path = 'read2.fq'
 output1_path = 'sub1.fq'
 output2_path = 'sub2.fq'
# Read all lines from both files
 with open(read1_path) as f1, open(read2_path) as f2:
 read1_lines = f1.readlines()
 read2_lines = f2.readlines()
# Number of reads to sample
 num_reads = 10000
# Make sure the number of reads to sample doesn't exceed the available reads
 assert len(read1_lines) // 4 >= num_reads, "Not enough reads to sample."
# Randomly sample the read indices
 sample_indices = random.sample(range(0, len(read1_lines) // 4), num_reads)
# Write the sampled reads to output files
 with open(output1_path, 'w') as out1, open(output2_path, 'w') as out2:
 for idx in sample_indices:
 out1.writelines(read1_lines[idx*4:(idx+1)*4])
 out2.writelines(read2_lines[idx*4:(idx+1)*4])

print(f"Random sampling complete: {num_reads} pairs selected.")

How the Python Script Works:

Reads the FASTQ files: It opens the input files and reads all lines.
Random Sampling: The random.sample function is used to randomly select read indices.
Writing Output: The selected reads are written to two separate output files.

Final Thoughts:

Ensuring Pair Integrity: It is essential to keep the read pairs together. This can be achieved easily when working with paired FASTQ files by maintaining synchronization between the forward and reverse reads.
Memory Management: For very large FASTQ files, using streaming tools like seqtk or shuf is essential, as these tools do not require loading the entire file into memory.

By following this guide, you should be able to efficiently select random pairs from paired-end FASTQ files using either command-line tools or Python, depending on your preference and requirements.