Efficient Processing of SFF/FastQ Files
January 10, 2025Processing SFF and FastQ files efficiently is a common task in bioinformatics. Below is a step-by-step guide to efficiently view, analyze, clip ends, convert, demultiplex, and dereplicate SFF/FastQ files using a combination of tools and one-liners.
1. Viewing FastQ Files
To quickly inspect the contents of a FastQ file, use the following commands:
View the First Few Reads
zcat file.fastq.gz | head -n 12
zcat
: Decompresses.gz
files on the fly.head -n 12
: Displays the first 12 lines (3 reads).
View Specific Reads
awk 'NR>=1000 && NR<=1012' file.fastq
- Displays reads from line 1000 to 1012.
2. Analyzing FastQ Files
Calculate Basic Statistics
Use fastqc
for a comprehensive quality report:
fastqc file.fastq.gz
Calculate Read Length Distribution
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
- Outputs the distribution of read lengths.
Calculate Average Quality
zcat file.fastq.gz | awk 'NR%4 == 0 {sum=0; for (i=1; i<=length($0); i++) {sum+=ord(substr($0,i,1))-33}; print sum/length($0)}'
- Calculates the average quality score per read.
3. Clipping Low-Quality Ends
Trim Low-Quality Bases
Use cutadapt
to trim low-quality bases:
cutadapt -q 20 -o trimmed.fastq.gz file.fastq.gz
-q 20
: Trims bases with quality below 20.-o
: Output file.
Remove Adapters
cutadapt -a ADAPTER_SEQ -o trimmed.fastq.gz file.fastq.gz
-a ADAPTER_SEQ
: Specifies the adapter sequence to remove.
4. Converting File Formats
Convert SFF to FastQ
Use sff2fastq
or biopython
:
sff2fastq -o output.fastq file.sff
Convert FastQ to FASTA
zcat file.fastq.gz | awk 'NR%4==1{print ">" substr($0,2)} NR%4==2{print}' > output.fasta
- Converts FastQ to FASTA format.
5. Demultiplexing
Split by Barcodes
Use cutadapt
to demultiplex:
cutadapt -g ^BARCODE1 -o output1.fastq.gz file.fastq.gz cutadapt -g ^BARCODE2 -o output2.fastq.gz file.fastq.gz
-g ^BARCODE
: Specifies the barcode sequence.
Demultiplex with fastq-multx
fastq-multx -b barcodes.txt file.fastq.gz -o output_%.fastq.gz
barcodes.txt
: File containing barcode sequences.
6. Dereplicating Sequences
Remove Duplicate Reads
Use fastx_collapser
from the FASTX toolkit:
zcat file.fastq.gz | fastx_collapser -o dereplicated.fasta
Dereplicate with vsearch
vsearch --derep_fulllength file.fastq --output dereplicated.fasta
--derep_fulllength
: Dereplicates identical sequences.
7. Splitting Large Files
Split by Number of Reads
zcat file.fastq.gz | split -l 4000000 -d - part_
- Splits into files with 4 million reads each.
Split by Size
zcat file.fastq.gz | split -b 1G -d - part_
- Splits into 1GB chunks.
8. Removing Contaminants
Filter Out Contaminant Sequences
Use bowtie2
to align and filter:
bowtie2 -x contaminant_index -U file.fastq.gz --un clean.fastq.gz
--un clean.fastq.gz
: Outputs reads that do not align to the contaminant index.
9. Visualizing Data
Generate Quality Plots
Use fastqc
:
fastqc file.fastq.gz
Plot GC Content
zcat file.fastq.gz | awk 'NR%4 == 2 {gc=0; for (i=1; i<=length($0); i++) {if (substr($0,i,1) ~ /[GC]/) gc++}; print gc/length($0)}' > gc_content.txt
- Outputs GC content per read.
10. Parallel Processing
Run Tasks in Parallel
Use GNU parallel
to process multiple files:
ls *.fastq.gz | parallel -j 8 "fastqc {}"
-j 8
: Runs 8 jobs in parallel.
11. Combining Results
Merge FastQ Files
cat *.fastq.gz > combined.fastq.gz
Merge FASTA Files
cat *.fasta > combined.fasta
12. Advanced Tools
Use BioApps
(if available)
If you have access to the BioApps
tool mentioned in the original post, it provides a GUI for many of these tasks, including:
- Quality filtering.
- Demultiplexing.
- File conversion.
- Visualization.
Summary
- Viewing: Use
zcat
,head
, orawk
to inspect files. - Analyzing: Use
fastqc
or customawk
scripts. - Clipping: Use
cutadapt
for trimming and adapter removal. - Converting: Use
sff2fastq
orawk
for format conversion. - Demultiplexing: Use
cutadapt
orfastq-multx
. - Dereplicating: Use
fastx_collapser
orvsearch
. - Splitting: Use
split
for large files. - Visualizing: Use
fastqc
or custom scripts for plots.
These tools and one-liners should help you efficiently process SFF/FastQ files for your bioinformatics workflows.