Single-Molecule Sequencing in RNA Dynamics

Efficient Processing of SFF/FastQ Files

January 10, 2025 Off By admin
Shares

Processing SFF and FastQ files efficiently is a common task in bioinformatics. Below is a step-by-step guide to efficiently view, analyze, clip ends, convert, demultiplex, and dereplicate SFF/FastQ files using a combination of tools and one-liners.


Table of Contents

1. Viewing FastQ Files

To quickly inspect the contents of a FastQ file, use the following commands:

View the First Few Reads

bash
Copy
zcat file.fastq.gz | head -n 12
  • zcat: Decompresses .gz files on the fly.
  • head -n 12: Displays the first 12 lines (3 reads).

View Specific Reads

bash
Copy
awk 'NR>=1000 && NR<=1012' file.fastq
  • Displays reads from line 1000 to 1012.

2. Analyzing FastQ Files

Calculate Basic Statistics

Use fastqc for a comprehensive quality report:

bash
Copy
fastqc file.fastq.gz

Calculate Read Length Distribution

bash
Copy
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
  • Outputs the distribution of read lengths.

Calculate Average Quality

bash
Copy
zcat file.fastq.gz | awk 'NR%4 == 0 {sum=0; for (i=1; i<=length($0); i++) {sum+=ord(substr($0,i,1))-33}; print sum/length($0)}'
  • Calculates the average quality score per read.

3. Clipping Low-Quality Ends

Trim Low-Quality Bases

Use cutadapt to trim low-quality bases:

bash
Copy
cutadapt -q 20 -o trimmed.fastq.gz file.fastq.gz
  • -q 20: Trims bases with quality below 20.
  • -o: Output file.

Remove Adapters

bash
Copy
cutadapt -a ADAPTER_SEQ -o trimmed.fastq.gz file.fastq.gz
  • -a ADAPTER_SEQ: Specifies the adapter sequence to remove.

4. Converting File Formats

Convert SFF to FastQ

Use sff2fastq or biopython:

bash
Copy
sff2fastq -o output.fastq file.sff

Convert FastQ to FASTA

bash
Copy
zcat file.fastq.gz | awk 'NR%4==1{print ">" substr($0,2)} NR%4==2{print}' > output.fasta
  • Converts FastQ to FASTA format.

5. Demultiplexing

Split by Barcodes

Use cutadapt to demultiplex:

bash
Copy
cutadapt -g ^BARCODE1 -o output1.fastq.gz file.fastq.gz
cutadapt -g ^BARCODE2 -o output2.fastq.gz file.fastq.gz
  • -g ^BARCODE: Specifies the barcode sequence.

Demultiplex with fastq-multx

bash
Copy
fastq-multx -b barcodes.txt file.fastq.gz -o output_%.fastq.gz
  • barcodes.txt: File containing barcode sequences.

6. Dereplicating Sequences

Remove Duplicate Reads

Use fastx_collapser from the FASTX toolkit:

bash
Copy
zcat file.fastq.gz | fastx_collapser -o dereplicated.fasta

Dereplicate with vsearch

bash
Copy
vsearch --derep_fulllength file.fastq --output dereplicated.fasta
  • --derep_fulllength: Dereplicates identical sequences.

7. Splitting Large Files

Split by Number of Reads

bash
Copy
zcat file.fastq.gz | split -l 4000000 -d - part_
  • Splits into files with 4 million reads each.

Split by Size

bash
Copy
zcat file.fastq.gz | split -b 1G -d - part_
  • Splits into 1GB chunks.

8. Removing Contaminants

Filter Out Contaminant Sequences

Use bowtie2 to align and filter:

bash
Copy
bowtie2 -x contaminant_index -U file.fastq.gz --un clean.fastq.gz
  • --un clean.fastq.gz: Outputs reads that do not align to the contaminant index.

9. Visualizing Data

Generate Quality Plots

Use fastqc:

bash
Copy
fastqc file.fastq.gz

Plot GC Content

bash
Copy
zcat file.fastq.gz | awk 'NR%4 == 2 {gc=0; for (i=1; i<=length($0); i++) {if (substr($0,i,1) ~ /[GC]/) gc++}; print gc/length($0)}' > gc_content.txt
  • Outputs GC content per read.

10. Parallel Processing

Run Tasks in Parallel

Use GNU parallel to process multiple files:

bash
Copy
ls *.fastq.gz | parallel -j 8 "fastqc {}"
  • -j 8: Runs 8 jobs in parallel.

11. Combining Results

Merge FastQ Files

bash
Copy
cat *.fastq.gz > combined.fastq.gz

Merge FASTA Files

bash
Copy
cat *.fasta > combined.fasta

12. Advanced Tools

Use BioApps (if available)

If you have access to the BioApps tool mentioned in the original post, it provides a GUI for many of these tasks, including:


Summary

  • Viewing: Use zcathead, or awk to inspect files.
  • Analyzing: Use fastqc or custom awk scripts.
  • Clipping: Use cutadapt for trimming and adapter removal.
  • Converting: Use sff2fastq or awk for format conversion.
  • Demultiplexing: Use cutadapt or fastq-multx.
  • Dereplicating: Use fastx_collapser or vsearch.
  • Splitting: Use split for large files.
  • Visualizing: Use fastqc or custom scripts for plots.

These tools and one-liners should help you efficiently process SFF/FastQ files for your bioinformatics workflows.

Shares