How To Filter Multi-FASTA By Length?
January 3, 2025Filtering a multi-FASTA file by sequence length is a common task in bioinformatics. Below are step-by-step instructions using Unix, Perl, Python, and R, along with some online tools and software.
1. Using Unix (AWK)
AWK is a powerful text-processing tool that can be used to filter FASTA files by sequence length.
Step-by-Step Instructions:
- Save the following AWK script in a file (e.g.,
filter_fasta.awk
):#!/bin/awk -f BEGIN { RS = ">" ; ORS = "" } { if (length($2) >= min_length) { print ">" $0 } }
- Make the script executable:
chmod +x filter_fasta.awk
- Run the script with your FASTA file:
awk -v min_length=200 -f filter_fasta.awk input.fasta > output.fasta
- Replace
200
with your desired minimum sequence length. input.fasta
is your input FASTA file.output.fasta
will contain sequences with lengths ≥ 200.
- Replace
2. Using Perl
Perl is another versatile scripting language for bioinformatics tasks.
Step-by-Step Instructions:
- Save the following Perl script in a file (e.g.,
filter_fasta.pl
):#!/usr/bin/perl use strict; use warnings; my $minlen = shift or die "Error: `minlen` parameter not provided\n"; { local $/ = ">"; while (<>) { chomp; next unless /\w/; s/>$//gs; my @chunk = split /\n/; my $header = shift @chunk; my $seqlen = length join "", @chunk; print ">$_" if ($seqlen >= $minlen); } local $/ = "\n"; }
- Make the script executable:
chmod +x filter_fasta.pl
- Run the script with your FASTA file:
perl filter_fasta.pl 200 input.fasta > output.fasta
- Replace
200
with your desired minimum sequence length. input.fasta
is your input FASTA file.output.fasta
will contain sequences with lengths ≥ 200.
- Replace
3. Using Python
Python is widely used in bioinformatics due to its readability and extensive libraries.
Step-by-Step Instructions:
- Save the following Python script in a file (e.g.,
filter_fasta.py
):from Bio import SeqIO def filter_fasta(input_file, output_file, min_length): with open(output_file, "w") as out_handle: for record in SeqIO.parse(input_file, "fasta"): if len(record.seq) >= min_length: SeqIO.write(record, out_handle, "fasta") if __name__ == "__main__": import sys input_file = sys.argv[1] output_file = sys.argv[2] min_length = int(sys.argv[3]) filter_fasta(input_file, output_file, min_length)
- Run the script with your FASTA file:
python filter_fasta.py input.fasta output.fasta 200
- Replace
200
with your desired minimum sequence length. input.fasta
is your input FASTA file.output.fasta
will contain sequences with lengths ≥ 200.
- Replace
4. Using R
R is a powerful language for statistical computing and graphics, and it can also handle FASTA files.
Step-by-Step Instructions:
- Install the
Biostrings
package if you haven’t already:if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biostrings")
- Run the following R script:
library(Biostrings) filter_fasta <- function(input_file, output_file, min_length) { fasta <- readDNAStringSet(input_file) filtered_fasta <- fasta[width(fasta) >= min_length] writeXStringSet(filtered_fasta, output_file) } # Usage filter_fasta("input.fasta", "output.fasta", 200)
- Replace
200
with your desired minimum sequence length. input.fasta
is your input FASTA file.output.fasta
will contain sequences with lengths ≥ 200.
- Replace
5. Online Tools and Software
a. Bioawk
Bioawk is an extension of AWK designed for bioinformatics. It simplifies parsing of FASTA files.
- Installation:
git clone https://github.com/lh3/bioawk cd bioawk make sudo cp bioawk /usr/local/bin/
- Usage:
bioawk -c fastx -v min_length=200 '{ if(length($seq) >= min_length) { print ">"$name; print $seq }}' input.fasta > output.fasta
b. SeqKit
SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.
- Installation:
wget https://github.com/shenwei356/seqkit/releases/download/v2.0.0/seqkit_linux_amd64.tar.gz tar -zxvf seqkit_linux_amd64.tar.gz sudo mv seqkit /usr/local/bin/
- Usage:
seqkit seq -m 200 input.fasta > output.fasta
c. Galaxy
Galaxy is an open-source platform for data-intensive biomedical research. It provides a web-based interface for filtering FASTA files by length.
- Website: https://usegalaxy.org/
- Tool: “Filter by length” under “FASTA manipulation.”
Conclusion
Filtering a multi-FASTA file by sequence length can be done using various tools and programming languages. Choose the method that best fits your workflow and environment. For quick tasks, online tools like Galaxy or command-line tools like Bioawk and SeqKit are highly recommended.