How To Filter Multi-FASTA By Length?

January 3, 2025 Off By admin

Filtering a multi-FASTA file by sequence length is a common task in bioinformatics. Below are step-by-step instructions using Unix, Perl, Python, and R, along with some online tools and software.

Table of Contents

1. Using Unix (AWK)

AWK is a powerful text-processing tool that can be used to filter FASTA files by sequence length.

Step-by-Step Instructions:

Save the following AWK script in a file (e.g., filter_fasta.awk):

#!/bin/awk -f
BEGIN { RS = ">" ; ORS = "" }
{
    if (length($2) >= min_length) {
        print ">" $0
    }
}

Make the script executable:
bash
Copy
```
chmod +x filter_fasta.awk
```
Run the script with your FASTA file:
bash
Copy
```
awk -v min_length=200 -f filter_fasta.awk input.fasta > output.fasta
```
- Replace 200 with your desired minimum sequence length.
- input.fasta is your input FASTA file.
- output.fasta will contain sequences with lengths ≥ 200.

2. Using Perl

Perl is another versatile scripting language for bioinformatics tasks.

Step-by-Step Instructions:

Save the following Perl script in a file (e.g., filter_fasta.pl):

#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
    local $/ = ">";
    while (<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if ($seqlen >= $minlen);
    }
    local $/ = "\n";
}

Make the script executable:
bash
Copy
```
chmod +x filter_fasta.pl
```
Run the script with your FASTA file:
bash
Copy
```
perl filter_fasta.pl 200 input.fasta > output.fasta
```
- Replace 200 with your desired minimum sequence length.
- input.fasta is your input FASTA file.
- output.fasta will contain sequences with lengths ≥ 200.

3. Using Python

Python is widely used in bioinformatics due to its readability and extensive libraries.

Step-by-Step Instructions:

Save the following Python script in a file (e.g., filter_fasta.py):

from Bio import SeqIO

def filter_fasta(input_file, output_file, min_length):
    with open(output_file, "w") as out_handle:
        for record in SeqIO.parse(input_file, "fasta"):
            if len(record.seq) >= min_length:
                SeqIO.write(record, out_handle, "fasta")

if __name__ == "__main__":
    import sys
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    min_length = int(sys.argv[3])
    filter_fasta(input_file, output_file, min_length)

Run the script with your FASTA file:
bash
Copy
```
python filter_fasta.py input.fasta output.fasta 200
```
- Replace 200 with your desired minimum sequence length.
- input.fasta is your input FASTA file.
- output.fasta will contain sequences with lengths ≥ 200.

4. Using R

R is a powerful language for statistical computing and graphics, and it can also handle FASTA files.

Step-by-Step Instructions:

Install the Biostrings package if you haven’t already:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Biostrings")

Run the following R script:

library(Biostrings)

filter_fasta <- function(input_file, output_file, min_length) {
    fasta <- readDNAStringSet(input_file)
    filtered_fasta <- fasta[width(fasta) >= min_length]
    writeXStringSet(filtered_fasta, output_file)
}

# Usage
filter_fasta("input.fasta", "output.fasta", 200)

Replace 200 with your desired minimum sequence length.
input.fasta is your input FASTA file.
output.fasta will contain sequences with lengths ≥ 200.

5. Online Tools and Software

a. Bioawk

Bioawk is an extension of AWK designed for bioinformatics. It simplifies parsing of FASTA files.

Installation:

git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin/

Usage:

bioawk -c fastx -v min_length=200 '{ if(length($seq) >= min_length) { print ">"$name; print $seq }}' input.fasta > output.fasta

b. SeqKit

SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.

Installation:

wget https://github.com/shenwei356/seqkit/releases/download/v2.0.0/seqkit_linux_amd64.tar.gz
tar -zxvf seqkit_linux_amd64.tar.gz
sudo mv seqkit /usr/local/bin/

Usage:

seqkit seq -m 200 input.fasta > output.fasta

c. Galaxy

Galaxy is an open-source platform for data-intensive biomedical research. It provides a web-based interface for filtering FASTA files by length.

Website: https://usegalaxy.org/
Tool: “Filter by length” under “FASTA manipulation.”

Conclusion

Filtering a multi-FASTA file by sequence length can be done using various tools and programming languages. Choose the method that best fits your workflow and environment. For quick tasks, online tools like Galaxy or command-line tools like Bioawk and SeqKit are highly recommended.