Cutting-Edge Bioinformatics Techniques

How To Filter Multi-FASTA By Length?

January 3, 2025 Off By admin
Shares

Filtering a multi-FASTA file by sequence length is a common task in bioinformatics. Below are step-by-step instructions using Unix, Perl, Python, and R, along with some online tools and software.


1. Using Unix (AWK)

AWK is a powerful text-processing tool that can be used to filter FASTA files by sequence length.

Step-by-Step Instructions:

  1. Save the following AWK script in a file (e.g., filter_fasta.awk):
    bash
    Copy
    #!/bin/awk -f
    BEGIN { RS = ">" ; ORS = "" }
    {
        if (length($2) >= min_length) {
            print ">" $0
        }
    }
  2. Make the script executable:
    bash
    Copy
    chmod +x filter_fasta.awk
  3. Run the script with your FASTA file:
    bash
    Copy
    awk -v min_length=200 -f filter_fasta.awk input.fasta > output.fasta
    • Replace 200 with your desired minimum sequence length.
    • input.fasta is your input FASTA file.
    • output.fasta will contain sequences with lengths ≥ 200.

2. Using Perl

Perl is another versatile scripting language for bioinformatics tasks.

Step-by-Step Instructions:

  1. Save the following Perl script in a file (e.g., filter_fasta.pl):
    perl
    Copy
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my $minlen = shift or die "Error: `minlen` parameter not provided\n";
    {
        local $/ = ">";
        while (<>) {
            chomp;
            next unless /\w/;
            s/>$//gs;
            my @chunk = split /\n/;
            my $header = shift @chunk;
            my $seqlen = length join "", @chunk;
            print ">$_" if ($seqlen >= $minlen);
        }
        local $/ = "\n";
    }
  2. Make the script executable:
    bash
    Copy
    chmod +x filter_fasta.pl
  3. Run the script with your FASTA file:
    bash
    Copy
    perl filter_fasta.pl 200 input.fasta > output.fasta
    • Replace 200 with your desired minimum sequence length.
    • input.fasta is your input FASTA file.
    • output.fasta will contain sequences with lengths ≥ 200.

3. Using Python

Python is widely used in bioinformatics due to its readability and extensive libraries.

Step-by-Step Instructions:

  1. Save the following Python script in a file (e.g., filter_fasta.py):
    python
    Copy
    from Bio import SeqIO
    
    def filter_fasta(input_file, output_file, min_length):
        with open(output_file, "w") as out_handle:
            for record in SeqIO.parse(input_file, "fasta"):
                if len(record.seq) >= min_length:
                    SeqIO.write(record, out_handle, "fasta")
    
    if __name__ == "__main__":
        import sys
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        min_length = int(sys.argv[3])
        filter_fasta(input_file, output_file, min_length)
  2. Run the script with your FASTA file:
    bash
    Copy
    python filter_fasta.py input.fasta output.fasta 200
    • Replace 200 with your desired minimum sequence length.
    • input.fasta is your input FASTA file.
    • output.fasta will contain sequences with lengths ≥ 200.

4. Using R

R is a powerful language for statistical computing and graphics, and it can also handle FASTA files.

Step-by-Step Instructions:

  1. Install the Biostrings package if you haven’t already:
    R
    Copy
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("Biostrings")
  2. Run the following R script:
    R
    Copy
    library(Biostrings)
    
    filter_fasta <- function(input_file, output_file, min_length) {
        fasta <- readDNAStringSet(input_file)
        filtered_fasta <- fasta[width(fasta) >= min_length]
        writeXStringSet(filtered_fasta, output_file)
    }
    
    # Usage
    filter_fasta("input.fasta", "output.fasta", 200)
    • Replace 200 with your desired minimum sequence length.
    • input.fasta is your input FASTA file.
    • output.fasta will contain sequences with lengths ≥ 200.

5. Online Tools and Software

a. Bioawk

Bioawk is an extension of AWK designed for bioinformatics. It simplifies parsing of FASTA files.

  • Installation:
    bash
    Copy
    git clone https://github.com/lh3/bioawk
    cd bioawk
    make
    sudo cp bioawk /usr/local/bin/
  • Usage:
    bash
    Copy
    bioawk -c fastx -v min_length=200 '{ if(length($seq) >= min_length) { print ">"$name; print $seq }}' input.fasta > output.fasta

b. SeqKit

SeqKit is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.

  • Installation:
    bash
    Copy
    wget https://github.com/shenwei356/seqkit/releases/download/v2.0.0/seqkit_linux_amd64.tar.gz
    tar -zxvf seqkit_linux_amd64.tar.gz
    sudo mv seqkit /usr/local/bin/
  • Usage:
    bash
    Copy
    seqkit seq -m 200 input.fasta > output.fasta

c. Galaxy

Galaxy is an open-source platform for data-intensive biomedical research. It provides a web-based interface for filtering FASTA files by length.


Conclusion

Filtering a multi-FASTA file by sequence length can be done using various tools and programming languages. Choose the method that best fits your workflow and environment. For quick tasks, online tools like Galaxy or command-line tools like Bioawk and SeqKit are highly recommended.

Shares