Extracting Multiple FASTA Sequences at a Time from a File Containing Many Sequences

January 9, 2025 Off By admin

Extracting specific sequences from a large FASTA file is a common task in bioinformatics. Below are several methods to achieve this using different tools and programming languages, including Perl, Python (with Biopython), and command-line utilities.

Table of Contents

1. Using Perl

a. Basic Perl Script

This script reads a list of sequence IDs from a file and extracts the corresponding sequences from a FASTA file.

#!/usr/bin/perl -w

use strict;

my $idsfile = "ids.txt";  # File containing IDs, one per line
my $seqfile = "seqs.fa";  # Input FASTA file
my %ids = ();

# Read IDs into a hash
open FILE, $idsfile or die "Cannot open $idsfile: $!";
while (<FILE>) {
  chomp;
  $ids{$_} = 1;
}
close FILE;

# Read FASTA file and extract sequences
local $/ = "\n>";  # Read by FASTA record
open FASTA, $seqfile or die "Cannot open $seqfile: $!";
while (<FASTA>) {
    chomp;
    my $seq = $_;
    my ($id) = $seq =~ /^>*(\S+)/;  # Parse ID as first word in FASTA header
    if (exists($ids{$id})) {
        $seq =~ s/^>*.+\n//;  # Remove FASTA header
        $seq =~ s/\n//g;  # Remove endlines
        print ">$id\n$seq\n";
    }
}
close FASTA;

Usage:

perl extractSeqs.pl ids.txt seqs.fa > extracted_seqs.fa

b. Using BioPerl

BioPerl provides a more robust and efficient way to handle FASTA files, especially for large datasets.

use Bio::DB::Fasta;

my $fastaFile = shift;  # Input FASTA file
my $queryFile = shift;  # File containing IDs, one per line

my $db = Bio::DB::Fasta->new($fastaFile);
open (IN, $queryFile) or die "Cannot open $queryFile: $!";
while (<IN>) {
    chomp;
    my $seq = $_;
    my $sequence = $db->seq($seq);
    if (!defined($sequence)) {
        warn "Sequence $seq not found.\n";
        next;
    }
    print ">$seq\n$sequence\n";
}
close IN;

Usage:

perl extractSeqsBioPerl.pl seqs.fa ids.txt > extracted_seqs.fa

2. Using Python (with Biopython)

a. Basic Python Script

This script uses Biopython to parse the FASTA file and extract sequences based on a list of IDs.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]  # Input FASTA file
id_file = sys.argv[2]     # File containing IDs, one per line
output_file = sys.argv[3] # Output FASTA file

# Read IDs into a set
with open(id_file) as f:
    wanted = set(line.strip() for line in f if line.strip())

# Parse FASTA file and extract sequences
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print(f"Saved {count} records from {input_file} to {output_file}")
if count < len(wanted):
    print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")

Usage:

python extractSeqs.py seqs.fa ids.txt extracted_seqs.fa

b. Using Biopython with Indexing

For very large FASTA files, indexing can be more efficient.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]  # Input FASTA file
id_file = sys.argv[2]     # File containing IDs, one per line
output_file = sys.argv[3] # Output FASTA file

# Read IDs into a set
with open(id_file) as f:
    wanted = set(line.strip() for line in f if line.strip())

# Index FASTA file and extract sequences
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")

print(f"Saved {count} records from {input_file} to {output_file}")
if count < len(wanted):
    print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")

Usage:

python extractSeqsIndex.py seqs.fa ids.txt extracted_seqs.fa

3. Using Command-Line Tools

a. Using `samtools`

If you have a FASTA file indexed with samtools, you can extract sequences by ID.

samtools faidx seqs.fa
xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa

Usage:

samtools faidx seqs.fa
xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa

b. Using `blastdbcmd`

If you have a BLAST database, you can use blastdbcmd to extract sequences.

blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa

Usage:

blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa

4. Using `seqtk`

seqtk is a fast and lightweight tool for processing sequences in FASTA/Q format.

seqtk subseq seqs.fa ids.txt > extracted_seqs.fa

Usage:

seqtk subseq seqs.fa ids.txt > extracted_seqs.fa

Conclusion

Extracting multiple sequences from a large FASTA file can be efficiently achieved using various tools and programming languages. For small to medium-sized files, Perl or Python scripts are sufficient. For larger files, consider using indexed approaches or specialized tools like samtools, blastdbcmd, or seqtk. Each method has its advantages, so choose the one that best fits your needs and workflow.