insurance-bioinformatics

Extracting Multiple FASTA Sequences at a Time from a File Containing Many Sequences

January 9, 2025 Off By admin
Shares

Extracting specific sequences from a large FASTA file is a common task in bioinformatics. Below are several methods to achieve this using different tools and programming languages, including Perl, Python (with Biopython), and command-line utilities.


1. Using Perl

a. Basic Perl Script

This script reads a list of sequence IDs from a file and extracts the corresponding sequences from a FASTA file.

perl
Copy
#!/usr/bin/perl -w

use strict;

my $idsfile = "ids.txt";  # File containing IDs, one per line
my $seqfile = "seqs.fa";  # Input FASTA file
my %ids = ();

# Read IDs into a hash
open FILE, $idsfile or die "Cannot open $idsfile: $!";
while (<FILE>) {
  chomp;
  $ids{$_} = 1;
}
close FILE;

# Read FASTA file and extract sequences
local $/ = "\n>";  # Read by FASTA record
open FASTA, $seqfile or die "Cannot open $seqfile: $!";
while (<FASTA>) {
    chomp;
    my $seq = $_;
    my ($id) = $seq =~ /^>*(\S+)/;  # Parse ID as first word in FASTA header
    if (exists($ids{$id})) {
        $seq =~ s/^>*.+\n//;  # Remove FASTA header
        $seq =~ s/\n//g;  # Remove endlines
        print ">$id\n$seq\n";
    }
}
close FASTA;

Usage:

bash
Copy
perl extractSeqs.pl ids.txt seqs.fa > extracted_seqs.fa

b. Using BioPerl

BioPerl provides a more robust and efficient way to handle FASTA files, especially for large datasets.

perl
Copy
use Bio::DB::Fasta;

my $fastaFile = shift;  # Input FASTA file
my $queryFile = shift;  # File containing IDs, one per line

my $db = Bio::DB::Fasta->new($fastaFile);
open (IN, $queryFile) or die "Cannot open $queryFile: $!";
while (<IN>) {
    chomp;
    my $seq = $_;
    my $sequence = $db->seq($seq);
    if (!defined($sequence)) {
        warn "Sequence $seq not found.\n";
        next;
    }
    print ">$seq\n$sequence\n";
}
close IN;

Usage:

bash
Copy
perl extractSeqsBioPerl.pl seqs.fa ids.txt > extracted_seqs.fa

2. Using Python (with Biopython)

a. Basic Python Script

This script uses Biopython to parse the FASTA file and extract sequences based on a list of IDs.

python
Copy
#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]  # Input FASTA file
id_file = sys.argv[2]     # File containing IDs, one per line
output_file = sys.argv[3] # Output FASTA file

# Read IDs into a set
with open(id_file) as f:
    wanted = set(line.strip() for line in f if line.strip())

# Parse FASTA file and extract sequences
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print(f"Saved {count} records from {input_file} to {output_file}")
if count < len(wanted):
    print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")

Usage:

bash
Copy
python extractSeqs.py seqs.fa ids.txt extracted_seqs.fa

b. Using Biopython with Indexing

For very large FASTA files, indexing can be more efficient.

python
Copy
#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]  # Input FASTA file
id_file = sys.argv[2]     # File containing IDs, one per line
output_file = sys.argv[3] # Output FASTA file

# Read IDs into a set
with open(id_file) as f:
    wanted = set(line.strip() for line in f if line.strip())

# Index FASTA file and extract sequences
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")

print(f"Saved {count} records from {input_file} to {output_file}")
if count < len(wanted):
    print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")

Usage:

bash
Copy
python extractSeqsIndex.py seqs.fa ids.txt extracted_seqs.fa

3. Using Command-Line Tools

a. Using samtools

If you have a FASTA file indexed with samtools, you can extract sequences by ID.

bash
Copy
samtools faidx seqs.fa
xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa

Usage:

bash
Copy
samtools faidx seqs.fa
xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa

b. Using blastdbcmd

If you have a BLAST database, you can use blastdbcmd to extract sequences.

bash
Copy
blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa

Usage:

bash
Copy
blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa

4. Using seqtk

seqtk is a fast and lightweight tool for processing sequences in FASTA/Q format.

bash
Copy
seqtk subseq seqs.fa ids.txt > extracted_seqs.fa

Usage:

bash
Copy
seqtk subseq seqs.fa ids.txt > extracted_seqs.fa

Conclusion

Extracting multiple sequences from a large FASTA file can be efficiently achieved using various tools and programming languages. For small to medium-sized files, Perl or Python scripts are sufficient. For larger files, consider using indexed approaches or specialized tools like samtoolsblastdbcmd, or seqtk. Each method has its advantages, so choose the one that best fits your needs and workflow.

Shares