Extracting Multiple FASTA Sequences at a Time from a File Containing Many Sequences
January 9, 2025Extracting specific sequences from a large FASTA file is a common task in bioinformatics. Below are several methods to achieve this using different tools and programming languages, including Perl, Python (with Biopython), and command-line utilities.
1. Using Perl
a. Basic Perl Script
This script reads a list of sequence IDs from a file and extracts the corresponding sequences from a FASTA file.
#!/usr/bin/perl -w use strict; my $idsfile = "ids.txt"; # File containing IDs, one per line my $seqfile = "seqs.fa"; # Input FASTA file my %ids = (); # Read IDs into a hash open FILE, $idsfile or die "Cannot open $idsfile: $!"; while (<FILE>) { chomp; $ids{$_} = 1; } close FILE; # Read FASTA file and extract sequences local $/ = "\n>"; # Read by FASTA record open FASTA, $seqfile or die "Cannot open $seqfile: $!"; while (<FASTA>) { chomp; my $seq = $_; my ($id) = $seq =~ /^>*(\S+)/; # Parse ID as first word in FASTA header if (exists($ids{$id})) { $seq =~ s/^>*.+\n//; # Remove FASTA header $seq =~ s/\n//g; # Remove endlines print ">$id\n$seq\n"; } } close FASTA;
Usage:
perl extractSeqs.pl ids.txt seqs.fa > extracted_seqs.fa
b. Using BioPerl
BioPerl provides a more robust and efficient way to handle FASTA files, especially for large datasets.
use Bio::DB::Fasta; my $fastaFile = shift; # Input FASTA file my $queryFile = shift; # File containing IDs, one per line my $db = Bio::DB::Fasta->new($fastaFile); open (IN, $queryFile) or die "Cannot open $queryFile: $!"; while (<IN>) { chomp; my $seq = $_; my $sequence = $db->seq($seq); if (!defined($sequence)) { warn "Sequence $seq not found.\n"; next; } print ">$seq\n$sequence\n"; } close IN;
Usage:
perl extractSeqsBioPerl.pl seqs.fa ids.txt > extracted_seqs.fa
2. Using Python (with Biopython)
a. Basic Python Script
This script uses Biopython to parse the FASTA file and extract sequences based on a list of IDs.
#!/usr/bin/env python import sys from Bio import SeqIO input_file = sys.argv[1] # Input FASTA file id_file = sys.argv[2] # File containing IDs, one per line output_file = sys.argv[3] # Output FASTA file # Read IDs into a set with open(id_file) as f: wanted = set(line.strip() for line in f if line.strip()) # Parse FASTA file and extract sequences records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted) count = SeqIO.write(records, output_file, "fasta") print(f"Saved {count} records from {input_file} to {output_file}") if count < len(wanted): print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")
Usage:
python extractSeqs.py seqs.fa ids.txt extracted_seqs.fa
b. Using Biopython with Indexing
For very large FASTA files, indexing can be more efficient.
#!/usr/bin/env python import sys from Bio import SeqIO input_file = sys.argv[1] # Input FASTA file id_file = sys.argv[2] # File containing IDs, one per line output_file = sys.argv[3] # Output FASTA file # Read IDs into a set with open(id_file) as f: wanted = set(line.strip() for line in f if line.strip()) # Index FASTA file and extract sequences index = SeqIO.index(input_file, "fasta") records = (index[r] for r in wanted) count = SeqIO.write(records, output_file, "fasta") print(f"Saved {count} records from {input_file} to {output_file}") if count < len(wanted): print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")
Usage:
python extractSeqsIndex.py seqs.fa ids.txt extracted_seqs.fa
3. Using Command-Line Tools
a. Using samtools
If you have a FASTA file indexed with samtools
, you can extract sequences by ID.
samtools faidx seqs.fa xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa
Usage:
samtools faidx seqs.fa xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa
b. Using blastdbcmd
If you have a BLAST database, you can use blastdbcmd
to extract sequences.
blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa
Usage:
blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa
4. Using seqtk
seqtk
is a fast and lightweight tool for processing sequences in FASTA/Q format.
seqtk subseq seqs.fa ids.txt > extracted_seqs.fa
Usage:
seqtk subseq seqs.fa ids.txt > extracted_seqs.fa
Conclusion
Extracting multiple sequences from a large FASTA file can be efficiently achieved using various tools and programming languages. For small to medium-sized files, Perl or Python scripts are sufficient. For larger files, consider using indexed approaches or specialized tools like samtools
, blastdbcmd
, or seqtk
. Each method has its advantages, so choose the one that best fits your needs and workflow.