
Extracting Multiple FASTA Sequences at a Time from a File Containing Many Sequences
January 9, 2025Extracting specific sequences from a large FASTA file is a common task in bioinformatics. Below are several methods to achieve this using different tools and programming languages, including Perl, Python (with Biopython), and command-line utilities.
1. Using Perl
a. Basic Perl Script
This script reads a list of sequence IDs from a file and extracts the corresponding sequences from a FASTA file.
#!/usr/bin/perl -w use strict; my $idsfile = "ids.txt"; # File containing IDs, one per line my $seqfile = "seqs.fa"; # Input FASTA file my %ids = (); # Read IDs into a hash open FILE, $idsfile or die "Cannot open $idsfile: $!"; while (<FILE>) { chomp; $ids{$_} = 1; } close FILE; # Read FASTA file and extract sequences local $/ = "\n>"; # Read by FASTA record open FASTA, $seqfile or die "Cannot open $seqfile: $!"; while (<FASTA>) { chomp; my $seq = $_; my ($id) = $seq =~ /^>*(\S+)/; # Parse ID as first word in FASTA header if (exists($ids{$id})) { $seq =~ s/^>*.+\n//; # Remove FASTA header $seq =~ s/\n//g; # Remove endlines print ">$id\n$seq\n"; } } close FASTA;
Usage:
perl extractSeqs.pl ids.txt seqs.fa > extracted_seqs.fab. Using BioPerl
BioPerl provides a more robust and efficient way to handle FASTA files, especially for large datasets.
use Bio::DB::Fasta; my $fastaFile = shift; # Input FASTA file my $queryFile = shift; # File containing IDs, one per line my $db = Bio::DB::Fasta->new($fastaFile); open (IN, $queryFile) or die "Cannot open $queryFile: $!"; while (<IN>) { chomp; my $seq = $_; my $sequence = $db->seq($seq); if (!defined($sequence)) { warn "Sequence $seq not found.\n"; next; } print ">$seq\n$sequence\n"; } close IN;
Usage:
perl extractSeqsBioPerl.pl seqs.fa ids.txt > extracted_seqs.fa2. Using Python (with Biopython)
a. Basic Python Script
This script uses Biopython to parse the FASTA file and extract sequences based on a list of IDs.
#!/usr/bin/env python import sys from Bio import SeqIO input_file = sys.argv[1] # Input FASTA file id_file = sys.argv[2] # File containing IDs, one per line output_file = sys.argv[3] # Output FASTA file # Read IDs into a set with open(id_file) as f: wanted = set(line.strip() for line in f if line.strip()) # Parse FASTA file and extract sequences records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted) count = SeqIO.write(records, output_file, "fasta") print(f"Saved {count} records from {input_file} to {output_file}") if count < len(wanted): print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")
Usage:
python extractSeqs.py seqs.fa ids.txt extracted_seqs.fa
b. Using Biopython with Indexing
For very large FASTA files, indexing can be more efficient.
#!/usr/bin/env python import sys from Bio import SeqIO input_file = sys.argv[1] # Input FASTA file id_file = sys.argv[2] # File containing IDs, one per line output_file = sys.argv[3] # Output FASTA file # Read IDs into a set with open(id_file) as f: wanted = set(line.strip() for line in f if line.strip()) # Index FASTA file and extract sequences index = SeqIO.index(input_file, "fasta") records = (index[r] for r in wanted) count = SeqIO.write(records, output_file, "fasta") print(f"Saved {count} records from {input_file} to {output_file}") if count < len(wanted): print(f"Warning: {len(wanted) - count} IDs not found in {input_file}")
Usage:
python extractSeqsIndex.py seqs.fa ids.txt extracted_seqs.fa
3. Using Command-Line Tools
a. Using samtools
If you have a FASTA file indexed with samtools, you can extract sequences by ID.
samtools faidx seqs.fa xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa
Usage:
samtools faidx seqs.fa xargs -a ids.txt -I{} samtools faidx seqs.fa {} > extracted_seqs.fa
b. Using blastdbcmd
If you have a BLAST database, you can use blastdbcmd to extract sequences.
blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa
Usage:
blastdbcmd -db seqs.fa -entry_batch ids.txt -out extracted_seqs.fa
4. Using seqtk
seqtk is a fast and lightweight tool for processing sequences in FASTA/Q format.
seqtk subseq seqs.fa ids.txt > extracted_seqs.faUsage:
seqtk subseq seqs.fa ids.txt > extracted_seqs.faConclusion
Extracting multiple sequences from a large FASTA file can be efficiently achieved using various tools and programming languages. For small to medium-sized files, Perl or Python scripts are sufficient. For larger files, consider using indexed approaches or specialized tools like samtools, blastdbcmd, or seqtk. Each method has its advantages, so choose the one that best fits your needs and workflow.


















