How to extract FASTA sequences from a file using a list of headers provided in another file

December 28, 2024 Off By admin

Here’s a step-by-step manual on how to extract FASTA sequences from a file using a list of headers provided in another file. The manual includes approaches using Unix commands, Perl, and Python, ensuring compatibility with different skill levels and software setups.

Table of Contents

Step-by-Step Guide

Prerequisites

FASTA file: Ensure your FASTA file (sequences.fas) is formatted correctly, with headers starting with > and sequences in one or more lines beneath each header.
Header list file: Create a plain text file (list.txt) with one header per line, without the > symbol.

Option 1: Using `grep` Command (Unix/Linux)

Steps

Run the following command:
bash
grep -w -A 1 -Ff list.txt sequences.fas --no-group-separator > output.fas
- -w: Match whole words.
- -A 1: Include one line after the matching line (use -A 2 if sequences span two lines, etc.).
- -Ff: Use fixed strings from list.txt.
Handle multiline sequences: If sequences have varying line counts:
bash
awk 'NR==FNR {h[$1]; next} /^>/ {p = substr($0, 2) in h} p' list.txt sequences.fas > output.fas

Advantages:

Simple and fast for moderately sized files.
Built-in Unix tools; no extra installation needed.

Option 2: Using `seqkit` Tool

Installation

Download seqkit from seqkit website.

Steps

Extract sequences:
bash
seqkit grep -n -f list.txt sequences.fas > output.fas
- -n: Match by name (header).
- -f: Read the header list from list.txt.
Handle gzipped files:
bash
zcat sequences.fas.gz | seqkit grep -n -f list.txt > output.fas

Advantages:

Handles complex scenarios (e.g., gzip files).
Easy-to-use and efficient for large datasets.

Option 3: Using `seqtk` Tool

Installation

Install seqtk using:

Steps

Extract sequences:
bash
seqtk subseq sequences.fas list.txt > output.fas

Advantages:

Lightweight and faster for large datasets.
Minimal commands required.

Option 4: Using Perl Script

Steps

Write the script: Save the following as extract_fasta.pl:
perl
#!/usr/bin/perl use strict; use warnings; my $fasta_file = "sequences.fas"; my $list_file = "list.txt"; my $output_file = "output.fas"; open(my $list, "<", $list_file) or die "Cannot open $list_file: $!"; my %wanted = map { chomp; $_ => 1 } <$list>; close($list); open(my $fasta, "<", $fasta_file) or die "Cannot open $fasta_file: $!"; open(my $out, ">", $output_file) or die "Cannot open $output_file: $!";
my $print = 0; while (<$fasta>) { if (/^>(\S+)/) { $print = exists $wanted{$1}; } print $out $_ if $print; } close($fasta); close($out);
Run the script:
bash
perl extract_fasta.pl

Advantages:

Fully customizable for complex requirements.

Option 5: Using Python with Biopython

Installation

Install Biopython:

Steps

Write the script: Save the following as extract_fasta.py:
python
from Bio import SeqIO fasta_file = "sequences.fas" list_file = "list.txt" output_file = "output.fas" # Read headers into a set with open(list_file) as f: headers = set(line.strip() for line in f)
# Extract matching sequences with open(fasta_file) as fasta, open(output_file, "w") as output: for record in SeqIO.parse(fasta, "fasta"): if record.id in headers: SeqIO.write(record, output, "fasta")
Run the script:
bash
python extract_fasta.py

Advantages:

Robust and handles edge cases.
Suitable for beginners familiar with Python.

Conclusion

For simplicity: Use grep or seqkit.
For large files: Use seqtk or Biopython for better memory management.
For custom processing: Perl or Python scripts are more flexible.

These methods provide efficient and accurate solutions to extract FASTA sequences based on a list of headers. Choose the one that best fits your needs and computational expertise!