How to extract FASTA sequences from a file using a list of headers provided in another file
December 28, 2024Here’s a step-by-step manual on how to extract FASTA sequences from a file using a list of headers provided in another file. The manual includes approaches using Unix commands, Perl, and Python, ensuring compatibility with different skill levels and software setups.
Step-by-Step Guide
Prerequisites
- FASTA file: Ensure your FASTA file (
sequences.fas
) is formatted correctly, with headers starting with>
and sequences in one or more lines beneath each header. - Header list file: Create a plain text file (
list.txt
) with one header per line, without the>
symbol.
Option 1: Using grep
Command (Unix/Linux)
Steps
- Run the following command:
-w
: Match whole words.-A 1
: Include one line after the matching line (use-A 2
if sequences span two lines, etc.).-Ff
: Use fixed strings fromlist.txt
.
- Handle multiline sequences: If sequences have varying line counts:
Advantages:
- Simple and fast for moderately sized files.
- Built-in Unix tools; no extra installation needed.
Option 2: Using seqkit
Tool
Installation
Download seqkit
from seqkit website.
Steps
- Extract sequences:
-n
: Match by name (header).-f
: Read the header list fromlist.txt
.
- Handle gzipped files:
Advantages:
- Handles complex scenarios (e.g., gzip files).
- Easy-to-use and efficient for large datasets.
Option 3: Using seqtk
Tool
Installation
Install seqtk
using:
Steps
- Extract sequences:
Advantages:
- Lightweight and faster for large datasets.
- Minimal commands required.
Option 4: Using Perl Script
Steps
- Write the script: Save the following as
extract_fasta.pl
: - Run the script:
Advantages:
- Fully customizable for complex requirements.
Option 5: Using Python with Biopython
Installation
Install Biopython:
Steps
- Write the script: Save the following as
extract_fasta.py
: - Run the script:
Advantages:
- Robust and handles edge cases.
- Suitable for beginners familiar with Python.
Conclusion
- For simplicity: Use
grep
orseqkit
. - For large files: Use
seqtk
or Biopython for better memory management. - For custom processing: Perl or Python scripts are more flexible.
These methods provide efficient and accurate solutions to extract FASTA sequences based on a list of headers. Choose the one that best fits your needs and computational expertise!