
How to extract FASTA sequences from a file using a list of headers provided in another file
December 28, 2024Here’s a step-by-step manual on how to extract FASTA sequences from a file using a list of headers provided in another file. The manual includes approaches using Unix commands, Perl, and Python, ensuring compatibility with different skill levels and software setups.
Step-by-Step Guide
Prerequisites
- FASTA file: Ensure your FASTA file (
sequences.fas) is formatted correctly, with headers starting with>and sequences in one or more lines beneath each header. - Header list file: Create a plain text file (
list.txt) with one header per line, without the>symbol.
Option 1: Using grep Command (Unix/Linux)
Steps
- Run the following command:
-w: Match whole words.-A 1: Include one line after the matching line (use-A 2if sequences span two lines, etc.).-Ff: Use fixed strings fromlist.txt.
- Handle multiline sequences: If sequences have varying line counts:
Advantages:
- Simple and fast for moderately sized files.
- Built-in Unix tools; no extra installation needed.
Option 2: Using seqkit Tool
Installation
Download seqkit from seqkit website.
Steps
- Extract sequences:
-n: Match by name (header).-f: Read the header list fromlist.txt.
- Handle gzipped files:
Advantages:
- Handles complex scenarios (e.g., gzip files).
- Easy-to-use and efficient for large datasets.
Option 3: Using seqtk Tool
Installation
Install seqtk using:
Steps
- Extract sequences:
Advantages:
- Lightweight and faster for large datasets.
- Minimal commands required.
Option 4: Using Perl Script
Steps
- Write the script: Save the following as
extract_fasta.pl: - Run the script:
Advantages:
- Fully customizable for complex requirements.
Option 5: Using Python with Biopython
Installation
Install Biopython:
Steps
- Write the script: Save the following as
extract_fasta.py: - Run the script:
Advantages:
- Robust and handles edge cases.
- Suitable for beginners familiar with Python.
Conclusion
- For simplicity: Use
greporseqkit. - For large files: Use
seqtkor Biopython for better memory management. - For custom processing: Perl or Python scripts are more flexible.
These methods provide efficient and accurate solutions to extract FASTA sequences based on a list of headers. Choose the one that best fits your needs and computational expertise!


















