Step-by-Step Guide: Removing Duplicate Sequences in FASTA Files
December 28, 2024This guide will help you remove duplicate sequences from a FASTA file using various methods, including Unix commands, Perl, and Python scripts. Each approach is detailed to ensure clarity for beginners.
1. Preparation
- Install necessary tools:
- Unix commands (no installation needed for basic tools like
sed
orsort
). - Perl (pre-installed on most Unix systems).
- Python (ensure version 3.x or later is installed).
- Optional tools:
- Unix commands (no installation needed for basic tools like
- Create a test FASTA file:
- Save the following as
test.fasta
:
- Save the following as
2. Removing Duplicates Using Unix Commands
- Transform FASTA file for processing:
- Combine header and sequence into one line:
- Sort and remove duplicates:
- Sort by sequence and keep unique entries:
- Restore the FASTA format:
- Convert back to FASTA format:
3. Removing Duplicates Using Perl Script
- Save the following Perl script as
remove_duplicates.pl
: - Run the script:
4. Removing Duplicates Using Python Script
- Save the following Python script as
remove_duplicates.py
: - Run the script:
5. Using Dedicated Tools
- Fastx Toolkit:
- Install via package manager (e.g.,
apt-get install fastx-toolkit
on Ubuntu). - Run:
- Install via package manager (e.g.,
- CD-HIT:
- Install from CD-HIT website.
- Run:
- GenomeTools:
- Install from GenomeTools website.
- Run:
6. Notes
- Ensure the sequences are not multiline. Use
awk
orsed
to preprocess if required. - Choose tools based on your sequence type (nucleotide or protein) and file size.
- For very large files, use tools optimized for high performance, like CD-HIT or
gt sequniq
.
7. Example Output
For the input file test.fasta
, the output unique.fasta
will contain:
This step-by-step guide ensures clarity and flexibility for beginners to handle duplicate sequence removal efficiently.