Step-by-Step Guide: Removing Duplicate Sequences in FASTA Files

December 28, 2024 Off By admin

This guide will help you remove duplicate sequences from a FASTA file using various methods, including Unix commands, Perl, and Python scripts. Each approach is detailed to ensure clarity for beginners.

Table of Contents

1. Preparation

Install necessary tools:
- Unix commands (no installation needed for basic tools like sed or sort).
- Perl (pre-installed on most Unix systems).
- Python (ensure version 3.x or later is installed).
- Optional tools:
Create a test FASTA file:
- Save the following as test.fasta:
  fasta
  >seq1 ATGCATGCATGC >seq2 ATGCATGCATGC >seq3 GGCCTTAAGGCC

2. Removing Duplicates Using Unix Commands

Transform FASTA file for processing:
- Combine header and sequence into one line:
  bash
  sed -e '/^>/s/$/@/' -e 's/^>/#/' test.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" > temp.tsv
Sort and remove duplicates:
- Sort by sequence and keep unique entries:
  bash
  sort -u -t $'\t' -k 2,2 temp.tsv > sorted.tsv
Restore the FASTA format:
- Convert back to FASTA format:
  bash
  awk -F'\t' '{print ">"$1"\n"$2}' sorted.tsv > unique.fasta

3. Removing Duplicates Using Perl Script

Save the following Perl script as remove_duplicates.pl:
perl
use strict; use Bio::SeqIO; my $file = "test.fasta"; my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta"); my $outseq = Bio::SeqIO->new(-file => ">unique.fasta", -format => "fasta"); my %unique;
while (my $seq = $seqio->next_seq) { my $sequence = $seq->seq; unless (exists $unique{$sequence}) { $unique{$sequence} = 1; $outseq->write_seq($seq); } }
Run the script:
bash
perl remove_duplicates.pl

4. Removing Duplicates Using Python Script

Save the following Python script as remove_duplicates.py:
python
from itertools import groupby def is_header(line): return line.startswith(">") def remove_duplicates(input_file, output_file): sequences = set() with open(input_file, 'r') as infile, open(output_file, 'w') as outfile: header = None for is_header_line, group in groupby(infile, key=is_header): if is_header_line: header = next(group).strip() else: sequence = "".join(line.strip() for line in group) if sequence not in sequences: sequences.add(sequence) outfile.write(f"{header}\n{sequence}\n")
remove_duplicates("test.fasta", "unique.fasta")
Run the script:
bash
python remove_duplicates.py

5. Using Dedicated Tools

Fastx Toolkit:
- Install via package manager (e.g., apt-get install fastx-toolkit on Ubuntu).
- Run:
  bash
  fastx_collapser -i test.fasta -o unique.fasta
CD-HIT:
- Install from CD-HIT website.
- Run:
  bash
  cd-hit-est -i test.fasta -o unique.fasta -c 1.0
GenomeTools:
- Install from GenomeTools website.
- Run:
  bash
  gt sequniq -o unique.fasta test.fasta

6. Notes

Ensure the sequences are not multiline. Use awk or sed to preprocess if required.
Choose tools based on your sequence type (nucleotide or protein) and file size.
For very large files, use tools optimized for high performance, like CD-HIT or gt sequniq.