python-bioinformatics-basics

Step-by-Step Guide: Removing Duplicate Sequences in FASTA Files

December 28, 2024 Off By admin
Shares

This guide will help you remove duplicate sequences from a FASTA file using various methods, including Unix commands, Perl, and Python scripts. Each approach is detailed to ensure clarity for beginners.


1. Preparation

  1. Install necessary tools:
    • Unix commands (no installation needed for basic tools like sed or sort).
    • Perl (pre-installed on most Unix systems).
    • Python (ensure version 3.x or later is installed).
    • Optional tools:
  2. Create a test FASTA file:
    • Save the following as test.fasta:
      fasta
      >seq1
      ATGCATGCATGC
      >seq2
      ATGCATGCATGC
      >seq3
      GGCCTTAAGGCC

2. Removing Duplicates Using Unix Commands

  1. Transform FASTA file for processing:
    • Combine header and sequence into one line:
      bash
      sed -e '/^>/s/$/@/' -e 's/^>/#/' test.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" > temp.tsv
  2. Sort and remove duplicates:
    • Sort by sequence and keep unique entries:
      bash
      sort -u -t $'\t' -k 2,2 temp.tsv > sorted.tsv
  3. Restore the FASTA format:
    • Convert back to FASTA format:
      bash
      awk -F'\t' '{print ">"$1"\n"$2}' sorted.tsv > unique.fasta

3. Removing Duplicates Using Perl Script

  1. Save the following Perl script as remove_duplicates.pl:
    perl
    use strict;
    use Bio::SeqIO;

    my $file = "test.fasta";
    my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");
    my $outseq = Bio::SeqIO->new(-file => ">unique.fasta", -format => "fasta");
    my %unique;

    while (my $seq = $seqio->next_seq) {
    my $sequence = $seq->seq;
    unless (exists $unique{$sequence}) {
    $unique{$sequence} = 1;
    $outseq->write_seq($seq);
    }
    }

  2. Run the script:
    bash
    perl remove_duplicates.pl

4. Removing Duplicates Using Python Script

  1. Save the following Python script as remove_duplicates.py:
    python
    from itertools import groupby

    def is_header(line):
    return line.startswith(">")

    def remove_duplicates(input_file, output_file):
    sequences = set()
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
    header = None
    for is_header_line, group in groupby(infile, key=is_header):
    if is_header_line:
    header = next(group).strip()
    else:
    sequence = "".join(line.strip() for line in group)
    if sequence not in sequences:
    sequences.add(sequence)
    outfile.write(f"{header}\n{sequence}\n")

    remove_duplicates("test.fasta", "unique.fasta")

  2. Run the script:
    bash
    python remove_duplicates.py

5. Using Dedicated Tools

  1. Fastx Toolkit:
    • Install via package manager (e.g., apt-get install fastx-toolkit on Ubuntu).
    • Run:
      bash
      fastx_collapser -i test.fasta -o unique.fasta
  2. CD-HIT:
    • Install from CD-HIT website.
    • Run:
      bash
      cd-hit-est -i test.fasta -o unique.fasta -c 1.0
  3. GenomeTools:

6. Notes

  • Ensure the sequences are not multiline. Use awk or sed to preprocess if required.
  • Choose tools based on your sequence type (nucleotide or protein) and file size.
  • For very large files, use tools optimized for high performance, like CD-HIT or gt sequniq.

7. Example Output

For the input file test.fasta, the output unique.fasta will contain:

fasta
>seq1
ATGCATGCATGC
>seq3
GGCCTTAAGGCC

This step-by-step guide ensures clarity and flexibility for beginners to handle duplicate sequence removal efficiently.

Shares