Blast1

How to extract FASTA sequences from a file using a list of headers provided in another file

December 28, 2024 Off By admin
Shares

Here’s a step-by-step manual on how to extract FASTA sequences from a file using a list of headers provided in another file. The manual includes approaches using Unix commands, Perl, and Python, ensuring compatibility with different skill levels and software setups.


Step-by-Step Guide

Prerequisites

  1. FASTA file: Ensure your FASTA file (sequences.fas) is formatted correctly, with headers starting with > and sequences in one or more lines beneath each header.
  2. Header list file: Create a plain text file (list.txt) with one header per line, without the > symbol.

Option 1: Using grep Command (Unix/Linux)

Steps

  1. Run the following command:
    bash
    grep -w -A 1 -Ff list.txt sequences.fas --no-group-separator > output.fas
    • -w: Match whole words.
    • -A 1: Include one line after the matching line (use -A 2 if sequences span two lines, etc.).
    • -Ff: Use fixed strings from list.txt.
  2. Handle multiline sequences: If sequences have varying line counts:
    bash
    awk 'NR==FNR {h[$1]; next} /^>/ {p = substr($0, 2) in h} p' list.txt sequences.fas > output.fas

Advantages:

  • Simple and fast for moderately sized files.
  • Built-in Unix tools; no extra installation needed.

Option 2: Using seqkit Tool

Installation

Download seqkit from seqkit website.

Steps

  1. Extract sequences:
    bash
    seqkit grep -n -f list.txt sequences.fas > output.fas
    • -n: Match by name (header).
    • -f: Read the header list from list.txt.
  2. Handle gzipped files:
    bash
    zcat sequences.fas.gz | seqkit grep -n -f list.txt > output.fas

Advantages:

  • Handles complex scenarios (e.g., gzip files).
  • Easy-to-use and efficient for large datasets.

Option 3: Using seqtk Tool

Installation

Install seqtk using:

bash
sudo apt install seqtk

Steps

  1. Extract sequences:
    bash
    seqtk subseq sequences.fas list.txt > output.fas

Advantages:

  • Lightweight and faster for large datasets.
  • Minimal commands required.

Option 4: Using Perl Script

Steps

  1. Write the script: Save the following as extract_fasta.pl:
    perl
    #!/usr/bin/perl
    use strict;
    use warnings;

    my $fasta_file = "sequences.fas";
    my $list_file = "list.txt";
    my $output_file = "output.fas";

    open(my $list, "<", $list_file) or die "Cannot open $list_file: $!";
    my %wanted = map { chomp; $_ => 1 } <$list>;
    close($list);

    open(my $fasta, "<", $fasta_file) or die "Cannot open $fasta_file: $!";
    open(my $out, ">", $output_file) or die "Cannot open $output_file: $!";

    my $print = 0;
    while (<$fasta>) {
    if (/^>(\S+)/) {
    $print = exists $wanted{$1};
    }
    print $out $_ if $print;
    }
    close($fasta);
    close($out);

  2. Run the script:
    bash
    perl extract_fasta.pl

Advantages:

  • Fully customizable for complex requirements.

Option 5: Using Python with Biopython

Installation

Install Biopython:

bash
pip install biopython

Steps

  1. Write the script: Save the following as extract_fasta.py:
    python
    from Bio import SeqIO

    fasta_file = "sequences.fas"
    list_file = "list.txt"
    output_file = "output.fas"

    # Read headers into a set
    with open(list_file) as f:
    headers = set(line.strip() for line in f)

    # Extract matching sequences
    with open(fasta_file) as fasta, open(output_file, "w") as output:
    for record in SeqIO.parse(fasta, "fasta"):
    if record.id in headers:
    SeqIO.write(record, output, "fasta")

  2. Run the script:
    bash
    python extract_fasta.py

Advantages:

  • Robust and handles edge cases.
  • Suitable for beginners familiar with Python.

Conclusion

  • For simplicity: Use grep or seqkit.
  • For large files: Use seqtk or Biopython for better memory management.
  • For custom processing: Perl or Python scripts are more flexible.

These methods provide efficient and accurate solutions to extract FASTA sequences based on a list of headers. Choose the one that best fits your needs and computational expertise!

Shares