FASTA-protein sequence-proteome.

Step-by-step manual to extract user-defined regions from a FASTA file

December 27, 2024 Off By admin
Shares

Here’s a step-by-step manual to extract user-defined regions from a FASTA file. This guide incorporates both UNIX commands and Perl scripts suitable for beginners. It is designed to work on Windows using cross-platform tools or within a UNIX-like environment such as Linux or WSL (Windows Subsystem for Linux).


Step 1: Understanding Input Files

You will need:

  1. FASTA File: Contains multi-FASTA sequences.
    shell
    >lyrata FEN1
    MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDAS...
  2. Regions File: Specifies the regions to extract.
    lyrata 1 108
    lyrata 147 234
    Arabidopsis 1 108
    Brachypodium 1 93

Step 2: Install Required Tools

Ensure you have these tools installed:

  • Python with pyfaidx: For a simple solution.
  • Perl: For custom scripting.
  • FASTA Utilities: Such as samtools, or tools available in Bioinformatics repositories.

Step 3: Using Python with pyfaidx

If you prefer Python, follow these steps:

  1. Install pyfaidx:
    bash
    pip install pyfaidx
  2. Prepare a BED file: Convert your regions into a BED format (tab-delimited):
    lyrata 0 108
    lyrata 146 234
    Arabidopsis 0 108
  3. Run the extraction:
    bash
    faidx --bed regions.bed input.fasta
  4. Save the Output: Redirect the output to a new file:
    bash
    faidx --bed regions.bed input.fasta > extracted_regions.fasta

Step 4: Using UNIX Commands (AWK)

AWK can extract sequences from the FASTA file based on region specifications:

  1. Prepare the Script: Save this AWK script as extract_regions.awk:
    awk
    BEGIN { FS=" "; OFS="\t" }
    {
    if (/^>/) { seq_name = substr($1, 2); next }
    else {
    for (i=1; i<=length($0); i++) {
    seq[seq_name] = seq[seq_name] substr($0, i, 1)
    }
    }
    }
    END {
    while ((getline < "regions.txt") > 0) {
    split($0, region, "\t")
    name = region[1]
    start = region[2]
    end = region[3]
    print ">" name ":" start "-" end
    print substr(seq[name], start, end-start+1)
    }
    }
  2. Run the Script:
    bash
    awk -f extract_regions.awk input.fasta > extracted_regions.fasta

Step 5: Using Perl

Perl can handle batch queries efficiently.

  1. Prepare the Script: Save this as extract_regions.pl:
    perl
    use strict;
    use warnings;

    my $fasta_file = 'input.fasta';
    my $region_file = 'regions.txt';
    my %sequences;

    # Read the FASTA file
    open my $FASTA, '<', $fasta_file or die $!;
    my $header;
    while (<$FASTA>) {
    chomp;
    if (/^>/) {
    $header = substr($_, 1);
    } else {
    $sequences{$header} .= $_;
    }
    }
    close $FASTA;

    # Read the regions and extract
    open my $REGIONS, '<', $region_file or die $!;
    while (<$REGIONS>) {
    chomp;
    my ($seq_name, $start, $end) = split /\s+/;
    my $extracted = substr($sequences{$seq_name}, $start-1, $end-$start+1);
    print ">$seq_name:$start-$end\n$extracted\n";
    }
    close $REGIONS;

  2. Run the Script:
    bash
    perl extract_regions.pl > extracted_regions.fasta

Step 6: Using samtools

If your sequences are nucleotide-based:

  1. Index the FASTA:
    bash
    samtools faidx input.fasta
  2. Query Regions:
    bash
    samtools faidx input.fasta lyrata:1-108 > extracted_regions.fasta
  3. Batch Extraction: Use a loop for multiple regions:
    bash
    while read line; do
    seq=$(echo $line | awk '{print $1}')
    range=$(echo $line | awk '{print $2 "-" $3}')
    samtools faidx input.fasta ${seq}:${range} >> extracted_regions.fasta
    done < regions.txt

Step 7: Handling Protein Sequences

If working with protein sequences:

  • Modify the scripts above to handle protein identifiers.
  • Use custom delimiters in scripts like pyfaidx.

Step 8: Final Validation

Ensure extracted regions are correct by:

  • Manually checking extracted FASTA entries.
  • Using tools like seqkit for sequence validation.

This guide should help you extract user-defined regions from FASTA files using accessible and flexible tools. Adjust steps for your specific environment and needs.

Shares