Step-by-step manual to extract user-defined regions from a FASTA file

December 27, 2024 Off By admin

Here’s a step-by-step manual to extract user-defined regions from a FASTA file. This guide incorporates both UNIX commands and Perl scripts suitable for beginners. It is designed to work on Windows using cross-platform tools or within a UNIX-like environment such as Linux or WSL (Windows Subsystem for Linux).

Table of Contents

Step 1: Understanding Input Files

You will need:

FASTA File: Contains multi-FASTA sequences.
shell
>lyrata FEN1 MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDAS...
Regions File: Specifies the regions to extract.
lyrata 1 108 lyrata 147 234 Arabidopsis 1 108 Brachypodium 1 93

Step 2: Install Required Tools

Ensure you have these tools installed:

Python with pyfaidx: For a simple solution.
Perl: For custom scripting.
FASTA Utilities: Such as samtools, or tools available in Bioinformatics repositories.

Step 3: Using Python with `pyfaidx`

If you prefer Python, follow these steps:

Install pyfaidx:
bash
pip install pyfaidx
Prepare a BED file: Convert your regions into a BED format (tab-delimited):
lyrata 0 108 lyrata 146 234 Arabidopsis 0 108
Run the extraction:
bash
faidx --bed regions.bed input.fasta
Save the Output: Redirect the output to a new file:
bash
faidx --bed regions.bed input.fasta > extracted_regions.fasta

Step 4: Using UNIX Commands (AWK)

AWK can extract sequences from the FASTA file based on region specifications:

Prepare the Script: Save this AWK script as extract_regions.awk:
awk
BEGIN { FS=" "; OFS="\t" } { if (/^>/) { seq_name = substr($1, 2); next } else { for (i=1; i<=length($0); i++) { seq[seq_name] = seq[seq_name] substr($0, i, 1) } } } END { while ((getline < "regions.txt") > 0) { split($0, region, "\t") name = region[1] start = region[2] end = region[3] print ">" name ":" start "-" end print substr(seq[name], start, end-start+1) } }
Run the Script:
bash
awk -f extract_regions.awk input.fasta > extracted_regions.fasta

Step 5: Using Perl

Perl can handle batch queries efficiently.

Prepare the Script: Save this as extract_regions.pl:
perl
use strict; use warnings; my $fasta_file = 'input.fasta'; my $region_file = 'regions.txt'; my %sequences; # Read the FASTA file open my $FASTA, '<', $fasta_file or die $!; my $header; while (<$FASTA>) { chomp; if (/^>/) { $header = substr($_, 1); } else { $sequences{$header} .= $_; } } close $FASTA;
# Read the regions and extract open my $REGIONS, '<', $region_file or die $!; while (<$REGIONS>) { chomp; my ($seq_name, $start, $end) = split /\s+/; my $extracted = substr($sequences{$seq_name}, $start-1, $end-$start+1); print ">$seq_name:$start-$end\n$extracted\n"; } close $REGIONS;
Run the Script:
bash
perl extract_regions.pl > extracted_regions.fasta

Step 6: Using `samtools`

If your sequences are nucleotide-based:

Index the FASTA:
bash
samtools faidx input.fasta
Query Regions:
bash
samtools faidx input.fasta lyrata:1-108 > extracted_regions.fasta
Batch Extraction: Use a loop for multiple regions:
bash
while read line; do seq=$(echo $line | awk '{print $1}') range=$(echo $line | awk '{print $2 "-" $3}') samtools faidx input.fasta ${seq}:${range} >> extracted_regions.fasta done < regions.txt

Step 7: Handling Protein Sequences

If working with protein sequences:

Modify the scripts above to handle protein identifiers.
Use custom delimiters in scripts like pyfaidx.

Step 8: Final Validation

Ensure extracted regions are correct by:

Manually checking extracted FASTA entries.
Using tools like seqkit for sequence validation.

This guide should help you extract user-defined regions from FASTA files using accessible and flexible tools. Adjust steps for your specific environment and needs.