Blast

Step-by-Step Guide: Customizing BLAST Output

December 27, 2024 Off By admin
Shares

1. Understand BLAST Output Options

BLAST’s -outfmt parameter allows you to control the format and content of the output file. The default formats range from pairwise alignments to tabular outputs, XML, and custom configurations.

Key Points:

  • Default tabular format (-outfmt 6) includes basic fields like query ID, subject ID, % identity, etc.
  • Custom tabular output can include additional fields by specifying space-delimited format specifiers (e.g., qseq, sseq, etc.).

2. Basic BLAST Command

A typical blastn command looks like this:

bash
blastn -db BLASTDB -query input.fa -out output_file -word_size 7 -perc_identity 100 -outfmt 6 -max_target_seqs 2

Here:

  • -db BLASTDB: Specifies the database.
  • -query input.fa: Specifies the input query file.
  • -out output_file: Specifies the output file name.
  • -word_size 7: Sets the word size for alignment.
  • -perc_identity 100: Filters alignments with 100% identity.
  • -outfmt 6: Produces tabular output.
  • -max_target_seqs 2: Limits the number of hits to 2 per query.

3. Adding the Target Sequence to the Output

To include the actual aligned subject sequence (sseq), you need to modify the -outfmt parameter.

Updated Command:

bash
blastn -db BLASTDB -query input.fa -out output_file -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sseq" -word_size 7 -perc_identity 100 -max_target_seqs 2

Explanation of Added Fields:

  • sseq: Adds the subject (target) sequence.
  • qseqid: Query sequence ID.
  • sseqid: Subject sequence ID.
  • pident: Percentage of identical matches.
  • length: Alignment length.
  • mismatch: Number of mismatches.
  • gapopen: Number of gap openings.
  • qstart/qend: Start and end positions in the query.
  • sstart/send: Start and end positions in the subject.
  • evalue: Expect value.
  • bitscore: Bit score.

4. Extracting Specific Information

If you only want specific information from the output, use grep, awk, or cut commands in UNIX.

Example: Extract Query ID and Target Sequence

bash
awk '{print $1, $13}' output_file > query_target_sequences.txt

Here:

  • $1 refers to qseqid (Query ID).
  • $13 refers to sseq (Subject Sequence) based on the -outfmt order.

5. Using Perl for Further Customization

If additional post-processing is needed, Perl scripts can help parse and manipulate the BLAST output.

Example: Extract Query ID, Subject ID, and Target Sequence

perl
#!/usr/bin/perl
use strict;
use warnings;

my $blast_output = 'output_file';
open(my $fh, '<', $blast_output) or die "Cannot open file: $!";

while (<$fh>) {
chomp;
my @fields = split(/\t/, $_);
my ($qseqid, $sseqid, $sseq) = @fields[0, 1, 12]; # Adjust indices based on -outfmt
print "$qseqid\t$sseqid\t$sseq\n";
}
close($fh);

Save this script as extract_sequences.pl and run it:

bash
perl extract_sequences.pl > extracted_output.txt

6. Verifying the Results

After running the command or script, inspect the output to ensure the desired columns are present.

Quick Inspection Commands:

bash
head output_file # Displays the first 10 lines of the output
bash
grep -E "desired_pattern" output_file # Filters specific patterns

7. Automating the Workflow

To streamline the process, create a shell script for the BLAST execution and formatting.

Example: Shell Script (run_blast.sh)

bash
#!/bin/bash
DB="BLASTDB"
QUERY="input.fa"
OUTFILE="output_file"
OUTFMT="6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sseq"

blastn -db $DB -query $QUERY -out $OUTFILE -outfmt "$OUTFMT" -word_size 7 -perc_identity 100 -max_target_seqs 2

Make the script executable and run it:

bash
chmod +x run_blast.sh
./run_blast.sh

8. Troubleshooting Tips

  • Ensure BLAST+ is correctly installed and accessible in your PATH.
  • Check the BLAST database with blastdbcmd if there are issues with sseq.
  • Validate your input FASTA file for proper formatting.

By following this guide, you can efficiently customize BLAST output to include specific fields such as the aligned target sequence. Using UNIX tools and scripting enhances reproducibility and automation, making your analysis more robust and scalable.

Shares