Getting Started with Automated Protein Sequence Analysis and Interpretation: A Linux and Bash Scripting Approach to BLAST

September 24, 2023 Off By admin

Starting with Linux and performing bioinformatics analyses can seem daunting at first, but with a few basics, you can begin to analyze protein sequences with ease. Below is a step-by-step guide with a case study to get you started:

Table of Contents

1. Basics of Linux Terminal:

Navigate between directories:
sh
cd directory_name # Changes the current directory to the specified one ls # Lists all files and directories in the current directory pwd # Prints the current working directory
Create and remove directories:
sh
mkdir directory_name # Makes a new directory rmdir directory_name # Removes a directory
Create and remove files:
sh
touch file_name # Creates a new file rm file_name # Removes a file

2. Installing Bioinformatics Software:

Most Linux distributions use a package manager like apt (for Ubuntu/Debian) or yum (for CentOS/Red Hat) to install software. For protein sequence analysis, we’ll use BLAST as an example.

sudo apt-get update
 sudo apt-get install ncbi-blast+

3. Case Study: Protein Sequence Analysis using BLAST

Assume you have a protein sequence in a FASTA file, query.fasta. You want to find similar protein sequences from a protein database using BLAST.

Step 1: Prepare the Database

Download a protein database, like the Swiss-Prot database, or use a pre-formatted one. Here, we are assuming you have a database file swissprot.fasta.

makeblastdb -in swissprot.fasta -dbtype prot

Step 2: Run BLAST

Run BLAST with the query sequence against the prepared database.

blastp -query query.fasta -db swissprot.fasta -out output.txt

Step 3: Analyze Output

Inspect the output.txt file to view the BLAST results, analyze alignments, and infer biological significance. You can use less or more command to view the file.

less output.txt

4. Additional Useful Commands:

Search within a file:
sh
grep "search_term" file_name # Searches for a term within a file
Download Files from the Internet:
sh
wget http://example.com/file.fasta # Downloads a file from the internet
Count Lines in a File:
sh
wc -l file_name # Counts the number of lines in a file

5. Learning Further:

Explore more advanced bioinformatics tools like HMMER for domain analysis, Clustal Omega, or MUSCLE for multiple sequence alignments.
Look into Bioconda for managing bioinformatics software and dependencies.
Consider using BioPython for creating and executing more advanced and custom bioinformatics pipelines in Python.

By practicing and exploring these commands and tools, you can gradually become proficient in performing protein sequence analysis using the Linux terminal.

Step 1: Viewing the Output

You can use terminal-based text editors or viewers to read and search through the BLAST output file:

less output.txt # Use arrow keys to navigate and ‘q’ to quit

nano output.txt # Use arrow keys to navigate, 'CTRL + X' to exit, 'Y' to save changes, and 'N' to discard changes.

Step 2: Parsing BLAST Output

BLAST allows you to format the output. For instance, if you want the output to be in a tabular format, which is easier to parse, you can run BLAST as follows:

blastp -query query.fasta -db swissprot.fasta -out output.txt -outfmt 6

Here, -outfmt 6 generates a tab-separated file with the following columns:

query acc.ver
subject acc.ver
% identity
alignment length
mismatches
gap opens
q. start
q. end
s. start
s. end
evalue
bit score

Extract Specific Columns

You can use awk to extract specific columns. For example, to extract the first two columns (query and subject accession.version):

awk '{print $1 "\t" $2}' output.txt > parsed_output.txt

Step 3: Analyzing Specific Results

Filter by E-value

To filter results with an e-value less than 1e-5 (considered statistically significant), use:

awk '$11 < 1e-5' output.txt > filtered_output.txt

Sort by Bit Score

To sort the results by bit score in descending order, use:

sort -k12,12nr output.txt > sorted_output.txt

Step 4: Visualizing Results

You might want to visualize the alignments using tools like Jalview or even just with the BLAST output. For quick visualization and interpretation, you can use the BLAST default output format or other formats that BLAST provides (e.g., XML) and visualize them using suitable viewers.

Step 5: Further Analysis

Once you have identified sequences with significant similarity to your query sequence, you can perform further analyses such as:

Fetching the full-length sequences of the hits using efetch or similar tools and aligning them with your query using tools like Clustal Omega or MUSCLE.
Analyzing the domains present in the aligned sequences using tools like InterProScan.
Running a phylogenetic analysis to understand the evolutionary relationships between the sequences.

Additional Tip:

To get better at analyzing BLAST results, it is crucial to understand the biology behind it and to get accustomed to reading and interpreting the BLAST output. This involves practicing with different sequences and trying to make biological sense of the BLAST hits.
Regularly refer to the BLAST help documentation or manual pages (man blastp) for detailed and updated information on the various options and output formats available.

Step 1: Create a Bash Script File

Create a new file, for example, blast_script.sh.

touch blast_script.sh
 nano blast_script.sh

Step 2: Write the Bash Script

Paste the following bash script in blast_script.sh. This script runs BLAST, filters results by e-value, sorts by bit score, and extracts specific columns.

bash

#!/bin/bash
# Define variables
 QUERY="query.fasta"
 DB="swissprot.fasta"
 OUTPUT="output.txt"
 FILTERED="filtered_output.txt"
 SORTED="sorted_output.txt"
 PARSED="parsed_output.txt"
# Run BLAST
 blastp -query $QUERY -db $DB -out $OUTPUT -outfmt 6
# Filter by e-value and sort by bit score
 awk '$11 < 1e-5' $OUTPUT > $FILTERED
 sort -k12,12nr $FILTERED > $SORTED
# Extract specific columns
 awk '{print $1 "\t" $2 "\t" $11 "\t" $12}' $SORTED > $PARSED

# Display the top 5 results echo "Top 5 BLAST Hits:" head -n 5 $PARSED

Step 3: Run the Bash Script

Make your script executable and run it:

chmod +x blast_script.sh
 ./blast_script.sh

Step 4: Further Analysis with Bash

Fetching Sequences: Once you’ve parsed the BLAST output, you may want to fetch the sequences of your hits. You can use efetch from the NCBI toolkit or wget to download sequences.
Multiple Sequence Alignment: You can use command-line tools like clustalo for multiple sequence alignments and append those commands to your script.

For Example:

If you want to align the top 5 hits using clustalo, you can append the following to your bash script:

bash

# Extract sequences of the top 5 hits and save them in a new file
 awk '{print $2}' $PARSED | head -n 5 > top_hits.txt
# Fetch sequences and align them using clustalo
 while read -r line; do
 efetch -db protein -id "$line" -format fasta >> top_hits.fasta
 done < top_hits.txt

clustalo -i top_hits.fasta -o aligned_top_hits.fasta

Step 5: Automating Visualization and Additional Analysis

For more advanced visualizations and additional analysis steps like phylogenetic analysis, domain analysis, etc., consider integrating Python scripts (using BioPython) or R scripts (using Bioconductor packages) within your Bash script. These scripts can read the parsed BLAST output or the aligned sequences and generate plots, tables, or other summary statistics.

Final Note

This bash script is very basic and serves as a starting point. Depending on your needs and the complexity of your analysis, you may need to modify it, add error handling, logging, etc., and combine it with other scripts and tools for a complete analysis pipeline.