bioinformatics business service

Getting Started with Automated Protein Sequence Analysis and Interpretation: A Linux and Bash Scripting Approach to BLAST

September 24, 2023 Off By admin
Shares

Starting with Linux and performing bioinformatics analyses can seem daunting at first, but with a few basics, you can begin to analyze protein sequences with ease. Below is a step-by-step guide with a case study to get you started:

1. Basics of Linux Terminal:

  • Navigate between directories:
    sh
    cd directory_name # Changes the current directory to the specified one
    ls # Lists all files and directories in the current directory
    pwd # Prints the current working directory
  • Create and remove directories:
    sh
    mkdir directory_name # Makes a new directory
    rmdir directory_name # Removes a directory
  • Create and remove files:
    sh
    touch file_name # Creates a new file
    rm file_name # Removes a file

2. Installing Bioinformatics Software:

Most Linux distributions use a package manager like apt (for Ubuntu/Debian) or yum (for CentOS/Red Hat) to install software. For protein sequence analysis, we’ll use BLAST as an example.

sh
sudo apt-get update
sudo apt-get install ncbi-blast+

3. Case Study: Protein Sequence Analysis using BLAST

Assume you have a protein sequence in a FASTA file, query.fasta. You want to find similar protein sequences from a protein database using BLAST.

Step 1: Prepare the Database

Download a protein database, like the Swiss-Prot database, or use a pre-formatted one. Here, we are assuming you have a database file swissprot.fasta.

sh
makeblastdb -in swissprot.fasta -dbtype prot

Step 2: Run BLAST

Run BLAST with the query sequence against the prepared database.

sh
blastp -query query.fasta -db swissprot.fasta -out output.txt

Step 3: Analyze Output

Inspect the output.txt file to view the BLAST results, analyze alignments, and infer biological significance. You can use less or more command to view the file.

sh
less output.txt

4. Additional Useful Commands:

  • Search within a file:
    sh
    grep "search_term" file_name # Searches for a term within a file
  • Download Files from the Internet:
    sh
    wget http://example.com/file.fasta # Downloads a file from the internet
  • Count Lines in a File:
    sh
    wc -l file_name # Counts the number of lines in a file

5. Learning Further:

By practicing and exploring these commands and tools, you can gradually become proficient in performing protein sequence analysis using the Linux terminal.

Step 1: Viewing the Output

You can use terminal-based text editors or viewers to read and search through the BLAST output file:

sh
less output.txt # Use arrow keys to navigate and ‘q’ to quit

or

sh
nano output.txt # Use arrow keys to navigate, 'CTRL + X' to exit, 'Y' to save changes, and 'N' to discard changes.

Step 2: Parsing BLAST Output

BLAST allows you to format the output. For instance, if you want the output to be in a tabular format, which is easier to parse, you can run BLAST as follows:

sh
blastp -query query.fasta -db swissprot.fasta -out output.txt -outfmt 6

Here, -outfmt 6 generates a tab-separated file with the following columns:

  1. query acc.ver
  2. subject acc.ver
  3. % identity
  4. alignment length
  5. mismatches
  6. gap opens
  7. q. start
  8. q. end
  9. s. start
  10. s. end
  11. evalue
  12. bit score

Extract Specific Columns

You can use awk to extract specific columns. For example, to extract the first two columns (query and subject accession.version):

sh
awk '{print $1 "\t" $2}' output.txt > parsed_output.txt

Step 3: Analyzing Specific Results

Filter by E-value

To filter results with an e-value less than 1e-5 (considered statistically significant), use:

sh
awk '$11 < 1e-5' output.txt > filtered_output.txt

Sort by Bit Score

To sort the results by bit score in descending order, use:

sh
sort -k12,12nr output.txt > sorted_output.txt

Step 4: Visualizing Results

You might want to visualize the alignments using tools like Jalview or even just with the BLAST output. For quick visualization and interpretation, you can use the BLAST default output format or other formats that BLAST provides (e.g., XML) and visualize them using suitable viewers.

Step 5: Further Analysis

Once you have identified sequences with significant similarity to your query sequence, you can perform further analyses such as:

  • Fetching the full-length sequences of the hits using efetch or similar tools and aligning them with your query using tools like Clustal Omega or MUSCLE.
  • Analyzing the domains present in the aligned sequences using tools like InterProScan.
  • Running a phylogenetic analysis to understand the evolutionary relationships between the sequences.

Additional Tip:

  • To get better at analyzing BLAST results, it is crucial to understand the biology behind it and to get accustomed to reading and interpreting the BLAST output. This involves practicing with different sequences and trying to make biological sense of the BLAST hits.
  • Regularly refer to the BLAST help documentation or manual pages (man blastp) for detailed and updated information on the various options and output formats available.

 

Step 1: Create a Bash Script File

Create a new file, for example, blast_script.sh.

sh
touch blast_script.sh
nano blast_script.sh

Step 2: Write the Bash Script

Paste the following bash script in blast_script.sh. This script runs BLAST, filters results by e-value, sorts by bit score, and extracts specific columns.

bash
#!/bin/bash

# Define variables
QUERY="query.fasta"
DB="swissprot.fasta"
OUTPUT="output.txt"
FILTERED="filtered_output.txt"
SORTED="sorted_output.txt"
PARSED="parsed_output.txt"

# Run BLAST
blastp -query $QUERY -db $DB -out $OUTPUT -outfmt 6

# Filter by e-value and sort by bit score
awk '$11 < 1e-5' $OUTPUT > $FILTERED
sort -k12,12nr $FILTERED > $SORTED

# Extract specific columns
awk '{print $1 "\t" $2 "\t" $11 "\t" $12}' $SORTED > $PARSED

# Display the top 5 results
echo "Top 5 BLAST Hits:"
head -n 5 $PARSED

Step 3: Run the Bash Script

Make your script executable and run it:

sh
chmod +x blast_script.sh
./blast_script.sh

Step 4: Further Analysis with Bash

  • Fetching Sequences: Once you’ve parsed the BLAST output, you may want to fetch the sequences of your hits. You can use efetch from the NCBI toolkit or wget to download sequences.
  • Multiple Sequence Alignment: You can use command-line tools like clustalo for multiple sequence alignments and append those commands to your script.

For Example:

If you want to align the top 5 hits using clustalo, you can append the following to your bash script:

bash
# Extract sequences of the top 5 hits and save them in a new file
awk '{print $2}' $PARSED | head -n 5 > top_hits.txt

# Fetch sequences and align them using clustalo
while read -r line; do
efetch -db protein -id "$line" -format fasta >> top_hits.fasta
done < top_hits.txt

clustalo -i top_hits.fasta -o aligned_top_hits.fasta

Step 5: Automating Visualization and Additional Analysis

For more advanced visualizations and additional analysis steps like phylogenetic analysis, domain analysis, etc., consider integrating Python scripts (using BioPython) or R scripts (using Bioconductor packages) within your Bash script. These scripts can read the parsed BLAST output or the aligned sequences and generate plots, tables, or other summary statistics.

Final Note

This bash script is very basic and serves as a starting point. Depending on your needs and the complexity of your analysis, you may need to modify it, add error handling, logging, etc., and combine it with other scripts and tools for a complete analysis pipeline.

Shares