Getting Started with Automated Protein Sequence Analysis and Interpretation: A Linux and Bash Scripting Approach to BLAST
September 24, 2023Starting with Linux and performing bioinformatics analyses can seem daunting at first, but with a few basics, you can begin to analyze protein sequences with ease. Below is a step-by-step guide with a case study to get you started:
1. Basics of Linux Terminal:
- Navigate between directories:sh
cd directory_name # Changes the current directory to the specified one
ls # Lists all files and directories in the current directory
pwd # Prints the current working directory
- Create and remove directories:sh
mkdir directory_name # Makes a new directory
rmdir directory_name # Removes a directory
- Create and remove files:sh
touch file_name # Creates a new file
rm file_name # Removes a file
2. Installing Bioinformatics Software:
Most Linux distributions use a package manager like apt
(for Ubuntu/Debian) or yum
(for CentOS/Red Hat) to install software. For protein sequence analysis, we’ll use BLAST as an example.
sudo apt-get update
sudo apt-get install ncbi-blast+
3. Case Study: Protein Sequence Analysis using BLAST
Assume you have a protein sequence in a FASTA file, query.fasta
. You want to find similar protein sequences from a protein database using BLAST.
Step 1: Prepare the Database
Download a protein database, like the Swiss-Prot database, or use a pre-formatted one. Here, we are assuming you have a database file swissprot.fasta
.
makeblastdb -in swissprot.fasta -dbtype prot
Step 2: Run BLAST
Run BLAST with the query sequence against the prepared database.
blastp -query query.fasta -db swissprot.fasta -out output.txt
Step 3: Analyze Output
Inspect the output.txt
file to view the BLAST results, analyze alignments, and infer biological significance. You can use less
or more
command to view the file.
less output.txt
4. Additional Useful Commands:
- Search within a file:sh
grep "search_term" file_name # Searches for a term within a file
- Download Files from the Internet:sh
wget http://example.com/file.fasta # Downloads a file from the internet
- Count Lines in a File:sh
wc -l file_name # Counts the number of lines in a file
5. Learning Further:
- Explore more advanced bioinformatics tools like HMMER for domain analysis, Clustal Omega, or MUSCLE for multiple sequence alignments.
- Look into Bioconda for managing bioinformatics software and dependencies.
- Consider using BioPython for creating and executing more advanced and custom bioinformatics pipelines in Python.
By practicing and exploring these commands and tools, you can gradually become proficient in performing protein sequence analysis using the Linux terminal.
Step 1: Viewing the Output
You can use terminal-based text editors or viewers to read and search through the BLAST output file:
less output.txt # Use arrow keys to navigate and ‘q’ to quit
or
nano output.txt # Use arrow keys to navigate, 'CTRL + X' to exit, 'Y' to save changes, and 'N' to discard changes.
Step 2: Parsing BLAST Output
BLAST allows you to format the output. For instance, if you want the output to be in a tabular format, which is easier to parse, you can run BLAST as follows:
blastp -query query.fasta -db swissprot.fasta -out output.txt -outfmt 6
Here, -outfmt 6
generates a tab-separated file with the following columns:
- query acc.ver
- subject acc.ver
- % identity
- alignment length
- mismatches
- gap opens
- q. start
- q. end
- s. start
- s. end
- evalue
- bit score
Extract Specific Columns
You can use awk
to extract specific columns. For example, to extract the first two columns (query and subject accession.version):
awk '{print $1 "\t" $2}' output.txt > parsed_output.txt
Step 3: Analyzing Specific Results
Filter by E-value
To filter results with an e-value less than 1e-5 (considered statistically significant), use:
awk '$11 < 1e-5' output.txt > filtered_output.txt
Sort by Bit Score
To sort the results by bit score in descending order, use:
sort -k12,12nr output.txt > sorted_output.txt
Step 4: Visualizing Results
You might want to visualize the alignments using tools like Jalview or even just with the BLAST output. For quick visualization and interpretation, you can use the BLAST default output format or other formats that BLAST provides (e.g., XML) and visualize them using suitable viewers.
Step 5: Further Analysis
Once you have identified sequences with significant similarity to your query sequence, you can perform further analyses such as:
- Fetching the full-length sequences of the hits using
efetch
or similar tools and aligning them with your query using tools like Clustal Omega or MUSCLE. - Analyzing the domains present in the aligned sequences using tools like InterProScan.
- Running a phylogenetic analysis to understand the evolutionary relationships between the sequences.
Additional Tip:
- To get better at analyzing BLAST results, it is crucial to understand the biology behind it and to get accustomed to reading and interpreting the BLAST output. This involves practicing with different sequences and trying to make biological sense of the BLAST hits.
- Regularly refer to the BLAST help documentation or manual pages (
man blastp
) for detailed and updated information on the various options and output formats available.
Step 1: Create a Bash Script File
Create a new file, for example, blast_script.sh
.
touch blast_script.sh
nano blast_script.sh
Step 2: Write the Bash Script
Paste the following bash script in blast_script.sh
. This script runs BLAST, filters results by e-value, sorts by bit score, and extracts specific columns.
# Define variables
QUERY="query.fasta"
DB="swissprot.fasta"
OUTPUT="output.txt"
FILTERED="filtered_output.txt"
SORTED="sorted_output.txt"
PARSED="parsed_output.txt"
# Run BLAST
blastp -query $QUERY -db $DB -out $OUTPUT -outfmt 6
# Filter by e-value and sort by bit score
awk '$11 < 1e-5' $OUTPUT > $FILTERED
sort -k12,12nr $FILTERED > $SORTED
# Extract specific columns
awk '{print $1 "\t" $2 "\t" $11 "\t" $12}' $SORTED > $PARSED
# Display the top 5 results
echo "Top 5 BLAST Hits:"
head -n 5 $PARSED
Step 3: Run the Bash Script
Make your script executable and run it:
chmod +x blast_script.sh
./blast_script.sh
Step 4: Further Analysis with Bash
- Fetching Sequences: Once you’ve parsed the BLAST output, you may want to fetch the sequences of your hits. You can use
efetch
from the NCBI toolkit orwget
to download sequences. - Multiple Sequence Alignment: You can use command-line tools like
clustalo
for multiple sequence alignments and append those commands to your script.
For Example:
If you want to align the top 5 hits using clustalo
, you can append the following to your bash script:
# Extract sequences of the top 5 hits and save them in a new file
awk '{print $2}' $PARSED | head -n 5 > top_hits.txt# Fetch sequences and align them using clustalo
while read -r line; do
efetch -db protein -id "$line" -format fasta >> top_hits.fasta
done < top_hits.txt
clustalo -i top_hits.fasta -o aligned_top_hits.fasta
Step 5: Automating Visualization and Additional Analysis
For more advanced visualizations and additional analysis steps like phylogenetic analysis, domain analysis, etc., consider integrating Python scripts (using BioPython) or R scripts (using Bioconductor packages) within your Bash script. These scripts can read the parsed BLAST output or the aligned sequences and generate plots, tables, or other summary statistics.
Final Note
This bash script is very basic and serves as a starting point. Depending on your needs and the complexity of your analysis, you may need to modify it, add error handling, logging, etc., and combine it with other scripts and tools for a complete analysis pipeline.