Running and parsing BLAST+ using the Linux command line
March 12, 2024Table of Contents
About the tutorials
The following tutorials teach how to run NCBI [legacy-blast and] BLAST+ from the command line, along with useful AWK, Bash, Perl and R code and idioms, including 1-liner programs and fully fledged scripts to parse and analyze results efficiently. Particularly helpful is the advanced compute_RBH_clusters.sh script.
ASSUMPTIONS
This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the [legacy- and] blast+ software suite(s), which are freely available for download from the NCBI FTP server
For further instruction on installing and using the BLAST+ suite, read NCBI’s BLAST Book and NCBI’s The Statistics of Sequence Similarity Scores tutorial.
THE BLAST SUITE OF PROGRAMS – OVERVIEW AND PROGRAM SELECTION
BLAST finds regions of similarity between a query and subject sequences. The program compares nucleotide or protein query (problem) sequences to sequence databases and calculates the statistical significance of the resulting query-subject pairwise alignments expressed as an Expect (E-value).
To run a BLAST SEARCH we must select the proper blast program based on the nature of our query sequence(s) and that of the database to search for homologs.
query | database | program | note |
---|---|---|---|
DNA | DNA | blastn | – |
PROT | PROT | blastp | – |
DNA | PROT | blastx | The DNA query is translated in the 6 frames and the products searched against protein DB |
PROT | DNA | tblastn | The DNA DB (genome) is translated in the 6 frames and queried with protein sequences |
DNA | DNA | tblastx | The DNA query and DNA DB are both translated in the 6 frames and searches performed at the protein level |
Table 1. Five frequently used BLAST programs.
Figure 1. The four standard BLAST programs most frequently used for database searching.
Query sequences are stored in a FASTA-formatted file. In addition to the programs listed in table 1, the blast+ software suite also provides the makeblastdb and blastdbcmd programs to make custom DNA or protein databases (makeblastdb) and retrieve the hits from the corresponding blast database found in a blast search (blastdbcmd).
The complete set of blast+ binaries in the ncbi-blast-2.14.1+-x64-linux.tar.gz release are:
blastdb_aliastool blastdbcheck blastdbcmd blast_formatter blast_formatter_vdb blastn blastn_vdb blastp blast_vdb_cmd blastx cleanup-blastdb-volumes.py convert2blastmask deltablast dustmasker get_species_taxids.sh legacy_blast.pl makeblastdb makembindex makeprofiledb psiblast rpsblast rpstblastn segmasker tblastn tblastn_vdb tblastx update_blastdb.pl windowmasker
PRELIMINARIES – setting up working directories
The following code will set up our working directories for the tutorials and fetch the required sequences.
# i) Download or copy the sequence data to your working directories for this practice
cd && [ ! -d sesion2_blast ] && mkdir sesion2_blast && cd sesion2_blast
# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/16S_4blastN.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/gene_discovery_and_annotation_using_blastx.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/UniProt_proteomes.tgz
# iii) Generate two subdirectories to hold the data and analysis results
mkdir BLAST_DB_AA BLAST_DB_NT UniProt_proteomes
# iv) Save the path to our parental working directory in a variable
wkdir=$(pwd)
echo $wkdir
# v) Move each *tgz file to the proper dir
mv 16S_4blastN.tgz BLAST_DB_NT
mv UniProt_proteomes.tgz UniProt_proteomes
mv gene_discovery_and_annotation_using_blastx.tgz BLAST_DB_AA
We are set to start the hands-on tutorials on running BLAST from the command line.
Searching a DNA database with DNA queries (BLASTN)
EXERCISE 1 – classify novel 16S rDNA sequences at the genus level using blastn
The exercise aims to teach the basics of a BLAST analysis. We will generate an indexed BLAST database of nucleotide sequences and interrogate it using blastn.
We will use the following data for the exercise.
- Query sequences: 16S_problema.fna, correspond to partial 16S rDNA sequences generated in our lab from environmental microbes recovered from contaminated soils and rivers in Morelos, Mexico. Part of these sequences are reported in Ocha-Sánchez & Vinuesa, 2017.
- Reference sequences to build a blastn-searchable DB: 16S_seqs4_blastDB.fna. This is a tiny portion of the bacterial type strain 16S rRNA sequences fetched from RDPII.
The objective of the exercise is to classify the query sequences in 16S_problema.fna at the genus level based on BLAST statistics.
Exploring the query and reference sequences
Move into the directory holding the nucleotide sequences and explore both files using standard shell filtering tools and modify the FASTA headers in 16S_seqs4_blastDB.fna to make them suitable for formatdb|makeblastdb indexing.
# We need to untar & uncompress (gunzip) the compressed tar file
cd BLAST_DB_NT/ && tar -xvzf 16S_4blastN.tgz
# 1) explore the fasta headers:
grep '>' 16S_seqs4_blastDB.fna |less
# 1.1) How many genera and species per genus are found in the source file 16S_seqs4_blastDB.fna?
# i. the genera
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1 | sort | uniq -c | sort -nrk1
# ii. the species
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1,2 | sort | uniq -c | sort -nrk1
Using Perl 1-liners to generate FASTA headers suitable for database indexing with makeblastdb
# 1.2) Are the sequences in a suitable format for makeblastdb indexing?
# What command would you use to inspect the sequences?
# 1.2.1) generate a proper FASTA header for db indexing
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna | grep '^>'
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna > 16S_seqs4_blastDB.fnaed
Running [formatdb]|makeblastdb to generate an indexed blast database
Note: The commented command lines in the following blocks provide the code to run legacy blast. The exercises are demonstrated using the blast+ suite of programs.
You can display the compact or detailed help formats of any blast+ program using blastprogramx -h
or blastprogramx
, respectively.
-help
#formatdb -i 16S_seqs4_blastDB.fnaed -p F -o T
# display the compact help format
makeblastdb -h
# display the detailed help format
makeblastdb -help
# run makeblastdb to generate an indexed database of nucleotide sequences
makeblastdb -in 16S_seqs4_blastDB.fnaed -dbtype nucl -parse_seqids
Running a default blastn search
The simplest blastn (or any other standard blast search program) invocation requires only two parameters: blastn -query
.
<FASTA_file> -db <database_file>
The following blastn call will search the 16S_seqs4_blastDB.fnaed database we generated in the previous section with the 16S_problema.fna query sequences, using all default values of the search parameters, as defined in the blastn -help
output. This includes the standard output format familiar to users of the Web implementation of NCBI-BLAST.
The output (truncated to save space) has the following sections:
- header
- one-line hit descriptions
- alignments
- footer
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed
BLASTN 2.14.1+ Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14. Database: 16S_seqs4_blastDB.fnaed 604 sequences; 880,434 total letters Query= 13 Length=1359 Score E Sequences producing significant alignments: (Bits) Value 16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 2475 0.0 16S_DB:573 Mycobacterium_arupense_T|AR30097|DQ157760 2453 0.0 16S_DB:501 Mycobacterium_hiberniae_T|ATCC_9874|X67096 2407 0.0 16S_DB:592 Mycobacterium_senuense_T|05-832|DQ536408 2355 0.0 16S_DB:583 Mycobacterium_kumamotonense_T|CST7274_(=_GTC2729)|AB23... 2350 0.0 16S_DB:579 Mycobacterium_pallens_T|czh-8|DQ370008 2252 0.0 16S_DB:531 Mycobacterium_asiaticum_T|ATCC_25276|AF480595 2244 0.0 ... TRUNCATED >16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 Length=1466 Score = 2475 bits (1340), Expect = 0.0 Identities = 1352/1359 (99%), Gaps = 2/1359 (0%) Strand=Plus/Plus Query 1 GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCATCTTGGGATAA 60 ||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 48 GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTTTGGGATAA 107 Query 61 GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 108 GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA 167 Query 121 GCTTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA 180 || ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 168 GC-TTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA 226 Query 181 CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC 240 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 227 CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC 286 ... TRUNCATED >16S_DB:102 Aeromonas_enteropelogenes_T|DSM_6394|X71121 Length=1490 Score = 920 bits (498), Expect = 0.0 Identities = 970/1204 (81%), Gaps = 23/1204 (2%) Strand=Plus/Plus Query 153 ATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGG 212 || |||| ||||||| ||||||||| ||||||||||||| |||| || |||||||| Sbjct 235 ATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGG 294 Query 213 GTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGG 272 || | |||||||||| ||||||| |||| ||||||||||||||||||||||||||||| Sbjct 295 ATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGG 354 Query 273 AATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTC 332 |||||||||||||||| | || ||||||||||| | |||||||| | || || ||||||| Sbjct 355 AATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTC 414 Query 333 GGGTTGTAAACCTCTTTCAGTATCGGCG-AAGCTCCA-AGGT--TTTCT-TTGG-GGTGA 386 |||||||||| | ||||||| | | ||| || || | | ||| ||| |||| Sbjct 415 GGGTTGTAAAGCACTTTCAGCGAGGAGGAAAGGTCAGTAGCTAATATCTGCTGGCTGTGA 474 Query 387 CGGTAGGTACAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGG 446 || || |||||||||||||||| |||| |||||||||||||||||||||||| |||| Sbjct 475 CGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGG 534 ... TRUNCATED Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 946715250 Database: 16S_seqs4_blastDB.fnaed Posted date: Aug 6, 2022 3:00 PM Number of letters in database: 880,434 Number of sequences in database: 604 Matrix: blastn matrix 1 -2 Gap Penalties: Existence: 0, Extension: 2.5
Note that the header and footer sections of the standard blast output contain essential information (m, n, k, λ�) required to compute the E-value (see Karlin-Altschul equation E=kmne−λS�=����−��**, as detailed in a previous section).
AWK one-liner to extract key statistical parameters of a blast[npx] run
The following little AWK one-liner helps extract key statistical parameters from the standard output of a blast[npx] run. The AWK program has the following structure: /pattern/,
.
/pattern/; /pattern/, /pattern/{action}{action}
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -num_alignments 1 | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'
Database: 16S_seqs4_blastDB.fnaed 604 sequences; 880,434 total letters Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850
Identify the single best-hit for each query sequence
A standard blastn search output can be lengthy and excessively wordy. For the aim of the exercise, it is more convenient to use a tabular output format reporting only the best hit found in the database, as we want to classify our unknown 16S rDNA sequences based on their best database hit.
This can be easily achieved with the following blastn call:
# 2.1) run blastn (blastall -p blastn), and get only the best hit (-b 1) in tabular format (-m 8).
# blastall -p blastn -i 16S_problema.fna -b 1 -a 6 -d 16S_seqs4_blastDB.fnaed -m 8 > 16S_problema_blastN_b1_m8.tab
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt 6 -num_alignments 1 > 16S_problema_blastN_maln1_m6.tab
# 2.2) explore the table structure
head 16S_problema_blastN_maln1_m6.tab
13 16S_DB:480 99.485 1359 5 2 1 1359 48 1404 0.0 2475 18 16S_DB:494 99.033 1344 10 3 1 1344 49 1389 0.0 2416 2 16S_DB:549 99.552 1340 5 1 4 1343 81 1419 0.0 2440 27 16S_DB:546 100.000 1342 0 0 1 1342 78 1419 0.0 2479 36 16S_DB:480 99.705 1358 3 1 1 1358 48 1404 0.0 2490 4 16S_DB:600 99.335 1354 4 4 1 1353 79 1428 0.0 2446 48 16S_DB:484 98.375 1354 20 1 2 1355 48 1399 0.0 2390 53 16S_DB:548 99.925 1342 1 0 1 1342 78 1419 0.0 2473 62 16S_DB:496 99.106 1342 8 4 1 1342 85 1422 0.0 2409 7 16S_DB:600 98.736 1345 14 3 1 1343 79 1422 0.0 2386
As can be seen in the output above, the -outfmt 6
tabular format has no header.
The twelve fields of a standad tabular blast output (-outfmt 6|[-m 8])
The default -outfmt 6
blast output contains the following 12 fields
# The default m6 blast output fields #===================================================================================== # >>> The default -outfmt 6 | -m 8 blast output contains the following 12 fields <<< # qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore #------------------------------------------------------------------------------------- # 1. qseqid query (e.g., gene) sequence id # 2. sseqid subject (e.g., reference genome) sequence id # 3. pident percentage of identical matches # 4. length alignment length # 5. mismatch number of mismatches # 6. gapopen number of gap openings # 7. qstart start of alignment in query # 8. qend end of alignment in query # 9. sstart start of alignment in subject #10. send end of alignment in subject #11. evalue expect value #12. bitscore bit score #-------------------------------------------------------------------------------------
Adding a header to the standard tabular output: -outfmt 6 [-m 8 in legacy blast]
# 2.3 Add a header to the standard output; print to file and append to 16S_problema_blastN_maln1_m6.tab
echo -e "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore" > standard_blast_fields.tsv
cat standard_blast_fields.tsv 16S_problema_blastN_maln1_m6.tab > t && mv t 16S_problema_blastN_maln1_m6.tab
head -20 16S_problema_blastN_maln1_m6.tab | column -t
# 2.4) how many unique hits are there?
cut -f 2 16S_problema_blastN_maln1_m6.tab | sed '1d' | sort | uniq -c
Retrieve all the hits from the database using blastdbcmd
# Generate the list of hit IDs for each query
# NOTE: the hits are in the same order as the query sequences
cut -f2 16S_problema_blastN_maln1_m6.tab | sed '1d' > IDs4blastdbcmd.txt
#blastdbcmd -i IDs4blastdbcmd.txt -d 16S_seqs4_blastDB.fnaed | grep '>' |cut -d\| -f3 | cut -d' ' -f2 > hit_sequences_b1.fna
blastdbcmd -entry_batch IDs4blastdbcmd.txt -db 16S_seqs4_blastDB.fnaed > best-hit_sequences.fna
Generate a table containing the query sequence IDs, the subject headers and basic alignment statistics using Linux tools
# 1) Generate a list of the hits from the FASTA headers
grep '>' best-hit_sequences.fna | cut -d' ' -f2 > hit_sequence_headers.txt
# 2) Classify our query sequences, based on the subject headers of each best hit, using a simple tabular output format
cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d' > problem_seqs.txt
paste problem_seqs.txt hit_sequence_headers.txt > classified_16S_problema_sequences.tab
head -5 classified_16S_problema_sequences.tab
13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
Generate a table containing the query sequence IDs, the subject headers, and basic alignment statistics using process substitution to avoid writing intermediate files
A more efficient and idiomatic (“Bashish”) code to achieve the same result, without having to write intermediate files, is the use of “process substitution”, which has this general syntax: <(any comamnd(s) or pipeline here)
. Process substitution replaces the command with its output, treating it as if it were stored in a file. This output can therefore be used by commands like cat���, paste�����, or any other that works on files. Let us see it in action:
paste <(cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head | column -t
13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 Mycobacterium_chitae_T|ATCC_19627|X55603 7 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 ... TRUNCATED
Add basic BLAST statistic information to the summary table using Bash’s “process substitution” idiom
The previous result looks great but needs to include critical statistical information about the robustness of the hits. Let us use the “process substitution” trick that we learned in the previous example to add extra fields from the blast output table (pident, length, e-value, bitscore) to our summary table, including a header.
# 3) add also columns 3, 4, 11, 12 (pident, length,e-value & bit score) to the tabular output, including a header
cat <(echo -e "sid\tpid\tlen\tEval\tscore\ttax") <( paste <(cut -f1,3,4,11,12 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head) | column -t
sid pid len Eval score tax 13 99.485 1359 0.0 2475 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 99.033 1344 0.0 2416 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 99.552 1340 0.0 2440 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 100.000 1342 0.0 2479 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 99.705 1358 0.0 2490 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 99.335 1354 0.0 2446 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 98.375 1354 0.0 2390 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 99.925 1342 0.0 2473 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 99.106 1342 0.0 2409 Mycobacterium_chitae_T|ATCC_19627|X55603 7 98.736 1345 0.0 2386 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 ... TRUNCATED
Note: we are using Linux commands to achieve a task that can be performed directly within BLAST, using a customized tabular output, as we will see in the final section on BLASTN, and explained in the blast[n|p] -help
.
Filter out high-confidence classifications
It is important to realize that finding a hit in the database does not necessarily mean that it supports a reliable identification or classification of the query sequence. In order to have a reasonably high confidence in the classification of 16S rDNA sequences at the genus level, based on tje best blastn������ hit, it should at least satisfy the following conditions:
- alignment length >= 500 bp
- pident >= 95%
AWK is a very handy and efficient tool for filtering BLAST output tables
Let us first explore the BLAST table to identify hits that do not satisfy the two criteria defined above (pident >= 95% && length >= 500). AWK��� is ideal for this task.
identify hits with < 95% identity OR short alignments (length < 500) using AWK��� patterns
# we could use any of the following lines to print the results, including a header
# cat <(cat standard_blast_fields.tsv) <(awk 'BEGIN{FS=OFS="\t"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab) | column -t
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab | column -t
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore 1367SAA005_14Cip-FD1.b1.ab1_638 16S_DB:86 93.218 634 39 4 1 632 37 668 0.0 929 1367SAA005_A17Cip-FD1.b1.ab1_171 16S_DB:86 95.930 172 6 1 1 171 35 206 2.51e-76 278 1367SAA005_A6BSm-FD1.b1.ab1_477 16S_DB:86 98.745 478 5 1 1 477 37 514 0.0 848 1367SAA005_S13-FD1.b1.ab1_121 16S_DB:534 88.333 120 11 3 3 121 64 181 2.33e-35 141 1367SAA005_S2CbTm-FD1.b1.ab1_84 16S_DB:269 97.647 85 1 1 1 84 47 131 1.15e-36 145 1367SAA005_S4CbTm-FD1.b1.ab1_172 16S_DB:269 100.000 172 0 0 1 172 52 223 1.49e-88 318 1367SAA005_S6Sm-FD1.b1.ab1_347 16S_DB:184 99.143 350 0 3 1 347 40 389 0.0 627 1367SAA005_S9b-FD1.b1.ab1_142 16S_DB:3 97.887 142 3 0 1 142 40 181 1.60e-67 248 1367SAA005_Tm203-FD1.b1.ab1_491 16S_DB:128 97.942 486 9 1 3 487 52 537 0.0 841 1367SAA005_VNP1-FD1.b1.ab1_578 16S_DB:435 94.965 576 28 1 3 578 31 605 0.0 902
generate the final (filtered) results table with highly confident classifications
In order to get a confident blastn-based classification of our problem sequences at the genus level, we should use only hits with pident >= 95
% and a minimal alignment length >= 500
bp.
paste <(awk 'BEGIN{FS=OFS="\t"}NR > 1{print $1,$3,$4,$11,$12}' 16S_problema_blastN_maln1_m6.tab) <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | awk 'BEGIN{FS=OFS="\t"; print "sid\tpid\tlen\tEval\tscore\tbest_hit"}$2 >= 95 && $3 >= 500' > 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv
Display the table’s head and tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv
cat <(head 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) <(tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) | column -t
sid pid len Eval score best_hit 13 99.485 1359 0.0 2475 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 99.033 1344 0.0 2416 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 99.552 1340 0.0 2440 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 100.000 1342 0.0 2479 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 99.705 1358 0.0 2490 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 99.335 1354 0.0 2446 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 98.375 1354 0.0 2390 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 99.925 1342 0.0 2473 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 99.106 1342 0.0 2409 Mycobacterium_chitae_T|ATCC_19627|X55603 1367SAA005_S11Sm-FD1.b1.ab1_1049 98.763 1051 0.0 1866 Pseudomonas_monteilii_T|CIP_104883|AF064458 1367SAA005_S12-FD1.b1.ab1_899 98.210 894 0.0 1559 Microbacterium_testaceum_T|DSM_20166|X77445 1367SAA005_S15-FD1.b1.ab1_1111 98.115 1114 0.0 1936 Microbacterium_testaceum_T|DSM_20166|X77445 1367SAA005_S1CbTm-FD1.b1.ab1_683 99.227 647 0.0 1166 Pseudomonas_monteilii_T|CIP_104883|AF064458 1367SAA005_S1-FD1.b1.ab1_542 99.815 542 0.0 996 Pseudomonas_plecoglossicida_T|FPC951|AB009457 1367SAA005_S7CbTm-FD1.b1.ab1_961 98.954 956 0.0 1709 Citrobacter_freundii_T|DSM_30039|AJ233408 1367SAA005_S9-FD1.b1.ab1_1091 99.451 1092 0.0 1984 Cellulosimicrobium_funkei_T|W6122|AY501364 1367SAA005_VNC15-FD1.b1.ab1_1062 99.052 1055 0.0 1890 Rhizobium_alamii_T|type_strain:GBV016|AM931436 1367SAA005_VRP15_bis-FD1.b1.ab1_1124 99.822 1125 0.0 2065 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 1367SAA005_VRP15-FD1.b1.ab1_1112 99.012 1113 0.0 1993 Ensifer_terangae_T|type_strain:_LMG_7834|X68388
The previous examples illustrated how to efficiently combine Shell�ℎ��� and Linux����� tools to process BLAST output tables. With a bit of practice, you will soon master these power tools.
Note: We are using Linux commands to achieve a task that can be performed directly within BLAST, using a customized tabular output, as we will see in the final section on BLASTN “making-use-of-a-customized-tabular-output-with-selected-fields-for-a-slick-and-neat-blastn-based-classification-strategy-of-16s-rdna-sequences”, and explained in the blast[n|p] -help
.
Generate summaries of the results
Finally, let us generate some simple summary statistics of the classified sequences.
# 6 How many sequences could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | wc -l # 42
42
# 7 How many sequences per genus could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | cut -f6 | sed 's/_.*$//' | sort | uniq -c | sort -nrk1
14 Mycobacterium 5 Rhizobium 5 Ensifer 4 Pseudomonas 2 Microbacterium 2 Escherichia/Shigella 2 Beijerinckia 1 Sphingomonas 1 Sinorhizobium 1 Shinella 1 Isoptericola 1 Citrobacter 1 Cellulosimicrobium 1 Brevundimonas 1 Acinetobacter
Making use of a customized tabular output with selected fields for a slick and neat blastn-based classification strategy of 16S rDNA sequences
The strategy shown above works perfectly well but could be more succinct. However, it helped demonstrate interesting Bash idioms, like the process substitution trick we learned to avoid writing intermediate files to disk. This strategy, or similar, was required to add subject title information to the tabular output generated by the legacy blast programs run with -m 8 (tabular output). The current BLAST+ suite is much more powerful and versatile in multiple ways, including the impressive amount of information from HSPs that can be added to tabular outputs (-outfmt 6|7|10
), as can be learned from the blast[npx] -help
output.
We will use this feature to significantly simplify the 16S rDNA sequence classification strategy based on best blastn hits satisfying some minimal alignment criteria. In particular, we will use the stitle (subject title) column in the output, which contains the taxonomic information from the FASTA header. In summary, we will run a new blastn search including selected output fields to avoid retrieving the subject hit FASTA files with fastadbcmd, and grep out their header lines to paste them to the table.
To print a tabular output with custom fields use the following syntax: blastn -outfmt ‘6 keyword1 keyword2 …’
.
Running blastn with a customized tabular outut
The following blastn call with customized tabular output, containing the desired ‘qseqid stitle pident length evalue score’ fields will do the trick. We will also add the query length ‘qlen’ and query coverage ‘qcovs’ fields as the final two columns. We filter the blastn output with AWK to print only HSPs with pident >= 95 && length > 500
and finally format the output with column -t
for a tidy columnar display on the terminal. That’s it, slick and neat.
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt '6 qseqid stitle pident length evalue score qlen qcovs' -num_alignments 1 | awk 'BEGIN{FS=OFS="\t"; print "seqid\tstitle\tpident\tlength\tevalue\tscore\tqlen\tqcovs"}$3 >= 95 && $4 >= 500' | column -t
seqid stitle pident length evalue score qlen qcovs 13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 99.485 1359 0.0 1340 1359 100 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 99.033 1344 0.0 1308 1347 99 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 99.552 1340 0.0 1321 1343 99 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 100.000 1342 0.0 1342 1342 100 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 99.705 1358 0.0 1348 1358 100 4 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 99.335 1354 0.0 1324 1354 99 48 Mycobacterium_gordonae_T|ATCC_14470.|X52923 98.375 1354 0.0 1294 1356 99 53 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 99.925 1342 0.0 1339 1342 100 62 Mycobacterium_chitae_T|ATCC_19627|X55603 99.106 1342 0.0 1304 1342 100 7 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 98.736 1345 0.0 1292 1350 99 8 Mycobacterium_mantenii_T|NLA000401474|FJ042897 99.926 1351 0.0 1347 1350 100 85 Mycobacterium_avium_T|ATCC_49884|EF521891 98.968 1357 0.0 1313 1356 100 95 Mycobacterium_goodii_T|M069|Y12872 99.702 1344 0.0 1334 1344 100 96 Mycobacterium_arupense_T|AR30097|DQ157760 99.705 1358 0.0 1345 1358 100 1367SAA005_2NXC111R-FD1.b1.ab1_579 Rhizobium_loessense_T|CCBAU_7190B|AF364069 97.396 576 0.0 530 579 99 1367SAA005_2NXC11R-FD1.b1.ab1_1153 Shinella_yambaruensis_T|MS4|AB285481 96.766 1144 0.0 1031 1153 99 1367SAA005_2NXC14R-FD1.b1.ab1_578 Rhizobium_loessense_T|CCBAU_7190B|AF364069 97.751 578 0.0 538 578 100 1367SAA005_A17-FD1.b1.ab1_1072 Sphingomonas_aromaticivorans_T|DSM_12444|CP000248 97.577 1073 0.0 994 1072 100 1367SAA005_A1b-FD1.b1.ab1_1010 Pseudomonas_plecoglossicida_T|FPC951|AB009457 98.811 1009 0.0 972 1010 99 1367SAA005_A2CbTm-FD1.b1.ab1_908 Escherichia/Shigella_flexneri_T|X96963 99.008 907 0.0 880 908 99 1367SAA005_A5-FD1.b1.ab1_597 Isoptericola_variabilis_T|MX5|AJ298873 97.138 594 0.0 541 597 99 1367SAA005_A7-FD1.b1.ab1_763 Brevundimonas_nasdae_T|GTC1043|AB071954 98.949 761 0.0 737 763 99 1367SAA005_A7Sm-FD1.b1.ab1_896 Acinetobacter_baumannii_T|DSM_30007|X81660 96.763 865 0.0 780 896 97 1367SAA005_A9Cip-FD1.b1.ab1_579 Escherichia/Shigella_flexneri_T|X96963 99.481 578 0.0 569 579 99 1367SAA005_ACOP7-FD1.b1.ab1_1110 Rhizobium_hainanense_T|I66|U71078 97.212 1112 0.0 1015 1110 100 1367SAA005_AYOT11_bis-FD1.b1.ab1_1118 Beijerinckia_fluminensis_T|UQM_1685|EU401907 98.749 1119 0.0 1076 1118 100 1367SAA005_AYOT11-FD1.b1.ab1_1069 Beijerinckia_fluminensis_T|UQM_1685|EU401907 99.147 1055 0.0 1027 1069 99 1367SAA005_CP6-FD1.b1.ab1_1117 Ensifer_saheli_T|LMG_7837|X68390 98.743 1114 0.0 1070 1117 99 1367SAA005_CP7_bis-FD1.b1.ab1_819 Ensifer_mexicanus_T|ITTG-R7|DQ411930 98.862 791 0.0 763 819 96 1367SAA005_CP7-FD1.b1.ab1_1035 Sinorhizobium_americanum_T|CFNEI_156|AF506513 98.551 1035 0.0 988 1035 99 1367SAA005_NXC126R-FD1.b1.ab1_1017 Ensifer_adhaerens_T|LMG_20216|AM181733 98.527 1018 0.0 972 1017 99 1367SAA005_PROO4-FD1.b1.ab1_539 Rhizobium_loessense_T|CCBAU_7190B|AF364069 97.407 540 0.0 493 539 100 1367SAA005_S11Sm-FD1.b1.ab1_1049 Pseudomonas_monteilii_T|CIP_104883|AF064458 98.763 1051 0.0 1010 1049 100 1367SAA005_S12-FD1.b1.ab1_899 Microbacterium_testaceum_T|DSM_20166|X77445 98.210 894 0.0 844 899 99 1367SAA005_S15-FD1.b1.ab1_1111 Microbacterium_testaceum_T|DSM_20166|X77445 98.115 1114 0.0 1048 1111 100 1367SAA005_S1CbTm-FD1.b1.ab1_683 Pseudomonas_monteilii_T|CIP_104883|AF064458 99.227 647 0.0 631 683 95 1367SAA005_S1-FD1.b1.ab1_542 Pseudomonas_plecoglossicida_T|FPC951|AB009457 99.815 542 0.0 539 542 100 1367SAA005_S7CbTm-FD1.b1.ab1_961 Citrobacter_freundii_T|DSM_30039|AJ233408 98.954 956 0.0 925 961 99 1367SAA005_S9-FD1.b1.ab1_1091 Cellulosimicrobium_funkei_T|W6122|AY501364 99.451 1092 0.0 1074 1091 99 1367SAA005_VNC15-FD1.b1.ab1_1062 Rhizobium_alamii_T|type_strain:GBV016|AM931436 99.052 1055 0.0 1023 1062 99 1367SAA005_VRP15_bis-FD1.b1.ab1_1124 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 99.822 1125 0.0 1118 1124 100 1367SAA005_VRP15-FD1.b1.ab1_1112 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 99.012 1113 0.0 1079 1112 100
This final blastn example demonstrates the importance of knowing our tools well, meaning that we should always take the time to carefully read the software’s documentation to understand and exploit all its key features.
HOMEWORK 1
Repeat the blastn-based classification exercise based on 16S rDNA sequences using 10 hits per query. Impose the same filters applied in the previous exercise.
- Are the results of both blastn runs (single best-hit vs. 10 hits) consistent?
- How many sequences from the problem dataset remain unclassified, and what is the reason they were not classified?
- Are there sequences that remain unclassified that have a pident >= 90% && < 95%?
- If so, how many?
- Based on the evidence gathered from the 10 hits/query, how would you classify these sequences?
BLAST+ with protein queries against a protein database (BLASTP)
BLASTP is a key program in genomics and phylogenomics analyses, as it is one of the main tools used to:
- search for homologs (orthologs and paralogs) of query proteins across or within genomes for downstream evolutionary/phylogenetic analyses
- annotate the proteomes of newly-sequenced genomes, typically using a reference database of well-annotated proteins, such as SwissProt, to transfer the annotations of the best hit to the query
Prepare working directory
- extract the sequence data from the compressed tar file
# Move into the UniProt_proteomes dir and extract the sequences in the tgz file
cd $wkdir && cd UniProt_proteomes
tar -xvzf UniProt_proteomes.tgz
Format the Acanthamoeba castellanii reference proteome fetched from UNIPROT
UniProt is the world’s leading high-quality, comprehensive, and freely accessible protein sequence and functional information resource. Explore this extraordinary resource and read the following paper from the UniProt Consortium – UniProt: the Universal Protein Knowledgebase in 2023.
makeblastdb -in ACACA.faa -dbtype prot -parse_seqids
ACACA.faa ACACA.faa.pdb ACACA.faa.phr ACACA.faa.pin ACACA.faa.pog ACACA.faa.pos ACACA.faa.pot ACACA.faa.psq ACACA.faa.ptf ACACA.faa.pto DICDI.faa
Default blastp run, and anatomy of its output
blastp requires only two arguments to run: the -query
and the
-db
, running under default parameters.
blastp -query RAB7A_HUMAN.faa -db ACACA.faa
which produces the following (abbreviated/truncated) output, with the following sections:
- header
- one-line hit descriptions
- alignments
- footer
BLASTP 2.14.1+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. Database: ACACA.faa 14,938 sequences; 6,560,263 total letters Query= sp|P51149|RAB7A_HUMAN Ras-related protein Rab-7a OS=Homo sapiens OX=9606 GN=RAB7A PE=1 SV=1 Length=207 Score E Sequences producing significant alignments: (Bits) Value L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str... 293 2e-102 L8GPH2 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s... 249 4e-85 L8GXM9 GTPbinding protein YPTC5, putative OS=Acanthamoeba castell... 190 7e-62 L8GRY8 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s... 180 4e-58 L8GLL4 RAB family member (Rab7), putative OS=Acanthamoeba castell... 166 2e-52 L8HGU8 Ras family protein OS=Acanthamoeba castellanii str. Neff O... 154 6e-48 L8HEH5 Rab7, putative OS=Acanthamoeba castellanii str. Neff OX=12... 149 8e-46 L8H231 GTPbinding protein OS=Acanthamoeba castellanii str. Neff O... 142 2e-43 L8HDY4 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef... 137 4e-41 L8GSP6 Ras-related protein Rab-2A, putative OS=Acanthamoeba caste... 134 7e-40 L8GKG8 Rasrelated protein Rab5, putative OS=Acanthamoeba castella... 133 2e-39 L8H896 RAB1B, member RAS oncogene family OS=Acanthamoeba castella... 132 6e-39 ... TRUNCATED L8HKA3 HEAT repeat domain containing protein OS=Acanthamoeba cast... 26.6 7.0 L8GYP1 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef... 26.2 7.3 L8GLL9 WD domain, G-beta repeat-containing protein OS=Acanthamoeb... 26.6 7.5 L8H6G5 Uncharacterized protein OS=Acanthamoeba castellanii str. N... 26.2 8.7 L8HK33 GTPase OS=Acanthamoeba castellanii str. Neff OX=1257118 GN... 26.2 9.5 L8H1F4 Myosin VIIa, putative OS=Acanthamoeba castellanii str. Nef... 26.2 9.6 L8GRW1 Rab-like protein 6 OS=Acanthamoeba castellanii str. Neff O... 26.2 9.7 L8HDD1 Cobalamin synthesis protein/P47K OS=Acanthamoeba castellan... 26.2 9.9 L8H2D9 ABC transporter, ATPbinding domain containing protein OS=A... 26.2 9.9 >L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_074580 PE=4 SV=1 Length=186 Score = 293 bits (749), Expect = 2e-102, Method: Compositional matrix adjust. Identities = 149/208 (72%), Positives = 166/208 (80%), Gaps = 23/208 (11%) Query 1 MTSRKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDRLVTMQ 60 M++RKKVLLK+IILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDD+LVT+Q Sbjct 1 MSTRKKVLLKIIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDKLVTLQ 60 Query 61 IWDTAGQERFQSLGVAFYRGADCCVLVFDVTAPNTFKTLDSWRDEFLIQASPRDPENFPF 120 IWDTAGQERFQSLGVAFYRGAD CVLVFDV A PRDPE FPF Sbjct 61 IWDTAGQERFQSLGVAFYRGADACVLVFDVN------------------AGPRDPETFPF 102 Query 121 VVLGNKIDLE-NRQVATKRAQAWCYSKNNIPYFETSAKEAINVEQAFQTIARNALKQETE 179 +VLGNKIDLE +R V+ KRAQ WC SK NIPYFETSAKEAINVEQAFQTIA+NA+K+E + Sbjct 103 IVLGNKIDLESSRVVSQKRAQTWCQSKGNIPYFETSAKEAINVEQAFQTIAKNAMKEEED 162 Query 180 VELYNEFPEPIKLDKNDRAKASAESCSC 207 V+L + I+LDK++ S CSC Sbjct 163 VDLAG-ITDAIQLDKSE---PSQSGCSC 186 ... TRUNCATED >L8GVA9 Fbox domain containing protein OS=Acanthamoeba castellanii str. Neff OX=1257118 GN=ACA1_042330 PE=4 SV=1 Length=226 Score = 27.7 bits (60), Expect = 2.3, Method: Compositional matrix adjust. Identities = 23/100 (23%), Positives = 40/100 (40%), Gaps = 15/100 (15%) Query 11 VIILGDSGVGKTSL-------MNQYVNKKFSNQYKAT----IGADFLTKEVMVDDRLVTM 59 +++LGD GVGKT+ + KK ++ G LT V + Sbjct 114 ILVLGDVGVGKTAYAERCLAGAQESTTKKKKGRFDMVALRGTGKHHLTLPTNYGP--VNI 171 Query 60 QIWDTAGQERFQSLGV--AFYRGADCCVLVFDVTAPNTFK 97 +WD R S GV G +++FD+ + +++ Sbjct 172 TLWDQPATSREISGGVRDELLAGGQAAIIMFDIASLQSYR 211 Lambda K H a alpha 0.318 0.132 0.386 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6 Effective search space used: 603304980 Database: ACACA.faa Posted date: Oct 7, 2023 9:42 AM Number of letters in database: 6,560,263 Number of sequences in database: 14,938 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40
Note: The header and footer sections contain essential information (m, n, k, λ�) required to compute the E-value (see Karlin-Altschul equation E=kmne−λS�=����−��, as detailed in a previous section).
Remember, this critical information can be easily extracted from the standard BLAST output using, for example, an AWK one-liner, such as the following one:
blastp -query RAB7A_HUMAN.faa -db ACACA.faa | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'
Database: ACACA.faa 14,938 sequences; 6,560,263 total letters Lambda K H a alpha 0.318 0.132 0.386 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6
Identify HUMAN vs. Acanthamoeba reciprocal best hits (RBHs) in the Rab-GTPase multigene family
The previous exercise identified a selection of likely Rab-GTPase homologs in the A. castellanii proteome. Our next aim is to identify reciprocal best-hits or RBHs between a set of human reference Rabs retrieved from UNIPROT and saved in file HUMAN_63Rabs_Ran.faa and the A. castellanii proteome. The resulting RBHs are likely orthologs; therefore, we could transfer the annotation of the HUMAN Rabs to their Acanthamoeba RBHs.
Overview of the bioinformatic strategy to find RBHs between the A. castellanii proteome and the HUMAN_63Rabs_Ran.faa reference proteins
The bioinformatics strategy to find RBHs:
- run blastp against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as query with the BLOSUM45 matrix
- retrieve the A. castellanii proteome database hits using blastdbcmd����������
- make a blastdb for HUMAN_63Rabs_Ran.faa and query it with the retrieved A. castellanii proteome database hits
- retrieve a non-redundant list of the ACACA IDs from the blastp run and use it to grep out the matching lines from the first blastp run, sorting it by increasing E-values (in column 11)
- filter out unique HUMAN vs. ACACA RBHs using AWK hashes from the sorted blast output table
Prepare a working directory for the RBH searches
Use symlinks to avoid unnecessary file copying to preserve precious disk space whenever possible.
mkdir RBHs
cd RBHs/
# make symlinks whenever possible, to avoid unnecessary file copying to avoid filling the server's disks
ln -s ../ACACA.faa .
ln -s ../ACACA.faa.* .
ln -s ../HUMAN_63Rabs_Ran.faa .
Run blastp������ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as the query with the BLOSUM45 matrix
- Given the huge evolutionary distance between human and amoebae (> 1 billion yrs), we may use the BLOSUM45 matrix to maximize sensitivity.
- We will use a customized tabular output with the following fields: ‘qseqid sseqid pident gaps length qlen slen qcovs evalue score’, saved in a variable
# set BLAST and filtering variables
mat=BLOSUM45
cols='6 qseqid sseqid pident gaps length qlen slen qcovs evalue score'
min_qcov=80
blastp -query HUMAN_63Rabs_Ran.faa -db ACACA.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out
Retrieve the A. castellanii proteome database hits using blastdbcmd���������� only if qcov > 80.
Note the following syntactic details:
- use of Bash’s process substitution idiom
<(your code here)
awk -F”\t” -v qcov=$min_qcov ‘$8 >= qcov{print $2}’
to filter out Acanthamoeba hits with
HUMAN_63Rabs_vs_ACACA_besthits_m6.outqcov >=$min_qcov
.
blastdbcmd -entry_batch <(awk -F"\t" -v qcov=$min_qcov '$8 > qcov{print $2}' HUMAN_63Rabs_vs_ACACA_besthits_m6.out) -db ACACA.faa > ACACA_vs_HUMAN_best_1hits.faa
Make a blastdb for HUMAN_63Rabs_Ran.faa, and query it with the retrieved A. castellanii proteome database hits
- generate the HUMAN_63Rabs_Ran.faa BLASTdb
- run blastp with the filtered Acanthamoeba hits from the first BLAST as queries against the HUMAN_63Rabs_Ran.faa BLASTdb, retrieving only the best hit
makeblastdb -in HUMAN_63Rabs_Ran.faa -dbtype prot -parse_seqids
blastp -query ACACA_vs_HUMAN_best_1hits.faa -db HUMAN_63Rabs_Ran.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > ACACA_vs_HUMAN_best_1hit_m6.out
Sort the blastp output table from the preceding search by increasing E-values (on column 11) and filter out the first (best-scoring) hits with AWK hashes
# The following lines extract a sorted list (n=54) of unique Acanthamoeba identifiers
# found in the ACA_vs_HUMAN and HUMAN_vs_ACA searches
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort | uniq -c
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u
# This code greps out the sorted and unique ACACA hits from the ACACA_vs_HUMAN_best_1hit_m6.out table
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done
# Same as previous line, but sorting the BLAST output table by increasing E-values (in column 9; sort -gk9,9)
# output displayed below
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done | sort -gk9,9 | column -t
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done | sort -gk9,9 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted
- Sorted output table generated by last line of code, by increasing E-values. Note the presence of repeated Acanthamoeba identifiers (2cnd col)
sp|P62826|RAN_HUMAN tr|L8GF16|L8GF16_ACACA 82.524 0 206 216 213 95 7.64e-129 1259 sp|P61106|RAB14_HUMAN tr|L8H946|L8H946_ACACA 73.953 2 215 215 213 100 2.88e-117 1154 sp|P62491|RB11A_HUMAN tr|L8HJ43|L8HJ43_ACACA 75.829 0 211 216 213 98 9.90e-117 1150 sp|Q15907|RB11B_HUMAN tr|L8HJ43|L8HJ43_ACACA 75.463 8 216 218 213 98 2.83e-113 1119 sp|Q8WUD1|RAB2B_HUMAN tr|L8H2N2|L8H2N2_ACACA 70.046 2 217 216 216 100 4.48e-110 1090 sp|P61019|RAB2A_HUMAN tr|L8H2N2|L8H2N2_ACACA 69.585 6 217 212 216 100 2.42e-109 1083 sp|P20340|RAB6A_HUMAN tr|L8GN15|L8GN15_ACACA 81.287 0 171 208 195 82 1.25e-104 1037 sp|Q9NRW1|RAB6B_HUMAN tr|L8GN15|L8GN15_ACACA 75.843 0 178 208 195 86 1.36e-103 1028 sp|P51149|RAB7A_HUMAN tr|L8HHF3|L8HHF3_ACACA 71.635 23 208 207 186 100 6.25e-103 1021 sp|P61018|RAB4B_HUMAN tr|L8HCK2|L8HCK2_ACACA 65.888 11 214 213 204 100 2.29e-102 1018 sp|P20338|RAB4A_HUMAN tr|L8HCK2|L8HCK2_ACACA 65.421 11 214 218 204 98 9.55e-99 986 sp|Q13637|RAB32_HUMAN tr|L8HAC7|L8HAC7_ACACA 68.681 2 182 225 220 80 9.57e-94 943 sp|P57729|RAB38_HUMAN tr|L8HAC7|L8HAC7_ACACA 69.886 1 176 211 220 83 1.31e-93 940 sp|P57735|RAB25_HUMAN tr|L8HJ43|L8HJ43_ACACA 59.330 4 209 213 213 96 2.16e-91 920 sp|P62820|RAB1A_HUMAN tr|L8HD31|L8HD31_ACACA 69.412 0 170 205 210 83 4.80e-91 915 sp|Q9H0U4|RAB1B_HUMAN tr|L8HD31|L8HD31_ACACA 69.412 0 170 201 210 85 6.47e-91 914 sp|Q92928|RAB1C_HUMAN tr|L8HD31|L8HD31_ACACA 66.864 0 169 201 210 84 5.35e-85 860 ... TRUNCATED
filter out unique HUMAN vs ACA RBHs using AWK hashes from the sorted blast output table
Note: if you need help with AWK and Linux for bioinformatics, visit my extensive AWK tutorial for bioinformatics here
awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted | column -t
awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted > HUMAN_63Rabs_vs_ACACA_RBHs.tsv
- The final table of fifteen HUMAN vs. ACACA Rab-GTPase RBHs (plus the Ran RBH), which are likely orthologs, filtered by
qcov >=
.
$min_qcovsp|P62826|RAN_HUMAN tr|L8GF16|L8GF16_ACACA 82.524 0 206 216 213 95 7.64e-129 1259 sp|P61106|RAB14_HUMAN tr|L8H946|L8H946_ACACA 73.953 2 215 215 213 100 2.88e-117 1154 sp|P62491|RB11A_HUMAN tr|L8HJ43|L8HJ43_ACACA 75.829 0 211 216 213 98 9.90e-117 1150 sp|Q8WUD1|RAB2B_HUMAN tr|L8H2N2|L8H2N2_ACACA 70.046 2 217 216 216 100 4.48e-110 1090 sp|P20340|RAB6A_HUMAN tr|L8GN15|L8GN15_ACACA 81.287 0 171 208 195 82 1.25e-104 1037 sp|P51149|RAB7A_HUMAN tr|L8HHF3|L8HHF3_ACACA 71.635 23 208 207 186 100 6.25e-103 1021 sp|P61018|RAB4B_HUMAN tr|L8HCK2|L8HCK2_ACACA 65.888 11 214 213 204 100 2.29e-102 1018 sp|Q13637|RAB32_HUMAN tr|L8HAC7|L8HAC7_ACACA 68.681 2 182 225 220 80 9.57e-94 943 sp|P62820|RAB1A_HUMAN tr|L8HD31|L8HD31_ACACA 69.412 0 170 205 210 83 4.80e-91 915 sp|Q9NP72|RAB18_HUMAN tr|L8GYY7|L8GYY7_ACACA 58.163 2 196 206 209 95 4.10e-80 816 sp|P51151|RAB9A_HUMAN tr|L8GPH2|L8GPH2_ACACA 51.961 3 204 201 205 100 1.18e-76 784 sp|P51153|RAB13_HUMAN tr|L8HGZ3|L8HGZ3_ACACA 58.282 6 163 203 208 80 1.16e-71 739 sp|Q14964|RB39A_HUMAN tr|L8GSP6|L8GSP6_ACACA 51.124 6 178 217 200 82 8.44e-61 641 sp|Q13636|RAB31_HUMAN tr|L8GKG8|L8GKG8_ACACA 46.734 9 199 194 218 98 2.73e-58 618 sp|Q969Q5|RAB24_HUMAN tr|L8H2Q9|L8H2Q9_ACACA 50.549 18 182 203 211 83 6.20e-56 597
Filter human-Acanthamoeba blastp RBHs with a single one-liner
This final code runs all the filtering steps presented above for the two blastp output tables using a single neat one-liner
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done | sort -gk9,9 | awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print}' | column -t