Running and parsing BLAST+ using the Linux command line

March 12, 2024 Off By admin

Table of Contents

About the tutorials

The following tutorials teach how to run NCBI [legacy-blast and] BLAST+ from the command line, along with useful AWK, Bash, Perl and R code and idioms, including 1-liner programs and fully fledged scripts to parse and analyze results efficiently. Particularly helpful is the advanced compute_RBH_clusters.sh script.

ASSUMPTIONS

This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the [legacy- and] blast+ software suite(s), which are freely available for download from the NCBI FTP server

For further instruction on installing and using the BLAST+ suite, read NCBI’s BLAST Book and NCBI’s The Statistics of Sequence Similarity Scores tutorial.

THE BLAST SUITE OF PROGRAMS – OVERVIEW AND PROGRAM SELECTION

BLAST finds regions of similarity between a query and subject sequences. The program compares nucleotide or protein query (problem) sequences to sequence databases and calculates the statistical significance of the resulting query-subject pairwise alignments expressed as an Expect (E-value).

To run a BLAST SEARCH we must select the proper blast program based on the nature of our query sequence(s) and that of the database to search for homologs.

query	database	program	note
DNA	DNA	blastn	–
PROT	PROT	blastp	–
DNA	PROT	blastx	The DNA query is translated in the 6 frames and the products searched against protein DB
PROT	DNA	tblastn	The DNA DB (genome) is translated in the 6 frames and queried with protein sequences
DNA	DNA	tblastx	The DNA query and DNA DB are both translated in the 6 frames and searches performed at the protein level

Table 1. Five frequently used BLAST programs.

Figure 1. The four standard BLAST programs most frequently used for database searching.

Query sequences are stored in a FASTA-formatted file. In addition to the programs listed in table 1, the blast+ software suite also provides the makeblastdb and blastdbcmd programs to make custom DNA or protein databases (makeblastdb) and retrieve the hits from the corresponding blast database found in a blast search (blastdbcmd).

The complete set of blast+ binaries in the ncbi-blast-2.14.1+-x64-linux.tar.gz release are:

blastdb_aliastool
blastdbcheck
blastdbcmd
blast_formatter
blast_formatter_vdb
blastn
blastn_vdb
blastp
blast_vdb_cmd
blastx
cleanup-blastdb-volumes.py
convert2blastmask
deltablast
dustmasker
get_species_taxids.sh
legacy_blast.pl
makeblastdb
makembindex
makeprofiledb
psiblast
rpsblast
rpstblastn
segmasker
tblastn
tblastn_vdb
tblastx
update_blastdb.pl
windowmasker

PRELIMINARIES – setting up working directories

The following code will set up our working directories for the tutorials and fetch the required sequences.

# i) Download or copy the sequence data to your working directories for this practice
cd && [ ! -d sesion2_blast ] && mkdir sesion2_blast && cd sesion2_blast

# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/16S_4blastN.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/gene_discovery_and_annotation_using_blastx.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/UniProt_proteomes.tgz

# iii) Generate two subdirectories to hold the data and analysis results
mkdir BLAST_DB_AA BLAST_DB_NT UniProt_proteomes

# iv) Save the path to our parental working directory in a variable
wkdir=$(pwd) 
echo $wkdir

# v) Move each *tgz file to the proper dir
mv 16S_4blastN.tgz BLAST_DB_NT
mv UniProt_proteomes.tgz UniProt_proteomes
mv gene_discovery_and_annotation_using_blastx.tgz BLAST_DB_AA

We are set to start the hands-on tutorials on running BLAST from the command line.

Searching a DNA database with DNA queries (BLASTN)

EXERCISE 1 – classify novel 16S rDNA sequences at the genus level using blastn

The exercise aims to teach the basics of a BLAST analysis. We will generate an indexed BLAST database of nucleotide sequences and interrogate it using blastn.

We will use the following data for the exercise.

Query sequences: 16S_problema.fna, correspond to partial 16S rDNA sequences generated in our lab from environmental microbes recovered from contaminated soils and rivers in Morelos, Mexico. Part of these sequences are reported in Ocha-Sánchez & Vinuesa, 2017.
Reference sequences to build a blastn-searchable DB: 16S_seqs4_blastDB.fna. This is a tiny portion of the bacterial type strain 16S rRNA sequences fetched from RDPII.

The objective of the exercise is to classify the query sequences in 16S_problema.fna at the genus level based on BLAST statistics.

Exploring the query and reference sequences

Move into the directory holding the nucleotide sequences and explore both files using standard shell filtering tools and modify the FASTA headers in 16S_seqs4_blastDB.fna to make them suitable for formatdb|makeblastdb indexing.

# We need to untar & uncompress (gunzip) the compressed tar file
cd BLAST_DB_NT/ && tar -xvzf 16S_4blastN.tgz

# 1) explore the fasta headers: 
grep '>' 16S_seqs4_blastDB.fna |less

# 1.1) How many genera and species per genus are found in the source file 16S_seqs4_blastDB.fna?
# i. the genera
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1 | sort | uniq -c | sort -nrk1

# ii. the species
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1,2 | sort | uniq -c | sort -nrk1

Using Perl 1-liners to generate FASTA headers suitable for database indexing with makeblastdb

# 1.2) Are the sequences in a suitable format for makeblastdb indexing?
#      What command would you use to inspect the sequences?

# 1.2.1)  generate a proper FASTA header for db indexing
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna | grep '^>' 
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna > 16S_seqs4_blastDB.fnaed

Running [formatdb]|makeblastdb to generate an indexed blast database

Note: The commented command lines in the following blocks provide the code to run legacy blast. The exercises are demonstrated using the blast+ suite of programs.

You can display the compact or detailed help formats of any blast+ program using blastprogramx -h or blastprogramx -help, respectively.

#formatdb -i 16S_seqs4_blastDB.fnaed -p F -o T

# display the compact help format
makeblastdb -h

# display the detailed help format
makeblastdb -help

# run makeblastdb to generate an indexed database of nucleotide sequences
makeblastdb -in 16S_seqs4_blastDB.fnaed -dbtype nucl -parse_seqids

Running a default blastn search

The simplest blastn (or any other standard blast search program) invocation requires only two parameters: blastn -query <FASTA_file> -db <database_file>.

The following blastn call will search the 16S_seqs4_blastDB.fnaed database we generated in the previous section with the 16S_problema.fna query sequences, using all default values of the search parameters, as defined in the blastn -help output. This includes the standard output format familiar to users of the Web implementation of NCBI-BLAST.

The output (truncated to save space) has the following sections:

header
one-line hit descriptions
alignments
footer

blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed

BLASTN 2.14.1+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: 16S_seqs4_blastDB.fnaed
           604 sequences; 880,434 total letters



Query= 13

Length=1359
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928        2475    0.0  
16S_DB:573 Mycobacterium_arupense_T|AR30097|DQ157760                  2453    0.0  
16S_DB:501 Mycobacterium_hiberniae_T|ATCC_9874|X67096                 2407    0.0  
16S_DB:592 Mycobacterium_senuense_T|05-832|DQ536408                   2355    0.0  
16S_DB:583 Mycobacterium_kumamotonense_T|CST7274_(=_GTC2729)|AB23...  2350    0.0  
16S_DB:579 Mycobacterium_pallens_T|czh-8|DQ370008                     2252    0.0  
16S_DB:531 Mycobacterium_asiaticum_T|ATCC_25276|AF480595              2244    0.0

... TRUNCATED


>16S_DB:480 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
Length=1466

 Score = 2475 bits (1340),  Expect = 0.0
 Identities = 1352/1359 (99%), Gaps = 2/1359 (0%)
 Strand=Plus/Plus

Query  1     GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCATCTTGGGATAA  60
             |||||||||||||||||||||||||||||||||||||||||||||||||  |||||||||
Sbjct  48    GTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTTTGGGATAA  107

Query  61    GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA  120
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  108   GCCTGGGAAACTGGGTCTAATACCGAATAGGACCGCATGCTGCATGGTGTGTGGTGGAAA  167

Query  121   GCTTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA  180
             || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  168   GC-TTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTA  226

Query  181   CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC  240
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  227   CCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATAC  286


... TRUNCATED

>16S_DB:102 Aeromonas_enteropelogenes_T|DSM_6394|X71121
Length=1490

 Score = 920 bits (498),  Expect = 0.0
 Identities = 970/1204 (81%), Gaps = 23/1204 (2%)
 Strand=Plus/Plus

Query  153   ATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGG  212
             || |||| ||||||| |||||||||  |||||||||||||    |||| || ||||||||
Sbjct  235   ATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGG  294

Query  213   GTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGG  272
              ||  | |||||||||| ||||||| |||| |||||||||||||||||||||||||||||
Sbjct  295   ATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGG  354

Query  273   AATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTC  332
             |||||||||||||||| | || ||||||||||| | |||||||| | || || |||||||
Sbjct  355   AATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTC  414

Query  333   GGGTTGTAAACCTCTTTCAGTATCGGCG-AAGCTCCA-AGGT--TTTCT-TTGG-GGTGA  386
             |||||||||| | |||||||    |  | ||| ||   || |  | |||  |||  ||||
Sbjct  415   GGGTTGTAAAGCACTTTCAGCGAGGAGGAAAGGTCAGTAGCTAATATCTGCTGGCTGTGA  474

Query  387   CGGTAGGTACAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGG  446
             || ||    |||||||||||||||| |||| |||||||||||||||||||||||| ||||
Sbjct  475   CGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGG  534

... TRUNCATED


Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 


Effective search space used: 946715250


  Database: 16S_seqs4_blastDB.fnaed
    Posted date:  Aug 6, 2022  3:00 PM
  Number of letters in database: 880,434
  Number of sequences in database:  604


Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

Note that the header and footer sections of the standard blast output contain essential information (m, n, k, $λ$ ) required to compute the E-value (see Karlin-Altschul equation $E = k m n e^{- λ S}$ **, as detailed in a previous section).

AWK one-liner to extract key statistical parameters of a blast[npx] run

The following little AWK one-liner helps extract key statistical parameters from the standard output of a blast[npx] run. The AWK program has the following structure: /pattern/, /pattern/; /pattern/, /pattern/{action}{action}.

blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -num_alignments 1 | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'

Database: 16S_seqs4_blastDB.fnaed
           604 sequences; 880,434 total letters
Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850

Identify the single best-hit for each query sequence

A standard blastn search output can be lengthy and excessively wordy. For the aim of the exercise, it is more convenient to use a tabular output format reporting only the best hit found in the database, as we want to classify our unknown 16S rDNA sequences based on their best database hit.

This can be easily achieved with the following blastn call:

# 2.1) run blastn (blastall -p blastn), and get only the best hit (-b 1) in tabular format (-m 8).
# blastall -p blastn -i 16S_problema.fna -b 1 -a 6 -d 16S_seqs4_blastDB.fnaed -m 8 > 16S_problema_blastN_b1_m8.tab
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt 6 -num_alignments 1 > 16S_problema_blastN_maln1_m6.tab

# 2.2) explore the table structure
head 16S_problema_blastN_maln1_m6.tab

13  16S_DB:480  99.485  1359    5   2   1   1359    48  1404    0.0 2475
18  16S_DB:494  99.033  1344    10  3   1   1344    49  1389    0.0 2416
2   16S_DB:549  99.552  1340    5   1   4   1343    81  1419    0.0 2440
27  16S_DB:546  100.000 1342    0   0   1   1342    78  1419    0.0 2479
36  16S_DB:480  99.705  1358    3   1   1   1358    48  1404    0.0 2490
4   16S_DB:600  99.335  1354    4   4   1   1353    79  1428    0.0 2446
48  16S_DB:484  98.375  1354    20  1   2   1355    48  1399    0.0 2390
53  16S_DB:548  99.925  1342    1   0   1   1342    78  1419    0.0 2473
62  16S_DB:496  99.106  1342    8   4   1   1342    85  1422    0.0 2409
7   16S_DB:600  98.736  1345    14  3   1   1343    79  1422    0.0 2386

As can be seen in the output above, the -outfmt 6 tabular format has no header.

The twelve fields of a standad tabular blast output (-outfmt 6|[-m 8])

The default -outfmt 6 blast output contains the following 12 fields

# The default m6 blast output fields
#=====================================================================================
# >>> The default -outfmt 6 | -m 8 blast output contains the following 12 fields   <<<     
# qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
#-------------------------------------------------------------------------------------
# 1.     qseqid      query (e.g., gene) sequence id
# 2.     sseqid      subject (e.g., reference genome) sequence id
# 3.     pident      percentage of identical matches
# 4.     length      alignment length
# 5.     mismatch    number of mismatches
# 6.     gapopen     number of gap openings
# 7.     qstart      start of alignment in query
# 8.     qend        end of alignment in query
# 9.     sstart      start of alignment in subject
#10.     send        end of alignment in subject
#11.     evalue      expect value
#12.     bitscore    bit score
#-------------------------------------------------------------------------------------

Adding a header to the standard tabular output: -outfmt 6 [-m 8 in legacy blast]

# 2.3 Add a header to the standard output; print to file and append to 16S_problema_blastN_maln1_m6.tab
echo -e "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore" > standard_blast_fields.tsv
cat standard_blast_fields.tsv 16S_problema_blastN_maln1_m6.tab > t && mv t 16S_problema_blastN_maln1_m6.tab

head -20 16S_problema_blastN_maln1_m6.tab | column -t

# 2.4) how many unique hits are there?
cut -f 2 16S_problema_blastN_maln1_m6.tab | sed '1d' | sort | uniq -c

Retrieve all the hits from the database using $b l a s t d b c m d$

#   Generate the list of hit IDs for each query
#      NOTE: the hits are in the same order as the query sequences
cut -f2 16S_problema_blastN_maln1_m6.tab | sed '1d' > IDs4blastdbcmd.txt

#blastdbcmd -i IDs4blastdbcmd.txt -d 16S_seqs4_blastDB.fnaed | grep '>' |cut -d\| -f3 | cut -d' ' -f2 > hit_sequences_b1.fna
blastdbcmd -entry_batch IDs4blastdbcmd.txt -db 16S_seqs4_blastDB.fnaed > best-hit_sequences.fna

Generate a table containing the query sequence IDs, the subject headers and basic alignment statistics using Linux tools

# 1) Generate a list of the hits from the FASTA headers
grep '>' best-hit_sequences.fna | cut -d' ' -f2 > hit_sequence_headers.txt

# 2) Classify our query sequences, based on the subject headers of each best hit, using a simple tabular output format
cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d' > problem_seqs.txt
paste problem_seqs.txt hit_sequence_headers.txt > classified_16S_problema_sequences.tab

head -5 classified_16S_problema_sequences.tab

13  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18  Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27  Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928

Generate a table containing the query sequence IDs, the subject headers, and basic alignment statistics using process substitution to avoid writing intermediate files

A more efficient and idiomatic (“Bashish”) code to achieve the same result, without having to write intermediate files, is the use of “process substitution”, which has this general syntax: <(any comamnd(s) or pipeline here). Process substitution replaces the command with its output, treating it as if it were stored in a file. This output can therefore be used by commands like $c a t$ , $p a s t e$ , or any other that works on files. Let us see it in action:

paste <(cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head | column -t

13  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18  Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27  Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36  Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48  Mycobacterium_gordonae_T|ATCC_14470.|X52923
53  Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62  Mycobacterium_chitae_T|ATCC_19627|X55603
7   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
... TRUNCATED

Add basic BLAST statistic information to the summary table using Bash’s “process substitution” idiom

The previous result looks great but needs to include critical statistical information about the robustness of the hits. Let us use the “process substitution” trick that we learned in the previous example to add extra fields from the blast output table (pident, length, e-value, bitscore) to our summary table, including a header.

# 3) add also columns 3, 4, 11, 12 (pident, length,e-value & bit score) to the tabular output, including a header
cat <(echo -e "sid\tpid\tlen\tEval\tscore\ttax") <( paste <(cut -f1,3,4,11,12 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head) | column -t

sid  pid      len   Eval  score  tax
13   99.485   1359  0.0   2475   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18   99.033   1344  0.0   2416   Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2    99.552   1340  0.0   2440   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27   100.000  1342  0.0   2479   Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36   99.705   1358  0.0   2490   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4    99.335   1354  0.0   2446   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48   98.375   1354  0.0   2390   Mycobacterium_gordonae_T|ATCC_14470.|X52923
53   99.925   1342  0.0   2473   Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62   99.106   1342  0.0   2409   Mycobacterium_chitae_T|ATCC_19627|X55603
7    98.736   1345  0.0   2386   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
... TRUNCATED

Note: we are using Linux commands to achieve a task that can be performed directly within BLAST, using a customized tabular output, as we will see in the final section on BLASTN, and explained in the blast[n|p] -help.

Filter out high-confidence classifications

It is important to realize that finding a hit in the database does not necessarily mean that it supports a reliable identification or classification of the query sequence. In order to have a reasonably high confidence in the classification of 16S rDNA sequences at the genus level, based on tje best $b l a s t n$ hit, it should at least satisfy the following conditions:

alignment length >= 500 bp
pident >= 95%

AWK is a very handy and efficient tool for filtering BLAST output tables

Let us first explore the BLAST table to identify hits that do not satisfy the two criteria defined above (pident >= 95% && length >= 500). $A W K$ is ideal for this task.

identify hits with < 95% identity OR short alignments (length < 500) using $A W K$ patterns

# we could use any of the following lines to print the results, including a header
# cat <(cat standard_blast_fields.tsv) <(awk 'BEGIN{FS=OFS="\t"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab) | column -t
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab | column -t

qseqid                            sseqid      pident   length  mismatch  gapopen  qstart  qend  sstart  send  evalue    bitscore
1367SAA005_14Cip-FD1.b1.ab1_638   16S_DB:86   93.218   634     39        4        1       632   37      668   0.0       929
1367SAA005_A17Cip-FD1.b1.ab1_171  16S_DB:86   95.930   172     6         1        1       171   35      206   2.51e-76  278
1367SAA005_A6BSm-FD1.b1.ab1_477   16S_DB:86   98.745   478     5         1        1       477   37      514   0.0       848
1367SAA005_S13-FD1.b1.ab1_121     16S_DB:534  88.333   120     11        3        3       121   64      181   2.33e-35  141
1367SAA005_S2CbTm-FD1.b1.ab1_84   16S_DB:269  97.647   85      1         1        1       84    47      131   1.15e-36  145
1367SAA005_S4CbTm-FD1.b1.ab1_172  16S_DB:269  100.000  172     0         0        1       172   52      223   1.49e-88  318
1367SAA005_S6Sm-FD1.b1.ab1_347    16S_DB:184  99.143   350     0         3        1       347   40      389   0.0       627
1367SAA005_S9b-FD1.b1.ab1_142     16S_DB:3    97.887   142     3         0        1       142   40      181   1.60e-67  248
1367SAA005_Tm203-FD1.b1.ab1_491   16S_DB:128  97.942   486     9         1        3       487   52      537   0.0       841
1367SAA005_VNP1-FD1.b1.ab1_578    16S_DB:435  94.965   576     28        1        3       578   31      605   0.0       902

generate the final (filtered) results table with highly confident classifications

In order to get a confident blastn-based classification of our problem sequences at the genus level, we should use only hits with pident >= 95% and a minimal alignment length >= 500 bp.

paste <(awk 'BEGIN{FS=OFS="\t"}NR > 1{print $1,$3,$4,$11,$12}' 16S_problema_blastN_maln1_m6.tab) <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | awk 'BEGIN{FS=OFS="\t"; print "sid\tpid\tlen\tEval\tscore\tbest_hit"}$2 >= 95 && $3 >= 500' > 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv

Display the table’s head and tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv

cat <(head 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) <(tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) | column -t

sid                                   pid      len   Eval  score  best_hit
13                                    99.485   1359  0.0   2475   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
18                                    99.033   1344  0.0   2416   Mycobacterium_smegmatis_T|ATCC_19420|AJ131761
2                                     99.552   1340  0.0   2440   Mycobacterium_peregrinum_T|CIP_105382T|AY457069
27                                    100.000  1342  0.0   2479   Mycobacterium_fortuitum_T|CIP_104534T|AY457066
36                                    99.705   1358  0.0   2490   Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
4                                     99.335   1354  0.0   2446   Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270
48                                    98.375   1354  0.0   2390   Mycobacterium_gordonae_T|ATCC_14470.|X52923
53                                    99.925   1342  0.0   2473   Mycobacterium_neworleansense_T|ATCC_49404|AY457068
62                                    99.106   1342  0.0   2409   Mycobacterium_chitae_T|ATCC_19627|X55603
1367SAA005_S11Sm-FD1.b1.ab1_1049      98.763   1051  0.0   1866   Pseudomonas_monteilii_T|CIP_104883|AF064458
1367SAA005_S12-FD1.b1.ab1_899         98.210   894   0.0   1559   Microbacterium_testaceum_T|DSM_20166|X77445
1367SAA005_S15-FD1.b1.ab1_1111        98.115   1114  0.0   1936   Microbacterium_testaceum_T|DSM_20166|X77445
1367SAA005_S1CbTm-FD1.b1.ab1_683      99.227   647   0.0   1166   Pseudomonas_monteilii_T|CIP_104883|AF064458
1367SAA005_S1-FD1.b1.ab1_542          99.815   542   0.0   996    Pseudomonas_plecoglossicida_T|FPC951|AB009457
1367SAA005_S7CbTm-FD1.b1.ab1_961      98.954   956   0.0   1709   Citrobacter_freundii_T|DSM_30039|AJ233408
1367SAA005_S9-FD1.b1.ab1_1091         99.451   1092  0.0   1984   Cellulosimicrobium_funkei_T|W6122|AY501364
1367SAA005_VNC15-FD1.b1.ab1_1062      99.052   1055  0.0   1890   Rhizobium_alamii_T|type_strain:GBV016|AM931436
1367SAA005_VRP15_bis-FD1.b1.ab1_1124  99.822   1125  0.0   2065   Ensifer_terangae_T|type_strain:_LMG_7834|X68388
1367SAA005_VRP15-FD1.b1.ab1_1112      99.012   1113  0.0   1993   Ensifer_terangae_T|type_strain:_LMG_7834|X68388

The previous examples illustrated how to efficiently combine $S h e l l$ and $L i n u x$ tools to process BLAST output tables. With a bit of practice, you will soon master these power tools.

Note: We are using Linux commands to achieve a task that can be performed directly within BLAST, using a customized tabular output, as we will see in the final section on BLASTN “making-use-of-a-customized-tabular-output-with-selected-fields-for-a-slick-and-neat-blastn-based-classification-strategy-of-16s-rdna-sequences”, and explained in the blast[n|p] -help.

Generate summaries of the results

Finally, let us generate some simple summary statistics of the classified sequences.

# 6 How many sequences could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | wc -l # 42

# 7 How many sequences per genus could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | cut -f6 | sed 's/_.*$//' | sort | uniq -c | sort -nrk1

     14 Mycobacterium
      5 Rhizobium
      5 Ensifer
      4 Pseudomonas
      2 Microbacterium
      2 Escherichia/Shigella
      2 Beijerinckia
      1 Sphingomonas
      1 Sinorhizobium
      1 Shinella
      1 Isoptericola
      1 Citrobacter
      1 Cellulosimicrobium
      1 Brevundimonas
      1 Acinetobacter

Making use of a customized tabular output with selected fields for a slick and neat blastn-based classification strategy of 16S rDNA sequences

The strategy shown above works perfectly well but could be more succinct. However, it helped demonstrate interesting Bash idioms, like the process substitution trick we learned to avoid writing intermediate files to disk. This strategy, or similar, was required to add subject title information to the tabular output generated by the legacy blast programs run with -m 8 (tabular output). The current BLAST+ suite is much more powerful and versatile in multiple ways, including the impressive amount of information from HSPs that can be added to tabular outputs (-outfmt 6|7|10), as can be learned from the blast[npx] -help output.

We will use this feature to significantly simplify the 16S rDNA sequence classification strategy based on best blastn hits satisfying some minimal alignment criteria. In particular, we will use the stitle (subject title) column in the output, which contains the taxonomic information from the FASTA header. In summary, we will run a new blastn search including selected output fields to avoid retrieving the subject hit FASTA files with fastadbcmd, and grep out their header lines to paste them to the table.

To print a tabular output with custom fields use the following syntax: blastn -outfmt ‘6 keyword1 keyword2 …’.

Running blastn with a customized tabular outut

The following blastn call with customized tabular output, containing the desired ‘qseqid stitle pident length evalue score’ fields will do the trick. We will also add the query length ‘qlen’ and query coverage ‘qcovs’ fields as the final two columns. We filter the blastn output with AWK to print only HSPs with pident >= 95 && length > 500 and finally format the output with column -t for a tidy columnar display on the terminal. That’s it, slick and neat.

blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt '6 qseqid stitle pident length evalue score qlen qcovs' -num_alignments 1 | awk 'BEGIN{FS=OFS="\t"; print "seqid\tstitle\tpident\tlength\tevalue\tscore\tqlen\tqcovs"}$3 >= 95 && $4 >= 500' | column -t

seqid                                  stitle                                               pident   length  evalue  score  qlen  qcovs
13                                     Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928  99.485   1359    0.0     1340   1359  100
18                                     Mycobacterium_smegmatis_T|ATCC_19420|AJ131761        99.033   1344    0.0     1308   1347  99
2                                      Mycobacterium_peregrinum_T|CIP_105382T|AY457069      99.552   1340    0.0     1321   1343  99
27                                     Mycobacterium_fortuitum_T|CIP_104534T|AY457066       100.000  1342    0.0     1342   1342  100
36                                     Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928  99.705   1358    0.0     1348   1358  100
4                                      Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270     99.335   1354    0.0     1324   1354  99
48                                     Mycobacterium_gordonae_T|ATCC_14470.|X52923          98.375   1354    0.0     1294   1356  99
53                                     Mycobacterium_neworleansense_T|ATCC_49404|AY457068   99.925   1342    0.0     1339   1342  100
62                                     Mycobacterium_chitae_T|ATCC_19627|X55603             99.106   1342    0.0     1304   1342  100
7                                      Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270     98.736   1345    0.0     1292   1350  99
8                                      Mycobacterium_mantenii_T|NLA000401474|FJ042897       99.926   1351    0.0     1347   1350  100
85                                     Mycobacterium_avium_T|ATCC_49884|EF521891            98.968   1357    0.0     1313   1356  100
95                                     Mycobacterium_goodii_T|M069|Y12872                   99.702   1344    0.0     1334   1344  100
96                                     Mycobacterium_arupense_T|AR30097|DQ157760            99.705   1358    0.0     1345   1358  100
1367SAA005_2NXC111R-FD1.b1.ab1_579     Rhizobium_loessense_T|CCBAU_7190B|AF364069           97.396   576     0.0     530    579   99
1367SAA005_2NXC11R-FD1.b1.ab1_1153     Shinella_yambaruensis_T|MS4|AB285481                 96.766   1144    0.0     1031   1153  99
1367SAA005_2NXC14R-FD1.b1.ab1_578      Rhizobium_loessense_T|CCBAU_7190B|AF364069           97.751   578     0.0     538    578   100
1367SAA005_A17-FD1.b1.ab1_1072         Sphingomonas_aromaticivorans_T|DSM_12444|CP000248    97.577   1073    0.0     994    1072  100
1367SAA005_A1b-FD1.b1.ab1_1010         Pseudomonas_plecoglossicida_T|FPC951|AB009457        98.811   1009    0.0     972    1010  99
1367SAA005_A2CbTm-FD1.b1.ab1_908       Escherichia/Shigella_flexneri_T|X96963               99.008   907     0.0     880    908   99
1367SAA005_A5-FD1.b1.ab1_597           Isoptericola_variabilis_T|MX5|AJ298873               97.138   594     0.0     541    597   99
1367SAA005_A7-FD1.b1.ab1_763           Brevundimonas_nasdae_T|GTC1043|AB071954              98.949   761     0.0     737    763   99
1367SAA005_A7Sm-FD1.b1.ab1_896         Acinetobacter_baumannii_T|DSM_30007|X81660           96.763   865     0.0     780    896   97
1367SAA005_A9Cip-FD1.b1.ab1_579        Escherichia/Shigella_flexneri_T|X96963               99.481   578     0.0     569    579   99
1367SAA005_ACOP7-FD1.b1.ab1_1110       Rhizobium_hainanense_T|I66|U71078                    97.212   1112    0.0     1015   1110  100
1367SAA005_AYOT11_bis-FD1.b1.ab1_1118  Beijerinckia_fluminensis_T|UQM_1685|EU401907         98.749   1119    0.0     1076   1118  100
1367SAA005_AYOT11-FD1.b1.ab1_1069      Beijerinckia_fluminensis_T|UQM_1685|EU401907         99.147   1055    0.0     1027   1069  99
1367SAA005_CP6-FD1.b1.ab1_1117         Ensifer_saheli_T|LMG_7837|X68390                     98.743   1114    0.0     1070   1117  99
1367SAA005_CP7_bis-FD1.b1.ab1_819      Ensifer_mexicanus_T|ITTG-R7|DQ411930                 98.862   791     0.0     763    819   96
1367SAA005_CP7-FD1.b1.ab1_1035         Sinorhizobium_americanum_T|CFNEI_156|AF506513        98.551   1035    0.0     988    1035  99
1367SAA005_NXC126R-FD1.b1.ab1_1017     Ensifer_adhaerens_T|LMG_20216|AM181733               98.527   1018    0.0     972    1017  99
1367SAA005_PROO4-FD1.b1.ab1_539        Rhizobium_loessense_T|CCBAU_7190B|AF364069           97.407   540     0.0     493    539   100
1367SAA005_S11Sm-FD1.b1.ab1_1049       Pseudomonas_monteilii_T|CIP_104883|AF064458          98.763   1051    0.0     1010   1049  100
1367SAA005_S12-FD1.b1.ab1_899          Microbacterium_testaceum_T|DSM_20166|X77445          98.210   894     0.0     844    899   99
1367SAA005_S15-FD1.b1.ab1_1111         Microbacterium_testaceum_T|DSM_20166|X77445          98.115   1114    0.0     1048   1111  100
1367SAA005_S1CbTm-FD1.b1.ab1_683       Pseudomonas_monteilii_T|CIP_104883|AF064458          99.227   647     0.0     631    683   95
1367SAA005_S1-FD1.b1.ab1_542           Pseudomonas_plecoglossicida_T|FPC951|AB009457        99.815   542     0.0     539    542   100
1367SAA005_S7CbTm-FD1.b1.ab1_961       Citrobacter_freundii_T|DSM_30039|AJ233408            98.954   956     0.0     925    961   99
1367SAA005_S9-FD1.b1.ab1_1091          Cellulosimicrobium_funkei_T|W6122|AY501364           99.451   1092    0.0     1074   1091  99
1367SAA005_VNC15-FD1.b1.ab1_1062       Rhizobium_alamii_T|type_strain:GBV016|AM931436       99.052   1055    0.0     1023   1062  99
1367SAA005_VRP15_bis-FD1.b1.ab1_1124   Ensifer_terangae_T|type_strain:_LMG_7834|X68388      99.822   1125    0.0     1118   1124  100
1367SAA005_VRP15-FD1.b1.ab1_1112       Ensifer_terangae_T|type_strain:_LMG_7834|X68388      99.012   1113    0.0     1079   1112  100

This final blastn example demonstrates the importance of knowing our tools well, meaning that we should always take the time to carefully read the software’s documentation to understand and exploit all its key features.

HOMEWORK 1

Repeat the blastn-based classification exercise based on 16S rDNA sequences using 10 hits per query. Impose the same filters applied in the previous exercise.

Are the results of both blastn runs (single best-hit vs. 10 hits) consistent?
How many sequences from the problem dataset remain unclassified, and what is the reason they were not classified?
Are there sequences that remain unclassified that have a pident >= 90% && < 95%?
- If so, how many?
- Based on the evidence gathered from the 10 hits/query, how would you classify these sequences?

BLAST+ with protein queries against a protein database (BLASTP)

BLASTP is a key program in genomics and phylogenomics analyses, as it is one of the main tools used to:

search for homologs (orthologs and paralogs) of query proteins across or within genomes for downstream evolutionary/phylogenetic analyses
annotate the proteomes of newly-sequenced genomes, typically using a reference database of well-annotated proteins, such as SwissProt, to transfer the annotations of the best hit to the query

Prepare working directory

extract the sequence data from the compressed tar file

# Move into the UniProt_proteomes dir and extract the sequences in the tgz file
cd $wkdir && cd UniProt_proteomes 
tar -xvzf UniProt_proteomes.tgz

Format the Acanthamoeba castellanii reference proteome fetched from UNIPROT

UniProt is the world’s leading high-quality, comprehensive, and freely accessible protein sequence and functional information resource. Explore this extraordinary resource and read the following paper from the UniProt Consortium – UniProt: the Universal Protein Knowledgebase in 2023.

makeblastdb -in ACACA.faa -dbtype prot -parse_seqids

ACACA.faa
ACACA.faa.pdb
ACACA.faa.phr
ACACA.faa.pin
ACACA.faa.pog
ACACA.faa.pos
ACACA.faa.pot
ACACA.faa.psq
ACACA.faa.ptf
ACACA.faa.pto
DICDI.faa

Default blastp run, and anatomy of its output

blastp requires only two arguments to run: the -queryand the -db , running under default parameters.

blastp -query RAB7A_HUMAN.faa -db ACACA.faa

which produces the following (abbreviated/truncated) output, with the following sections:

header
one-line hit descriptions
alignments
footer

BLASTP 2.14.1+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.


Database: ACACA.faa
           14,938 sequences; 6,560,263 total letters


Query= sp|P51149|RAB7A_HUMAN Ras-related protein Rab-7a OS=Homo sapiens
OX=9606 GN=RAB7A PE=1 SV=1

Length=207
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str...  293     2e-102
L8GPH2 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s...  249     4e-85 
L8GXM9 GTPbinding protein YPTC5, putative OS=Acanthamoeba castell...  190     7e-62 
L8GRY8 Rab7/RabGfamily small GTPase OS=Acanthamoeba castellanii s...  180     4e-58 
L8GLL4 RAB family member (Rab7), putative OS=Acanthamoeba castell...  166     2e-52 
L8HGU8 Ras family protein OS=Acanthamoeba castellanii str. Neff O...  154     6e-48 
L8HEH5 Rab7, putative OS=Acanthamoeba castellanii str. Neff OX=12...  149     8e-46 
L8H231 GTPbinding protein OS=Acanthamoeba castellanii str. Neff O...  142     2e-43 
L8HDY4 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef...  137     4e-41 
L8GSP6 Ras-related protein Rab-2A, putative OS=Acanthamoeba caste...  134     7e-40 
L8GKG8 Rasrelated protein Rab5, putative OS=Acanthamoeba castella...  133     2e-39 
L8H896 RAB1B, member RAS oncogene family OS=Acanthamoeba castella...  132     6e-39
... TRUNCATED
L8HKA3 HEAT repeat domain containing protein OS=Acanthamoeba cast...  26.6    7.0   
L8GYP1 Ras subfamily protein OS=Acanthamoeba castellanii str. Nef...  26.2    7.3   
L8GLL9 WD domain, G-beta repeat-containing protein OS=Acanthamoeb...  26.6    7.5   
L8H6G5 Uncharacterized protein OS=Acanthamoeba castellanii str. N...  26.2    8.7   
L8HK33 GTPase OS=Acanthamoeba castellanii str. Neff OX=1257118 GN...  26.2    9.5   
L8H1F4 Myosin VIIa, putative OS=Acanthamoeba castellanii str. Nef...  26.2    9.6   
L8GRW1 Rab-like protein 6 OS=Acanthamoeba castellanii str. Neff O...  26.2    9.7   
L8HDD1 Cobalamin synthesis protein/P47K OS=Acanthamoeba castellan...  26.2    9.9   
L8H2D9 ABC transporter, ATPbinding domain containing protein OS=A...  26.2    9.9   


>L8HHF3 Ras-related protein Rab-7A OS=Acanthamoeba castellanii str. Neff 
OX=1257118 GN=ACA1_074580 PE=4 SV=1
Length=186

 Score = 293 bits (749),  Expect = 2e-102, Method: Compositional matrix adjust.
 Identities = 149/208 (72%), Positives = 166/208 (80%), Gaps = 23/208 (11%)

Query  1    MTSRKKVLLKVIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDRLVTMQ  60
            M++RKKVLLK+IILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDD+LVT+Q
Sbjct  1    MSTRKKVLLKIIILGDSGVGKTSLMNQYVNKKFSNQYKATIGADFLTKEVMVDDKLVTLQ  60

Query  61   IWDTAGQERFQSLGVAFYRGADCCVLVFDVTAPNTFKTLDSWRDEFLIQASPRDPENFPF  120
            IWDTAGQERFQSLGVAFYRGAD CVLVFDV                   A PRDPE FPF
Sbjct  61   IWDTAGQERFQSLGVAFYRGADACVLVFDVN------------------AGPRDPETFPF  102

Query  121  VVLGNKIDLE-NRQVATKRAQAWCYSKNNIPYFETSAKEAINVEQAFQTIARNALKQETE  179
            +VLGNKIDLE +R V+ KRAQ WC SK NIPYFETSAKEAINVEQAFQTIA+NA+K+E +
Sbjct  103  IVLGNKIDLESSRVVSQKRAQTWCQSKGNIPYFETSAKEAINVEQAFQTIAKNAMKEEED  162

Query  180  VELYNEFPEPIKLDKNDRAKASAESCSC  207
            V+L     + I+LDK++    S   CSC
Sbjct  163  VDLAG-ITDAIQLDKSE---PSQSGCSC  186

... TRUNCATED

>L8GVA9 Fbox domain containing protein OS=Acanthamoeba castellanii str. 
Neff OX=1257118 GN=ACA1_042330 PE=4 SV=1
Length=226

 Score = 27.7 bits (60),  Expect = 2.3, Method: Compositional matrix adjust.
 Identities = 23/100 (23%), Positives = 40/100 (40%), Gaps = 15/100 (15%)

Query  11   VIILGDSGVGKTSL-------MNQYVNKKFSNQYKAT----IGADFLTKEVMVDDRLVTM  59
            +++LGD GVGKT+          +   KK   ++        G   LT         V +
Sbjct  114  ILVLGDVGVGKTAYAERCLAGAQESTTKKKKGRFDMVALRGTGKHHLTLPTNYGP--VNI  171

Query  60   QIWDTAGQERFQSLGV--AFYRGADCCVLVFDVTAPNTFK  97
             +WD     R  S GV      G    +++FD+ +  +++
Sbjct  172  TLWDQPATSREISGGVRDELLAGGQAAIIMFDIASLQSYR  211



Lambda      K        H        a         alpha
   0.318    0.132    0.386    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6 

Effective search space used: 603304980


  Database: ACACA.faa
    Posted date:  Oct 7, 2023  9:42 AM
  Number of letters in database: 6,560,263
  Number of sequences in database:  14,938


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40

Note: The header and footer sections contain essential information (m, n, k, $λ$ ) required to compute the E-value (see Karlin-Altschul equation $E = k m n e^{- λ S}$ , as detailed in a previous section).

Remember, this critical information can be easily extracted from the standard BLAST output using, for example, an AWK one-liner, such as the following one:

blastp -query RAB7A_HUMAN.faa -db ACACA.faa | awk '/Database/, /total letters/; /^Lambda/, /Extension:/{seen++; print}{ if(seen>5) exit}'

Database: ACACA.faa
           14,938 sequences; 6,560,263 total letters
Lambda      K        H        a         alpha
   0.318    0.132    0.386    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

Identify HUMAN vs. Acanthamoeba reciprocal best hits (RBHs) in the Rab-GTPase multigene family

The previous exercise identified a selection of likely Rab-GTPase homologs in the A. castellanii proteome. Our next aim is to identify reciprocal best-hits or RBHs between a set of human reference Rabs retrieved from UNIPROT and saved in file HUMAN_63Rabs_Ran.faa and the A. castellanii proteome. The resulting RBHs are likely orthologs; therefore, we could transfer the annotation of the HUMAN Rabs to their Acanthamoeba RBHs.

Overview of the bioinformatic strategy to find RBHs between the A. castellanii proteome and the HUMAN_63Rabs_Ran.faa reference proteins

The bioinformatics strategy to find RBHs:

run $b l a s t p$ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as query with the BLOSUM45 matrix
retrieve the A. castellanii proteome database hits using $b l a s t d b c m d$
make a blastdb for HUMAN_63Rabs_Ran.faa and query it with the retrieved A. castellanii proteome database hits
retrieve a non-redundant list of the ACACA IDs from the blastp run and use it to grep out the matching lines from the first blastp run, sorting it by increasing E-values (in column 11)
filter out unique HUMAN vs. ACACA RBHs using AWK hashes from the sorted blast output table

Prepare a working directory for the RBH searches

Use symlinks to avoid unnecessary file copying to preserve precious disk space whenever possible.

mkdir RBHs
cd RBHs/
  
# make symlinks whenever possible, to avoid unnecessary file copying to avoid filling the server's disks
ln -s ../ACACA.faa .
ln -s ../ACACA.faa.* .
ln -s ../HUMAN_63Rabs_Ran.faa .

Run $b l a s t p$ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as the query with the BLOSUM45 matrix

Given the huge evolutionary distance between human and amoebae (> 1 billion yrs), we may use the BLOSUM45 matrix to maximize sensitivity.
We will use a customized tabular output with the following fields: ‘qseqid sseqid pident gaps length qlen slen qcovs evalue score’, saved in a variable

# set BLAST and filtering variables
mat=BLOSUM45
cols='6 qseqid sseqid pident gaps length qlen slen qcovs evalue score'
min_qcov=80

blastp -query HUMAN_63Rabs_Ran.faa -db ACACA.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out

Retrieve the A. castellanii proteome database hits using $b l a s t d b c m d$ only if qcov > 80.

Note the following syntactic details:

use of Bash’s process substitution idiom <(your code here)
awk -F”\t” -v qcov=$min_qcov ‘$8 >= qcov{print $2}’ HUMAN_63Rabs_vs_ACACA_besthits_m6.out to filter out Acanthamoeba hits with qcov >=$min_qcov.

blastdbcmd -entry_batch <(awk -F"\t" -v qcov=$min_qcov '$8 > qcov{print $2}' HUMAN_63Rabs_vs_ACACA_besthits_m6.out) -db ACACA.faa > ACACA_vs_HUMAN_best_1hits.faa

Make a blastdb for HUMAN_63Rabs_Ran.faa, and query it with the retrieved A. castellanii proteome database hits

generate the HUMAN_63Rabs_Ran.faa BLASTdb
run blastp with the filtered Acanthamoeba hits from the first BLAST as queries against the HUMAN_63Rabs_Ran.faa BLASTdb, retrieving only the best hit

makeblastdb -in HUMAN_63Rabs_Ran.faa -dbtype prot -parse_seqids
blastp -query ACACA_vs_HUMAN_best_1hits.faa -db HUMAN_63Rabs_Ran.faa -matrix "$mat" -outfmt "$cols" -num_alignments 1 > ACACA_vs_HUMAN_best_1hit_m6.out

Sort the blastp output table from the preceding search by increasing E-values (on column 11) and filter out the first (best-scoring) hits with AWK hashes

# The following lines extract a sorted list (n=54) of unique Acanthamoeba identifiers 
#   found in the ACA_vs_HUMAN and HUMAN_vs_ACA searches
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort | uniq -c
cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u

# This code greps out the sorted and unique ACACA hits from the ACACA_vs_HUMAN_best_1hit_m6.out table
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done

# Same as previous line, but sorting the BLAST output table by increasing E-values (in column 9; sort -gk9,9)
#  output displayed below
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done | sort -gk9,9 | column -t
for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done | sort -gk9,9 > HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted

Sorted output table generated by last line of code, by increasing E-values. Note the presence of repeated Acanthamoeba identifiers (2cnd col)

sp|P62826|RAN_HUMAN    tr|L8GF16|L8GF16_ACACA  82.524  0   206  216  213  95   7.64e-129  1259
sp|P61106|RAB14_HUMAN  tr|L8H946|L8H946_ACACA  73.953  2   215  215  213  100  2.88e-117  1154
sp|P62491|RB11A_HUMAN  tr|L8HJ43|L8HJ43_ACACA  75.829  0   211  216  213  98   9.90e-117  1150
sp|Q15907|RB11B_HUMAN  tr|L8HJ43|L8HJ43_ACACA  75.463  8   216  218  213  98   2.83e-113  1119
sp|Q8WUD1|RAB2B_HUMAN  tr|L8H2N2|L8H2N2_ACACA  70.046  2   217  216  216  100  4.48e-110  1090
sp|P61019|RAB2A_HUMAN  tr|L8H2N2|L8H2N2_ACACA  69.585  6   217  212  216  100  2.42e-109  1083
sp|P20340|RAB6A_HUMAN  tr|L8GN15|L8GN15_ACACA  81.287  0   171  208  195  82   1.25e-104  1037
sp|Q9NRW1|RAB6B_HUMAN  tr|L8GN15|L8GN15_ACACA  75.843  0   178  208  195  86   1.36e-103  1028
sp|P51149|RAB7A_HUMAN  tr|L8HHF3|L8HHF3_ACACA  71.635  23  208  207  186  100  6.25e-103  1021
sp|P61018|RAB4B_HUMAN  tr|L8HCK2|L8HCK2_ACACA  65.888  11  214  213  204  100  2.29e-102  1018
sp|P20338|RAB4A_HUMAN  tr|L8HCK2|L8HCK2_ACACA  65.421  11  214  218  204  98   9.55e-99   986
sp|Q13637|RAB32_HUMAN  tr|L8HAC7|L8HAC7_ACACA  68.681  2   182  225  220  80   9.57e-94   943
sp|P57729|RAB38_HUMAN  tr|L8HAC7|L8HAC7_ACACA  69.886  1   176  211  220  83   1.31e-93   940
sp|P57735|RAB25_HUMAN  tr|L8HJ43|L8HJ43_ACACA  59.330  4   209  213  213  96   2.16e-91   920
sp|P62820|RAB1A_HUMAN  tr|L8HD31|L8HD31_ACACA  69.412  0   170  205  210  83   4.80e-91   915
sp|Q9H0U4|RAB1B_HUMAN  tr|L8HD31|L8HD31_ACACA  69.412  0   170  201  210  85   6.47e-91   914
sp|Q92928|RAB1C_HUMAN  tr|L8HD31|L8HD31_ACACA  66.864  0   169  201  210  84   5.35e-85   860
... TRUNCATED

filter out unique HUMAN vs ACA RBHs using AWK hashes from the sorted blast output table

Note: if you need help with AWK and Linux for bioinformatics, visit my extensive AWK tutorial for bioinformatics here

awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted | column -t
awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print }' HUMAN_63Rabs_vs_ACACA_besthits_m6.out.sorted > HUMAN_63Rabs_vs_ACACA_RBHs.tsv

The final table of fifteen HUMAN vs. ACACA Rab-GTPase RBHs (plus the Ran RBH), which are likely orthologs, filtered by

qcov >=
 $min_qcov

sp|P62826|RAN_HUMAN    tr|L8GF16|L8GF16_ACACA  82.524  0   206  216  213  95   7.64e-129  1259
sp|P61106|RAB14_HUMAN  tr|L8H946|L8H946_ACACA  73.953  2   215  215  213  100  2.88e-117  1154
sp|P62491|RB11A_HUMAN  tr|L8HJ43|L8HJ43_ACACA  75.829  0   211  216  213  98   9.90e-117  1150
sp|Q8WUD1|RAB2B_HUMAN  tr|L8H2N2|L8H2N2_ACACA  70.046  2   217  216  216  100  4.48e-110  1090
sp|P20340|RAB6A_HUMAN  tr|L8GN15|L8GN15_ACACA  81.287  0   171  208  195  82   1.25e-104  1037
sp|P51149|RAB7A_HUMAN  tr|L8HHF3|L8HHF3_ACACA  71.635  23  208  207  186  100  6.25e-103  1021
sp|P61018|RAB4B_HUMAN  tr|L8HCK2|L8HCK2_ACACA  65.888  11  214  213  204  100  2.29e-102  1018
sp|Q13637|RAB32_HUMAN  tr|L8HAC7|L8HAC7_ACACA  68.681  2   182  225  220  80   9.57e-94   943
sp|P62820|RAB1A_HUMAN  tr|L8HD31|L8HD31_ACACA  69.412  0   170  205  210  83   4.80e-91   915
sp|Q9NP72|RAB18_HUMAN  tr|L8GYY7|L8GYY7_ACACA  58.163  2   196  206  209  95   4.10e-80   816
sp|P51151|RAB9A_HUMAN  tr|L8GPH2|L8GPH2_ACACA  51.961  3   204  201  205  100  1.18e-76   784
sp|P51153|RAB13_HUMAN  tr|L8HGZ3|L8HGZ3_ACACA  58.282  6   163  203  208  80   1.16e-71   739
sp|Q14964|RB39A_HUMAN  tr|L8GSP6|L8GSP6_ACACA  51.124  6   178  217  200  82   8.44e-61   641
sp|Q13636|RAB31_HUMAN  tr|L8GKG8|L8GKG8_ACACA  46.734  9   199  194  218  98   2.73e-58   618
sp|Q969Q5|RAB24_HUMAN  tr|L8H2Q9|L8H2Q9_ACACA  50.549  18  182  203  211  83   6.20e-56   597

Filter human-Acanthamoeba blastp RBHs with a single one-liner

This final code runs all the filtering steps presented above for the two blastp output tables using a single neat one-liner

for ACAid in $(cut -f1 ACACA_vs_HUMAN_best_1hit_m6.out | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_besthits_m6.out; done | sort -gk9,9 | awk -v qcov="$min_qcov" 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1 && $8 >= qcov) print}' | column -t

Categorybioinformatics

A Practical Introduction to Large Language Models in Biomedical Data Science Research

Installing Bioinformatics Software in Linux

Running and parsing BLAST+ using the Linux command line

About the tutorials

ASSUMPTIONS

THE BLAST SUITE OF PROGRAMS – OVERVIEW AND PROGRAM SELECTION

Searching a DNA database with DNA queries (BLASTN)

EXERCISE 1 – classify novel 16S rDNA sequences at the genus level using blastn

Exploring the query and reference sequences

Using Perl 1-liners to generate FASTA headers suitable for database indexing with makeblastdb

Running [formatdb]|makeblastdb to generate an indexed blast database

Running a default blastn search

AWK one-liner to extract key statistical parameters of a blast[npx] run

Identify the single best-hit for each query sequence

The twelve fields of a standad tabular blast output (-outfmt 6|[-m 8])

Adding a header to the standard tabular output: -outfmt 6 [-m 8 in legacy blast]

Retrieve all the hits from the database using blastdbcmd

Generate a table containing the query sequence IDs, the subject headers and basic alignment statistics using Linux tools

Generate a table containing the query sequence IDs, the subject headers, and basic alignment statistics using process substitution to avoid writing intermediate files

Add basic BLAST statistic information to the summary table using Bash’s “process substitution” idiom

Filter out high-confidence classifications

AWK is a very handy and efficient tool for filtering BLAST output tables

identify hits with < 95% identity OR short alignments (length < 500) using AWK��� patterns

generate the final (filtered) results table with highly confident classifications

Generate summaries of the results

Making use of a customized tabular output with selected fields for a slick and neat blastn-based classification strategy of 16S rDNA sequences

Running blastn with a customized tabular outut

HOMEWORK 1

BLAST+ with protein queries against a protein database (BLASTP)

Prepare working directory

Format the Acanthamoeba castellanii reference proteome fetched from UNIPROT

Default blastp run, and anatomy of its output

Identify HUMAN vs. Acanthamoeba reciprocal best hits (RBHs) in the Rab-GTPase multigene family

Overview of the bioinformatic strategy to find RBHs between the A. castellanii proteome and the HUMAN_63Rabs_Ran.faa reference proteins

Prepare a working directory for the RBH searches

Run blastp������ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as the query with the BLOSUM45 matrix

Retrieve the A. castellanii proteome database hits using blastdbcmd���������� only if qcov > 80.

Make a blastdb for HUMAN_63Rabs_Ran.faa, and query it with the retrieved A. castellanii proteome database hits

Sort the blastp output table from the preceding search by increasing E-values (on column 11) and filter out the first (best-scoring) hits with AWK hashes

filter out unique HUMAN vs ACA RBHs using AWK hashes from the sorted blast output table

Filter human-Acanthamoeba blastp RBHs with a single one-liner

Related posts:

How is high-performance computing (HPC) used in bioinformatics?

Quantum Computing Promises to Overhaul Computational Drug Discovery

Integrating Bioinformatics Across Life Science Curricula: An Assessment of Educational Initiatives

Selecting Random Pairs from FASTQ Files: A Beginner's Guide

Careers in Code: Exciting Opportunities for Software Engineers in Bioinformatics

Best Practices for Applying Machine Learning in Bioinformatics Research

What Drives Personalized Medicine in Bioinformatics?

Deep Learning in Drug Discovery: Predicting Protein-Ligand Binding Affinity

Converting Chemical SMILES to IUPAC Names: Open Source Tools and Scripts

Step-by-Step Guide: How to Draw a Heat Map for Gene Expression Data

DNA Compression Algorithms

A Comprehensive Guide to Data Science and its Expanding Horizon

Step-by-Step Manual: Paired-End Sequencing

Advances in Quantum Computing Applications in Bioinformatics

Top Trending Searches of 2024 in Bioinformatics

Exploring the Lipidome: Insights into Cellular Metabolism and Disease

Share

Copy short link

Retrieve all the hits from the database using $b l a s t d b c m d$

identify hits with < 95% identity OR short alignments (length < 500) using $A W K$ patterns

Run $b l a s t p$ against the A. castellanii proteome database using HUMAN_63Rabs_Ran.faa as the query with the BLOSUM45 matrix

Retrieve the A. castellanii proteome database hits using $b l a s t d b c m d$ only if qcov > 80.