Basic Local Alignment Search Tool (BLAST) for bioinformatics

July 22, 2019 Off By admin

What is BLAST (Basic Local Alignment Search Tool)?
In bioinformatics, BLAST (basic local alignment search tool) is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences.

What we compare?
The increasing number of genes and genomics deposited in GenBank implies increasing imprtance of methods for finding genes by homology search. Indeed, sequence similarity search has been claimed to be most efective methods for exploiting the information in the rapidly growing molecular sequences databases.

An important goal of genomics is to determine if a particular sequence is like another sequence. This is accomplished by comparing the new sequence with sequences that have already been reported and stored in a database. This process is principally one that uses alignment procedures to uncover the “like” sequence in the database.

A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

Why we need to compare with existing sequence?
The process is principally one that uses alignment procedures to uncover the “like” sequence in the database. The alignment process will uncover those regions that are identical or closely similar and those regions with little (or any) similarity.

What is alignment?
Two alignment types are used: global and local.

The global approach compares one whole sequence with other entire sequences. The local method uses a subset of a sequence and attempts to align it to subset of other sequences. The output of a global alignment is a one-to comparison of two sequences.

Local alignments reveal regions that are highly similar, but do not
necessarily provide a comparison across the entire two sequences. The global approach is useful when you are comparing a small group of sequences, but becomes become computationally expensive as the number of sequence in the comparison increases.

Local alignments use heuristic programming methods that are better suited to successfully searching very large databases, but they do not necessarily give the most optimum solution. Even given this limitation, local alignments are very important to the field of genomics because they can uncover regions of homology that are related by descent between two otherwise diverse sequences.

BLAST algorithm
The BLAST is a set of algorithms that attempt to find a short fragment of a query sequence that aligns perfectly with a fragment of a subject sequence found in a database. That initial alignment must be greater than a neighborhood score threshold (T). For the original BLAST algorithm, the fragment is then used as a seed to extend the alignment in both directions.

The alignment is extended in both directions until the T score for the aligned segment does not continue to increase. Said another way, BLAST looks for short sequences in the query that matches short sequences found in the database.

Statistical significance
It is important to realise that any database search will extract close
matches based on calculated similarity between strings of letters. Biologists are likely to be interested in extracting those sequences that can be assumed from similarity to be evolutionarily related to their test sequence. Such sequences, which will have derived from a common ancestor, are defined to be ‘homologous’. Extracting this information from a purely numerical measure of similarity is difficult. The most practicable simple guide to the likelihood of a ‘hit ’ in a database search being evolutionarily related to a test sequence is a statistical measure of how likely the match is to have occurred purely by chance. Such values are calculated and quoted in the results generated by database searching algorithms.

E-Value
The most common measure of probability is the so-called Expect value, or E-value. This is the number of alignments with a given score that would be expected to occur at random in the database that was searched.

The number of Maximal Segment Pairs (MSPs) with similar score or higher that one can EXPECT to see by chance alone when searching a database of a particular size.

Thus, an E-value of 1.00 for a match between a database sequence and a test sequence would indicate that exactly one random sequence in a database of that size would be likely to match the test sequence as well as the current one.

E-values are independent of the lengths of the sequences. Values as low as 10 –50 are not uncommon in well-conserved families. With large databases, values between about 0.01 and 10 can be said to represent a
‘grey area’; it may be useful to analyse sequences matching at this level in more detail.

Gap penalties
Clearly, aligning a residue or group of residues in one sequence with a null character (a ‘gap’) in another should be penalised. Since a single point mutation may introduce many more than one residue into a sequence, long gaps are usually penalised only slightly more than short ones. This is achieved by using two separate negative scores: a
large penalty for introducing a gap and a much smaller one for extending an existing one.

Substitution matrix
Substitution matrices are used to score aligned postions in a sequence alignment procedure, usually of amino acids aor nucletoide sequences.

Two commonly used matrices: PAM and BLOSUM
PAM= Percent Accepted Mutations (Magaret Dayhoff)
BLOSUM= Blocks Substituion Matrix (steven and Henikoff)

Derived from small, closely related proteins with ~15% divergence. Higher PAM numbers to detect more remote sequence similarities. Errors in PAM 1 are scaled 250X in PAM 250. Based on empirical frequencies. Uses much larger, more diverse set of protein sequences (30-90% ID) Lower BLOSUM numbers to detect more remote sequence similarities. Errors in BLOSUM arise from errors in alignment.

Scoring matrices for DNA comparison
For comparisons between DNA sequences, the choice of scoring matrix
is generally trivial. A high score is given for a match between bases and zero, or a negative score, to any mismatch. A match between A and
‘not-A’ is scored as a mismatch under all circumstances.

Scoring matrices for Protein comparison
Comparisons at the protein level are much more complex. All algorithms
comparing protein sequences give matches between amino acids thought
of as‘similar’– such as leucine and isoleucine, or phenylalanine and
tyrosine– intermediate scores between those of identical amino acids and those of amino acids with no similarity.

Researchers have used different criteria to assign scores to each of the 210 possible pairs of amino acids. Mainly they BLOSUM and PAM matrix.

A variety of BLOSUM (BLOcks SUBstitution Matrix) matrices are available, whose utility depends on whether the user is comparing more highly divergent or less divergent sequences. The BLOSUM62 matrix is used as the default scoring matrix for BLASTP. The BLOSUM62 matrix was developed by analyzing the frequencies of amino acid substitutions in clusters of related proteins. Within each cluster, or block, the
amino acid sequences were at least 62% identical when two proteins were aligned. Investigators computationally determined the frequencies of all amino acid substitutions that had occurred in these conserved blocks of proteins.

BLAST
BLAST (basic local alignment search tool) will compare your DNA sequence with other sequences in the database. For this example, let’s see what the closest human gene is, so keep the database on the human genome. There are five different types of BLAST programs. For a protein database use the blastp and blastx programs and for a nucleotide database use the blastn, tblastn, and tblastx programs.

1. Go to http://www.ncbi.nlm.nih.gov/ and click on ‘BLAST’.
2. In the ‘nucleotide’ box, click on ‘nucleotide blast’, and paste your sequence (from step 2, see above) in the box.
3. Next, use the” human genome plus transcript” for the box marked database and press ‘blast’.
4. For kicks, press on the genome viewer button. There you will see the region of the genome that best matches. (One of the ways in which a genome sequence can help with mapping genes to a location.)
5. This database shows a color key for alignment scores. The matches are color coded red being the closest match and black being the furthest from a match.
6. Scroll down the page to alignments and observe the values given. One of the values you see is gaps. Gaps in the sequence analysis are caused by insertions, deletions, and substitutions. These differences will affect the final sequence alignment and final alignment score. Insertions and deletions create gaps in the alignment which affects the overall score of the sequence comparison.
7. While analyzing this database you will notice a set of data beside your sequence comparisons. The ‘e values’ (expectation values) are the different alignments with scores equivalent or better than S (score of alignment) that are expected to occur in a data base by chance. The ‘e value’ describes the random background noise that exists for matches between sequences. The lower the e value the closer the match.

Table of Contents

Types of BLAST

Nucleotide-nucleotide BLAST (blastn) – This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies.

Protein-protein BLAST (blastp) – This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies.  Position-Specific Iterative BLAST (PSI- BLAST) (blastpgp) – This program is used to find distant relatives of a protein.

Nucleotide 6-frame translation-protein (blastx) -This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx) -The purpose of tblastx is to find very distant relationships between nucleotide sequences.

Protein-nucleotide 6-frame translation (tblastn) -This program compares a protein query against the all six reading frames of a nucleotide sequence database.

Large numbers of query sequences (megablast) -When comparing large numbers of input sequences via the command-line BLAST, “megablast” is much faster than running BLAST multiple times.

Of these programs, BLASTn and BLASTp are the most commonly used because they use direct comparisons, and do not require translations. However, since protein sequences are better conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx, produce more reliable and accurate results when dealing with coding DNA.

BLAST Program	Further details
nucleotide blast or blastn	Compares a nucleotide query sequence against a nucleotide sequence database.
protein blast or blastp	Compares an amino acid query sequence against a protein sequence database.
blastx	Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.
tblastn	Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
tblastx	Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive

BLAST analysis- Tutorial

Retrieve a Protein Sequence

This can seem to be trivial. In fact it may not be that simple given the abundance of data. Mainly two databases can be used: (i) NCBI, which we will use here, and (ii) SRS at EBI. The latter can be very convenient as the query form allows making complicated requests. However, it is not as intuitive as NCBI…so:

1/ Go to the NCBI web site . Note that you can also download the NCBI search toolbar for Internet Explorer or Firefox.

2/ Enter your query using the NCBI search field. We will be working with the Yeast gene VPS36. To look for it, you may simply type VPS36 in the search field. Note however that your search won’t be very specific. There will be 113 entries that have “vps36” somewhere in their text, but this includes the annotation, e.g., it may include proteins known to associate with Vps36. If you click on the “Preview/Index” tab below the search field, you can use the associated menus to narrow your search.

From the preview/index page, you can add qualifying terms to narrow the search. For example, to restrict the search to entries having the gene name “Vps36”, pull down the “Field name” tab, select “Gene name”, type Vps36 in the search box then click on “AND”.

This generates a new search command as shown below

Clicking on the Go button generates 16 hits. If pull-down menus annoy you, you could have simply typed “vps36[gene name]” in the search field and gotten the same result.

For more details on how to make specific searches please refer to this link, but for this workshop, [gene name] will suffice.

3/ In the list of Vps36 genes, you will see the protein NP_013521 among the results. Click on it.

4/ You can simply get the sequence (or part of it) in a FASTA format using the Display and Range options on the top of the screen. Save the sequence from position 1 to 289…. done

You might as well copy this sequence to the clipboard, as you’ll need it in the next section.

Do a Blast Search With Your Sequence

1/ Go back to the NCBI protein page.
2/ On the left, below “related resouces” click on Blast.
3/ In the “Protein” subdivision, click on “Protein-protein BLAST (blastp)”
4/ Paste your sequence (just the sequence, not the header). Then, there are a number of options. In general, I would:

Look in the “nr” database. The default database for a BLAST is the “nr” database. The “nr” database is the largest database available through NCBI BLAST. Choosing the largest database is not always best. You may want to find a match from a specific organism. The name “nr” is derived from “non-redundant”, but this is historical only, because this database is no longer non-redundant.
Try with “NO CD-search” selected at least you know that the predicted domains are accurate.
Composition adjustments: select composition-based statistics
Word size, select 2

Finally, press the Blast! button. You will have to click on the Format button to get your results. The Blast page also gives you the option of limiting your query by taxonomy by using the “Organism” menu. You can also apply more complicated filters using the general “ Entrez search fields”

You will get a list of pairwise alignments with your query sequence in order from most similar to least similar. The column labeled “E-value” represents approximately how many sequences you would expect to match by chance in a database of the size searched (i.e., the nr database).

The “bit scores” (S values) have been normalized with respect to the scoring system, so that they can be used to compare alignment scores from different searches.

Types of Protein & Nucleotide databases

Do iterative Blast Searches: PSI-BLAST

The evolutionary pressure is not equivalent on all residues of a protein. For example, buried residues, residues in a secondary structure, at an active site or at a binding site are generally more conserved than residues in loops. When you compare two sequences, you do not take into account these differences in conservation that can be very informative. However, when you have a set of similar sequences you can compare them to each other and identify which regions are variable and which regions are not. This is what PSI-BLAST does. It identifies regions of importance (not variable) and it gives them more weight in subsequent comparisons.

So, PSI-BLAST is a kind of hybrid program in between BLAST and HMMs (explained in the next section): it starts by looking for sequences similar to yours. Once it finds some, it asks which sequences you want to keep for the next search iteration. You have to very carefully select those sequences that you think are relevant. Then you start a new iteration. The sequences you selected are used to define a sort of motif (with some statistics), which will help to detect previously not detected sequences and also discard previously ambiguous sequence. In brief, it will increase the specificity and sensitivity of the search.

So … let’s try!
1/ Go to the NCBI Blast web page,
2/ In the “Protein” sub-division, click on “Position-specific iterated and pattern-hit initiated BLAST” (now you understand what it means)
3/ Paste in the Vps36 1-289 region in the sequence box and select the same parameters as for the previous BLAST
4/ Click on BLAST! button.
5/ You will have to press the format button to see the results. When they are ready, they will appear in another browser window. Keep pressing on Format periodically until the first iteration appears.
6/ At the first iteration, there are a lot of sequences. The top sets matches the query sequence (the one you submitted) closely throughout the entire range of the sequence (the red and purple hits). However, these are all closely related. They are all Vps36 genes from other fungi

Select the red, purple and green hits for the next iteration. Click on the Run Psi-Blast Iteration 2 button.

Again, you will have to click on the Format button in the original window periodically to get the results.

The list from the second iteration shows the same top entries as in the first iteration, but some new sequences have appeared in blue that show similarity to the query sequence at both ends, but not in a region in the middle. There are many “hypothetical” and “unnamed” sequences among them. There are also some proteins that contain protease related (calpain link) domains. These might be important, but be conservative at first. For the next iteration choose only those proteins with an annotated function of “vacuolar protein sorting”. This annotation could be wrong, but it is useful for a start. Uncheck all other sequences.

Run blast iteration 3.

At the third iteration, a clear pattern has begun to emerge

There is now a large group of sequences related to Vps36 that show no similarity in the region from about 100-200. These seem to be the metazoan Vps36 genes. Don’t bother doing it now, but by choosing only the top sequences and these metazoan sequences with a split region of similarity and iterating further, at iteration 5, this pattern is very distinct. Below you can see the results from iteratin 8.

Note, the genes can have lots of synonyms in various organisms. For example, for Vs36, you will see EAP45, and the locus CGI-145.

The psi-blast exercise has helped get a clearer picture of the organization of this N-terminal region of Vps36. There seems to a yeast-specific insertion consisting of about 150 residues.

USES OF BLAST
BLAST can be used for several purpose. These include identifying species, locating domains, establishing phylogeny, DNA mapping and comparison. In addition, for identifying species with the use of BLAST we can correctly identify or find homologous species. For instance, when we are working with a DNA sequence from an unknown species it can be useful.
Furthermore, locating domains is when it working with a protein sequence we can input it into the BLAST. It is to locate known domains within the sequence of interest. Next, using the results received through BLAST we can create a phylogenetic tree using the BLAST web- page. The phylogenies is based on the BLAST alone are less reliable than other purpose-built computational phylogenetic methods and it should only be relied upon for the first pass phylogenetic analyses.
In addition, when we working with a known species and looking to sequence a gene at an unknown location. So, the BLAST can compare the chromosomal position of the sequence of interest to relevant sequences in the databases. Last but not least, BLAST also can locate common genes in two related species and it can be used to annotations from one organism to another.

CONCLUSION

In conclusion, BLAST has become an essential tool for biologists. Furthermore, its speed and sensitivity allow scientists to compare nucleotide and protein sequences to both single sequences and large databases. The most importantly, BLAST has helped democratize bioinformatics analysis and make it accessible to any researcher over the Internet and it is rare to read a modern molecular biology paper that does not refer to a BLAST alignment. We should be able to compare BLAST result from different databases and converting values if they are reported differently.
Moreover, we also should know why BLAST result might change from one day to the next even though on the same server. Last but not least, BLAST and its descendant applications have permitted scientists to estimate the functions of genes and proteins in whole genomes, answering questions that could never be answered at a lab bench or in a class.