Interpreting dot plot-bioinformatics with an example
July 10, 2019In bioinformatics a dot plot is a graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them.
A dot plot is a simple, yet intuitive way of comparing two sequences, either DNA or protein, and is probably the oldest way of comparing two sequences [Maizel and Lenk, 1981].
Principle
Dot plot are two dimensional graphs, showing a comarision of two sequences. The principle used to generate the dot plot is: The top X and the left y axes of a rectangular array are used to represent the two sequences to be compared.
Calculation: Matrix
• Columns = residues of sequence 1
• Rows = residues of sequence 2.
A dot is plotted at every co-ordinate where there is similarity between the bases.
Dot plot algorithm:
As an initial example for dot plots one can imagine the same sequence written onto two strips of chequered paper. Every symbol of the sequence is written consecutively into one chequer, with its index number next to it. By overlaying a frame containing a window that allows viewing exactly one symbol of each strip at a time symbols are compared in pairs. Whenever symbols in the observing windows match, a bright dot is placed in a grid at the respective indices. The resulting rectangular graphical representation is a dot plot. It thus represents all possible comparisons of characters in either sequences and is colour-coded with two colours indicating a match or mismatch between any two characters. The resulting rectangular graphical representation is a dot-plot.
Problem: Plot becomes too noisy when we compare large and similar sequences.
Solution: By using sliding window (size =w), cut-off (value=v)
Window size: A residue by residue comparison (window size = 1) would undoubtedly result in a very noisy background due to a lot of similarities between the two sequences of interest. For DNA sequences the background noise will be even more dominant as a match between only four nucleotide is very likely to happen. Moreover, a residue by residue comparison (window size = 1) can be very time consuming and computationally demanding. Increasing the window size will make the dot plot more ‘smooth’ (defualt window size = 10).
How do we choose a window size?
Window size changes with goal of analysis
– size of average exon
– size of average protein structural element
– size of gene promoter
– size of enzyme active site
How do we choose a threshold value?
• Threshold based on statistics
– using shuffled actual sequence
• find average (m) and s.d. (σ) of match scores of shuffled
sequence
• convert original (unshuffled) scores (x) to Z scores
– Z = (x ‐ m)/σ
• use threshold Z of of 3 to 6
– using analysis of other sets of sequences
• provides “objective” standard of significance
How dot plot created?
Dot plots compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. When the residues of both sequences match at the same location on the plot, a dot is drawn at the corresponding position. Note, that the sequences can be written backwards or forwards, however the sequences on both axes must be written in the same direction. Also note, that the direction of the sequences on the axes will determine the direction of the line on the dot plot. Once the dots have been plotted, they will combine to form lines. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a direct relationship is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.
What for dot plot is used?
A dot plot is a 2 dimensional matrix where each axis of the plot represents one sequence. By sliding a fixed size window over the sequences and making a sequence match by a dot in the matrix, a diagonal line will emerge if two identical (or very homologous) sequences are plotted against each other. Dot plots can also be used to visually inspect sequences for direct or inverted repeats or regions with low sequence complexity. Various smoothing algorithms can be applied to the dot plot calculation to avoid noisy background of the plot. Moreover, various substitution matrices can be applied in order to take the evolutionary distance of the two sequences into account.
Using a dotplot graphic, we can can identify such the following differences between the sequences:
1. Matches
A match between sequences looks like a diagonal line on the dotplot graphic, representing the continuous match (or repeat).
2. Frame shifts
What is frameshift mutation?
A frameshift mutation (also called a framing error or a reading frame shift) is a genetic mutation caused by indels (insertions or deletions) of a number of nucleotides in a DNA sequence that is not divisible by three.
a. Mutations
Mutations are distinctions between sequences. On the graphic they are represented by gaps in diagonal lines. They interrupt matches.
b. Insertions
Insertions are parts of one sequence that are missed in the another, while the surrounding parts match. In other words, an insertion is a subsequence that was inserted into a sequence.
Graphically, insertions are represented by gaps which lie only on one axis. A little shift towards the other axis indicates a mutation involved.
c. Deletions
A deletion is a subsequence that was deleted from a sequence.
A deletion from sequence A found in sequence B can be considered as an insertion into sequence B and contained in sequence A.
Figure showing frame shit mutation (1) Matches (2a) Frame shit mutation (2b) Frameshit insertion (2c) Frame shit deletion
3. Inverted repeats
What is Inverted Repeat Sequences?
Copies of nucleic acid sequence that are arranged in opposing orientation. They may lie adjacent to each other (tandem) or be separated by some sequence that is not part of the repeat (hyphenated). They may be true palindromic repeats, i.e. read the same backwards as forward, or complementary which reads as the base complement in the opposite orientation. Complementary inverted repeats have the potential to form hairpin loop or stem-loop structures which results in cruciform structures (such as CRUCIFORM DNA) when the complementary inverted repeats occur in double stranded regions.
How dot plot show inverted repeat sequences
The Dotplot allows to search for inverted repeats as well. Inverted repeats are shown contrary to the direct repeats.
4. Low-complexity regions
What is low complexity regions?
Regions within sequences that are composed of a lower diversity of residues (nucleotides or amino acids) compared to other areas can be defined as low complexity regions (LCRs). These regions can have many different configurations, from repetitive single amino acids (also known as homorepeats, single amino acid repeats, or homopolymeric regions; here abbreviated as HPRs) to aperiodic motifs of multiple residues, and are present in coding and non-coding areas of a genome.
How dot plot show low complexity regions
A low-complexity region is a region produced by redundancy in a particular part of the sequence. It is represented on a plot as a rectangular area filled with the matches.
Application of dot plot
Dot plot applications are particularly useful in the identification of interspersed repeats such as transposons and tandem-repeat motifs such as microsatellites. Furthermore, loss or gain of whole motifs can easily be spotted in different types of domains, a trait useful in characterising the evolution of certain protein families. Dot plots are also employed in the investigation of properties of protein coding sequences by predicting secondary structures, like stem-loop formation or structural RNA domains. Can use to find self base‐pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed.
Advantage
Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other.
Disadvantage
Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is (no statistical significance that could be tested).
Software to create dot plots
- ANACON – Contact analysis of dot plots.
- D-Genies– Specializes in interactive whole genome dotplots of large genomes
- Dotlet – Provides a program allowing you to construct a dot plot with your own sequences.
- dotmatcher– Web tool to generate dot plots (and part of the EMBOSS suite).
- Dotplot – easy (educational) HTML5 tool to generate dot plots from RNA sequences.
- dotplot – R package to rapidly generate dot plots as either traditional or ggplot graphics.
- Dotter– Stand alone program to generate dot plots.
- JDotter – Java version of Dotter.
- Flexidot – Customizable and ambiguity-aware dotplot suite for aesthetics, batch analyses and printing (implemented in Python).
- Gepard – Dot plot tool suitable for even genome scale.
- Genomdiff – An open source Java dot plot program for viruses.
- LAST for whole-genome “split-alignment”.
- lastz and laj – Programs to prepare and visualize genomic alignments.
- yass – Web-based tool to generate (both forward and reverse complement) dot plots from genomic alignments.
- seqinr – R package to generate dot plots.
- SynMap – An easy to use, web-based tool to generate dotplots for many species with access to an extensive genome database. Offered by the comparative genomics platform CoGe.
- UGENE Dot Plot viewer – Opensource dot plot visualizer.
Reference websites
1. http://www.code10.info
2. http://www.bioinfo.rpi.edu
3. https://ugene.net