Homologue, orthologue and Paralogue-Bioinformatics
July 22, 2019Homolog, ortholog and Paralog-Bioinformatics
Both concepts are illustrated by the gene trees above (a gene tree is a type of phylogenetic tree depicting the evolutionary history of genes through time). Initially, there is a single gene (black) in a single lineage, the last common ancestor of species 1 and 2.
In the scenario on the left (1), this gene first undergoes a gene duplication event, and both copies (A and B in red and blue, respectively), while gradually evolving apart, persist in the last common ancestor until a subsequent speciation event splits the lineage into two new species. Both species inherit both copies of the duplicated gene, where they continue to diverge until the present day. Genes of different colors are paralogs because they are related through the initial gene duplication, regardless of whether they are found in the same (like 1A and 1B) or different species (like 1B and 2A). In contrast, genes sharing a color are more closely related through the speciation event and are therefore orthologs. By definition, orthologs can never be found in the same species.
The scenario on the right (2) shows what happens if the order of events is reversed: first, the ancestral lineage harboring the gene splits by speciation into two new species, both of which inherit the gene. Then independent gene duplication events create copies of the gene in each species. In this case, the genes found within each species are paralogous to each other. Between species, however, they are orthologous to both copies in the other species, because both are related through speciation first and then through gene duplication.
Whether two genes are orthologs or paralogs has important implications. In phylogenetics, they can be used to trace the relatedness of organisms because orthologous gene trees are a reflection of the species tree. Orthologous genes are also often assumed to fulfill similar or identical roles in two organisms. While this is not necessarily true, establishing orthology can often provide a first hint at the function of a newly discovered gene by comparing it to its orthologs from well-studied species.
Homolog or homologue
A gene related to a second gene by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship betwen genes separated by the event of genetic duplication (see paralog).
Ortholog or othologue
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.
Orthologues
Homologues which diverged by a speciation event. There are four types of orthologues:
1-to-1 orthologues (ortholog_one2one)
1-to-many orthologues (ortholog_one2many)
many-to-many orthologues (ortholog_many2many)
between-species paralogues – only as exceptions
Genes in different species and related by a speciation event are defined as orthologues. Depending on the number of genes found in each species, we differentiate among 1:1, 1:many and many:many relationships. Please, refer to the figure where there are examples of the three kinds.
Speciation
Speciation is the origin of a new species capable of making a living in a new way from the species from which it arose. As part of this process it has also aquired some barreir to genetic exchage with the parent species.
Paralog or paralogue
Paralogs are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one.
Paralogues
Homologues which diverged by a duplication event. There are two types of paralogues:
same-species paralogies (within_species_paralog)
fragments of the same ‐predicted‐ gene (gene_split)
Genes of the same species and related by a duplication event are defined as paralogues. In the previous figure, Hsap2 and Hsap2′, and Mmus2 and Mmus2′ are two examples of within species paralogues. The duplication event relating the paralogues does not need to affect this species only. For example, Mmus2′ and Mmus3′ are also within species paralogues but the duplication event has occurred in the common ancestor between species Hsap (human) and species Mmus (mouse). The taxonomy level “times” the duplication event to the ancestor of Euarchontoglires.
Between species paralogues
A between species paralogue corresponds to a relation between genes of different species where the ancestor node has been labelled as a duplication node e.g. Mmus1:Hsap2 or Mmus1:Hsap3. Currently, we only annotate between species paralogue when there is no better match for any of the genes, and the duplication is weakly-supported (duplication confidence score ≤ 0.25).
Such cases can be the results of real duplications followed by gene losses (as shown in the picture below), but most of the times occur as the result of a wrong gene tree topology with a spurious duplication node. Often assembly errors are behind these problems. It is not clear whether these genes are real orthologues or not, but they are the best available candidates (given the data), and we bend the definition of orthology to tag them as orthologues. They are flagged as “non-compliant with the gene tree”. People interested in phylogenetic analysis mixing the orthologies and the trees should probably use the set of tree-compliant orthologies.
Identifying homologs-Paralog, ortholog through bioinformatics approach
The most commonly used method to establish homology is through sequence similarity (sometimes, though not quite accurately, called sequence homology). In the absence of gene duplication events, it is reasonable to assume that orthologous genes fill an equivalent functional niche, and are more similar in sequence to each other than to any other gene. In a pairwise comparison, orthologs are therefore each other’s best match. This method, called the reciprocal best hit method, is easily implemented using BLAST. While all orthologs are reciprocal best BLAST hits, the reverse is not necessarily true: gene duplications and gene loss can lead to scenarios in which reciprocal best BLAST hits are actually paralogs. However, in simple cases, this method is very accurate and still useful to acquire a set of candidates genes for orthologs in more complex cases.
To see how it works, consider the following exercise:
- Starting with the D. melanogaster Cytoplasmic Ribosomal Protein RpL30, find the best match in P. barbatus, using the BLAST tool on HGD’s Ant Portal as described in the chapter on BLAST searches (blastp against the Official Gene Set).
- Take the best hit (PB22887; you can acquire the protein sequence by clicking on the gene identifier) and BLAST it against the annotated protein dataset of D. melanogaster on FlyBase.
- Coming full circle, the best reciprocal hit turns out to be RpL30 (note that there are several alternative transcripts in the fly) — there seems to be only a single gene for RpL30 in both species, and they can be assumed to be orthologous to each other.
In the example above, the lack of paralogs is evident by the fact that neither genome harbors another gene that comes even close in similarity. The following examples show what happens if there is. First, repeat the exercise above with Rp10Aa:
While P. barbatus provides only one single significant BLAST hit, PB25666, running the reciprocal BLAST search in D. melanogaster nets two genes, RpL10Aa and RpL10Ab. Moreover, the best reciprocal BLAST hit is RpL10Ab, not our starting gene RpL10Aa. This suggests that there are two paralogs in the fruit fly genome, and that RpL10Aa has diverged more from the ancestral gene than RpL10Ab. While it is possible that there were originally two copies in P. barbatus as well, one of which was subsequently lost, it is more parsimonious to assume that the fly paralogs came into being after the speciation event that split flies from ants. Both fly genes are therefore most likely orthologous to the P. barbatus gene.
The opposite case, a recent gene duplication leading to two paralogs in P. barbatus, both of which are orthologs to a single fly gene, is exemplified by RpLP0: There are two significant BLAST hits in P. barbatus, PB13254 and PB16486, both of which match the same gene in the fly.
Tree-based homology assessment
Multiple instances of speciation and gene duplication in several species can create a complex web of homology relations that is impossible to resolve with the reciprocal best hit method, and requires the reconstruction of a gene tree. After all, orthology and paralogy are defined in phylogenetic terms. We will discuss phylogenetic reconstruction methods in the next chapter, but the following figure may illustrate the principle:
In this gene tree of RpS7 from select Hymenopteran species (bees, wasps and ants), some species are represented by a single gene, others by two genes (species can be distinguished by the four-letter prefixes of the gene identifiers). The duplicate genes fall into two groups, each forming a subtree with almost identical topology (or shape), which mostly reflects the known evolutionary relationships of the represented ant species (Aech, Acep, Pbar, Sinv, Cflo).
This pattern suggests that a gene duplication occurred in the lineage leading to the last common ancestor of these five species, all of which therefore inherited two copies. Since the gene in Hsal is more closely related to one set of genes, the duplication might be even older, and the Hsal copy belonging to the other set was since deleted from its genome.
Gene trees are useful to visualize the homology relations of a large number of genes, but are time-consuming to reconstruct and can be prone to artifacts in the tree reconstruction process. Phylogenies are not definite results of an infallible method, but hypotheses. For example, if the position of the Hsal gene in the gene tree is inaccurate, we would misinterpret the timing of the gene duplication event. Finally, to make sense of gene trees, the underlying species tree has to be known, which is not always the case.