Phylogeny and Evolutionary Analysis Tutorial

March 17, 2024 Off By admin

Course Description: This course provides an introduction to phylogeny and its applications in evolutionary analysis. Students will learn about the principles of phylogenetics, various methods used in phylogenetic analysis, and how phylogeny is utilized to study evolutionary relationships among organisms. The course will also cover practical aspects of using bioinformatics tools for phylogenetic analysis.

Course Objectives:

Understand the principles of phylogeny and evolutionary relationships.
Learn about different methods used in phylogenetic analysis.
Gain practical skills in using bioinformatics tools for phylogenetic analysis.
Apply phylogenetic analysis to study evolutionary patterns and processes.

Table of Contents

Introduction to Phylogeny

Overview of evolutionary biology and phylogenetics

Evolutionary biology is the study of how organisms have evolved and diversified over time. It encompasses various subfields, including phylogenetics, which is the study of evolutionary relationships among organisms. Phylogenetics uses genetic, morphological, and behavioral data to reconstruct the evolutionary history, or phylogeny, of organisms. Here’s an overview of evolutionary biology and phylogenetics:

Evolutionary Theory: Evolution is the change in heritable traits of biological populations over successive generations. The theory of evolution, proposed by Charles Darwin, suggests that all species of life have descended from common ancestors through the process of natural selection, genetic drift, and other mechanisms.
Phylogenetics: Phylogenetics aims to reconstruct the evolutionary relationships among species or groups of organisms. It uses molecular data, such as DNA sequences, as well as morphological and behavioral traits, to infer the evolutionary tree or phylogeny.
Evolutionary Trees: An evolutionary tree represents the evolutionary relationships among a group of organisms. It shows the common ancestor and the branching points where species diverged from each other.
Taxonomy: Taxonomy is the science of classifying and naming organisms. Phylogenetics plays a crucial role in taxonomy by providing a scientific basis for classifying organisms based on their evolutionary relationships.
Applications: Phylogenetics has diverse applications, including understanding the evolutionary history of species, identifying new species, studying the spread of diseases, and conservation biology.
Methods: Phylogenetic analysis involves several methods, such as maximum likelihood, Bayesian inference, and distance-based methods, to reconstruct evolutionary trees and infer ancestral relationships.
Tools: There are many software tools available for phylogenetic analysis, including MEGA, PAUP*, MrBayes, and PhyML, which help researchers analyze molecular data and reconstruct phylogenetic trees.
Challenges: Phylogenetics faces challenges such as incomplete data, horizontal gene transfer, and the need for improved models of molecular evolution to accurately reconstruct evolutionary histories.

Overall, evolutionary biology and phylogenetics are fundamental to understanding the diversity of life on Earth and the processes that have shaped it over billions of years.

Historical development of phylogenetic theory

The historical development of phylogenetic theory can be traced back to the early ideas about the relationships among living organisms. Here’s an overview of the key developments in the history of phylogenetics:

Early Concepts: The concept of a “natural classification” based on similarities among organisms was proposed by early naturalists like Aristotle and later by Carl Linnaeus. However, these classifications were based on morphological similarities rather than evolutionary relationships.
Darwin’s Theory of Evolution: Charles Darwin’s theory of evolution, published in his book “On the Origin of Species” in 1859, provided a scientific explanation for the diversity of life on Earth. Darwin proposed that species evolved from common ancestors through a process of natural selection, which became a foundational concept for phylogenetics.
Cladistics: In the 20th century, German entomologist Willi Hennig developed the principles of cladistics, a method for reconstructing evolutionary trees based on shared derived characteristics, or synapomorphies, among taxa. Cladistics revolutionized phylogenetic analysis by focusing on evolutionary relationships rather than overall similarity.
Molecular Phylogenetics: With the advancement of molecular biology in the mid-20th century, researchers began using molecular data, such as DNA sequences, to infer evolutionary relationships. This led to the development of molecular phylogenetics, which provided a new perspective on phylogenetic relationships and helped resolve some long-standing evolutionary questions.
Computational Phylogenetics: The development of computational methods in the late 20th century, such as maximum likelihood and Bayesian inference, revolutionized phylogenetic analysis. These methods allowed researchers to analyze large datasets and complex evolutionary relationships more efficiently.
Phylogenomics and Big Data: Recent advances in sequencing technologies have led to the accumulation of vast amounts of genomic data, enabling phylogenetic analysis at a scale never before possible. Phylogenomics, the study of evolutionary relationships using genomic data, has become a major area of research in modern phylogenetics.
Current Challenges: Despite the advancements, phylogenetics still faces challenges, such as the accurate modeling of molecular evolution, dealing with incomplete or biased data, and integrating morphological and molecular data to reconstruct accurate phylogenies.

Overall, the historical development of phylogenetic theory reflects the evolution of our understanding of evolutionary relationships among organisms, from early morphological classifications to the modern integration of molecular data and computational methods.

Importance of phylogeny in understanding evolutionary relationships

Phylogeny is crucial for understanding evolutionary relationships among organisms. Here are some key reasons why phylogeny is important in the study of evolution:

Reconstructing Evolutionary History: Phylogenetics helps reconstruct the evolutionary history of species, showing how they are related through common ancestors. This information is essential for understanding the diversification and adaptation of life forms over time.
Classifying Organisms: Phylogenetic trees provide a framework for classifying organisms based on their evolutionary relationships. This helps in organizing and naming species in a systematic way, known as taxonomy.
Predicting Traits and Behaviors: By studying the evolutionary relationships among organisms, researchers can make predictions about the traits and behaviors of species. This is useful for understanding the ecological roles of organisms and predicting their responses to environmental changes.
Biogeography: Phylogenetics helps explain the distribution of organisms in different geographical regions. By studying the evolutionary history of species, researchers can infer how and when organisms migrated to different areas.
Conservation Biology: Understanding the evolutionary relationships among species is essential for conservation efforts. Phylogenetics helps identify endangered species, prioritize conservation efforts, and understand the potential impacts of extinction on ecosystems.
Medical and Agricultural Applications: Phylogenetics is used in medical and agricultural research to study the evolution of pathogens, pests, and crop species. This information is valuable for developing strategies to control diseases and improve crop yields.
Understanding Biodiversity: By revealing the evolutionary relationships among organisms, phylogenetics provides insights into the origin and maintenance of biodiversity. This information is crucial for conservation and management of ecosystems.

Overall, phylogeny is a fundamental tool in evolutionary biology, providing a framework for understanding the history of life on Earth and its diversity.

Principles of Phylogenetics

Concepts of homology, analogy, and synapomorphy

Homology, analogy, and synapomorphy are concepts used in phylogenetics and evolutionary biology to describe different types of similarities and differences among organisms. Here’s a brief overview of each concept:

Homology: Homology refers to similarities between organisms that are due to shared ancestry. These similarities can be in terms of structures, such as organs or body parts, or in terms of genes and DNA sequences. Homologous traits are similar because they are inherited from a common ancestor, even if they have different functions in different organisms. For example, the forelimbs of vertebrates (such as humans, bats, and whales) are considered homologous structures, as they share a common ancestral structure despite their different functions (e.g., flying, swimming, or grasping).
Analogy: Analogy, also known as homoplasy, refers to similarities between organisms that are not due to shared ancestry but rather to convergent evolution or evolutionary parallelism. Analogous traits evolve independently in different lineages in response to similar environmental pressures. These traits may serve similar functions but do not share a common evolutionary origin. For example, the wings of birds and bats are analogous structures, as they have evolved independently in these two groups for the purpose of flight, despite not sharing a common ancestral wing structure.
Synapomorphy: Synapomorphy is a shared derived trait that is unique to a specific group of organisms and their common ancestor. Synapomorphies are used in phylogenetic analysis to infer evolutionary relationships and define monophyletic groups, or clades. For example, the presence of feathers is a synapomorphy of birds and their closest dinosaurian relatives, indicating their common ancestry.

In summary, homology refers to similarities due to shared ancestry, analogy refers to similarities due to convergent evolution, and synapomorphy refers to shared derived traits that define evolutionary relationships among organisms. Understanding these concepts is essential for reconstructing evolutionary relationships and studying the patterns and processes of evolution.

Construction of phylogenetic trees

Phylogenetic trees, also known as evolutionary trees or cladograms, depict the evolutionary relationships among organisms. Constructing a phylogenetic tree involves several steps:

Data Collection: Gather molecular (DNA, RNA) or morphological data (physical characteristics) from the organisms of interest. The choice of data depends on the organisms and the evolutionary relationships being studied.
Alignment: For molecular data, align the sequences to ensure that homologous positions are correctly matched. This step is crucial for accurate phylogenetic inference.
Model Selection: Choose an appropriate evolutionary model that describes the substitution pattern of nucleotides or amino acids in the aligned sequences. Common models include the Jukes-Cantor, Kimura, and GTR models for nucleotide sequences, and the JTT and WAG models for protein sequences.
Phylogenetic Inference: Use a phylogenetic inference method to construct the tree. Common methods include distance-based methods (e.g., Neighbor Joining), maximum parsimony, maximum likelihood, and Bayesian inference. These methods use the data and evolutionary model to estimate the most likely tree topology.
Tree Evaluation: Assess the robustness of the tree using bootstrap analysis or posterior probability values. Bootstrap analysis involves resampling the data to generate multiple datasets and constructing trees from each dataset to evaluate the support for each branch.
Tree Visualization: Visualize the phylogenetic tree using tree-drawing software. The tree is typically represented as a branching diagram, with branches indicating evolutionary relationships and branch lengths representing the amount of evolutionary change.
Interpretation: Interpret the tree to understand the evolutionary relationships among the organisms. This involves identifying clades (groups of organisms that share a common ancestor) and inferring evolutionary events such as speciation and gene duplication events.
Tree Editing: Sometimes, the tree may need to be edited or pruned to improve clarity or highlight specific relationships. This can be done using tree-editing software.
Tree Annotation: Add labels, annotations, and scale bars to the tree to make it more informative and easier to interpret.
Publication: Finally, the phylogenetic tree can be included in scientific publications or presentations to communicate the evolutionary relationships among the organisms studied.

Constructing phylogenetic trees is a complex process that requires careful consideration of the data, evolutionary models, and inference methods. The resulting trees provide valuable insights into the evolutionary history and relationships among organisms.

Tree-thinking and interpretation of phylogenetic trees

Tree-thinking refers to the ability to interpret and understand phylogenetic trees as representations of evolutionary relationships among organisms. Here are key concepts and skills involved in tree-thinking and interpreting phylogenetic trees:

Branches: Branches in a phylogenetic tree represent evolutionary lineages. Longer branches typically indicate more evolutionary change, while shorter branches indicate less change.
Nodes: Nodes, or branch points, represent points in evolutionary history where a common ancestor split into two or more lineages. Nodes are important for defining monophyletic groups, or clades, which include an ancestor and all its descendants.
Root: The root of the tree represents the most recent common ancestor of all the organisms in the tree. The root is often inferred based on outgroup analysis, where an external group is used to root the tree.
Clades: Clades are monophyletic groups that include an ancestor and all its descendants. Clades are identified based on shared derived traits, or synapomorphies, that are unique to the group.
Sister Taxa: Sister taxa are two clades that are each other’s closest relatives, sharing a common ancestor not shared with any other taxa in the tree.
Polytomy: A polytomy is a node in a tree where more than two branches emerge, indicating uncertainty about the exact branching order.
Homoplasy: Homoplasy refers to similarities between organisms that are not due to shared ancestry (analogous traits). Homoplasy can lead to misleading interpretations of phylogenetic trees and needs to be carefully considered in tree interpretation.
Character Evolution: Phylogenetic trees can be used to study the evolution of characters (traits) along branches. This can help infer ancestral states and understand the patterns of character evolution within a group.
Divergence Times: Some phylogenetic trees include information about the estimated divergence times between lineages. This information can provide insights into the timing of evolutionary events.
Tree Confidence: The confidence in the branching pattern of a tree can be assessed using statistical measures such as bootstrap support values or posterior probabilities. Higher values indicate greater confidence in the inferred relationships.

Tree-thinking is a fundamental skill in evolutionary biology and is essential for interpreting and communicating the complex relationships among organisms depicted in phylogenetic trees.

Methods in Phylogenetic Analysis

Molecular vs. morphological phylogenetics

Molecular phylogenetics and morphological phylogenetics are two approaches used to reconstruct evolutionary relationships among organisms. Here’s a comparison of the two approaches:

Data Type:
- Molecular Phylogenetics: Uses molecular data, such as DNA sequences, RNA sequences, or protein sequences, to infer evolutionary relationships. This data is often more directly related to the genetic history of organisms.
- Morphological Phylogenetics: Uses morphological data, such as physical characteristics (e.g., anatomy, morphology), to infer evolutionary relationships. This data is based on observable traits of organisms.
Evolutionary Rate:
- Molecular Phylogenetics: Molecular data can evolve at different rates, depending on the gene or region being studied. Some molecular data evolve rapidly, making them useful for studying recent evolutionary events, while others evolve more slowly, providing insights into deeper evolutionary relationships.
- Morphological Phylogenetics: Morphological traits can also evolve at different rates, but the rate of morphological evolution is generally slower than that of molecular evolution. This makes morphological data more useful for studying ancient evolutionary relationships.
Accuracy:
- Molecular Phylogenetics: Molecular data can provide more accurate estimates of evolutionary relationships, especially at deeper nodes in the tree where morphological data may be more variable or subject to convergent evolution.
- Morphological Phylogenetics: Morphological data can be less accurate, especially for distantly related species or when morphological traits are subject to convergence or homoplasy.
Data Availability:
- Molecular Phylogenetics: Molecular data are more readily available for a wide range of organisms due to advances in sequencing technologies. This has made molecular phylogenetics a popular choice for studying evolutionary relationships.
- Morphological Phylogenetics: Morphological data are often more limited, especially for extinct or poorly studied species. However, advances in imaging techniques have improved the availability of morphological data.
Complementary Approaches:
- Combined Approach: In many cases, molecular and morphological data are used together to reconstruct phylogenetic trees. This approach can provide more robust and comprehensive phylogenies, as it combines the strengths of both data types.

In summary, molecular phylogenetics and morphological phylogenetics each have their strengths and limitations. The choice of approach depends on the research question, the availability of data, and the evolutionary scale being studied.

Distance-based methods: Neighbor Joining

Distance-based methods, such as Neighbor Joining (NJ), are used in phylogenetics to construct phylogenetic trees based on genetic distances between sequences. Here’s an overview of the Neighbor Joining method:

Distance Matrix: The first step in Neighbor Joining is to calculate a pairwise distance matrix that represents the genetic distances between all pairs of sequences in the dataset. The distance can be based on genetic differences, such as nucleotide or amino acid substitutions, or on other measures of dissimilarity.
Tree Construction:
- Initialization: Start with a star-like tree, where all sequences are connected to a central node.
- Iteration:
  1. Find the pair of sequences with the smallest distance in the distance matrix.
  2. Join these two sequences to a new internal node in the tree, creating a bifurcation (branching point).
  3. Calculate the distances from the new internal node to all other nodes in the tree using the formula:
  4. Update the distance matrix by removing the rows and columns corresponding to the joined sequences and adding a new row and column for the new internal node.
- Termination: Repeat the iteration until all sequences are joined, resulting in a fully resolved phylogenetic tree.
Tree Rooting: The Neighbor Joining method does not provide a rooted tree, so the tree needs to be rooted using an outgroup or by other methods.
Advantages:
- Neighbor Joining is computationally efficient and can handle large datasets.
- It is robust to violations of the molecular clock assumption (i.e., that genetic change occurs at a constant rate over time).
Limitations:
- Neighbor Joining can be sensitive to the choice of distance metric and may not always produce the most accurate tree.
- It does not account for evolutionary models or assumptions about the underlying evolutionary process.

Despite its limitations, Neighbor Joining is a widely used method for reconstructing phylogenetic trees, especially for preliminary analyses or when computational resources are limited.

Character-based methods: Maximum Parsimony

Maximum Parsimony (MP) is a character-based method used in phylogenetics to reconstruct phylogenetic trees based on the principle of parsimony, which states that the tree requiring the fewest evolutionary changes is the most likely. Here’s an overview of the Maximum Parsimony method:

Character Data: Maximum Parsimony uses discrete character data, such as nucleotide or amino acid sequences, morphological traits, or binary data (presence/absence of traits), to reconstruct phylogenetic trees. Each character is treated as a binary character state (0 or 1, or A, C, G, T for nucleotides).
Optimization Criterion: The goal of Maximum Parsimony is to find the tree topology that minimizes the total number of evolutionary changes (or character state changes) required to explain the observed data. This is done by finding the tree that requires the fewest evolutionary steps to explain the observed character states at the tips of the tree.
Tree Construction:
- Initialization: Start with an unrooted tree or an arbitrary root.
- Iteration:
  1. For each possible tree topology, calculate the number of evolutionary changes required to explain the observed character states.
  2. Select the tree topology with the fewest number of evolutionary changes as the best tree.
- Termination: Repeat the iteration until all possible tree topologies have been evaluated or until a stopping criterion is met.
Tree Evaluation: The best tree is the one that requires the fewest evolutionary changes (i.e., has the highest parsimony score). The parsimony score is the total number of evolutionary changes required to explain the observed data.
Advantages:
- Maximum Parsimony is conceptually simple and easy to understand.
- It is computationally efficient for small to moderate-sized datasets.
- It can handle complex evolutionary scenarios, such as gene duplications, losses, and horizontal gene transfer events.
Limitations:
- Maximum Parsimony does not take into account the evolutionary model or the rate of evolution, which can lead to incorrect tree inference, especially for large datasets or when the evolutionary rate varies among sites.
- It can be sensitive to homoplasy (convergence or parallel evolution), which can result in misleading tree topologies.

Despite its limitations, Maximum Parsimony is still widely used, especially for preliminary analyses or when the assumptions of other methods are not met. It can provide valuable insights into evolutionary relationships, especially when combined with other phylogenetic methods or when used to test specific hypotheses.

Model-based methods: Maximum Likelihood and Bayesian Inference

Model-based methods, such as Maximum Likelihood (ML) and Bayesian Inference (BI), are widely used in phylogenetics to reconstruct phylogenetic trees based on probabilistic models of sequence evolution. Here’s an overview of these two methods:

Maximum Likelihood (ML):
- Model Selection: ML begins with selecting an appropriate model of sequence evolution that best fits the data. Common models include the General Time Reversible (GTR) model for nucleotide sequences and the Jones-Taylor-Thornton (JTT) model for protein sequences.
- Likelihood Calculation: For a given tree topology and model of evolution, ML calculates the likelihood of the observed sequence data. The likelihood is a measure of how well the model explains the observed data.
- Tree Search: ML searches for the tree topology that maximizes the likelihood of the data. This is done by exploring different tree topologies and branch lengths to find the one that best fits the data.
- Branch Support: ML can also calculate branch support values, such as bootstrap values, which indicate the confidence in the inferred branches of the tree.
Bayesian Inference (BI):
- Prior and Posterior Probability: BI starts with specifying a prior probability distribution over the space of possible trees. The prior reflects our initial beliefs about the tree topology before seeing the data. BI then calculates the posterior probability of each tree given the data, which combines the prior and the likelihood of the data under the model of evolution.
- Markov Chain Monte Carlo (MCMC): BI uses MCMC algorithms to sample trees from the posterior distribution. MCMC explores the space of possible trees, converging towards the posterior distribution of trees that best explain the data.
- Tree Reconstruction: The final phylogenetic tree is reconstructed based on the sampled trees, with branch lengths and topology representing the posterior distribution of trees.
Advantages:
- ML and BI are based on explicit models of sequence evolution, allowing for more realistic and flexible modeling of evolutionary processes.
- These methods can account for heterogeneity in evolutionary rates among sites, as well as other complexities such as substitution rate variation among lineages.
Limitations:
- ML and BI can be computationally intensive, especially for large datasets or complex models.
- They rely on the accuracy of the selected model of sequence evolution, which can affect the accuracy of the inferred tree.

Both Maximum Likelihood and Bayesian Inference are powerful methods for phylogenetic tree reconstruction and are widely used in evolutionary biology due to their ability to provide well-supported and statistically rigorous estimates of phylogenetic relationships.

Using Phylogeny in Evolutionary Analysis

Comparative methods in phylogenetics

Comparative methods in phylogenetics are used to study evolutionary processes and patterns across different species. These methods leverage phylogenetic trees to infer ancestral states, test hypotheses about trait evolution, and understand the mechanisms driving evolutionary change. Here are some common comparative methods used in phylogenetics:

Ancestral State Reconstruction: Ancestral state reconstruction infers the ancestral states of traits or characters at internal nodes of a phylogenetic tree. This can help understand how traits have evolved over time and provide insights into the evolutionary history of traits.
Character Evolution Models: Character evolution models are used to describe the process of trait evolution along a phylogenetic tree. These models can incorporate parameters such as the rate of change, the direction of change, and the likelihood of different types of changes (e.g., gains, losses, or reversals).
Comparative Phylogenetic Methods: Comparative phylogenetic methods, such as phylogenetic independent contrasts (PICs) and phylogenetic generalized least squares (PGLS), are used to account for the non-independence of species due to shared evolutionary history when analyzing trait data. These methods are important for testing hypotheses about trait evolution while accounting for phylogenetic relationships.
Trait Evolutionary Rate Estimation: Some comparative methods aim to estimate the rate of evolution of traits across a phylogenetic tree. This can help identify traits that have evolved rapidly or slowly compared to other traits, providing insights into the factors driving trait evolution.
Diversification Analysis: Diversification analysis uses phylogenetic trees to study patterns of speciation and extinction rates across different lineages. These analyses can help understand the processes that have shaped the diversity of life on Earth.
Phylogenetic Comparative Methods in Ecology: In ecological studies, comparative methods are used to understand how ecological traits (e.g., body size, habitat preference) have evolved in relation to environmental factors and to predict how species might respond to environmental change.
Model Selection and Hypothesis Testing: Comparative methods often involve comparing different evolutionary models to test hypotheses about trait evolution. Model selection criteria, such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), are used to determine the best-fitting model.

Overall, comparative methods in phylogenetics provide powerful tools for studying trait evolution, understanding the processes driving evolutionary change, and testing hypotheses about the evolution of life on Earth.

Molecular clock and divergence time estimation

The molecular clock is a concept in molecular evolution that suggests that mutations accumulate in a genome at a relatively constant rate over time. This concept forms the basis for estimating divergence times between species or lineages based on molecular data, such as DNA or protein sequences. Here’s an overview of molecular clock and divergence time estimation:

Assumptions of the Molecular Clock:
- The molecular clock assumes that the rate of molecular evolution is constant across lineages and over time. However, this assumption may not always hold true, as the rate of molecular evolution can vary among different genes, species, and evolutionary timescales.
- Violations of the molecular clock assumption, such as changes in the evolutionary rate, can occur due to factors such as natural selection, population size changes, and genetic drift.
Types of Molecular Clocks:
- Strict Clock: Assumes a constant rate of molecular evolution across all lineages and over time.
- Relaxed Clock: Allows for variation in the evolutionary rate among lineages or genes. This approach is more realistic and can account for rate variation in different parts of the tree.
Methods for Divergence Time Estimation:
- Distance-Based Methods: Use genetic distances between sequences to estimate divergence times. These methods assume a constant rate of molecular evolution and are often used for rough estimates of divergence times.
- Likelihood and Bayesian Methods: Use phylogenetic trees and models of molecular evolution to estimate divergence times. These methods can account for rate variation among lineages and provide more accurate estimates of divergence times.
- Calibration Points: Divergence time estimation often relies on calibration points, which are known divergence times based on fossil evidence or biogeographic events. These calibration points are used to calibrate the molecular clock and estimate divergence times for other parts of the tree.
Factors Affecting Divergence Time Estimates:
- The choice of molecular clock model and evolutionary model can affect divergence time estimates.
- The quality and availability of calibration points can also impact the accuracy of divergence time estimates.
- Rate variation among lineages and genes, as well as the presence of molecular evolutionary processes (e.g., selection), can complicate divergence time estimation.
Applications of Divergence Time Estimation:
- Divergence time estimation is used to study the timing of evolutionary events, such as speciation events, and to infer the evolutionary history of organisms.
- It is also used in fields such as molecular dating, biogeography, and comparative genomics to understand patterns and processes of evolution.

Overall, molecular clock and divergence time estimation provide valuable tools for studying the evolutionary history of life on Earth and can help answer fundamental questions about the origins and diversification of species.

Phylogeography and biogeography

Phylogeography and biogeography are related fields in biology that study the distribution of species and the processes that have shaped their geographic ranges over time. While they share some similarities, they differ in their focus and methods:

Phylogeography:
- Phylogeography focuses on the geographic distribution of genetic lineages within a species and how historical processes, such as population divergence, migration, and isolation, have influenced the genetic structure of populations.
- It uses molecular data, such as DNA sequences, to reconstruct the evolutionary history of populations and infer patterns of migration, colonization, and population expansion.
- Phylogeography can provide insights into the evolutionary processes that have shaped genetic diversity within species, including the impact of past climate change, geological events, and human activities.
Biogeography:
- Biogeography focuses on the distribution of species and the factors that have influenced their geographic ranges, including historical events, ecological factors, and evolutionary processes.
- It considers patterns at the level of species or higher taxa and aims to understand the processes driving the distribution of biodiversity across different regions and habitats.
- Biogeography integrates data from various disciplines, including geology, ecology, paleontology, and genetics, to study patterns of species distribution and diversity.
Relationship between Phylogeography and Biogeography:
- Phylogeography provides a molecular perspective on biogeographic patterns, helping to understand how historical processes have influenced the genetic diversity of species within regions.
- Biogeography provides a broader context for understanding phylogeographic patterns, considering factors such as dispersal barriers, environmental gradients, and historical biogeographic events that have shaped the distribution of species over evolutionary time scales.
Applications:
- Phylogeography is used to study the evolutionary history of species, track the spread of pathogens, and inform conservation efforts by identifying genetically distinct populations.
- Biogeography is used to study patterns of biodiversity, assess the impact of climate change on species distributions, and inform conservation planning and habitat restoration efforts.

Overall, phylogeography and biogeography are complementary fields that provide important insights into the processes driving the distribution of species and genetic diversity across the planet. They both contribute to our understanding of how species have evolved and adapted to their environments over time.

Phylogenetic comparative methods in ecology and evolution

Phylogenetic comparative methods (PCMs) are used in ecology and evolutionary biology to study the evolutionary relationships among species and how these relationships influence ecological and evolutionary processes. PCMs use phylogenetic trees to account for the non-independence of species due to shared evolutionary history, allowing researchers to test hypotheses and make inferences about trait evolution, ecological interactions, and evolutionary patterns. Here are some common applications of PCMs in ecology and evolution:

Trait Evolution: PCMs are used to study the evolution of traits, such as body size, morphology, behavior, and physiological characteristics, across different lineages. By incorporating phylogenetic information, researchers can estimate ancestral states, infer the direction and rate of trait evolution, and identify factors driving trait diversification.
Adaptation and Natural Selection: PCMs can be used to test hypotheses about adaptation and natural selection by comparing the fit of evolutionary models that incorporate selection pressures with models that do not. This approach can help identify traits that have evolved in response to specific environmental factors or selective pressures.
Biogeography and Speciation: PCMs are used to study patterns of species distribution and diversification in relation to historical biogeographic events and environmental factors. By integrating phylogenetic and geographic data, researchers can infer historical migration patterns, identify dispersal barriers, and assess the impact of past climate change on species distributions.
Community Ecology: PCMs are used to study ecological interactions among species, such as competition, predation, and mutualism. By incorporating phylogenetic information, researchers can test hypotheses about the role of evolutionary history in shaping species interactions and community structure.
Macroevolutionary Patterns: PCMs are used to study broad-scale patterns of evolution, such as the tempo and mode of diversification, species turnover rates, and evolutionary convergence. These analyses can provide insights into the processes driving biodiversity at large spatial and temporal scales.
Conservation Biology: PCMs are used to inform conservation strategies by identifying evolutionarily distinct populations, assessing the genetic diversity of endangered species, and predicting the impact of habitat loss and fragmentation on species evolution.

Overall, PCMs are powerful tools for integrating evolutionary history into ecological and evolutionary studies, allowing researchers to uncover the underlying processes driving biodiversity and species interactions in natural ecosystems.

Practical Session: Bioinformatics Tools for Phylogenetic Analysis

Before you start: please install the FigTree viewer (http://tree.bio.ed.ac.uk/software/figtree/) on your computer.

The Phylogeny of HIV

In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:

Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, …) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.

The “Pol” gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 21 different POL- polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1.

>HTLV P03362 POL_HTL1A POL polyprotein (HTLV-I).

GKKAACNLANTGASRPWARTPPKAPRNQPVPFKPERLQALQHLVRKALEAGHIEPYTGPGNNPVFPVKKANG

TWRFIHDLRATNSLTIDLSSSSPGPPDLSSLPTTLAHLQTIDLRDAFFQIPLPKQFQPYFAFTVPQQCNYGP

GTRYAWKVLPQGFKNSPTLFEMQLAHILQPIRQAFPQCTILQYMDDILLASPSHEDLLLLSEATMASLISHG

LPVSENKTQQTPGTIKFLGQIISPNHLTYDAVPTVPIRSRWALPELQALLGEIQWVSKGTPTLRQPLHSLYC

ALQRHTDPRDQIYLNPSQVQSLVQLRQALSQNCRSRLVQTLPLLGAIMLTLTGTTTVVFQSKEQWPLVWLHA

PLPHTSQCPWGQLLASAVLLLDKYTLQSYGLLCQTIHHNISTQTFNQFIQTSDHPSVPILLHHSHRFKNLGA

QTGELWNTFLKTAAPLAPVKALMPVFTLSPVIINTAPCLFSDGSTSRAAYILWDKQILSQRSFPLPPPHKSA

QRAELLGLLHGLSSARSWRCLNIFLDSKYLYHYLRTLALGTFQGRSSQAPFQALLPRLLSRKVVYLHHVRSH

TNLPDPISRLNALTDALLITPVLQLSPAELHSFTHCGQTALTLQGATTTEASNILRSCHACRGGNPQHQMPR

GHIRRGLLPNHIWQGDITHFKYKNTLYRLHVWVDTFSGAISATQKRKETSSEAISSLLQAIAHLGKPSYINT

DNGPAYISQDFLNMCTSLAIRHTTHVPYNPTSSGLVERSNGILKTLLYKYFTDKPDLPMDNALSIALWTINH

LNVLTNCHKTRWQLHHSPRLQPIPETRSLSNKQTHWYYFKLPGLNSRQWKGPQEALQEAAGAALIPVSASSA

QWIPWRLLKRAACPRPVGGPADPKEKDLQHHG

>HIV1B5 P04587 POL polyprotein [Contains: Protease (Retro

FFREDLAFLQGKAREFSSEQTRANSPTISSEQTRANSPTRRELQVWGRDNNSPSEAGADRQGTVSFNFPQIT

LWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVL

VGPTPVNIIGRNLLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISK

IGPENPYNTPVFAIKKKDSTKWRKLVDFRELNRRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLD

EDFRKYTAFTIPSINNETPGSGYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDL

EIGQHRTKIEELRQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTIQPIVLPEKDSWTVNDIQKLVGKLN

WASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQW

TYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQAT

WIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAASRETKLGKAGYVTNRGRQKVVTLTHTTNQKTELQA

IHLALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVS

AGIRKILFLDGIDKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDC

THLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQ

EFGIPYNPQSQGVVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQ

TKELQKQITKIQNFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCV

ASRQDED

>HIV1H2 P04585 POL polyprotein [Contains: Protease (Retro

FFREDLAFLQGKAREFSSEQTRANSPTRRELQVWGRDNNSPSEAGADRQGTVSFNFPQVTLWQRPLVTIKIG

GQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRN

LLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIGPENPYNTPVF

AIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTIP

SINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEEL

RQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVR

QLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNL

KTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQATWIPEWEFVNTPP

LVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVTNRGRQKVVTLTDTTNQKTELQAIYLALQDSGLEV

NIVTDSQYALGIIQAQPDQSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFLDGI

DKAQDEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAV

HVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTGATVRAACWWAGIKQEFGIPYNPQSQG

VVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQTKELQKQITKIQ

NFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED

>HIV1MN P05961 POL polyprotein [Contains: Protease (Retro

FFREDLAFLQGKAEFSSEQNRANSPTRRELQVWGRDNNSLSEAGEEAGDDRQGPVSFSFPQITLWQRPIVTI

KIGGQLKEALLDTGADDTVLGEMNLPRRWKPKMIGGIGGFIKVRQYDQITIGICGHKAIGTVLVGPTPVNII

GRNLLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALIEICTEMEKEGKISKIGPENPYNT

PVFAIKKKDSTKWRKLVDFRELNKKTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAF

TIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRAKI

EELRRHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYAGI

KVKQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEVQKQGQGQWTYQIYQEPF

KNLKTGKYARMRGAHTNDVKQLTEAVQKIATESIVIWGKTPKFRLPIQKETWETWWTEYTXATWIPEWEVVN

TPPLVKLWYQLEKEPIVGAETFYVDGAANRETKKGKAGYVTNRGRQKVVSLTDTTNQKTELQAIHLALQDSG

LEVNIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFL

DGIDKAQEDHEKYHSNWRAMASDFNLPPIVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVIL

VAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGPNFTSTTVKAACWWTGIKQEFGIPYNPQ

SQGVIESMNKELKKIIGQVRDQAEHLKRAVQMAVFIHNFKRKGGIGGYSAGERIVGIIATDIQTKELQKQIT

KIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNNDIKVVPRRKAKVIRDYGKQTAGDDCVASRQDED

>HIV1N5 P12497 POL polyprotein [Contains: Protease (Retro

FFREDLAFPQGKAREFSSEQTRANSPTRRELQVWGRDNNSLSEAGADRQGTVSFSFPQITLWQRPLVTIKIG

GQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVGQYDQILIEICGHKAIGTVLVGPTPVNIIGRN

LLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIGPENPYNTPVF

AIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKQKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIP

SINNETPGIRYQYNVLPQGWKGSPAIFQCSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEEL

RQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVR

QLCKLLRGTKALTEVVPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNL

KTGKYARMKGAHTNDVKQLTEAVQKIATESIVIWGKTPKFKLPIQKETWEAWWTEYWQATWIPEWEFVNTPP

LVKLWYQLEKEPIIGAETFYVDGAANRETKLGKAGYVTDRGRQKVVPLTDTTNQKTELQAIHLALQDSGLEV

NIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDGLVSAGIRKVLFLDGI

DKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAV

HVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTVHTDNGSNFTSTTVKAACWWAGIKQEFGIPYNPQSQG

VIESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQTKELQKQITKIQ

NFRVYYRDSRDPVWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED

>HIV1ND P18802 POL polyprotein [Contains: Protease (Retro

FFREDLAFPQGKAGEFSSEQTRANSPTSRELRVWGGDNPLSETGAERQGTVSFSFPQITLWQRPLVTIKIGG

QLKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYDQILIEICGYKAMGTVLVGPTPVNIIGRNL

LTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTEICTEMEKEGKISRIGPENPYNTPIFA

IKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPS

INNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPEIVIYQYMDDLYVGSDLEIGQHRTKIEELR

EHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPINLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQ

LCKLLRGTKALTEVVPLTEEAELELAENREILKEPVHGVYYDPSKDLIAELQKQGDGQWTYQIYQEPFKNLK

TGKYARTRGAHTNDVKQLTEAVQKIATESIVIWGKTPKFKLPIQKETWETWWIEYWQATWIPEWEFVNTPPL

VKLWYQLEKEPIIGAETFYVDGAANRETKLGKAGYVTDRGRQKVVPFTDTTNQKTELQAINLALQDSGLEVN

IVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSQGIRKVLFLDGID

KAQEEHEKYHNNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH

VASGYIEAEVIPAETGQETAYFLLKLAGRWPVKVVHTDNGSNFTSATVKAACWWAGIKQEFGIPYNPQSQGV

VESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIIDIIATDIQTRELQKQIIKIQN

FRVYYRDSRDPIWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKVKIIRDYGKQMAGDDCVASRQDED

>HIV1OY P20892 POL polyprotein [Contains: Protease (Retro

FFREDLAFPQGKAREFSSEQTRANSPTSRELRVWGRDNNSPSEAGADRQGTVSFNLPQITLWQRPIVTIKIG

GQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRN

LLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKVLIEICTEMEKEGKISKVGPENPYNTPVF

AIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIP

SINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEEL

RQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIMLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVK

NLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLVAELQKQGQGQWTYQIYQEPFKNL

KTGKYARMRGAHTNDVKQLTEAVQKITQESIVIWGKTPKFKLPIQKETWEAWWTEYWQATWIPEWEFVNTPP

LVKLWYQLEKDPIVGAETFYVDGAANRETKLGKAGYVTDRGRQKVVSLTDTTNQKTELQAIHLALQDSGLEV

NIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFLDGI

DKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKIILVAV

HVASGYIEAEVIPAETGQETAYFILKLAGRWPVKTIHTDNGSNFTSTTVKAACWWAGIKQEFGIPYNPQSQG

VVESMNNELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQTKELQKQITKIQ

NFRVYYRDSREPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED

>HIV1PV P03368 POL polyprotein [Contains: Protease (Retro

FFREDLAFLQGKAREFSSEQTRANSPTISSEQTRANSPTRRELQVWGRDNNSPSEAGADRQGTVSFNFPQIT

LWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVL

VGPTPVNIIGRNLLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISK

IGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLD

EDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDL

EIGQHRTKIEELRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLN

WASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQW

TYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQAT

WIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETRLGKAGYLTNKGRQKVVPLTNTTNQKTELQA

IYLALQDSGLEVNIVTDSQYALGIIQAQPDQSESELVNQIIEQLIKKQKVYLAWVPAHKGIGGNEQVDKLVS

AGIRKILFLDGIDKAQDEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDC

THLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQ

EFGIPYNPQSQGVVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQ

TKELQKQITKIQNFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCV

ASRQDED

>HIV1U4 P24740 POL polyprotein [Contains: Protease (Retro

FFRENLAFQQGEAREFSSEQTRANSPTSRNLWDGGKDDLPCETGAERQGTDSFSFPQITLWQRPLVTVKIGG

QLIEALLDTGADDTVLEDINLPGKWKPKIIGGIGGFIKVRQYDQILIEICGKKTIGTVLVGPTPVNIIGRNM

LTQIGCTLNFPISPIETVPVKLKPEMDGPKVKQWPLTEEKIKALTEICNEMEKEGKISKIGPENPYNTPVFA

IKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHTAGLKKKKSVTVLDVGDAYFSVPLDESFRKYTAFTIPS

INNETPGVRYQYNVLPQGWKGSPSIFQSSMTKILEPFRSQHPDIVIYQYMDDLYVGSDLEIGQHRAKIEELR

AHLLSWGFITPDKKHQKEPPFLWMGYELHPDKWTVQPIQLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVKQ

LCKLLRGAKALTDIVTLTEEAELELAENREILKDPVHGVYYDPSKDLVAEIQKQGQDQWTYQIYQEPFKNLK

TGKYARKRSAHTNDVKQLTEVVQKVSTESIVIWGKIPKFRLPIQKETWEAWWMEYWQATWIPEWEFVNTPPL

VKLWYQLEKDPIAGAETFYVDGAANRETKLGKAGYVTDRGRQKVVSLTETTNQKTELHAIHLALQDSGSEVN

IVTDSQYALGIIQAQPDRSESEIVNQIIEKLIEKEKVYLSWVPAHKGIGGNEQVDKLVSSGIRKVLFLDGID

KAQEDHEKYHCNWRAMASDFNLPPVVAKEIVASCNKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH

VASGYIEAEVIPAETGQETAYFILKLAGRWPVKVIHTDNGSNFTSAAVKAVCWWANIQQEFGIPYNPQSQGV

VESMNKELKKIIGQVREQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIIDIIATDIQTKELQKQISKIQN

FRVYYRDSRDPIWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCMAGRQDED

>HIV1Z2 P12499 POL polyprotein [Contains: Protease (Retro

FFREDLAFPQGKAGELSSEQTRANSPTSRELRVWGRDNPLSETGAERQGTVSFNCPQITLWQRPLVTIKIGG

QLKEALLDTGADDTVLEEMNLPGKWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNL

LTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTEICTEMEKEGKISRVGPENPYNTPIFA

IKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIPS

INNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPEIVIYQYMDDLYVGSDLEIGQHRTKIEELR

EHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQSIKLPEKESWTVNDIQKLVGKLNWASQIYPGIKVRQ

LCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGHGQWTYQIYQEPFKNLK

TGKYARMRGAHTNDVKQLAEVVQKISTESIVIWGKTPKFRLPIQKETWETWWVEYWQATWIPEWEFVNTPPL

VKLWYQLEKEPIIGAETFYVDGAANRETKLGKAGYVTDRGRQKVVPFTDTTNQKTELQAINLALQDSGLEVN

IVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSQGIRKVLFLDGID

KAQEEHEKYHNNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH

VASGYIEAEVIPAETGQETAYFILKLAGRWPVKIVHTDNGSNFTSAAVKAACWWAGIKQEFGIPYNPQSQGV

VESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIIDIIATDIQTKELQKQITKIQN

FRVYYRDSRDPIWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKVKIIRDYGKQMAGDDCVASRQDED

>HIV2CA P24107 POL polyprotein [Contains: Protease (Retro

TGGFFRDWPLGKEAPQFPRGPSSTGANTNSTPIGSSSGSTGEIYAAREKAEGAETETIQRGDRGLTAPRTRR

GPMQGDNRGLAAPQFSLWKRPVVTAHIEGQPVEVLLDTGADDSIVAGIELGSNYSPKIVGGIGGFINTKEYK

NVEIEVLGKRVRATIMTGDTPINIFGRNILTALGMSLNLPVAKIEPIKIMLKPGKDGPRLRQWPLTKEKIEA

LKEICEKMEKEGQLEEAPPTNPYNTPTFAIRKKDKNKWRMLIDFRELNKVTQDFTEIQLGIPHPAGLAKKRR

ITVLDVGDAYFSIPLHEDFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQYTMRQVLEPFRKANSD

VIIIQYMDDILIASDRTDLEHDKVVLQLKELLNNLGFSTPDEKFQKDPPYRWMGYELWPTKWKLQKIQLPQK

EVWTVNDIQKLVGVLNWAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTELAEAELEENRIILSQEQEGHYYQE

EKELEATVQKDQDNQWTYKIHQEEKILKVGKYAKIKHTHTNGVKLLAQVVQKIGKEALVIGRIPKFHLPVER

EVWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVGDPIPGTETFYTDGSCNRQSKEGKAGYVTDRGRDKVK

ILEQTTNQQAELEAFAMALTDSGPKANIIVDSQYVMGIVAGQPTESENRIVNQIIEEMIKKEAIYVAWVPAH

KGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHTNVKELCHKFDIPQLVARQIVNTCAQYQQKGEAIH

GQVNAEVGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRWPITHLHTDNGANFTS

QEVKMVAWWVGIEQTFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTVETIVLMAVHCMNFKRRGGIGDMT

PSERLINMITTEQEIQFLQAKNSKLKNFRVYFREGRDQLWKGPGELLWKGDGAVIVKVGTDIKIIPRRKAKI

IRDYGGRQELDSSSHLEGARENGEVA

>HIV2D1 P17757 POL polyprotein [Contains: Protease (Retro

VLELWKGGTLGETVPSTQKTGLLEVWQVRTHHGKLPGKTGRFFRDGPTGKAAPQLPRGPSSSGADTNSTPNR

SSSGPVGEIYAAREKAERAEGETIQGGDGGLTAPRAGRDAPQRGDRGLATPQFSLWKRPVVTAFIEDQPVEV

LLDTGADDSIVAGIELGDNYTPKIVGGIGGFINTKEYKNVEIKVLNKRVRATIMTGDTPINIFGRNILATLG

MSLNLPVAKLDPIKVTLKPGKDGPRLKQWPLTKEKIEALKEICEKMEREGQLEEAPPTNPYNTPTFAIKKKD

KNKWRMLIDFRELNRVTQDFTEIQLGIPHPAGLAKKKRITVLDVGDAYFSIPLHEDFRQYTAFTLPSVNNAE

PEKRYVYKVLPQGWKGSPAIFQFMMRQILEPFRKANPDVILIQYMDDILIASDRTGLEHDKVVLQLKELLNG

LGFSTPDEKFQKDPPFQWMGYELWPTKWKLQKIQLPQKEIWTVNDIQKLVGVLNWAAQIYPGIKTKHLCKLI

RGKMTLTEEVQWTELAEAELEENKIILSQEQEGSYYQEEEELEATVIKSQDNQWAYKIHQGERVLKVGKYAK

IKNTHTNGVRLLAQVVQKIGKEALVIWGRVPKFHLPVERDTWEQWWDNYWQVTWVPEWDFVSTPPLVRLTFN

LVGDPIPGTETFYTDGSCNRQSKEGKAGYVTDRGRDRVRVLEQTSNQQAELEAFAMALADSGPKVNIIVDSQ

YVMGIVAGQPTESENRIVNQIIEDMIKKEAVYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEH

EKYHSNIKELTHKFGIPQLVARQIVNTCAQCQQKGEAIHGQVNAEIGVWQMDCTHLEGKIIIVAVHVASGFI

EAEVIPQESGRQTALFLLKLASRWPITHLHTDNGPNFTSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNH

HLKNQISRIREQANTIETIVLMAVHCMNFKRRGGIGDMTPAERLINMITTEQEIQFLQRKNSNFKKFQVYYR

EGRDQLWKGPGELLWKGDGAVIVKVGADIKVVPRRKAKIIRDYGGRQELDSSSHLEGAREDGEVA

>HIV2G1 P18042 POL polyprotein [Contains: Protease (Retro

MWQDRTRHGKMPRKTGRFFRDGSMGKEAPQLPRGPSSSGADTNSTPSRSSSGSIGKIYAAGERAEGAEGETI

QRGDGRLTAPRAGKSTSQRGDRGLAAPQFSLWKRPVVTAYIEVQPVEVLLDTGADDSIVAGIQLGDNYVPKI

VGGIGGFINTKEIKNIEIKVLNKRVRATIMTGDTPINIFGRNILTALGMSLNLPIAKIEPIKVTLKPGKDGP

RLRQWPLTKEKIEALREICEKMEKEGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNRVTQDFTEIQ

LGIPHPAGLAKKKRITVLDVGDAYFSIPLHEDFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQHT

MRQVLEPFRKANPDVILIQYMDDILIASDRTGLEHDKVVLQLKELLNGLGFSTPDEKFQKDPPLQWMGYELW

PTKWKLQKLQLPQKEIWTVNDIQKLVGVLNWAAQIYPGIKTKHLCRLIKGKMTLTEEVQWTELAEAELEENK

IILSQEQEGYYYQEEKELEATIQKNQDNQWTYKIHQEEKILKVGKYAKIKNTHTNGVRLLAQVVQKIGKEAL

VIWGRIPKFHLPVERETWEQWWDNYWQVTWIPEWDFVSTPPLVRLTFNLVGDPIPGAETFYTDGSCNRQSKE

GKARYVTDRGRDKVRVLERTTNQQAELEAFAMTLTDSGPKVNIIVDSQYVMGIVVGQPTESESRIVNQIIED

MIKKEAVYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLERIEPAQEEHEKYHSNMKELTHKFGIPQLVARQI

VNTCAQCQQKGEAIHGQVNAEIGVWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRW

PITHLHTDNGSNFTSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTIETIVLMAV

HCMNFKRRGGIGDMTPAERLINMITTEQEIQFLQRKNSNFKNFQVYYREGRDQLWKGPGELLWKGDGAVIVK

VGADIKVIPRRKAKIIRDYGGRQELDSSHLEGAREEDGEVA

>HIV2KR Q74120 POL polyprotein [Contains: Protease (Retro

TGWFFRDWPMGKEASQLPRDPSPAGADTNSTPSRPSSRPAREVLAAREEAERAENETIQGGDRGLTAPRTRR

DTTQRGDRGFAAPQFSLWKRPVVTAYVEGQPVEVLLDTGADDSIVAGIELGSNYSPKIVGGIGGFINTKEYK

NVEIKVLNKKVKATIMTGDTPINIFGRNILTALGMSLNLPVAKVDPIKVILKPGKDGPKVRQWPLTKEKIEA

LKEICEKMEREGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQEFTEIQLGIPHPAGLAKKRR

ITVLDIGDAYFSIPLHEDFRQYTAFTLPTVNNAEPGKRYIYKVLPQGWKGSPAIFQHTMRQVLEPFRKANPD

VILVQYMDDILIASDRTDLEHDRTVLQLKELLNGLGFSTPDEKFQKDPPYKWMGYELWPTKWKLQKIQLPQK

EVWTVNDIQKLVGVLNWAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTELAEAELEENKIILSQEQEGCYYQE

EKELEATVQKDQDNQWTYKIHQGEKILKVGKYAKIKNTHTNGVRLLAHVVQKIGKEALVIWGRIPKFHLPVE

RETWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVKDPIPGEETFYTDGSCNRQSKEGKAGYITDRGRDKV

RILEQTTNQQAELEAFAMALTDSGPKANIIVDSQYVMGIVAGQPTESESKLVNQIIEEMIKKETLYVAWVPA

HKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSNVKELSHKFGLPKLVARQIVNTCAQCQQKGEAI

HGQVDAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQETGRQTALFLLKLASRWPITHLHTDNGANFT

SQEVKMVAWWTGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTMETIVLMAVHCMNFKRRGGIGDM

TPAERLINMITTEQEIQFLHAKNSKLKNFRVYFREGRDQLWKGPGELLWKGDGAVIVKVGTDIKIVPRRKAK

IIRDYGGRREVDSSSHLEGTREDGEVA

>HIV2RO P04584 POL polyprotein [Contains: Protease (Retro

TGRFFRTGPLGKEAPQLPRGPSSAGADTNSTPSGSSSGSTGEIYAAREKTERAERETIQGSDRGLTAPRAGG

DTIQGATNRGLAAPQFSLWKRPVVTAYIEGQPVEVLLDTGADDSIVAGIELGNNYSPKIVGGIGGFINTKEY

KNVEIEVLNKKVRATIMTGDTPINIFGRNILTALGMSLNLPVAKVEPIKIMLKPGKDGPKLRQWPLTKEKIE

ALKEICEKMEKEGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEIQLGIPHPAGLAKKR

RITVLDVGDAYFSIPLHEDFRPYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQHTMRQVLEPFRKANK

DVIIIQYMDDILIASDRTDLEHDRVVLQLKELLNGLGFSTPDEKFQKDPPYHWMGYELWPTKWKLQKIQLPQ

KEIWTVNDIQKLVGVLNWAAQLYPGIKTKHLCRLIRGKMTLTEEVQWTELAEAELEENRIILSQEQEGHYYQ

EEKELEATVQKDQENQWTYKIHQEEKILKVGKYAKVKNTHTNGIRLLAQVVQKIGKEALVIWGRIPKFHLPV

EREIWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVGDPIPGAETFYTDGSCNRQSKEGKAGYVTDRGKDK

VKKLEQTTNQQAELEAFAMALTDSGPKVNIIVDSQYVMGISASQPTESESKIVNQIIEEMIKKEAIYVAWVP

AHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSNVKELSHKFGIPNLVARQIVNSCAQCQQKGEA

IHGQVNAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRWPITHLHTDNGANF

TSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTIETIVLMAIHCMNFKRRGGIGD

MTPSERLINMITTEQEIQFLQAKNSKLKDFRVYFREGRDQLWKGPGELLWKGEGAVLVKVGTDIKIIPRRKA

KIIRDYGGRQEMDSGSHLEGAREDGEMA

>HIV2SB P12451 POL polyprotein [Contains: Protease (Retro

TGWFFRAWTMGKEAPQLPRGPKFAGANTNSTPNGSSSGPTGEVHAAREKTERAETKTIQRSDRGLAASRARR

DTTQRDDRGLAAPQFSLWKRPVVTAYIEDQPVEVLLDTGADDSIVAGIELGSNYSPKIVGGIGGFINTKEYK

DVEIRVLNKKVRATIMTGDTPINIFGRNILTALGMSLNLPVAKIEPVKVTLKPGKDGPKQRQWPLTREKIEA

LREICEKMEREGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEVQLGIPHPAGLAKKRR

ITVLDVGDAYFSIPLYEDFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQYTMRQVLEPFRKANPD

VIIVQYMDDILIASDRTDLEHDKVVLQLKELLNGLGFSTPDEKFQKDPPYQWMGYELWPTKWKLQKIQLPQK

EVWTVNDIQKLVGVLNWAAQIYPGIKTKHLCKLIRGKMTPTEEVQWTELAEAELEENKIILSQEQEGHYYQE

EKELEATVQKDQDNQWTYKVHQGEKILKVGKYAKIKNTHTNGVRLLAQVVQKIGKEALVIWGRIPKFHLPVE

RETWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVKDPIPGAETFYTDGSCNRQSKEGKAGYITDRGKDKV

RILEQTTNQQAELEAFAMAVTDSGPKVNIVVDSQYVMGIVTGQPAESESRIVNKIIEEMIKKEAIYVAWVPA

HKGIGGNQEIDHLVSQGIRQVLFLERIEPAQEEHGKYHSNVKELAHKFGLPNLVARQIVNTCAQCQQKGEAI

HGQVNAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRWPITHLHTDNGANFT

SQEVKMVAWWVGIEQSFGVPYNPQSQGVVEAMNHHLKNQIERIREQANTMETIVLMAVHCMNFKRRGGIGDM

TPVERLVNMITTEQEIQFLQAKNSKLKNFRVYFREGRNQLWQGPGELLWKGDGAVIVKVGTDIKVIPRRKAK

IIRDYGPRQEMDSGSHLEGAREDGEMA

>HIV2ST P20876 POL polyprotein [Contains: Protease (Retro

KTRLLEMWQGRTHHGKMPRKTGGFFRVGPMGKEAPQFPCGPNPAGADTNSTPDRPSRGPTREVHAAREKAER

AEREAIQRSDRGLPAARETRDTMQRDDRGLAAPQFSLWKRPVVTAHVEGQPVEVLLDTGADDSIVAGVELGS

NYSPKIVGGIGGFINTKEYKNVEIRVLNKRVRATIMTGDTPINIFGRNILTALGMSLNLPVAKIEPIKIMLK

PGKDGPKLRQWPLTKEKIEALKEICEKMEREGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQ

DFTEIQLGIPHPAGLAKKKRITVLDVGDAYFSIPLHEDFRQYTAFTLPSINNAEPGKRYIYKVSPQGWKGSP

AIFQYTMRQVLEPFRKANPDIILIQYMDDILIASDRTDLEHDRVVLQLKELLNGLGFSTPDEKFQKDPPYQW

MGYELWPTKWKLQRIQLPQKEVWTVNDIQKLVGVLNWAAQIYPGIKTRNLCRLIRGKMTLTEEVQWTELAEA

ELEENKIILSQEQEGCYYQEEKELEATVQKDQDNQWTYKIHQGGKILKVGKYAKVKNTHTNGVRLLAQVVQK

IGKEALVIWGRIPKFHLPVERDTWEQWWDNYWQVTWIPDWDFISTPPLVRLVFNLVKDPILGAETFYTDGSC

NKQSREGKAGYITDRGRDKVRLLEQTTNQQAELEAFAMAVTDSGPKANIIVDSQYVMGIVAGQPTESESKIV

NQIIEEMIKKEAIYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSNVKELSHKFGLPK

LVARQIVNTCTQCQQKGEAIHGQVNAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLL

KLASRWPITHLHTDNGANFTSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTVET

IVLMAVHCMNFKRRGGIGDMTPAERLINMVTAEQEIQFLQAKNSKLQNFRVYFREGRDQLWKGPGELLWKGD

GAVIVKVGADIKIIPRRKAKIIKDYGGRQEMDSGSNLEGAREDGEVA

>SIVCZ P17283 POL polyprotein [Contains: Protease (Retro

STKKKRLLAVWARGTPNERLHRKTGEFFRERLAFPQREARQLCAEQNRTNGPTDRELWVPGGREEPGEERGR

EQSISTNLPQITLWQRPLIPVKVEGQLCEALLDTGADDTVIERIQLQGLWKPKMIGGIGGFIKVKQFDNVHI

EIEGRKVVGTVLVGPTPVNIIGRNILTQLGCTLVFPISSIETVPVKLKPGMDGPKVKQWPLSAEKIKALTEI

CQEMEKEGKISKIGPENPYNTPIFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVL

DVGDAYFSCPLDKDFRKYTAFTIPSINNETPGVRYQYNVLPQGWKGSPSIFQSSMTKILEPFREKNPDITIY

QYMDDLYVGSDLEIDQHRKKVEELRQHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIQLPEKEVWT

VNDIQKLIGKLNWASQIYPGIKIKQLCKLIRGTKKLTDVVPLTPEAELELAENREIVSTPVHGVYYDPDKEL

IAEIQKQGNCQWTYQIFQEPHKNLKTGKYARQRSAHTNDIRQLAEAVQKIATESIVIWGKTPKFRLPVQKES

WEAWWAEYWQATWIPEWEFINTPPLVKLWYSLETEPIPTTDTYYVDGAANRETKTGKAGYVTDKGKQKIISL

ENTTNQQAELKALLLALQDSDQQVNIVTDSQYVLGIIQSQPDHSESELVNQIIEELIKKEKIYLSWVPAHKG

IGGNEQVDKLVSAGIRKVLFLDGIDRAQEEHERYHSNWKAMASDFNLPPIVAKEIVAHCDKCQVKGEAMHGQ

VDCSPGIWQVDCTHLEGKVIIVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGPNFTSAA

VKAACWWADIKQEFGIPYNPQSQGVVESLNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYTAG

ERIIDIIATDIQTSELQKQILKVQKFRVYYRDSRDPIWKGPATLLWKGEGAVVIQDQGELKVVPRRKAKIIR

DYGKQMAGDDCVASRQNED

>Smanga_S4 P12502 POL polyprotein [Contains: Protease (Retro

KTGGFFRAWPMGKEAPQFPHGPDASGADTNCSPRGSSCGSTEELHEDGQKAEGEQRETLQGGDRGFAAPQFS

LWRRPVVTAYIEEQPVEVLLDTGADDSIVAGIELGPNYTPKIVGGIGGFINTKEYKDVKIKVLGKVIKGTIM

TGDTPINIFGRNLLTAMGMSLNLPIAKVEPIKVTLKPGKEGPKLRQWPLSKEKIIALREICEKMEKDGQLEE

APPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEVQLGIPHPAGLAKRRRITVLDVGDAYFSIPLD

EEFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQYTMRNVLEPFRKANPDVTLIQYMDDILIASDR

TDLEHDRVVLQLKELLNGIGFSTPEEKFQKDPPFQWMGYELWPTKWKLQKIELPQRETWTVNDIQKLVGVLN

WAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTEMAEAEYEENKIILSQEQEGCYYQEGKPIEATVIKSQDNQW

SYKIHQEDKVLKVGKFAKVKNTHTNGVRLLAHVVQKIGKEALVIWGEVPKFHLPVEREIWEQWWTDYWQVTW

IPDWDFVSTPPLVRLVFNLVKEPIQGAETFYVDGSCNRQSREGKAGYVTDRGRDKAKLLEQTTNQQAELEAF

YLALADSGPKANIIVDSQYVMGIIAGQPTESESRLVNQIIEEMIKKEAIYVAWVPAHKGIGGNQEVDHLVSQ

GIRQVLFLKKIEPAQEEHEKYHSNVKELVFKFGLPRLVAKQIVDTCDKCHQKGEAIHGQVNAELGTWQMDCT

HLEGKIIIVAVHVASGFIEAEVIPQETGRQTALFLLKLAGRWPITHLHTDNGANFTSQEVKMVAWWAGIEQT

FGVPYNPQSQGVVEAMNHHLKTQIDRIREQANSIETIVLMAVHCMNFKRRGGIGDMTPAERLVNMITTEQEI

QFQQSKNSKFKNFRVYYREGRDQLWKGPGELLWKGEGAVILKVGTEIKVVPRRKAKIIKDYGGGKELDSGSH

LEDTGEAREVA

>Smanga_SP P19505 POL polyprotein [Contains: Protease (Retro

MPRKTSGFFRAWPMGKEAPQFPHGPDASGADTNCSPRGSSCGSTEELHEDGQKAEGEQRETLQGGNGGFAAP

QFSLWRRPIVTAYIEEQPVEVLLDTGADDSIVAGIELGPNYTPKIVGGIGGFINTKEYKDVKIKVLGKVIKG

TIMTGDTPINIFGRNLLTAMGMSLNLPIAKVEPIKVTLKPGKDGPKLRQWPLSKEKIIALREICEKMEKDGQ

LEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEVQLGIPHPAGLAKRRRITVLDVGDAYFSI

PLDEEFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQHTMRNVLEPFRKANPDVTLIQYMDDILIA

SDRTDLEHDRVVLQLKELLNSIGFSTPEEKFQKDPPFQWMGYELWPTKWKLQKIELPQRETWTVNDIQKLVG

VLNWAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTEMAEAEYEENKIILSQEQEGCYYQEGKPLEATVIKSQD

NQWSYKIHQEDKILKVGKFAKIKNTHTNGVRLLAHVVQKIGKEAIVIWGQVPRFHLPVEREIWEQWWTDYWQ

VTWIPEWDFVSTPPLVRLVFNLVKEPIQGAETFYVDGSCNRQSREGKAGYVTDRGRDKAKLLEQTTNQQAEL

EAFYLALADSGPKANIIVDSQYVMGIVAGQPTESESRLVNQIIEEMIKKEAIYVAWVPAHKGIGGNQEVDHL

VSQGIRQVLFLEKIEPAQEEHEKYHSNVKELVFKFGLPRLVAKQIVDTCDKCHQKGEAIHGQVNAELGTWQM

DCTHLEGKIIIVAVHVASGFIEAEVIPQETGRQTALFLLKLASRWPITHLHTDNGANFTSQEVKMVAWWAGI

EQTFGVPYNPQSQGVVEAMNHHLKTQIDRIREQANSIETIVLMAVHCMNFKRRGGIGDMTPAERLVNMITTE

QEIQFQQSKNSKFKNFRVYYREGRDQLWKGPGELLWKGEGAVILKVGTEIKVVPRRKAKIIKDYGGGKELDS

GSHLEDTGEAREVA

Step 1

Align the Pol sequences using the MAFFT server (http://www.ebi.ac.uk/Tools/msa/mafft/) at EBI with default settings. Let Output format be “Pearson/FASTA”.

Once the alignment is done, save the resulting alignment as a fasta file: right-click the “Download alignment file” button on the mafft output page, and then save the file using “Save linked file as” (or whatever it is called in your particular browser). Make sure you can find the file again!

Step 2

Open the TreeHugger web server (https://services.healthtech.dtu.dk/service.php?TreeHugger). (The TreeHugger server constructs a neighbor joining tree from an aligned set of sequences).

Step 3

Select the option to upload a file (see figure below), then choose the Pol-protein alignment file you just saved on your harddisk, and finally click “Submit Query” to construct the neighbor joining tree:

Step 4

When the run is done, right-click the “Download data in Newick/Phylip format” link to save the tree file as a text file on your harddisk (again make sure you can find it later). You will notice that the treefile is in the parenthesis-based format we discussed previously in the lecture:

Step 5

Open the FigTree treeviewer that you have previously installed on your own computer and use File->Open to open the treefile you just saved.

Step 6

The view that you will see first is presumably a rooted view similar to the one below. However, it is important to realize that we have not explicitly rooted the tree yet, so the root in this view has been chosen randomly. A more realistic view can be seen by clicking the unrooted view button (see figures below):

Step 7

The last figure above shows the unrooted tree. For now, however, go back to the (pseudo)rooted view you started out with. We wil now place the root by using the HTLV Pol sequence as a so-called outgroup. Click the branch leading to the HTLV sequence such that it gets selected (see figure below). Then click the “Reroot” button, which will subsequently root the tree on the selected outgroup:

The rationale for using an outgroup to place the root of the tree is as follows: our data set consists of sequences from HIV-1, HIV-2, SIV and HTLV. We know from other evidence that the lineage leading to HTLV branched off before any of the remaining viruses diverged from each other. The root of the tree connecting the organisms investigated here, must therefore be located between the HTLV sequence (the “outgroup”) and the rest (the “ingroup”). This way of finding a root is called “outgroup rooting”.

Step 8

Inspect the rooted tree that you get as a result of rerooting and consider what this tells you about the origin of HIV viruses.

Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences? The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?

With these groupings in consideration, what can you say about the origin of the two HIV viruses?

Now you can save your tree as a picture by choosing File -> Export Graphics. Choose a suitable location and file format (eg.

.png) and hand it in along with your answers.

Time to try on your own!

For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequence, but lack the first 90 nucleotides or so). The sequences can be found via the following link:

L18_CDS.fasta

Step 9

Your answers should include the following:

How did you construct the tree? (alignment method, construction of tree, outgroup etc.).

A picture of the tree. Note: It is easy to increase the font size of the sequence names in FigTree. Just click the small arrow next to Tip Labels and enter a new font size.

A comparison of your tree with NCBI taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy). Are there any taxa that are not placed correctly on your tree?

Mitochondrial versus nuclear proteins

In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion’s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use UniProt (http://www.uniprot.org/) to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyse the phylogeny of the dataset.

Step 10

Find all proteins named “ribosomal protein L3” from as many eukaryotes (Eukaryota) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).
How many of these have a Subcellular location of “mitochondrion” and “cytoplasm”, respectively? Download the results of these two searches in FASTA format.
Now combine the two data sets from the previous question into one FASTA file (using jEdit or another plain text editor). Note that their names start by “RL3” (cytoplasmic) or “RM03″/”RK3” (mitochondrial) which is very convenient for telling the difference between them. If you have any names that do not begin with “RL3”, “RK3” or “RM03”, revisit your UniProt search criteria! Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). Describe all the steps you did to make it.
Visualize the tree using FigTree. Reroot the tree so that the cytoplasmic and the mitochondrial sequences are in two monophyletic groups (if possible). Include a picture of the rerooted tree in your answer.
Consider your rerooted tree. Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?
Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?
Where has evolution been faster (where are there most mutations per time unit) – among the cytoplasmic or the mitochondrial proteins?

Exercise: Phylogeny-Answers

Answers to the Phylogeny exercise

Step 8

Answers to “The Phylogeny of HIV” can be found here (https://teaching.healthtech.dtu.dk/material/36611/files/binfintro/hiv_origin.html).

Step 9

How did you construct the tree? (alignment method, construction of tree, outgroup etc. )

For starters you need to do a multiple alignment of your sequences. A number of different alignment methods can be used (eg. MAFFT or RevTrans). Here you can see an example of a MAFFT alignment.

>Yeast

TACACTTTCTT—AGCTCGTCGTACTGATGCTCCATTCAA CAAGGTTGTCTTG

AAGGCTTTGTTCTTGTCTAAGATCAACAGACCACCTGTTTCTGTCTCTAGAATTGCTAGA GCTTTGAAGCAAGAAGGTGC TGCTAACAAGACTGTTGTCGTT

GTTGGTACTGTTACTGACGATGCCAGAATCTTTGAATTCCCAAAGACCACTGTTGCTGCT TTGAGATTCACTGCTGGTGCCAGAGCCAAGATTGTTAAGGCTGGTGGTGAATGTATCACT TTGGATCAATTAGCTGTCAGAGCTCCAAAGGGTCAAAACACTTTGATCTTGAGAGGTCCA AGAAACTCCAGAGAAGCTGTCAGACACTTCGGTATGGGTCC ACACAAGGGT

AAGGCTCCAAGAATCTTGTCCACCGGTAGAAAGTTCGAAAGAGCTAGAGGTAGAAGAAGA TCTAAGGGTTTCAAGGTG

>African_frog

TATCGATTCTT—GGCTCGTCGTACCAACTCCAGTTTCAA CCGGGTGGTTCTG

AAGCGTCTGTTCATGAGCCGAACCAACAGGCCACCCCTCTCTATGTCCCGTCTTATTCGC AAAATGAAATTGCAAGGACG TGAAAACAAGACTGCAGTGGTT

GTGGGCTGTATCACAGATGATGTCAGGATCCATGATATCCCCAAACTGAAGGTGTGCGCA CTTAAAATAACCAGCGGAGCACGTAGCCGAATCCTGAAGTCTGGAGGTCAGATTATGACG TTTGATCAGCTCGCCCTTGCGGCCCCTAAAGGCCAGAACACTGTTCTTCTTTCAGGACCT CGTAAGGCCCGTGAAGTATACAGACACTTTGGGAAGGCACCTGGTACTCCACACAGTCGC ACTAAGCCTTATGTGCTCTCCAAGGGTAGAAAGTTTGAGCGCGCCAGAGGACGCAGAGCC AGCAGAGGATACAAGAAC

>Pig

TACAGGTTTCT—GGCCAGACGAACCAACTCCACCTTCAA TCAAGTTGTGCTG

AAGAGGTTGTTCATGAGTCGCACCAACCGGCCACCCCTGTCGCTTTCCCGGATGATCCGG AAGATGAAGCTTCCTGGCCG GGAAGGCAAGACCGCTGTGGTC

GTAGGGACTATAACCGATGACGTGCGTGTCCAGGAGGTGCCCAAATTGAAGGTGTGCGCT CTGCGCGTGAGCAGCCGTGCCCGGAGCCGCATTCTCAAGGCCGGGGGCAAAATCCTCACC TTCGACCAGTTGGCCCTGGACTCCCCCAAAGGCTGTGGCACTGTCCTCCTCTCTGGGCCT CGCAAGGGCCGCGAGGTGTACAGGCATTTCGGCAAGGCCCCAGGGACCCCGCACAGCCAC ACCAAACCCTATGTTCGCTCCAAGGGCCGGAAGTTCGAGCGCGCCAGAGGCCGACGTGCC AGCCGCGGCTACAAAAAC

>Fin_whale

TACAGGTTTCT—GGCCAGGCGAACCAACTCCACCTTCAA TCAAGTTGTGCTG

AAGAGGTTGTTCATGAGTCGCACCAACCGGCCACCTCTGTCCCTTTCCCGGATGATTCGG AAGATGAAGCTTCCCGGCCG GGAAGGCAAAACGGCCGTGGTG

GTGGGGACAGTGACTGATGACGTGCGAGTCCAGGAGGTGCCCAAGCTGAAGGTGTGTGCT CTCCGGGTGAGCAGCCGCGCCCGGAGCCGCATCCTCAAGGCCGGGGGCAAGATCCTCACC TTCGACCAGCTGGCCCTGGACTCCCCCAAGGGCTGTGGCACCGTGCTCCTGTCTGGTCCT CGCAAGGGCCGAGAGGTGTACAGGCATTTCGGCAAGGCCCCAGGAACCCCGCATAGCCAC ACCAAACCCTATGTACGCTCCAAGGGCCGGAAGTTCGAGCGCGCCAGAGGCCGACGGGCC AGCCGTGGCTACAAA—

>Human

TACAGGTTTCT—GGCCAGAAGAACCAACTCCACATTCAA CCAGGTTGTGTTG

AAGAGGTTGTTTATGAGTCGCACCAACCGGCCGCCTCTGTCCCTTTCCCGGATGATCCGG AAGATGAAGCTTCCTGGCCG GGAAAACAAGACGGCCGTGGTT

GTGGGGACCATAACTGATGATGTGCGGGTTCAGGAGGTACCCAAACTGAAGGTATGTGCA CTGCGCGTGACCAGCCGGGCCCGCAGCCGCATCCTCAGGGCAGGGGGCAAGATCCTCACT TTCGACCAGCTGGCCCTGGACTCCCCTAAGGGCTGTGGCACTGTCCTGCTCTCCGGTCCT CGCAAGGGCCGAGAGGTGTACCGGCATTTCGGCAAGGCCCCAGGAACCCCGCACAGCCAC ACCAAACCCTACGTCCGCTCCAAGGGCCGGAAGTTCGAGCGTGCCAGAGGCCGACGGGCC AGCCGAGGCTACAAAAAC

>Monkey_macaque

TACAGGTTTCT—GGCCAGAAGAACCAATTCCACATTCAA CCAGGTTGTGCTG

AAGAGGTTGTTTATGAGTCGCACCAACCGGCCTCCTCTGTCCCTTTCTCGGATGATCCGG AAGATGAAGCTTCCTGGCCG GGAAAACAAAACGGCCGTGGTT

GTGGGGACCATAACGGACGACGTGCGGGTTCAGGAGGTGCCCAAACTGAAGGTATGTGCA CTGCGCGTAACCAGCCGGGCCCGCAGCCGCATCCTCAGGGCAGGGGGCAAGATCCTCACT TTCGACCAGCTGGCCCTGGACTCCCCCAAGGGCTGCGGCACTGTTCTGCTCTCCGGTCCT CGCAAGGGCCGAGAGGTGTACCGGCATTTCGGCAAGGCCCCAGGAACCCCGCACAGCCAC ACCAAACCCTACGTCCGCTCCAAGGGCCGGAAGTTCGAGCGTGCCAGAGGCCGACGGGCC AGTCGAGGCTACAAAAAC

>Rat

TACAGGTTTCT—GGCCAGACGGACCAACTCCACCTTCAA CCAGGTTGTGCTG

AAAAGGTTATTTATGAGCCGAACTAACCGGCCACCTCTGTCCCTGTCCCGAATGATCCGG AAGATGAAGCTTCCTGGTCG GGAGAACAAAACTGCTGTGGTT

GTGGGGACGATCACAGATGATGTGCGGATTCTGGAAGTGCCCAAGCTGAAGGTGTGTGCA CTGAGGGTGAGCAGCCGGGCCCGAAGTCGGATCCTCAAGGCTGGGGGTAAGATCCTGACC TTCGACCAGCTGGCCCTGGAGTCTCCCAAGGGCAGGGGCACTGTGCTCTTGTCTGGTCCT CGGAAGGGCCGAGAGGTGTACCGACACTTTGGCAAGGCCCCAGGAACTCCACACAGCCAC ACCAAACCCTATGTCCGTTCCAAGGGCCGGAAGTTCGAGCGTGCCAGAGGCCGAAGGGCC AGCCGAGGCTACAAAAAC

>Mouse

TACAGGTTTCT—GGCCAGACGGACCAACTCCACCTTCAA TCAGGTTGTGCTG

AAGAGGTTGTTCATGAGCCGAACCAACCGGCCACCTCTGTCCCTGTCCCGCATGATCCGA AAGATGAAGCTTCCTGGCCG CGAGAACAAGACTGCCGTGGTT

GTGGGGACGGTCACAGATGATGTGCGGATTCTGGAAGTTCCCAAGCTGAAGGTGTGTGCA CTGCGGGTGAGCAGCCGGGCCCGGAGTCGCATCCTCAAGGCTGGGGGTAAGATCCTCACC TTTGACCAGCTGGCCCTGGAGTCTCCCAAGGGCCGGGGCACTGTGCTCCTGTCTGGTCCT CGGAAGGGCCGAGAGGTGTACCGACATTTTGGCAAGGCCCCAGGAACCCCACACAGCCAT

ATGGCG

ACCAAACCCTATGTCCGTTCCAAGGGCCGGAAGTTTGAGCGCGCCAGAGGCCGAAGGGCC
AGCAGAGGCTACAAAAAC
>Salmon
TATCGTTTACCTGGAAGCAAATGCTCCACTGCTCCCTTCAA CAAGGTGGTCCTC
AGGAGGCTCTTCATGAGCAGGACCCACAGGCCTCCGATGTCAGTGTCCCGCATGATCCGT
AAGATGAAATTGCCTGGACG TGAGAACAGAACCGCAGTTGTC
GTGGGAACCGTCACTGATGATGTCAGAATTCATGAAATCCCTAATCTGAAGGTCTCGGCA
CTTAAAATAACCAGGCGAAATCGGACGCGAATTCTGAAGTTTGTG—CAGATTATGAGG
TTCGTTGGGCTCGCACTTGCTGCTCCTAATCGGCAGAAGAGTGTTCTTCTTTCCGCCCCC
CGTAACGCGCGTGATGTATCCAGGCACTTTGCCAACGCCCCCAGTATTCC TCAC
ACTAAGCCTTACGTGCTTTCCAA——CAAGTTACGGCG—CAGAGGCAGCAAGCTC
ACT TACAACAAC
>Fruit_fly
TACCGCTTCCT—TCAGCGCCGCACCAACAAGAAGTTCAA CCGCATCATCCTG
AAGCGTTTGTTCATGAGCAAGATCAACAGGCCGCCGCTATCGCTTCAGCGCATCGCTCGC
TTCTTCAAGGCCGCCAACCA GCCGGAGTCTACCATCGTGGTC
GTCGGCACCGTCACCGACGATGCCCGCCTCCTGGTGGTGCCCAAGCTCACCGTGTGCGCC
CTGCACGTCACGCAGACCGCCAGGGAGCGCATCCTGAAGGCCGGCGGTGAGGTCCTGACC
TTCGATCAACTGGCTCTCCGATCGCCCACCGGCAAGAACACGCTGCTGCTGCAGGGCAGG
CGTACCGCCCGCACCGCCTGCAAGCACTTCGGCAAGGCTCCCGGTGTGCCCCACTCGCAC
ACCCGCCCCTATGTCCGCTCTAAGGGACGCAAGTTCGAGCGTGCTCGTGGTCGTCGCTCC
AGCTGCGGCTACAAGAAG
>Arabidopsis
TACCGGTTTCT—GGTAAGGAGAACTAATAGCAAGTTCAA TGGTGTGATATTG
AAGAGGCTTTTCATGAGCAAAGTCAACAAAGCTCCTCTTTCTCTATCTAGGCTTGTGGAG
TTCATGACTGGCAA GGAAGATAAGATTGCCGTCTTG
GTTGGAACTATAACTGATGATTTGAGGGTACACGAGATTCCAGCCATGAAAGTGACTGCC
TTGAGGTTCACAGAGAGAGCAAGGGCTCGCATTGAGAAAGCTGGAGGTGAATGCTTAACC
TTTGACCAGCTCGCTCTCAGAGCTCCATTGGGCCAGAACACGGTTCTTCTTAGAGGACCT
AAGAATTCACGTGAAGCAGTGAAGCATTTCGGACCTGCTCCTGGTGTGCCACACAGTCAC
TCCAAGCCATATGTTCGGGCCAAGGGAAGGAAGTTCGAGAAGGCCAGAGGAAAGAGGAAG
AGTCGTGGATTCAAGGTT
>Soy
TATCGCTTCCT—TGTTCGGAGAACTGGCAGCAACTTCAA TGCTGTTATACTT
AAGAGATTGTTCATGAGCAAGGTTAACAAACCCCCATTGTCTTTGTCAAGGTTGATTAAG
TATACGAAGGGGAA GGAAGATAAGATTGCAGTGGTG
GTGGGGTCTATAACCGATGATATTCGTGTTTATGAAGTTCCACCATTGAAAGTTACAGCA
CTCAGGTTTACAGAGACTGCCCGTGCAAGAATTGAGAAGGCAGGCGGTGAATGCTTGACG
TTTGATCAGTTGGCTCTCAGGGCTCCTCTGGGACAGAACACGGTCCTTCTTAGAGGCCCA
AAGAATGCTCGCGAAGCTGTGAAGCACTTTGGTCCTGCTCCTGGTGTCCCTCACAGCCAC
ACCAAGCCTTATGTTCGAGCAAAGGGAAGGAAGTTTGAGAGGGCTAGAGGAAGGAGGAAC
AGCCGAGGATTTAGGGTT
>Rice
TACCGCTTCCT—GGTGCGGAGGACCAAGAGCCACTTCAA CGCCGTGATCCTG
AAGCGGCTCTTCATGAGCAAGACCAACCGCCCGCCGCTCTCGATGCGCCGTCTCGTCAGG
TTCATGGAGGGGAAGGTACCTGATCGCCATGCCATTTCGGGGGACCAGATCGCCGTGATC
GTGGGCACCGTCACAGATGACAAGAGGATCTATGAGGTGCCGGCGATGAAGGTGGCTGCT
CTCAGGTTCACCGAGACCGCGAGAGCACGGATCATCAATGCCGGTGGCGAGTGCCTCACG
TTCGACCAGCTCGCTCTCCGCGCCCCGCTTGGCCAGAACACGGTCCTCCTGAGGGGTCCC
AAGAACGCTAGGGAAGCTGTTAAGCACTTTGGCCCTGCTCCAGGAGTTCCCCACAGCAAC
ACTAAGCCATATGTTCGCTCAAAGGGAAGGAAATTTGAGAAGGCAAGAGGAAGAAGGAAC
AGCAAGGGCTTCAAGGTA
>Methanocaldococcus_jannaschii
ATTGAGATATTAAAGCAGGAAAGTTATAAAAATCAGGCAAAGATTTGGAAGGATATTGCA
AGAAGGTTAGCAAAACCAAGAAGAAGGAGAGCAGAGGTAAATTTAAGTAAGATAAACAGA
TACACAAA AGAAGGAGATGTTGTTTTAGTT
CCTGGTAAAGTTTTAGGAGCTGGGAAGTT AGAGCACAAGGTTGTCGTTGCTGCA
TTTGCATTCTCAGAAACAGCTAAAAAATTAATTAAAGAAGCTGGAGGAGAAGCAATAACA
ATTGAAGAGCTAATAAAAAGAAATCCAAAAGGTTCAAATGTTAAAATT————


>Pyrococcus
ATTCGTTACCTCAGGAAAAAGTCTAATGAAGAGAAAGTTAAGATATGGAAGGACATAGCT
TGGAGACTTGAAAGACCAAGGAGGCAGAGGGCCGAAGTAAACGTCAGCAGGATAAACAGG
TACGCGAA GGATGGAGACATGATAGTGGTT
CCAGGGAGCGTTCTTGGGGCCGGCAAGAT AGAGAAGAAGGTCATTGTAGCTGCT
TGGAAGTTCAGTGAAACTGCAAGGAGAAAAATCGAGGAGGCCGGTGGGGAGGCCATAACG
ATTGAAGAGCTAATTAAGAGGAATCCAAAGGGAAGTGGAGTAATAATT————

	ATGGAG

Next you need to construct the tree. This can be done using the Treehugger tool: http://www.cbs.dtu.dk/services/TreeHugger/web/ where you simply paste in your multiple alignment.

A picture of the tree

Now the tree is ready to be opened in FigTree. All the sequences, except two, are from eukaryotes. The last two (Pyrococcus and Methanocaldococcus jannaschii) are both archaea and we therefore choose those two to be our outgroup (Notice that you can easily choose more than one sequence as outgroup, just choose the branch that are connecting both organisms to the rest of the tree and press “Reroot”).

A comparison of your tree with NCBI taxonomy. Are there any taxa that are not placed correctly on your tree?

On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out below frog, which would branch out below the group of mammals (see illustration below).

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy’s “Common Tree” function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group Euarchontoglires.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group Opisthokonta.

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

Step 10

52 results.

Search string: (protein_name:”ribosomal protein l3″) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)

8 and 26 results, respectively.

Search strings: (protein_name:”ribosomal protein l3″) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)

and (protein_name:”ribosomal protein l3″) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)

Under the Download tab in UniProt, select “Download all”, “FASTA (canonical)” and “Uncompressed”.

Then use jEdit (or another text editor) to combine them. Combined FASTA file is here: Media:Ribosomal_proteins_34.fasta.txt
Go to EBI’s MAFFT server, choose “Protein”, upload the combined FASTA file, and let all other options be default. When the alignment is done, click “Download Alignment File” and save the file. Then upload the result to TreeHugger and save the resulting Newick file.
Yes, it is possible to reroot the tree so that all the cytoplasmic sequences (RL3_*) and all the mitochondrial sequences (RM03_*) are in two separate monophyletic groups. After increasing the font size of the tip labels, the tree looks like this:

The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
There is one difference: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. The cytoplasmic tree is more correct.
There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the mitochondrial branches being longer (the mitochondrial tips are further away from the root).

Applications of Phylogenetic Analysis

Phylogenomics and comparative genomics

Phylogenomics and comparative genomics are two closely related fields in genomics that use genomic data to study evolutionary relationships and genome evolution across different species. While they share some similarities, they differ in their focus and methods:

Phylogenomics:
- Phylogenomics integrates large-scale genomic data, such as whole-genome sequences or large sets of orthologous genes, to reconstruct phylogenetic trees and infer evolutionary relationships among species.
- It combines principles of phylogenetics with genomic data to resolve complex evolutionary relationships, especially among closely related species or in cases where traditional phylogenetic markers (e.g., single genes) are insufficient.
- Phylogenomics can provide insights into the evolutionary history of organisms, including the timing of speciation events, patterns of gene duplication and loss, and the evolution of novel traits.
Comparative Genomics:
- Comparative genomics compares the genomes of different species to identify similarities and differences in gene content, gene order, and genomic organization.
- It aims to understand the evolutionary processes that have shaped genomes, such as gene duplication, gene loss, horizontal gene transfer, and genome rearrangements.
- Comparative genomics can reveal the genetic basis of species-specific traits, evolutionary innovations, and adaptations to different environments.
Relationship between Phylogenomics and Comparative Genomics:
- Phylogenomics often relies on comparative genomics approaches to compare genomic features across species and infer evolutionary relationships based on shared genetic similarities.
- Comparative genomics provides the data and methods needed for phylogenomic analyses, such as identifying orthologous genes, constructing multiple sequence alignments, and building phylogenetic trees from genomic data.
Applications:
- Phylogenomics is used to reconstruct the Tree of Life, resolve phylogenetic relationships among organisms, and study the evolution of key traits, such as the origin of complex multicellularity or the evolution of vertebrate immunity.
- Comparative genomics is used to study genome evolution, identify genetic factors underlying disease, understand the genetic basis of adaptation, and compare the genomic features of model organisms to other species.

Overall, phylogenomics and comparative genomics are powerful approaches that leverage genomic data to study evolutionary relationships and genome evolution. They provide valuable insights into the genetic basis of biodiversity and the evolutionary processes that have shaped life on Earth.

Evolutionary developmental biology (Evo-Devo)

Evolutionary developmental biology, often abbreviated as Evo-Devo, is a field of biology that studies how changes in developmental processes contribute to the evolution of phenotypic diversity in organisms. Evo-Devo seeks to understand the genetic and molecular mechanisms underlying the development of various organisms and how these mechanisms have evolved over time to produce the diversity of forms seen in nature.

Key concepts and approaches in Evo-Devo include:

Heterochrony: Changes in the timing of developmental events. For example, changes in the rate of growth or the timing of gene expression can lead to differences in body size or shape.
Heterotopy: Changes in the location of developmental events. This can result in the evolution of novel structures or changes in the relative size or shape of existing structures.
Heterometry: Changes in the amount or degree of developmental processes. This can lead to changes in the size, shape, or complexity of structures.
Gene Regulation: Evo-Devo studies how changes in gene expression patterns and regulatory networks can lead to morphological differences between species. This includes the evolution of cis-regulatory elements that control gene expression.
Comparative Developmental Biology: By comparing the developmental processes of different species, Evo-Devo researchers can identify conserved developmental pathways as well as differences that contribute to evolutionary change.
Evolution of Novelty: Evo-Devo seeks to understand how new traits and structures evolve, including the role of gene duplication, gene loss, and co-option of existing developmental pathways.
Experimental Evolution: Some Evo-Devo research involves experimental evolution, where researchers manipulate the developmental processes of organisms in the lab to observe how new traits can evolve under controlled conditions.
Paleontological and Morphological Data: Evo-Devo researchers often use data from the fossil record and comparative morphology to infer how developmental processes have evolved over time and how they relate to morphological diversity.

Evo-Devo has provided insights into a wide range of biological questions, including the evolution of animal body plans, the origin of key innovations in evolution, and the genetic basis of morphological diversity. It has also contributed to our understanding of human evolution and development, shedding light on the genetic and developmental basis of human traits and diseases.

Phylogenetic signal and trait evolution

Phylogenetic signal refers to the tendency of closely related species to resemble each other more than they resemble species drawn at random from the same phylogenetic tree. In other words, it is the degree to which the evolutionary history of a group of species is reflected in their trait similarities or differences.

Phylogenetic signal can be quantified using various metrics, such as Blomberg’s K, Pagel’s lambda, or phylogenetic correlograms. These metrics compare the observed trait similarity among species to the expected trait similarity under a null model of random trait evolution along the phylogeny. A high phylogenetic signal indicates that closely related species are more similar in their traits than expected by chance, suggesting that the trait has evolved in a phylogenetically conserved manner. Conversely, a low phylogenetic signal suggests that trait evolution has been more labile and is not strongly influenced by the phylogenetic relationships among species.

Phylogenetic signal is important in evolutionary biology because it provides insights into the evolutionary processes that have shaped trait variation among species. Traits with a strong phylogenetic signal are likely to be influenced by genetic constraints, shared evolutionary history, or adaptation to similar environments, while traits with a weak phylogenetic signal may be more influenced by ecological or environmental factors.

Understanding phylogenetic signal is also important for comparative analyses, such as phylogenetic comparative methods (PCMs), which rely on the assumption of phylogenetic signal to make valid inferences about trait evolution. By quantifying phylogenetic signal, researchers can assess the appropriateness of using phylogenetic comparative methods to study trait evolution in a particular group of species.

In summary, phylogenetic signal is a fundamental concept in evolutionary biology that helps us understand how traits evolve and diversify across the tree of life. It provides a framework for studying trait evolution in a phylogenetic context and can help us unravel the processes driving the remarkable diversity of life on Earth.

Phylogeny-guided drug discovery and biotechnology

Phylogeny-guided drug discovery and biotechnology involve using evolutionary relationships among organisms to identify novel compounds, genes, or biochemical pathways with potential applications in medicine, agriculture, or industry. This approach leverages the diversity of life on Earth to discover new biological resources that can be used to develop new drugs, biotechnological products, or sustainable solutions. Here are some key aspects of phylogeny-guided drug discovery and biotechnology:

Biodiversity as a Resource: The vast diversity of life forms on Earth represents a rich source of potentially valuable compounds and genes. By studying the evolutionary relationships among organisms, researchers can identify groups of organisms that are likely to produce bioactive compounds or possess unique biochemical pathways.
Bioinformatics and Comparative Genomics: Phylogeny-guided drug discovery often relies on bioinformatics and comparative genomics to analyze genetic sequences and predict the functions of genes and proteins. By comparing the genomes of different organisms, researchers can identify genes that are unique to certain groups and may encode novel bioactive compounds or enzymes.
Natural Products Discovery: Many drugs and biotechnological products are derived from natural sources, such as plants, microbes, and marine organisms. Phylogeny-guided approaches can help researchers identify new sources of natural products and screen them for bioactivity.
Metagenomics and Environmental Sampling: Metagenomics involves studying the genetic material recovered directly from environmental samples, such as soil or water. By analyzing the genetic diversity in environmental samples, researchers can identify novel genes and biochemical pathways that may have biotechnological or medical applications.
Bioprospecting and Traditional Knowledge: Phylogeny-guided approaches can also involve working with indigenous communities and traditional healers to identify plants or organisms with medicinal properties. This approach combines traditional knowledge with modern scientific methods to discover new drugs and biotechnological products.
Drug Development and Optimization: Once potential drug candidates or biotechnological products are identified, they undergo further testing and development to optimize their efficacy, safety, and stability. This process may involve chemical modification, formulation development, and preclinical and clinical testing.

Overall, phylogeny-guided drug discovery and biotechnology offer a promising approach to discovering new drugs, biotechnological products, and sustainable solutions by tapping into the rich diversity of life on Earth. By integrating evolutionary biology with modern biotechnology and bioinformatics, researchers can uncover valuable resources that have the potential to benefit society and improve human health.