Phylogeny and Evolutionary Analysis Tutorial
March 17, 2024Course Description: This course provides an introduction to phylogeny and its applications in evolutionary analysis. Students will learn about the principles of phylogenetics, various methods used in phylogenetic analysis, and how phylogeny is utilized to study evolutionary relationships among organisms. The course will also cover practical aspects of using bioinformatics tools for phylogenetic analysis.
Course Objectives:
- Understand the principles of phylogeny and evolutionary relationships.
- Learn about different methods used in phylogenetic analysis.
- Gain practical skills in using bioinformatics tools for phylogenetic analysis.
- Apply phylogenetic analysis to study evolutionary patterns and processes.
Table of Contents
Introduction to Phylogeny
Overview of evolutionary biology and phylogenetics
Evolutionary biology is the study of how organisms have evolved and diversified over time. It encompasses various subfields, including phylogenetics, which is the study of evolutionary relationships among organisms. Phylogenetics uses genetic, morphological, and behavioral data to reconstruct the evolutionary history, or phylogeny, of organisms. Here’s an overview of evolutionary biology and phylogenetics:
- Evolutionary Theory: Evolution is the change in heritable traits of biological populations over successive generations. The theory of evolution, proposed by Charles Darwin, suggests that all species of life have descended from common ancestors through the process of natural selection, genetic drift, and other mechanisms.
- Phylogenetics: Phylogenetics aims to reconstruct the evolutionary relationships among species or groups of organisms. It uses molecular data, such as DNA sequences, as well as morphological and behavioral traits, to infer the evolutionary tree or phylogeny.
- Evolutionary Trees: An evolutionary tree represents the evolutionary relationships among a group of organisms. It shows the common ancestor and the branching points where species diverged from each other.
- Taxonomy: Taxonomy is the science of classifying and naming organisms. Phylogenetics plays a crucial role in taxonomy by providing a scientific basis for classifying organisms based on their evolutionary relationships.
- Applications: Phylogenetics has diverse applications, including understanding the evolutionary history of species, identifying new species, studying the spread of diseases, and conservation biology.
- Methods: Phylogenetic analysis involves several methods, such as maximum likelihood, Bayesian inference, and distance-based methods, to reconstruct evolutionary trees and infer ancestral relationships.
- Tools: There are many software tools available for phylogenetic analysis, including MEGA, PAUP*, MrBayes, and PhyML, which help researchers analyze molecular data and reconstruct phylogenetic trees.
- Challenges: Phylogenetics faces challenges such as incomplete data, horizontal gene transfer, and the need for improved models of molecular evolution to accurately reconstruct evolutionary histories.
Overall, evolutionary biology and phylogenetics are fundamental to understanding the diversity of life on Earth and the processes that have shaped it over billions of years.
Historical development of phylogenetic theory
The historical development of phylogenetic theory can be traced back to the early ideas about the relationships among living organisms. Here’s an overview of the key developments in the history of phylogenetics:
- Early Concepts: The concept of a “natural classification” based on similarities among organisms was proposed by early naturalists like Aristotle and later by Carl Linnaeus. However, these classifications were based on morphological similarities rather than evolutionary relationships.
- Darwin’s Theory of Evolution: Charles Darwin’s theory of evolution, published in his book “On the Origin of Species” in 1859, provided a scientific explanation for the diversity of life on Earth. Darwin proposed that species evolved from common ancestors through a process of natural selection, which became a foundational concept for phylogenetics.
- Cladistics: In the 20th century, German entomologist Willi Hennig developed the principles of cladistics, a method for reconstructing evolutionary trees based on shared derived characteristics, or synapomorphies, among taxa. Cladistics revolutionized phylogenetic analysis by focusing on evolutionary relationships rather than overall similarity.
- Molecular Phylogenetics: With the advancement of molecular biology in the mid-20th century, researchers began using molecular data, such as DNA sequences, to infer evolutionary relationships. This led to the development of molecular phylogenetics, which provided a new perspective on phylogenetic relationships and helped resolve some long-standing evolutionary questions.
- Computational Phylogenetics: The development of computational methods in the late 20th century, such as maximum likelihood and Bayesian inference, revolutionized phylogenetic analysis. These methods allowed researchers to analyze large datasets and complex evolutionary relationships more efficiently.
- Phylogenomics and Big Data: Recent advances in sequencing technologies have led to the accumulation of vast amounts of genomic data, enabling phylogenetic analysis at a scale never before possible. Phylogenomics, the study of evolutionary relationships using genomic data, has become a major area of research in modern phylogenetics.
- Current Challenges: Despite the advancements, phylogenetics still faces challenges, such as the accurate modeling of molecular evolution, dealing with incomplete or biased data, and integrating morphological and molecular data to reconstruct accurate phylogenies.
Overall, the historical development of phylogenetic theory reflects the evolution of our understanding of evolutionary relationships among organisms, from early morphological classifications to the modern integration of molecular data and computational methods.
Importance of phylogeny in understanding evolutionary relationships
Phylogeny is crucial for understanding evolutionary relationships among organisms. Here are some key reasons why phylogeny is important in the study of evolution:
- Reconstructing Evolutionary History: Phylogenetics helps reconstruct the evolutionary history of species, showing how they are related through common ancestors. This information is essential for understanding the diversification and adaptation of life forms over time.
- Classifying Organisms: Phylogenetic trees provide a framework for classifying organisms based on their evolutionary relationships. This helps in organizing and naming species in a systematic way, known as taxonomy.
- Predicting Traits and Behaviors: By studying the evolutionary relationships among organisms, researchers can make predictions about the traits and behaviors of species. This is useful for understanding the ecological roles of organisms and predicting their responses to environmental changes.
- Biogeography: Phylogenetics helps explain the distribution of organisms in different geographical regions. By studying the evolutionary history of species, researchers can infer how and when organisms migrated to different areas.
- Conservation Biology: Understanding the evolutionary relationships among species is essential for conservation efforts. Phylogenetics helps identify endangered species, prioritize conservation efforts, and understand the potential impacts of extinction on ecosystems.
- Medical and Agricultural Applications: Phylogenetics is used in medical and agricultural research to study the evolution of pathogens, pests, and crop species. This information is valuable for developing strategies to control diseases and improve crop yields.
- Understanding Biodiversity: By revealing the evolutionary relationships among organisms, phylogenetics provides insights into the origin and maintenance of biodiversity. This information is crucial for conservation and management of ecosystems.
Overall, phylogeny is a fundamental tool in evolutionary biology, providing a framework for understanding the history of life on Earth and its diversity.
Principles of Phylogenetics
Concepts of homology, analogy, and synapomorphy
Homology, analogy, and synapomorphy are concepts used in phylogenetics and evolutionary biology to describe different types of similarities and differences among organisms. Here’s a brief overview of each concept:
- Homology: Homology refers to similarities between organisms that are due to shared ancestry. These similarities can be in terms of structures, such as organs or body parts, or in terms of genes and DNA sequences. Homologous traits are similar because they are inherited from a common ancestor, even if they have different functions in different organisms. For example, the forelimbs of vertebrates (such as humans, bats, and whales) are considered homologous structures, as they share a common ancestral structure despite their different functions (e.g., flying, swimming, or grasping).
- Analogy: Analogy, also known as homoplasy, refers to similarities between organisms that are not due to shared ancestry but rather to convergent evolution or evolutionary parallelism. Analogous traits evolve independently in different lineages in response to similar environmental pressures. These traits may serve similar functions but do not share a common evolutionary origin. For example, the wings of birds and bats are analogous structures, as they have evolved independently in these two groups for the purpose of flight, despite not sharing a common ancestral wing structure.
- Synapomorphy: Synapomorphy is a shared derived trait that is unique to a specific group of organisms and their common ancestor. Synapomorphies are used in phylogenetic analysis to infer evolutionary relationships and define monophyletic groups, or clades. For example, the presence of feathers is a synapomorphy of birds and their closest dinosaurian relatives, indicating their common ancestry.
In summary, homology refers to similarities due to shared ancestry, analogy refers to similarities due to convergent evolution, and synapomorphy refers to shared derived traits that define evolutionary relationships among organisms. Understanding these concepts is essential for reconstructing evolutionary relationships and studying the patterns and processes of evolution.
Construction of phylogenetic trees
Phylogenetic trees, also known as evolutionary trees or cladograms, depict the evolutionary relationships among organisms. Constructing a phylogenetic tree involves several steps:
- Data Collection: Gather molecular (DNA, RNA) or morphological data (physical characteristics) from the organisms of interest. The choice of data depends on the organisms and the evolutionary relationships being studied.
- Alignment: For molecular data, align the sequences to ensure that homologous positions are correctly matched. This step is crucial for accurate phylogenetic inference.
- Model Selection: Choose an appropriate evolutionary model that describes the substitution pattern of nucleotides or amino acids in the aligned sequences. Common models include the Jukes-Cantor, Kimura, and GTR models for nucleotide sequences, and the JTT and WAG models for protein sequences.
- Phylogenetic Inference: Use a phylogenetic inference method to construct the tree. Common methods include distance-based methods (e.g., Neighbor Joining), maximum parsimony, maximum likelihood, and Bayesian inference. These methods use the data and evolutionary model to estimate the most likely tree topology.
- Tree Evaluation: Assess the robustness of the tree using bootstrap analysis or posterior probability values. Bootstrap analysis involves resampling the data to generate multiple datasets and constructing trees from each dataset to evaluate the support for each branch.
- Tree Visualization: Visualize the phylogenetic tree using tree-drawing software. The tree is typically represented as a branching diagram, with branches indicating evolutionary relationships and branch lengths representing the amount of evolutionary change.
- Interpretation: Interpret the tree to understand the evolutionary relationships among the organisms. This involves identifying clades (groups of organisms that share a common ancestor) and inferring evolutionary events such as speciation and gene duplication events.
- Tree Editing: Sometimes, the tree may need to be edited or pruned to improve clarity or highlight specific relationships. This can be done using tree-editing software.
- Tree Annotation: Add labels, annotations, and scale bars to the tree to make it more informative and easier to interpret.
- Publication: Finally, the phylogenetic tree can be included in scientific publications or presentations to communicate the evolutionary relationships among the organisms studied.
Constructing phylogenetic trees is a complex process that requires careful consideration of the data, evolutionary models, and inference methods. The resulting trees provide valuable insights into the evolutionary history and relationships among organisms.
Tree-thinking and interpretation of phylogenetic trees
Tree-thinking refers to the ability to interpret and understand phylogenetic trees as representations of evolutionary relationships among organisms. Here are key concepts and skills involved in tree-thinking and interpreting phylogenetic trees:
- Branches: Branches in a phylogenetic tree represent evolutionary lineages. Longer branches typically indicate more evolutionary change, while shorter branches indicate less change.
- Nodes: Nodes, or branch points, represent points in evolutionary history where a common ancestor split into two or more lineages. Nodes are important for defining monophyletic groups, or clades, which include an ancestor and all its descendants.
- Root: The root of the tree represents the most recent common ancestor of all the organisms in the tree. The root is often inferred based on outgroup analysis, where an external group is used to root the tree.
- Clades: Clades are monophyletic groups that include an ancestor and all its descendants. Clades are identified based on shared derived traits, or synapomorphies, that are unique to the group.
- Sister Taxa: Sister taxa are two clades that are each other’s closest relatives, sharing a common ancestor not shared with any other taxa in the tree.
- Polytomy: A polytomy is a node in a tree where more than two branches emerge, indicating uncertainty about the exact branching order.
- Homoplasy: Homoplasy refers to similarities between organisms that are not due to shared ancestry (analogous traits). Homoplasy can lead to misleading interpretations of phylogenetic trees and needs to be carefully considered in tree interpretation.
- Character Evolution: Phylogenetic trees can be used to study the evolution of characters (traits) along branches. This can help infer ancestral states and understand the patterns of character evolution within a group.
- Divergence Times: Some phylogenetic trees include information about the estimated divergence times between lineages. This information can provide insights into the timing of evolutionary events.
- Tree Confidence: The confidence in the branching pattern of a tree can be assessed using statistical measures such as bootstrap support values or posterior probabilities. Higher values indicate greater confidence in the inferred relationships.
Tree-thinking is a fundamental skill in evolutionary biology and is essential for interpreting and communicating the complex relationships among organisms depicted in phylogenetic trees.
Methods in Phylogenetic Analysis
Molecular vs. morphological phylogenetics
Molecular phylogenetics and morphological phylogenetics are two approaches used to reconstruct evolutionary relationships among organisms. Here’s a comparison of the two approaches:
- Data Type:
- Molecular Phylogenetics: Uses molecular data, such as DNA sequences, RNA sequences, or protein sequences, to infer evolutionary relationships. This data is often more directly related to the genetic history of organisms.
- Morphological Phylogenetics: Uses morphological data, such as physical characteristics (e.g., anatomy, morphology), to infer evolutionary relationships. This data is based on observable traits of organisms.
- Evolutionary Rate:
- Molecular Phylogenetics: Molecular data can evolve at different rates, depending on the gene or region being studied. Some molecular data evolve rapidly, making them useful for studying recent evolutionary events, while others evolve more slowly, providing insights into deeper evolutionary relationships.
- Morphological Phylogenetics: Morphological traits can also evolve at different rates, but the rate of morphological evolution is generally slower than that of molecular evolution. This makes morphological data more useful for studying ancient evolutionary relationships.
- Accuracy:
- Molecular Phylogenetics: Molecular data can provide more accurate estimates of evolutionary relationships, especially at deeper nodes in the tree where morphological data may be more variable or subject to convergent evolution.
- Morphological Phylogenetics: Morphological data can be less accurate, especially for distantly related species or when morphological traits are subject to convergence or homoplasy.
- Data Availability:
- Molecular Phylogenetics: Molecular data are more readily available for a wide range of organisms due to advances in sequencing technologies. This has made molecular phylogenetics a popular choice for studying evolutionary relationships.
- Morphological Phylogenetics: Morphological data are often more limited, especially for extinct or poorly studied species. However, advances in imaging techniques have improved the availability of morphological data.
- Complementary Approaches:
- Combined Approach: In many cases, molecular and morphological data are used together to reconstruct phylogenetic trees. This approach can provide more robust and comprehensive phylogenies, as it combines the strengths of both data types.
In summary, molecular phylogenetics and morphological phylogenetics each have their strengths and limitations. The choice of approach depends on the research question, the availability of data, and the evolutionary scale being studied.
Distance-based methods: Neighbor Joining
Distance-based methods, such as Neighbor Joining (NJ), are used in phylogenetics to construct phylogenetic trees based on genetic distances between sequences. Here’s an overview of the Neighbor Joining method:
- Distance Matrix: The first step in Neighbor Joining is to calculate a pairwise distance matrix that represents the genetic distances between all pairs of sequences in the dataset. The distance can be based on genetic differences, such as nucleotide or amino acid substitutions, or on other measures of dissimilarity.
- Tree Construction:
- Initialization: Start with a star-like tree, where all sequences are connected to a central node.
- Iteration:
- Find the pair of sequences with the smallest distance in the distance matrix.
- Join these two sequences to a new internal node in the tree, creating a bifurcation (branching point).
- Calculate the distances from the new internal node to all other nodes in the tree using the formula:
- Update the distance matrix by removing the rows and columns corresponding to the joined sequences and adding a new row and column for the new internal node.
- Termination: Repeat the iteration until all sequences are joined, resulting in a fully resolved phylogenetic tree.
- Tree Rooting: The Neighbor Joining method does not provide a rooted tree, so the tree needs to be rooted using an outgroup or by other methods.
- Advantages:
- Neighbor Joining is computationally efficient and can handle large datasets.
- It is robust to violations of the molecular clock assumption (i.e., that genetic change occurs at a constant rate over time).
- Limitations:
- Neighbor Joining can be sensitive to the choice of distance metric and may not always produce the most accurate tree.
- It does not account for evolutionary models or assumptions about the underlying evolutionary process.
Despite its limitations, Neighbor Joining is a widely used method for reconstructing phylogenetic trees, especially for preliminary analyses or when computational resources are limited.
Character-based methods: Maximum Parsimony
Before you start: please install the FigTree viewer (http://tree.bio.ed.ac.uk/software/figtree/) on your computer.
The Phylogeny of HIV
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, …) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.
The “Pol” gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 21 different POL- polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1.
>HTLV P03362 POL_HTL1A POL polyprotein (HTLV-I).
GKKAACNLANTGASRPWARTPPKAPRNQPVPFKPERLQALQHLVRKALEAGHIEPYTGPGNNPVFPVKKANG
TWRFIHDLRATNSLTIDLSSSSPGPPDLSSLPTTLAHLQTIDLRDAFFQIPLPKQFQPYFAFTVPQQCNYGP
GTRYAWKVLPQGFKNSPTLFEMQLAHILQPIRQAFPQCTILQYMDDILLASPSHEDLLLLSEATMASLISHG
LPVSENKTQQTPGTIKFLGQIISPNHLTYDAVPTVPIRSRWALPELQALLGEIQWVSKGTPTLRQPLHSLYC
ALQRHTDPRDQIYLNPSQVQSLVQLRQALSQNCRSRLVQTLPLLGAIMLTLTGTTTVVFQSKEQWPLVWLHA
PLPHTSQCPWGQLLASAVLLLDKYTLQSYGLLCQTIHHNISTQTFNQFIQTSDHPSVPILLHHSHRFKNLGA
QTGELWNTFLKTAAPLAPVKALMPVFTLSPVIINTAPCLFSDGSTSRAAYILWDKQILSQRSFPLPPPHKSA
QRAELLGLLHGLSSARSWRCLNIFLDSKYLYHYLRTLALGTFQGRSSQAPFQALLPRLLSRKVVYLHHVRSH
TNLPDPISRLNALTDALLITPVLQLSPAELHSFTHCGQTALTLQGATTTEASNILRSCHACRGGNPQHQMPR
GHIRRGLLPNHIWQGDITHFKYKNTLYRLHVWVDTFSGAISATQKRKETSSEAISSLLQAIAHLGKPSYINT
DNGPAYISQDFLNMCTSLAIRHTTHVPYNPTSSGLVERSNGILKTLLYKYFTDKPDLPMDNALSIALWTINH
LNVLTNCHKTRWQLHHSPRLQPIPETRSLSNKQTHWYYFKLPGLNSRQWKGPQEALQEAAGAALIPVSASSA
QWIPWRLLKRAACPRPVGGPADPKEKDLQHHG
>HIV1B5 P04587 POL polyprotein [Contains: Protease (Retro
FFREDLAFLQGKAREFSSEQTRANSPTISSEQTRANSPTRRELQVWGRDNNSPSEAGADRQGTVSFNFPQIT
LWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVL
VGPTPVNIIGRNLLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISK
IGPENPYNTPVFAIKKKDSTKWRKLVDFRELNRRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLD
EDFRKYTAFTIPSINNETPGSGYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDL
EIGQHRTKIEELRQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTIQPIVLPEKDSWTVNDIQKLVGKLN
WASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQW
TYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQAT
WIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAASRETKLGKAGYVTNRGRQKVVTLTHTTNQKTELQA
IHLALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVS
AGIRKILFLDGIDKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDC
THLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQ
EFGIPYNPQSQGVVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQ
TKELQKQITKIQNFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCV
ASRQDED
>HIV1H2 P04585 POL polyprotein [Contains: Protease (Retro
FFREDLAFLQGKAREFSSEQTRANSPTRRELQVWGRDNNSPSEAGADRQGTVSFNFPQVTLWQRPLVTIKIG
GQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRN
LLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIGPENPYNTPVF
AIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTIP
SINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEEL
RQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVR
QLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNL
KTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQATWIPEWEFVNTPP
LVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVTNRGRQKVVTLTDTTNQKTELQAIYLALQDSGLEV
NIVTDSQYALGIIQAQPDQSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFLDGI
DKAQDEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAV
HVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTGATVRAACWWAGIKQEFGIPYNPQSQG
VVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQTKELQKQITKIQ
NFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED
>HIV1MN P05961 POL polyprotein [Contains: Protease (Retro
FFREDLAFLQGKAEFSSEQNRANSPTRRELQVWGRDNNSLSEAGEEAGDDRQGPVSFSFPQITLWQRPIVTI
KIGGQLKEALLDTGADDTVLGEMNLPRRWKPKMIGGIGGFIKVRQYDQITIGICGHKAIGTVLVGPTPVNII
GRNLLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALIEICTEMEKEGKISKIGPENPYNT
PVFAIKKKDSTKWRKLVDFRELNKKTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAF
TIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRAKI
EELRRHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYAGI
KVKQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEVQKQGQGQWTYQIYQEPF
KNLKTGKYARMRGAHTNDVKQLTEAVQKIATESIVIWGKTPKFRLPIQKETWETWWTEYTXATWIPEWEVVN
TPPLVKLWYQLEKEPIVGAETFYVDGAANRETKKGKAGYVTNRGRQKVVSLTDTTNQKTELQAIHLALQDSG
LEVNIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFL
DGIDKAQEDHEKYHSNWRAMASDFNLPPIVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVIL
VAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGPNFTSTTVKAACWWTGIKQEFGIPYNPQ
SQGVIESMNKELKKIIGQVRDQAEHLKRAVQMAVFIHNFKRKGGIGGYSAGERIVGIIATDIQTKELQKQIT
KIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNNDIKVVPRRKAKVIRDYGKQTAGDDCVASRQDED
>HIV1N5 P12497 POL polyprotein [Contains: Protease (Retro
FFREDLAFPQGKAREFSSEQTRANSPTRRELQVWGRDNNSLSEAGADRQGTVSFSFPQITLWQRPLVTIKIG
GQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVGQYDQILIEICGHKAIGTVLVGPTPVNIIGRN
LLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIGPENPYNTPVF
AIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKQKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIP
SINNETPGIRYQYNVLPQGWKGSPAIFQCSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEEL
RQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVR
QLCKLLRGTKALTEVVPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNL
KTGKYARMKGAHTNDVKQLTEAVQKIATESIVIWGKTPKFKLPIQKETWEAWWTEYWQATWIPEWEFVNTPP
LVKLWYQLEKEPIIGAETFYVDGAANRETKLGKAGYVTDRGRQKVVPLTDTTNQKTELQAIHLALQDSGLEV
NIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDGLVSAGIRKVLFLDGI
DKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAV
HVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTVHTDNGSNFTSTTVKAACWWAGIKQEFGIPYNPQSQG
VIESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQTKELQKQITKIQ
NFRVYYRDSRDPVWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED
>HIV1ND P18802 POL polyprotein [Contains: Protease (Retro
FFREDLAFPQGKAGEFSSEQTRANSPTSRELRVWGGDNPLSETGAERQGTVSFSFPQITLWQRPLVTIKIGG
QLKEALLDTGADDTVLEEINLPGKWKPKMIGGIGGFIKVRQYDQILIEICGYKAMGTVLVGPTPVNIIGRNL
LTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTEICTEMEKEGKISRIGPENPYNTPIFA
IKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPS
INNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPEIVIYQYMDDLYVGSDLEIGQHRTKIEELR
EHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPINLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQ
LCKLLRGTKALTEVVPLTEEAELELAENREILKEPVHGVYYDPSKDLIAELQKQGDGQWTYQIYQEPFKNLK
TGKYARTRGAHTNDVKQLTEAVQKIATESIVIWGKTPKFKLPIQKETWETWWIEYWQATWIPEWEFVNTPPL
VKLWYQLEKEPIIGAETFYVDGAANRETKLGKAGYVTDRGRQKVVPFTDTTNQKTELQAINLALQDSGLEVN
IVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSQGIRKVLFLDGID
KAQEEHEKYHNNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH
VASGYIEAEVIPAETGQETAYFLLKLAGRWPVKVVHTDNGSNFTSATVKAACWWAGIKQEFGIPYNPQSQGV
VESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIIDIIATDIQTRELQKQIIKIQN
FRVYYRDSRDPIWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKVKIIRDYGKQMAGDDCVASRQDED
>HIV1OY P20892 POL polyprotein [Contains: Protease (Retro
FFREDLAFPQGKAREFSSEQTRANSPTSRELRVWGRDNNSPSEAGADRQGTVSFNLPQITLWQRPIVTIKIG
GQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRN
LLTQLGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKVLIEICTEMEKEGKISKVGPENPYNTPVF
AIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIP
SINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEEL
RQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIMLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVK
NLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLVAELQKQGQGQWTYQIYQEPFKNL
KTGKYARMRGAHTNDVKQLTEAVQKITQESIVIWGKTPKFKLPIQKETWEAWWTEYWQATWIPEWEFVNTPP
LVKLWYQLEKDPIVGAETFYVDGAANRETKLGKAGYVTDRGRQKVVSLTDTTNQKTELQAIHLALQDSGLEV
NIVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFLDGI
DKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKIILVAV
HVASGYIEAEVIPAETGQETAYFILKLAGRWPVKTIHTDNGSNFTSTTVKAACWWAGIKQEFGIPYNPQSQG
VVESMNNELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQTKELQKQITKIQ
NFRVYYRDSREPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED
>HIV1PV P03368 POL polyprotein [Contains: Protease (Retro
FFREDLAFLQGKAREFSSEQTRANSPTISSEQTRANSPTRRELQVWGRDNNSPSEAGADRQGTVSFNFPQIT
LWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVL
VGPTPVNIIGRNLLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISK
IGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLD
EDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDL
EIGQHRTKIEELRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLN
WASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQW
TYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQAT
WIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETRLGKAGYLTNKGRQKVVPLTNTTNQKTELQA
IYLALQDSGLEVNIVTDSQYALGIIQAQPDQSESELVNQIIEQLIKKQKVYLAWVPAHKGIGGNEQVDKLVS
AGIRKILFLDGIDKAQDEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDC
THLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQ
EFGIPYNPQSQGVVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDIQ
TKELQKQITKIQNFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCV
ASRQDED
>HIV1U4 P24740 POL polyprotein [Contains: Protease (Retro
FFRENLAFQQGEAREFSSEQTRANSPTSRNLWDGGKDDLPCETGAERQGTDSFSFPQITLWQRPLVTVKIGG
QLIEALLDTGADDTVLEDINLPGKWKPKIIGGIGGFIKVRQYDQILIEICGKKTIGTVLVGPTPVNIIGRNM
LTQIGCTLNFPISPIETVPVKLKPEMDGPKVKQWPLTEEKIKALTEICNEMEKEGKISKIGPENPYNTPVFA
IKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHTAGLKKKKSVTVLDVGDAYFSVPLDESFRKYTAFTIPS
INNETPGVRYQYNVLPQGWKGSPSIFQSSMTKILEPFRSQHPDIVIYQYMDDLYVGSDLEIGQHRAKIEELR
AHLLSWGFITPDKKHQKEPPFLWMGYELHPDKWTVQPIQLPEKDSWTVNDIQKLVGKLNWASQIYAGIKVKQ
LCKLLRGAKALTDIVTLTEEAELELAENREILKDPVHGVYYDPSKDLVAEIQKQGQDQWTYQIYQEPFKNLK
TGKYARKRSAHTNDVKQLTEVVQKVSTESIVIWGKIPKFRLPIQKETWEAWWMEYWQATWIPEWEFVNTPPL
VKLWYQLEKDPIAGAETFYVDGAANRETKLGKAGYVTDRGRQKVVSLTETTNQKTELHAIHLALQDSGSEVN
IVTDSQYALGIIQAQPDRSESEIVNQIIEKLIEKEKVYLSWVPAHKGIGGNEQVDKLVSSGIRKVLFLDGID
KAQEDHEKYHCNWRAMASDFNLPPVVAKEIVASCNKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH
VASGYIEAEVIPAETGQETAYFILKLAGRWPVKVIHTDNGSNFTSAAVKAVCWWANIQQEFGIPYNPQSQGV
VESMNKELKKIIGQVREQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIIDIIATDIQTKELQKQISKIQN
FRVYYRDSRDPIWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRDYGKQMAGDDCMAGRQDED
>HIV1Z2 P12499 POL polyprotein [Contains: Protease (Retro
FFREDLAFPQGKAGELSSEQTRANSPTSRELRVWGRDNPLSETGAERQGTVSFNCPQITLWQRPLVTIKIGG
QLKEALLDTGADDTVLEEMNLPGKWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNL
LTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALTEICTEMEKEGKISRVGPENPYNTPIFA
IKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIPS
INNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPEIVIYQYMDDLYVGSDLEIGQHRTKIEELR
EHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQSIKLPEKESWTVNDIQKLVGKLNWASQIYPGIKVRQ
LCKLLRGTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGHGQWTYQIYQEPFKNLK
TGKYARMRGAHTNDVKQLAEVVQKISTESIVIWGKTPKFRLPIQKETWETWWVEYWQATWIPEWEFVNTPPL
VKLWYQLEKEPIIGAETFYVDGAANRETKLGKAGYVTDRGRQKVVPFTDTTNQKTELQAINLALQDSGLEVN
IVTDSQYALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSQGIRKVLFLDGID
KAQEEHEKYHNNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH
VASGYIEAEVIPAETGQETAYFILKLAGRWPVKIVHTDNGSNFTSAAVKAACWWAGIKQEFGIPYNPQSQGV
VESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIIDIIATDIQTKELQKQITKIQN
FRVYYRDSRDPIWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKVKIIRDYGKQMAGDDCVASRQDED
>HIV2CA P24107 POL polyprotein [Contains: Protease (Retro
TGGFFRDWPLGKEAPQFPRGPSSTGANTNSTPIGSSSGSTGEIYAAREKAEGAETETIQRGDRGLTAPRTRR
GPMQGDNRGLAAPQFSLWKRPVVTAHIEGQPVEVLLDTGADDSIVAGIELGSNYSPKIVGGIGGFINTKEYK
NVEIEVLGKRVRATIMTGDTPINIFGRNILTALGMSLNLPVAKIEPIKIMLKPGKDGPRLRQWPLTKEKIEA
LKEICEKMEKEGQLEEAPPTNPYNTPTFAIRKKDKNKWRMLIDFRELNKVTQDFTEIQLGIPHPAGLAKKRR
ITVLDVGDAYFSIPLHEDFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQYTMRQVLEPFRKANSD
VIIIQYMDDILIASDRTDLEHDKVVLQLKELLNNLGFSTPDEKFQKDPPYRWMGYELWPTKWKLQKIQLPQK
EVWTVNDIQKLVGVLNWAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTELAEAELEENRIILSQEQEGHYYQE
EKELEATVQKDQDNQWTYKIHQEEKILKVGKYAKIKHTHTNGVKLLAQVVQKIGKEALVIGRIPKFHLPVER
EVWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVGDPIPGTETFYTDGSCNRQSKEGKAGYVTDRGRDKVK
ILEQTTNQQAELEAFAMALTDSGPKANIIVDSQYVMGIVAGQPTESENRIVNQIIEEMIKKEAIYVAWVPAH
KGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHTNVKELCHKFDIPQLVARQIVNTCAQYQQKGEAIH
GQVNAEVGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRWPITHLHTDNGANFTS
QEVKMVAWWVGIEQTFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTVETIVLMAVHCMNFKRRGGIGDMT
PSERLINMITTEQEIQFLQAKNSKLKNFRVYFREGRDQLWKGPGELLWKGDGAVIVKVGTDIKIIPRRKAKI
IRDYGGRQELDSSSHLEGARENGEVA
>HIV2D1 P17757 POL polyprotein [Contains: Protease (Retro
VLELWKGGTLGETVPSTQKTGLLEVWQVRTHHGKLPGKTGRFFRDGPTGKAAPQLPRGPSSSGADTNSTPNR
SSSGPVGEIYAAREKAERAEGETIQGGDGGLTAPRAGRDAPQRGDRGLATPQFSLWKRPVVTAFIEDQPVEV
LLDTGADDSIVAGIELGDNYTPKIVGGIGGFINTKEYKNVEIKVLNKRVRATIMTGDTPINIFGRNILATLG
MSLNLPVAKLDPIKVTLKPGKDGPRLKQWPLTKEKIEALKEICEKMEREGQLEEAPPTNPYNTPTFAIKKKD
KNKWRMLIDFRELNRVTQDFTEIQLGIPHPAGLAKKKRITVLDVGDAYFSIPLHEDFRQYTAFTLPSVNNAE
PEKRYVYKVLPQGWKGSPAIFQFMMRQILEPFRKANPDVILIQYMDDILIASDRTGLEHDKVVLQLKELLNG
LGFSTPDEKFQKDPPFQWMGYELWPTKWKLQKIQLPQKEIWTVNDIQKLVGVLNWAAQIYPGIKTKHLCKLI
RGKMTLTEEVQWTELAEAELEENKIILSQEQEGSYYQEEEELEATVIKSQDNQWAYKIHQGERVLKVGKYAK
IKNTHTNGVRLLAQVVQKIGKEALVIWGRVPKFHLPVERDTWEQWWDNYWQVTWVPEWDFVSTPPLVRLTFN
LVGDPIPGTETFYTDGSCNRQSKEGKAGYVTDRGRDRVRVLEQTSNQQAELEAFAMALADSGPKVNIIVDSQ
YVMGIVAGQPTESENRIVNQIIEDMIKKEAVYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEH
EKYHSNIKELTHKFGIPQLVARQIVNTCAQCQQKGEAIHGQVNAEIGVWQMDCTHLEGKIIIVAVHVASGFI
EAEVIPQESGRQTALFLLKLASRWPITHLHTDNGPNFTSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNH
HLKNQISRIREQANTIETIVLMAVHCMNFKRRGGIGDMTPAERLINMITTEQEIQFLQRKNSNFKKFQVYYR
EGRDQLWKGPGELLWKGDGAVIVKVGADIKVVPRRKAKIIRDYGGRQELDSSSHLEGAREDGEVA
>HIV2G1 P18042 POL polyprotein [Contains: Protease (Retro
MWQDRTRHGKMPRKTGRFFRDGSMGKEAPQLPRGPSSSGADTNSTPSRSSSGSIGKIYAAGERAEGAEGETI
QRGDGRLTAPRAGKSTSQRGDRGLAAPQFSLWKRPVVTAYIEVQPVEVLLDTGADDSIVAGIQLGDNYVPKI
VGGIGGFINTKEIKNIEIKVLNKRVRATIMTGDTPINIFGRNILTALGMSLNLPIAKIEPIKVTLKPGKDGP
RLRQWPLTKEKIEALREICEKMEKEGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNRVTQDFTEIQ
LGIPHPAGLAKKKRITVLDVGDAYFSIPLHEDFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQHT
MRQVLEPFRKANPDVILIQYMDDILIASDRTGLEHDKVVLQLKELLNGLGFSTPDEKFQKDPPLQWMGYELW
PTKWKLQKLQLPQKEIWTVNDIQKLVGVLNWAAQIYPGIKTKHLCRLIKGKMTLTEEVQWTELAEAELEENK
IILSQEQEGYYYQEEKELEATIQKNQDNQWTYKIHQEEKILKVGKYAKIKNTHTNGVRLLAQVVQKIGKEAL
VIWGRIPKFHLPVERETWEQWWDNYWQVTWIPEWDFVSTPPLVRLTFNLVGDPIPGAETFYTDGSCNRQSKE
GKARYVTDRGRDKVRVLERTTNQQAELEAFAMTLTDSGPKVNIIVDSQYVMGIVVGQPTESESRIVNQIIED
MIKKEAVYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLERIEPAQEEHEKYHSNMKELTHKFGIPQLVARQI
VNTCAQCQQKGEAIHGQVNAEIGVWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRW
PITHLHTDNGSNFTSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTIETIVLMAV
HCMNFKRRGGIGDMTPAERLINMITTEQEIQFLQRKNSNFKNFQVYYREGRDQLWKGPGELLWKGDGAVIVK
VGADIKVIPRRKAKIIRDYGGRQELDSSHLEGAREEDGEVA
>HIV2KR Q74120 POL polyprotein [Contains: Protease (Retro
TGWFFRDWPMGKEASQLPRDPSPAGADTNSTPSRPSSRPAREVLAAREEAERAENETIQGGDRGLTAPRTRR
DTTQRGDRGFAAPQFSLWKRPVVTAYVEGQPVEVLLDTGADDSIVAGIELGSNYSPKIVGGIGGFINTKEYK
NVEIKVLNKKVKATIMTGDTPINIFGRNILTALGMSLNLPVAKVDPIKVILKPGKDGPKVRQWPLTKEKIEA
LKEICEKMEREGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQEFTEIQLGIPHPAGLAKKRR
ITVLDIGDAYFSIPLHEDFRQYTAFTLPTVNNAEPGKRYIYKVLPQGWKGSPAIFQHTMRQVLEPFRKANPD
VILVQYMDDILIASDRTDLEHDRTVLQLKELLNGLGFSTPDEKFQKDPPYKWMGYELWPTKWKLQKIQLPQK
EVWTVNDIQKLVGVLNWAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTELAEAELEENKIILSQEQEGCYYQE
EKELEATVQKDQDNQWTYKIHQGEKILKVGKYAKIKNTHTNGVRLLAHVVQKIGKEALVIWGRIPKFHLPVE
RETWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVKDPIPGEETFYTDGSCNRQSKEGKAGYITDRGRDKV
RILEQTTNQQAELEAFAMALTDSGPKANIIVDSQYVMGIVAGQPTESESKLVNQIIEEMIKKETLYVAWVPA
HKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSNVKELSHKFGLPKLVARQIVNTCAQCQQKGEAI
HGQVDAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQETGRQTALFLLKLASRWPITHLHTDNGANFT
SQEVKMVAWWTGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTMETIVLMAVHCMNFKRRGGIGDM
TPAERLINMITTEQEIQFLHAKNSKLKNFRVYFREGRDQLWKGPGELLWKGDGAVIVKVGTDIKIVPRRKAK
IIRDYGGRREVDSSSHLEGTREDGEVA
>HIV2RO P04584 POL polyprotein [Contains: Protease (Retro
TGRFFRTGPLGKEAPQLPRGPSSAGADTNSTPSGSSSGSTGEIYAAREKTERAERETIQGSDRGLTAPRAGG
DTIQGATNRGLAAPQFSLWKRPVVTAYIEGQPVEVLLDTGADDSIVAGIELGNNYSPKIVGGIGGFINTKEY
KNVEIEVLNKKVRATIMTGDTPINIFGRNILTALGMSLNLPVAKVEPIKIMLKPGKDGPKLRQWPLTKEKIE
ALKEICEKMEKEGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEIQLGIPHPAGLAKKR
RITVLDVGDAYFSIPLHEDFRPYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQHTMRQVLEPFRKANK
DVIIIQYMDDILIASDRTDLEHDRVVLQLKELLNGLGFSTPDEKFQKDPPYHWMGYELWPTKWKLQKIQLPQ
KEIWTVNDIQKLVGVLNWAAQLYPGIKTKHLCRLIRGKMTLTEEVQWTELAEAELEENRIILSQEQEGHYYQ
EEKELEATVQKDQENQWTYKIHQEEKILKVGKYAKVKNTHTNGIRLLAQVVQKIGKEALVIWGRIPKFHLPV
EREIWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVGDPIPGAETFYTDGSCNRQSKEGKAGYVTDRGKDK
VKKLEQTTNQQAELEAFAMALTDSGPKVNIIVDSQYVMGISASQPTESESKIVNQIIEEMIKKEAIYVAWVP
AHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSNVKELSHKFGIPNLVARQIVNSCAQCQQKGEA
IHGQVNAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRWPITHLHTDNGANF
TSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTIETIVLMAIHCMNFKRRGGIGD
MTPSERLINMITTEQEIQFLQAKNSKLKDFRVYFREGRDQLWKGPGELLWKGEGAVLVKVGTDIKIIPRRKA
KIIRDYGGRQEMDSGSHLEGAREDGEMA
>HIV2SB P12451 POL polyprotein [Contains: Protease (Retro
TGWFFRAWTMGKEAPQLPRGPKFAGANTNSTPNGSSSGPTGEVHAAREKTERAETKTIQRSDRGLAASRARR
DTTQRDDRGLAAPQFSLWKRPVVTAYIEDQPVEVLLDTGADDSIVAGIELGSNYSPKIVGGIGGFINTKEYK
DVEIRVLNKKVRATIMTGDTPINIFGRNILTALGMSLNLPVAKIEPVKVTLKPGKDGPKQRQWPLTREKIEA
LREICEKMEREGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEVQLGIPHPAGLAKKRR
ITVLDVGDAYFSIPLYEDFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQYTMRQVLEPFRKANPD
VIIVQYMDDILIASDRTDLEHDKVVLQLKELLNGLGFSTPDEKFQKDPPYQWMGYELWPTKWKLQKIQLPQK
EVWTVNDIQKLVGVLNWAAQIYPGIKTKHLCKLIRGKMTPTEEVQWTELAEAELEENKIILSQEQEGHYYQE
EKELEATVQKDQDNQWTYKVHQGEKILKVGKYAKIKNTHTNGVRLLAQVVQKIGKEALVIWGRIPKFHLPVE
RETWEQWWDNYWQVTWIPDWDFVSTPPLVRLAFNLVKDPIPGAETFYTDGSCNRQSKEGKAGYITDRGKDKV
RILEQTTNQQAELEAFAMAVTDSGPKVNIVVDSQYVMGIVTGQPAESESRIVNKIIEEMIKKEAIYVAWVPA
HKGIGGNQEIDHLVSQGIRQVLFLERIEPAQEEHGKYHSNVKELAHKFGLPNLVARQIVNTCAQCQQKGEAI
HGQVNAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLLKLASRWPITHLHTDNGANFT
SQEVKMVAWWVGIEQSFGVPYNPQSQGVVEAMNHHLKNQIERIREQANTMETIVLMAVHCMNFKRRGGIGDM
TPVERLVNMITTEQEIQFLQAKNSKLKNFRVYFREGRNQLWQGPGELLWKGDGAVIVKVGTDIKVIPRRKAK
IIRDYGPRQEMDSGSHLEGAREDGEMA
>HIV2ST P20876 POL polyprotein [Contains: Protease (Retro
KTRLLEMWQGRTHHGKMPRKTGGFFRVGPMGKEAPQFPCGPNPAGADTNSTPDRPSRGPTREVHAAREKAER
AEREAIQRSDRGLPAARETRDTMQRDDRGLAAPQFSLWKRPVVTAHVEGQPVEVLLDTGADDSIVAGVELGS
NYSPKIVGGIGGFINTKEYKNVEIRVLNKRVRATIMTGDTPINIFGRNILTALGMSLNLPVAKIEPIKIMLK
PGKDGPKLRQWPLTKEKIEALKEICEKMEREGQLEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQ
DFTEIQLGIPHPAGLAKKKRITVLDVGDAYFSIPLHEDFRQYTAFTLPSINNAEPGKRYIYKVSPQGWKGSP
AIFQYTMRQVLEPFRKANPDIILIQYMDDILIASDRTDLEHDRVVLQLKELLNGLGFSTPDEKFQKDPPYQW
MGYELWPTKWKLQRIQLPQKEVWTVNDIQKLVGVLNWAAQIYPGIKTRNLCRLIRGKMTLTEEVQWTELAEA
ELEENKIILSQEQEGCYYQEEKELEATVQKDQDNQWTYKIHQGGKILKVGKYAKVKNTHTNGVRLLAQVVQK
IGKEALVIWGRIPKFHLPVERDTWEQWWDNYWQVTWIPDWDFISTPPLVRLVFNLVKDPILGAETFYTDGSC
NKQSREGKAGYITDRGRDKVRLLEQTTNQQAELEAFAMAVTDSGPKANIIVDSQYVMGIVAGQPTESESKIV
NQIIEEMIKKEAIYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSNVKELSHKFGLPK
LVARQIVNTCTQCQQKGEAIHGQVNAELGTWQMDCTHLEGKIIIVAVHVASGFIEAEVIPQESGRQTALFLL
KLASRWPITHLHTDNGANFTSQEVKMVAWWIGIEQSFGVPYNPQSQGVVEAMNHHLKNQISRIREQANTVET
IVLMAVHCMNFKRRGGIGDMTPAERLINMVTAEQEIQFLQAKNSKLQNFRVYFREGRDQLWKGPGELLWKGD
GAVIVKVGADIKIIPRRKAKIIKDYGGRQEMDSGSNLEGAREDGEVA
>SIVCZ P17283 POL polyprotein [Contains: Protease (Retro
STKKKRLLAVWARGTPNERLHRKTGEFFRERLAFPQREARQLCAEQNRTNGPTDRELWVPGGREEPGEERGR
EQSISTNLPQITLWQRPLIPVKVEGQLCEALLDTGADDTVIERIQLQGLWKPKMIGGIGGFIKVKQFDNVHI
EIEGRKVVGTVLVGPTPVNIIGRNILTQLGCTLVFPISSIETVPVKLKPGMDGPKVKQWPLSAEKIKALTEI
CQEMEKEGKISKIGPENPYNTPIFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVL
DVGDAYFSCPLDKDFRKYTAFTIPSINNETPGVRYQYNVLPQGWKGSPSIFQSSMTKILEPFREKNPDITIY
QYMDDLYVGSDLEIDQHRKKVEELRQHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIQLPEKEVWT
VNDIQKLIGKLNWASQIYPGIKIKQLCKLIRGTKKLTDVVPLTPEAELELAENREIVSTPVHGVYYDPDKEL
IAEIQKQGNCQWTYQIFQEPHKNLKTGKYARQRSAHTNDIRQLAEAVQKIATESIVIWGKTPKFRLPVQKES
WEAWWAEYWQATWIPEWEFINTPPLVKLWYSLETEPIPTTDTYYVDGAANRETKTGKAGYVTDKGKQKIISL
ENTTNQQAELKALLLALQDSDQQVNIVTDSQYVLGIIQSQPDHSESELVNQIIEELIKKEKIYLSWVPAHKG
IGGNEQVDKLVSAGIRKVLFLDGIDRAQEEHERYHSNWKAMASDFNLPPIVAKEIVAHCDKCQVKGEAMHGQ
VDCSPGIWQVDCTHLEGKVIIVAVHVASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGPNFTSAA
VKAACWWADIKQEFGIPYNPQSQGVVESLNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYTAG
ERIIDIIATDIQTSELQKQILKVQKFRVYYRDSRDPIWKGPATLLWKGEGAVVIQDQGELKVVPRRKAKIIR
DYGKQMAGDDCVASRQNED
>Smanga_S4 P12502 POL polyprotein [Contains: Protease (Retro
KTGGFFRAWPMGKEAPQFPHGPDASGADTNCSPRGSSCGSTEELHEDGQKAEGEQRETLQGGDRGFAAPQFS
LWRRPVVTAYIEEQPVEVLLDTGADDSIVAGIELGPNYTPKIVGGIGGFINTKEYKDVKIKVLGKVIKGTIM
TGDTPINIFGRNLLTAMGMSLNLPIAKVEPIKVTLKPGKEGPKLRQWPLSKEKIIALREICEKMEKDGQLEE
APPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEVQLGIPHPAGLAKRRRITVLDVGDAYFSIPLD
EEFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQYTMRNVLEPFRKANPDVTLIQYMDDILIASDR
TDLEHDRVVLQLKELLNGIGFSTPEEKFQKDPPFQWMGYELWPTKWKLQKIELPQRETWTVNDIQKLVGVLN
WAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTEMAEAEYEENKIILSQEQEGCYYQEGKPIEATVIKSQDNQW
SYKIHQEDKVLKVGKFAKVKNTHTNGVRLLAHVVQKIGKEALVIWGEVPKFHLPVEREIWEQWWTDYWQVTW
IPDWDFVSTPPLVRLVFNLVKEPIQGAETFYVDGSCNRQSREGKAGYVTDRGRDKAKLLEQTTNQQAELEAF
YLALADSGPKANIIVDSQYVMGIIAGQPTESESRLVNQIIEEMIKKEAIYVAWVPAHKGIGGNQEVDHLVSQ
GIRQVLFLKKIEPAQEEHEKYHSNVKELVFKFGLPRLVAKQIVDTCDKCHQKGEAIHGQVNAELGTWQMDCT
HLEGKIIIVAVHVASGFIEAEVIPQETGRQTALFLLKLAGRWPITHLHTDNGANFTSQEVKMVAWWAGIEQT
FGVPYNPQSQGVVEAMNHHLKTQIDRIREQANSIETIVLMAVHCMNFKRRGGIGDMTPAERLVNMITTEQEI
QFQQSKNSKFKNFRVYYREGRDQLWKGPGELLWKGEGAVILKVGTEIKVVPRRKAKIIKDYGGGKELDSGSH
LEDTGEAREVA
>Smanga_SP P19505 POL polyprotein [Contains: Protease (Retro
MPRKTSGFFRAWPMGKEAPQFPHGPDASGADTNCSPRGSSCGSTEELHEDGQKAEGEQRETLQGGNGGFAAP
QFSLWRRPIVTAYIEEQPVEVLLDTGADDSIVAGIELGPNYTPKIVGGIGGFINTKEYKDVKIKVLGKVIKG
TIMTGDTPINIFGRNLLTAMGMSLNLPIAKVEPIKVTLKPGKDGPKLRQWPLSKEKIIALREICEKMEKDGQ
LEEAPPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEVQLGIPHPAGLAKRRRITVLDVGDAYFSI
PLDEEFRQYTAFTLPSVNNAEPGKRYIYKVLPQGWKGSPAIFQHTMRNVLEPFRKANPDVTLIQYMDDILIA
SDRTDLEHDRVVLQLKELLNSIGFSTPEEKFQKDPPFQWMGYELWPTKWKLQKIELPQRETWTVNDIQKLVG
VLNWAAQIYPGIKTKHLCRLIRGKMTLTEEVQWTEMAEAEYEENKIILSQEQEGCYYQEGKPLEATVIKSQD
NQWSYKIHQEDKILKVGKFAKIKNTHTNGVRLLAHVVQKIGKEAIVIWGQVPRFHLPVEREIWEQWWTDYWQ
VTWIPEWDFVSTPPLVRLVFNLVKEPIQGAETFYVDGSCNRQSREGKAGYVTDRGRDKAKLLEQTTNQQAEL
EAFYLALADSGPKANIIVDSQYVMGIVAGQPTESESRLVNQIIEEMIKKEAIYVAWVPAHKGIGGNQEVDHL
VSQGIRQVLFLEKIEPAQEEHEKYHSNVKELVFKFGLPRLVAKQIVDTCDKCHQKGEAIHGQVNAELGTWQM
DCTHLEGKIIIVAVHVASGFIEAEVIPQETGRQTALFLLKLASRWPITHLHTDNGANFTSQEVKMVAWWAGI
EQTFGVPYNPQSQGVVEAMNHHLKTQIDRIREQANSIETIVLMAVHCMNFKRRGGIGDMTPAERLVNMITTE
QEIQFQQSKNSKFKNFRVYYREGRDQLWKGPGELLWKGEGAVILKVGTEIKVVPRRKAKIIKDYGGGKELDS
GSHLEDTGEAREVA
Step 1
Align the Pol sequences using the MAFFT server (http://www.ebi.ac.uk/Tools/msa/mafft/) at EBI with default settings. Let Output format be “Pearson/FASTA”.
Once the alignment is done, save the resulting alignment as a fasta file: right-click the “Download alignment file” button on the mafft output page, and then save the file using “Save linked file as” (or whatever it is called in your particular browser). Make sure you can find the file again!
Step 2
Open the TreeHugger web server (https://services.healthtech.dtu.dk/service.php?TreeHugger). (The TreeHugger server constructs a neighbor joining tree from an aligned set of sequences).
Step 3
Select the option to upload a file (see figure below), then choose the Pol-protein alignment file you just saved on your harddisk, and finally click “Submit Query” to construct the neighbor joining tree:
Step 4
When the run is done, right-click the “Download data in Newick/Phylip format” link to save the tree file as a text file on your harddisk (again make sure you can find it later). You will notice that the treefile is in the parenthesis-based format we discussed previously in the lecture:
Step 5
Open the FigTree treeviewer that you have previously installed on your own computer and use File->Open to open the treefile you just saved.
Step 6
The view that you will see first is presumably a rooted view similar to the one below. However, it is important to realize that we have not explicitly rooted the tree yet, so the root in this view has been chosen randomly. A more realistic view can be seen by clicking the unrooted view button (see figures below):
Step 7
The last figure above shows the unrooted tree. For now, however, go back to the (pseudo)rooted view you started out with. We wil now place the root by using the HTLV Pol sequence as a so-called outgroup. Click the branch leading to the HTLV sequence such that it gets selected (see figure below). Then click the “Reroot” button, which will subsequently root the tree on the selected outgroup:
The rationale for using an outgroup to place the root of the tree is as follows: our data set consists of sequences from HIV-1, HIV-2, SIV and HTLV. We know from other evidence that the lineage leading to HTLV branched off before any of the remaining viruses diverged from each other. The root of the tree connecting the organisms investigated here, must therefore be located between the HTLV sequence (the “outgroup”) and the rest (the “ingroup”). This way of finding a root is called “outgroup rooting”.
Step 8
Inspect the rooted tree that you get as a result of rerooting and consider what this tells you about the origin of HIV viruses.
Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences? The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?
With these groupings in consideration, what can you say about the origin of the two HIV viruses?
Now you can save your tree as a picture by choosing File -> Export Graphics. Choose a suitable location and file format (eg.
.png) and hand it in along with your answers.
Time to try on your own!
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequence, but lack the first 90 nucleotides or so). The sequences can be found via the following link:
Step 9
Your answers should include the following:
How did you construct the tree? (alignment method, construction of tree, outgroup etc.).
A picture of the tree. Note: It is easy to increase the font size of the sequence names in FigTree. Just click the small arrow next to Tip Labels and enter a new font size.
A comparison of your tree with NCBI taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy). Are there any taxa that are not placed correctly on your tree?
Mitochondrial versus nuclear proteins
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion’s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use UniProt (http://www.uniprot.org/) to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyse the phylogeny of the dataset.
Step 10
- Find all proteins named “ribosomal protein L3” from as many eukaryotes (Eukaryota) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).
- How many of these have a Subcellular location of “mitochondrion” and “cytoplasm”, respectively? Download the results of these two searches in FASTA format.
- Now combine the two data sets from the previous question into one FASTA file (using jEdit or another plain text editor). Note that their names start by “RL3” (cytoplasmic) or “RM03″/”RK3” (mitochondrial) which is very convenient for telling the difference between them. If you have any names that do not begin with “RL3”, “RK3” or “RM03”, revisit your UniProt search criteria! Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).
- Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). Describe all the steps you did to make it.
- Visualize the tree using FigTree. Reroot the tree so that the cytoplasmic and the mitochondrial sequences are in two monophyletic groups (if possible). Include a picture of the rerooted tree in your answer.
- Consider your rerooted tree. Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?
- Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?
- Where has evolution been faster (where are there most mutations per time unit) – among the cytoplasmic or the mitochondrial proteins?
Exercise: Phylogeny-Answers
Answers to the Phylogeny exercise
Step 8
Answers to “The Phylogeny of HIV” can be found here (https://teaching.healthtech.dtu.dk/material/36611/files/binfintro/hiv_origin.html).
Step 9
How did you construct the tree? (alignment method, construction of tree, outgroup etc. )
For starters you need to do a multiple alignment of your sequences. A number of different alignment methods can be used (eg. MAFFT or RevTrans). Here you can see an example of a MAFFT alignment.
>Yeast
TACACTTTCTT—AGCTCGTCGTACTGATGCTCCATTCAA CAAGGTTGTCTTG
AAGGCTTTGTTCTTGTCTAAGATCAACAGACCACCTGTTTCTGTCTCTAGAATTGCTAGA GCTTTGAAGCAAGAAGGTGC TGCTAACAAGACTGTTGTCGTT
GTTGGTACTGTTACTGACGATGCCAGAATCTTTGAATTCCCAAAGACCACTGTTGCTGCT TTGAGATTCACTGCTGGTGCCAGAGCCAAGATTGTTAAGGCTGGTGGTGAATGTATCACT TTGGATCAATTAGCTGTCAGAGCTCCAAAGGGTCAAAACACTTTGATCTTGAGAGGTCCA AGAAACTCCAGAGAAGCTGTCAGACACTTCGGTATGGGTCC ACACAAGGGT
AAGGCTCCAAGAATCTTGTCCACCGGTAGAAAGTTCGAAAGAGCTAGAGGTAGAAGAAGA TCTAAGGGTTTCAAGGTG
>African_frog
TATCGATTCTT—GGCTCGTCGTACCAACTCCAGTTTCAA CCGGGTGGTTCTG
AAGCGTCTGTTCATGAGCCGAACCAACAGGCCACCCCTCTCTATGTCCCGTCTTATTCGC AAAATGAAATTGCAAGGACG TGAAAACAAGACTGCAGTGGTT
GTGGGCTGTATCACAGATGATGTCAGGATCCATGATATCCCCAAACTGAAGGTGTGCGCA CTTAAAATAACCAGCGGAGCACGTAGCCGAATCCTGAAGTCTGGAGGTCAGATTATGACG TTTGATCAGCTCGCCCTTGCGGCCCCTAAAGGCCAGAACACTGTTCTTCTTTCAGGACCT CGTAAGGCCCGTGAAGTATACAGACACTTTGGGAAGGCACCTGGTACTCCACACAGTCGC ACTAAGCCTTATGTGCTCTCCAAGGGTAGAAAGTTTGAGCGCGCCAGAGGACGCAGAGCC AGCAGAGGATACAAGAAC
>Pig
TACAGGTTTCT—GGCCAGACGAACCAACTCCACCTTCAA TCAAGTTGTGCTG
AAGAGGTTGTTCATGAGTCGCACCAACCGGCCACCCCTGTCGCTTTCCCGGATGATCCGG AAGATGAAGCTTCCTGGCCG GGAAGGCAAGACCGCTGTGGTC
GTAGGGACTATAACCGATGACGTGCGTGTCCAGGAGGTGCCCAAATTGAAGGTGTGCGCT CTGCGCGTGAGCAGCCGTGCCCGGAGCCGCATTCTCAAGGCCGGGGGCAAAATCCTCACC TTCGACCAGTTGGCCCTGGACTCCCCCAAAGGCTGTGGCACTGTCCTCCTCTCTGGGCCT CGCAAGGGCCGCGAGGTGTACAGGCATTTCGGCAAGGCCCCAGGGACCCCGCACAGCCAC ACCAAACCCTATGTTCGCTCCAAGGGCCGGAAGTTCGAGCGCGCCAGAGGCCGACGTGCC AGCCGCGGCTACAAAAAC
>Fin_whale
TACAGGTTTCT—GGCCAGGCGAACCAACTCCACCTTCAA TCAAGTTGTGCTG
AAGAGGTTGTTCATGAGTCGCACCAACCGGCCACCTCTGTCCCTTTCCCGGATGATTCGG AAGATGAAGCTTCCCGGCCG GGAAGGCAAAACGGCCGTGGTG
GTGGGGACAGTGACTGATGACGTGCGAGTCCAGGAGGTGCCCAAGCTGAAGGTGTGTGCT CTCCGGGTGAGCAGCCGCGCCCGGAGCCGCATCCTCAAGGCCGGGGGCAAGATCCTCACC TTCGACCAGCTGGCCCTGGACTCCCCCAAGGGCTGTGGCACCGTGCTCCTGTCTGGTCCT CGCAAGGGCCGAGAGGTGTACAGGCATTTCGGCAAGGCCCCAGGAACCCCGCATAGCCAC ACCAAACCCTATGTACGCTCCAAGGGCCGGAAGTTCGAGCGCGCCAGAGGCCGACGGGCC AGCCGTGGCTACAAA—
>Human
TACAGGTTTCT—GGCCAGAAGAACCAACTCCACATTCAA CCAGGTTGTGTTG
AAGAGGTTGTTTATGAGTCGCACCAACCGGCCGCCTCTGTCCCTTTCCCGGATGATCCGG AAGATGAAGCTTCCTGGCCG GGAAAACAAGACGGCCGTGGTT
GTGGGGACCATAACTGATGATGTGCGGGTTCAGGAGGTACCCAAACTGAAGGTATGTGCA CTGCGCGTGACCAGCCGGGCCCGCAGCCGCATCCTCAGGGCAGGGGGCAAGATCCTCACT TTCGACCAGCTGGCCCTGGACTCCCCTAAGGGCTGTGGCACTGTCCTGCTCTCCGGTCCT CGCAAGGGCCGAGAGGTGTACCGGCATTTCGGCAAGGCCCCAGGAACCCCGCACAGCCAC ACCAAACCCTACGTCCGCTCCAAGGGCCGGAAGTTCGAGCGTGCCAGAGGCCGACGGGCC AGCCGAGGCTACAAAAAC
>Monkey_macaque
TACAGGTTTCT—GGCCAGAAGAACCAATTCCACATTCAA CCAGGTTGTGCTG
AAGAGGTTGTTTATGAGTCGCACCAACCGGCCTCCTCTGTCCCTTTCTCGGATGATCCGG AAGATGAAGCTTCCTGGCCG GGAAAACAAAACGGCCGTGGTT
GTGGGGACCATAACGGACGACGTGCGGGTTCAGGAGGTGCCCAAACTGAAGGTATGTGCA CTGCGCGTAACCAGCCGGGCCCGCAGCCGCATCCTCAGGGCAGGGGGCAAGATCCTCACT TTCGACCAGCTGGCCCTGGACTCCCCCAAGGGCTGCGGCACTGTTCTGCTCTCCGGTCCT CGCAAGGGCCGAGAGGTGTACCGGCATTTCGGCAAGGCCCCAGGAACCCCGCACAGCCAC ACCAAACCCTACGTCCGCTCCAAGGGCCGGAAGTTCGAGCGTGCCAGAGGCCGACGGGCC AGTCGAGGCTACAAAAAC
>Rat
TACAGGTTTCT—GGCCAGACGGACCAACTCCACCTTCAA CCAGGTTGTGCTG
AAAAGGTTATTTATGAGCCGAACTAACCGGCCACCTCTGTCCCTGTCCCGAATGATCCGG AAGATGAAGCTTCCTGGTCG GGAGAACAAAACTGCTGTGGTT
GTGGGGACGATCACAGATGATGTGCGGATTCTGGAAGTGCCCAAGCTGAAGGTGTGTGCA CTGAGGGTGAGCAGCCGGGCCCGAAGTCGGATCCTCAAGGCTGGGGGTAAGATCCTGACC TTCGACCAGCTGGCCCTGGAGTCTCCCAAGGGCAGGGGCACTGTGCTCTTGTCTGGTCCT CGGAAGGGCCGAGAGGTGTACCGACACTTTGGCAAGGCCCCAGGAACTCCACACAGCCAC ACCAAACCCTATGTCCGTTCCAAGGGCCGGAAGTTCGAGCGTGCCAGAGGCCGAAGGGCC AGCCGAGGCTACAAAAAC
>Mouse
TACAGGTTTCT—GGCCAGACGGACCAACTCCACCTTCAA TCAGGTTGTGCTG
AAGAGGTTGTTCATGAGCCGAACCAACCGGCCACCTCTGTCCCTGTCCCGCATGATCCGA AAGATGAAGCTTCCTGGCCG CGAGAACAAGACTGCCGTGGTT
GTGGGGACGGTCACAGATGATGTGCGGATTCTGGAAGTTCCCAAGCTGAAGGTGTGTGCA CTGCGGGTGAGCAGCCGGGCCCGGAGTCGCATCCTCAAGGCTGGGGGTAAGATCCTCACC TTTGACCAGCTGGCCCTGGAGTCTCCCAAGGGCCGGGGCACTGTGCTCCTGTCTGGTCCT CGGAAGGGCCGAGAGGTGTACCGACATTTTGGCAAGGCCCCAGGAACCCCACACAGCCAT
ATGGCG
ACCAAACCCTATGTCCGTTCCAAGGGCCGGAAGTTTGAGCGCGCCAGAGGCCGAAGGGCC | ||
AGCAGAGGCTACAAAAAC | ||
>Salmon | ||
TATCGTTTACCTGGAAGCAAATGCTCCACTGCTCCCTTCAA CAAGGTGGTCCTC | ||
AGGAGGCTCTTCATGAGCAGGACCCACAGGCCTCCGATGTCAGTGTCCCGCATGATCCGT | ||
AAGATGAAATTGCCTGGACG TGAGAACAGAACCGCAGTTGTC | ||
GTGGGAACCGTCACTGATGATGTCAGAATTCATGAAATCCCTAATCTGAAGGTCTCGGCA | ||
CTTAAAATAACCAGGCGAAATCGGACGCGAATTCTGAAGTTTGTG—CAGATTATGAGG | ||
TTCGTTGGGCTCGCACTTGCTGCTCCTAATCGGCAGAAGAGTGTTCTTCTTTCCGCCCCC | ||
CGTAACGCGCGTGATGTATCCAGGCACTTTGCCAACGCCCCCAGTATTCC TCAC | ||
ACTAAGCCTTACGTGCTTTCCAA——CAAGTTACGGCG—CAGAGGCAGCAAGCTC | ||
ACT TACAACAAC | ||
>Fruit_fly | ||
TACCGCTTCCT—TCAGCGCCGCACCAACAAGAAGTTCAA CCGCATCATCCTG | ||
AAGCGTTTGTTCATGAGCAAGATCAACAGGCCGCCGCTATCGCTTCAGCGCATCGCTCGC | ||
TTCTTCAAGGCCGCCAACCA GCCGGAGTCTACCATCGTGGTC | ||
GTCGGCACCGTCACCGACGATGCCCGCCTCCTGGTGGTGCCCAAGCTCACCGTGTGCGCC | ||
CTGCACGTCACGCAGACCGCCAGGGAGCGCATCCTGAAGGCCGGCGGTGAGGTCCTGACC | ||
TTCGATCAACTGGCTCTCCGATCGCCCACCGGCAAGAACACGCTGCTGCTGCAGGGCAGG | ||
CGTACCGCCCGCACCGCCTGCAAGCACTTCGGCAAGGCTCCCGGTGTGCCCCACTCGCAC | ||
ACCCGCCCCTATGTCCGCTCTAAGGGACGCAAGTTCGAGCGTGCTCGTGGTCGTCGCTCC | ||
AGCTGCGGCTACAAGAAG | ||
>Arabidopsis | ||
TACCGGTTTCT—GGTAAGGAGAACTAATAGCAAGTTCAA TGGTGTGATATTG | ||
AAGAGGCTTTTCATGAGCAAAGTCAACAAAGCTCCTCTTTCTCTATCTAGGCTTGTGGAG | ||
TTCATGACTGGCAA GGAAGATAAGATTGCCGTCTTG | ||
GTTGGAACTATAACTGATGATTTGAGGGTACACGAGATTCCAGCCATGAAAGTGACTGCC | ||
TTGAGGTTCACAGAGAGAGCAAGGGCTCGCATTGAGAAAGCTGGAGGTGAATGCTTAACC | ||
TTTGACCAGCTCGCTCTCAGAGCTCCATTGGGCCAGAACACGGTTCTTCTTAGAGGACCT | ||
AAGAATTCACGTGAAGCAGTGAAGCATTTCGGACCTGCTCCTGGTGTGCCACACAGTCAC | ||
TCCAAGCCATATGTTCGGGCCAAGGGAAGGAAGTTCGAGAAGGCCAGAGGAAAGAGGAAG | ||
AGTCGTGGATTCAAGGTT | ||
>Soy | ||
TATCGCTTCCT—TGTTCGGAGAACTGGCAGCAACTTCAA TGCTGTTATACTT | ||
AAGAGATTGTTCATGAGCAAGGTTAACAAACCCCCATTGTCTTTGTCAAGGTTGATTAAG | ||
TATACGAAGGGGAA GGAAGATAAGATTGCAGTGGTG | ||
GTGGGGTCTATAACCGATGATATTCGTGTTTATGAAGTTCCACCATTGAAAGTTACAGCA | ||
CTCAGGTTTACAGAGACTGCCCGTGCAAGAATTGAGAAGGCAGGCGGTGAATGCTTGACG | ||
TTTGATCAGTTGGCTCTCAGGGCTCCTCTGGGACAGAACACGGTCCTTCTTAGAGGCCCA | ||
AAGAATGCTCGCGAAGCTGTGAAGCACTTTGGTCCTGCTCCTGGTGTCCCTCACAGCCAC | ||
ACCAAGCCTTATGTTCGAGCAAAGGGAAGGAAGTTTGAGAGGGCTAGAGGAAGGAGGAAC | ||
AGCCGAGGATTTAGGGTT | ||
>Rice | ||
TACCGCTTCCT—GGTGCGGAGGACCAAGAGCCACTTCAA CGCCGTGATCCTG | ||
AAGCGGCTCTTCATGAGCAAGACCAACCGCCCGCCGCTCTCGATGCGCCGTCTCGTCAGG | ||
TTCATGGAGGGGAAGGTACCTGATCGCCATGCCATTTCGGGGGACCAGATCGCCGTGATC | ||
GTGGGCACCGTCACAGATGACAAGAGGATCTATGAGGTGCCGGCGATGAAGGTGGCTGCT | ||
CTCAGGTTCACCGAGACCGCGAGAGCACGGATCATCAATGCCGGTGGCGAGTGCCTCACG | ||
TTCGACCAGCTCGCTCTCCGCGCCCCGCTTGGCCAGAACACGGTCCTCCTGAGGGGTCCC | ||
AAGAACGCTAGGGAAGCTGTTAAGCACTTTGGCCCTGCTCCAGGAGTTCCCCACAGCAAC | ||
ACTAAGCCATATGTTCGCTCAAAGGGAAGGAAATTTGAGAAGGCAAGAGGAAGAAGGAAC | ||
AGCAAGGGCTTCAAGGTA | ||
>Methanocaldococcus_jannaschii | ||
ATTGAGATATTAAAGCAGGAAAGTTATAAAAATCAGGCAAAGATTTGGAAGGATATTGCA | ||
AGAAGGTTAGCAAAACCAAGAAGAAGGAGAGCAGAGGTAAATTTAAGTAAGATAAACAGA | ||
TACACAAA AGAAGGAGATGTTGTTTTAGTT | ||
CCTGGTAAAGTTTTAGGAGCTGGGAAGTT AGAGCACAAGGTTGTCGTTGCTGCA | ||
TTTGCATTCTCAGAAACAGCTAAAAAATTAATTAAAGAAGCTGGAGGAGAAGCAATAACA | ||
ATTGAAGAGCTAATAAAAAGAAATCCAAAAGGTTCAAATGTTAAAATT———— | ||
>Pyrococcus | ||
ATTCGTTACCTCAGGAAAAAGTCTAATGAAGAGAAAGTTAAGATATGGAAGGACATAGCT | ||
TGGAGACTTGAAAGACCAAGGAGGCAGAGGGCCGAAGTAAACGTCAGCAGGATAAACAGG | ||
TACGCGAA GGATGGAGACATGATAGTGGTT | ||
CCAGGGAGCGTTCTTGGGGCCGGCAAGAT AGAGAAGAAGGTCATTGTAGCTGCT | ||
TGGAAGTTCAGTGAAACTGCAAGGAGAAAAATCGAGGAGGCCGGTGGGGAGGCCATAACG | ||
ATTGAAGAGCTAATTAAGAGGAATCCAAAGGGAAGTGGAGTAATAATT———— | ||
ATGGAG | ||
Next you need to construct the tree. This can be done using the Treehugger tool: http://www.cbs.dtu.dk/services/TreeHugger/web/ where you simply paste in your multiple alignment.
A picture of the tree
Now the tree is ready to be opened in FigTree. All the sequences, except two, are from eukaryotes. The last two (Pyrococcus and Methanocaldococcus jannaschii) are both archaea and we therefore choose those two to be our outgroup (Notice that you can easily choose more than one sequence as outgroup, just choose the branch that are connecting both organisms to the rest of the tree and press “Reroot”).
A comparison of your tree with NCBI taxonomy. Are there any taxa that are not placed correctly on your tree?
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out below frog, which would branch out below the group of mammals (see illustration below).
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy’s “Common Tree” function (see illustration below).
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group Euarchontoglires.
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group Opisthokonta.
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).
Step 10
- 52 results.
Search string: (protein_name:”ribosomal protein l3″) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)
- 8 and 26 results, respectively.
Search strings: (protein_name:”ribosomal protein l3″) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)
and (protein_name:”ribosomal protein l3″) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)
Under the Download tab in UniProt, select “Download all”, “FASTA (canonical)” and “Uncompressed”.
- Then use jEdit (or another text editor) to combine them. Combined FASTA file is here: Media:Ribosomal_proteins_34.fasta.txt
- Go to EBI’s MAFFT server, choose “Protein”, upload the combined FASTA file, and let all other options be default. When the alignment is done, click “Download Alignment File” and save the file. Then upload the result to TreeHugger and save the resulting Newick file.
- Yes, it is possible to reroot the tree so that all the cytoplasmic sequences (RL3_*) and all the mitochondrial sequences (RM03_*) are in two separate monophyletic groups. After increasing the font size of the tip labels, the tree looks like this:
- The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
- There is one difference: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. The cytoplasmic tree is more correct.
- There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the mitochondrial branches being longer (the mitochondrial tips are further away from the root).
Applications of Phylogenetic Analysis
Phylogenomics and comparative genomics
Phylogenomics and comparative genomics are two closely related fields in genomics that use genomic data to study evolutionary relationships and genome evolution across different species. While they share some similarities, they differ in their focus and methods:
- Phylogenomics:
- Phylogenomics integrates large-scale genomic data, such as whole-genome sequences or large sets of orthologous genes, to reconstruct phylogenetic trees and infer evolutionary relationships among species.
- It combines principles of phylogenetics with genomic data to resolve complex evolutionary relationships, especially among closely related species or in cases where traditional phylogenetic markers (e.g., single genes) are insufficient.
- Phylogenomics can provide insights into the evolutionary history of organisms, including the timing of speciation events, patterns of gene duplication and loss, and the evolution of novel traits.
- Comparative Genomics:
- Comparative genomics compares the genomes of different species to identify similarities and differences in gene content, gene order, and genomic organization.
- It aims to understand the evolutionary processes that have shaped genomes, such as gene duplication, gene loss, horizontal gene transfer, and genome rearrangements.
- Comparative genomics can reveal the genetic basis of species-specific traits, evolutionary innovations, and adaptations to different environments.
- Relationship between Phylogenomics and Comparative Genomics:
- Phylogenomics often relies on comparative genomics approaches to compare genomic features across species and infer evolutionary relationships based on shared genetic similarities.
- Comparative genomics provides the data and methods needed for phylogenomic analyses, such as identifying orthologous genes, constructing multiple sequence alignments, and building phylogenetic trees from genomic data.
- Applications:
- Phylogenomics is used to reconstruct the Tree of Life, resolve phylogenetic relationships among organisms, and study the evolution of key traits, such as the origin of complex multicellularity or the evolution of vertebrate immunity.
- Comparative genomics is used to study genome evolution, identify genetic factors underlying disease, understand the genetic basis of adaptation, and compare the genomic features of model organisms to other species.
Overall, phylogenomics and comparative genomics are powerful approaches that leverage genomic data to study evolutionary relationships and genome evolution. They provide valuable insights into the genetic basis of biodiversity and the evolutionary processes that have shaped life on Earth.
Evolutionary developmental biology (Evo-Devo)
Evolutionary developmental biology, often abbreviated as Evo-Devo, is a field of biology that studies how changes in developmental processes contribute to the evolution of phenotypic diversity in organisms. Evo-Devo seeks to understand the genetic and molecular mechanisms underlying the development of various organisms and how these mechanisms have evolved over time to produce the diversity of forms seen in nature.
Key concepts and approaches in Evo-Devo include:
- Heterochrony: Changes in the timing of developmental events. For example, changes in the rate of growth or the timing of gene expression can lead to differences in body size or shape.
- Heterotopy: Changes in the location of developmental events. This can result in the evolution of novel structures or changes in the relative size or shape of existing structures.
- Heterometry: Changes in the amount or degree of developmental processes. This can lead to changes in the size, shape, or complexity of structures.
- Gene Regulation: Evo-Devo studies how changes in gene expression patterns and regulatory networks can lead to morphological differences between species. This includes the evolution of cis-regulatory elements that control gene expression.
- Comparative Developmental Biology: By comparing the developmental processes of different species, Evo-Devo researchers can identify conserved developmental pathways as well as differences that contribute to evolutionary change.
- Evolution of Novelty: Evo-Devo seeks to understand how new traits and structures evolve, including the role of gene duplication, gene loss, and co-option of existing developmental pathways.
- Experimental Evolution: Some Evo-Devo research involves experimental evolution, where researchers manipulate the developmental processes of organisms in the lab to observe how new traits can evolve under controlled conditions.
- Paleontological and Morphological Data: Evo-Devo researchers often use data from the fossil record and comparative morphology to infer how developmental processes have evolved over time and how they relate to morphological diversity.
Evo-Devo has provided insights into a wide range of biological questions, including the evolution of animal body plans, the origin of key innovations in evolution, and the genetic basis of morphological diversity. It has also contributed to our understanding of human evolution and development, shedding light on the genetic and developmental basis of human traits and diseases.
Phylogenetic signal and trait evolution
Phylogenetic signal refers to the tendency of closely related species to resemble each other more than they resemble species drawn at random from the same phylogenetic tree. In other words, it is the degree to which the evolutionary history of a group of species is reflected in their trait similarities or differences.
Phylogenetic signal can be quantified using various metrics, such as Blomberg’s K, Pagel’s lambda, or phylogenetic correlograms. These metrics compare the observed trait similarity among species to the expected trait similarity under a null model of random trait evolution along the phylogeny. A high phylogenetic signal indicates that closely related species are more similar in their traits than expected by chance, suggesting that the trait has evolved in a phylogenetically conserved manner. Conversely, a low phylogenetic signal suggests that trait evolution has been more labile and is not strongly influenced by the phylogenetic relationships among species.
Phylogenetic signal is important in evolutionary biology because it provides insights into the evolutionary processes that have shaped trait variation among species. Traits with a strong phylogenetic signal are likely to be influenced by genetic constraints, shared evolutionary history, or adaptation to similar environments, while traits with a weak phylogenetic signal may be more influenced by ecological or environmental factors.
Understanding phylogenetic signal is also important for comparative analyses, such as phylogenetic comparative methods (PCMs), which rely on the assumption of phylogenetic signal to make valid inferences about trait evolution. By quantifying phylogenetic signal, researchers can assess the appropriateness of using phylogenetic comparative methods to study trait evolution in a particular group of species.
In summary, phylogenetic signal is a fundamental concept in evolutionary biology that helps us understand how traits evolve and diversify across the tree of life. It provides a framework for studying trait evolution in a phylogenetic context and can help us unravel the processes driving the remarkable diversity of life on Earth.
Phylogeny-guided drug discovery and biotechnology
Phylogeny-guided drug discovery and biotechnology involve using evolutionary relationships among organisms to identify novel compounds, genes, or biochemical pathways with potential applications in medicine, agriculture, or industry. This approach leverages the diversity of life on Earth to discover new biological resources that can be used to develop new drugs, biotechnological products, or sustainable solutions. Here are some key aspects of phylogeny-guided drug discovery and biotechnology:
- Biodiversity as a Resource: The vast diversity of life forms on Earth represents a rich source of potentially valuable compounds and genes. By studying the evolutionary relationships among organisms, researchers can identify groups of organisms that are likely to produce bioactive compounds or possess unique biochemical pathways.
- Bioinformatics and Comparative Genomics: Phylogeny-guided drug discovery often relies on bioinformatics and comparative genomics to analyze genetic sequences and predict the functions of genes and proteins. By comparing the genomes of different organisms, researchers can identify genes that are unique to certain groups and may encode novel bioactive compounds or enzymes.
- Natural Products Discovery: Many drugs and biotechnological products are derived from natural sources, such as plants, microbes, and marine organisms. Phylogeny-guided approaches can help researchers identify new sources of natural products and screen them for bioactivity.
- Metagenomics and Environmental Sampling: Metagenomics involves studying the genetic material recovered directly from environmental samples, such as soil or water. By analyzing the genetic diversity in environmental samples, researchers can identify novel genes and biochemical pathways that may have biotechnological or medical applications.
- Bioprospecting and Traditional Knowledge: Phylogeny-guided approaches can also involve working with indigenous communities and traditional healers to identify plants or organisms with medicinal properties. This approach combines traditional knowledge with modern scientific methods to discover new drugs and biotechnological products.
- Drug Development and Optimization: Once potential drug candidates or biotechnological products are identified, they undergo further testing and development to optimize their efficacy, safety, and stability. This process may involve chemical modification, formulation development, and preclinical and clinical testing.
Overall, phylogeny-guided drug discovery and biotechnology offer a promising approach to discovering new drugs, biotechnological products, and sustainable solutions by tapping into the rich diversity of life on Earth. By integrating evolutionary biology with modern biotechnology and bioinformatics, researchers can uncover valuable resources that have the potential to benefit society and improve human health.