Exploring Precision Medicine with Bioinformatics – A High School Student’s Guide

March 1, 2024 Off By admin

Precision medicine is an innovative approach to healthcare that takes into account individual differences in people’s genes, environments, and lifestyles. It aims to customize medical treatment and healthcare practices to the unique characteristics of each patient, rather than adopting a one-size-fits-all approach.

At its core, precision medicine recognizes that individuals respond differently to various treatments due to genetic, environmental, and lifestyle factors. By understanding these differences, healthcare providers can tailor treatments to maximize effectiveness and minimize side effects. This approach contrasts with traditional medicine, which often relies on generalized treatment plans based on population averages.

One of the key components of precision medicine is the use of genetic information. While many doctors already consider factors such as a patient’s age, gender, and medical history when making treatment decisions, precision medicine goes a step further by incorporating genetic data. This can involve analyzing a patient’s entire genome to identify specific genetic markers that may influence their response to certain medications or susceptibility to particular diseases.

Precision medicine has the potential to revolutionize healthcare by enabling more targeted and personalized treatments. It can lead to better outcomes for patients, as treatments are tailored to their specific needs and characteristics. Additionally, by identifying genetic predispositions to diseases, precision medicine can also help in disease prevention and early detection.

While precision medicine holds great promise, there are also challenges to its widespread adoption. These include issues related to data privacy, the cost of genetic testing, and the need for healthcare providers to have the necessary training and infrastructure to implement precision medicine approaches effectively.

Overall, precision medicine represents a significant advancement in healthcare, offering the potential to improve outcomes and quality of life for many patients. As research in genetics and personalized medicine continues to advance, precision medicine is likely to play an increasingly important role in the future of healthcare.

Table of Contents

What can we learn from one’s genome?

The human genome is the complete set of genetic information in a human. It contains all the instructions needed to build and maintain an organism. Understanding an individual’s genome can provide valuable insights into various aspects of their health and biology. Here are some key aspects we can learn from analyzing someone’s genome:

Genetic Variants: By comparing an individual’s genome to the reference genome (the standard sequence used as a basis for comparison), we can identify differences known as genetic variants. These variants can include single nucleotide variants (SNVs), insertions, deletions, and structural variants.
Ethnic Background: Genetic studies have identified certain genetic variants that are more common in specific ethnic groups. By comparing an individual’s genome to a typical genome from their ethnic background, we can identify genetic variants that are unique or more prevalent in their population.
Family Relationships: Comparing the genomes of family members, such as parents, siblings, and other relatives, can reveal patterns of inheritance and genetic similarities. This can help in understanding genetic traits, diseases, and ancestry.
Disease Risk: Certain genetic variants are associated with an increased risk of developing specific diseases. By analyzing an individual’s genome, we can assess their risk for certain conditions and potentially take preventive measures.
Drug Response: Genetic differences can influence how individuals respond to medications. Pharmacogenomic studies aim to identify genetic variants that affect drug metabolism and efficacy, allowing for personalized medicine approaches.
Ancestry and Evolution: Studying genetic variants can provide insights into human evolution and migration patterns. By comparing genomes from different populations, researchers can trace genetic lineages and migration routes.
Non-Coding Regions: While genes make up only a small part of the genome, non-coding regions play crucial roles in gene regulation and expression. Studying these regions can provide insights into gene regulation and disease mechanisms.

Overall, analyzing an individual’s genome can provide valuable information about their health, ancestry, and genetic traits. However, it’s essential to consider the ethical and privacy implications of genomic analysis, as it involves sensitive personal information.

cancer genome

The “cancer genome” is a term used to describe the genetic alterations that occur in cancer cells. Cancer is not a single disease but rather a collection of diseases, each with its own set of genetic mutations. These mutations can affect various genes involved in cell growth, division, and death, leading to uncontrolled cell growth and tumor formation.

Precision medicine offers a promising approach to cancer treatment by leveraging the unique genetic profile of a patient’s tumor to identify the most effective treatment strategies. This approach, known as precision cancer medicine, involves sequencing the genome of a tumor to identify specific mutations that may be driving its growth. Based on this information, oncologists can tailor treatment plans to target these specific mutations, potentially improving treatment outcomes and reducing side effects.

However, precision cancer medicine also presents challenges. Tumor genomes can be complex, with multiple mutations occurring across different genes. Additionally, tumors can evolve over time, acquiring new mutations that may affect their response to treatment. This complexity underscores the need for ongoing research and innovation in the field of precision cancer medicine.

While the discussion of precision cancer medicine is beyond the scope of this conversation, it highlights the potential of genomic analysis in personalized medicine and the importance of understanding the genetic basis of cancer for developing more effective treatments.

Approaches to genome sequencing

here are several approaches to genome sequencing, each with its own advantages and applications:

Targeted Sequencing: This approach sequences only specific regions of interest in the genome, such as known functional regions or regions associated with particular diseases. This method is often used in studies where only certain areas of the genome are of interest, such as in targeted cancer sequencing or in studies of specific genetic disorders.
Whole Exome Sequencing (WES): WES involves sequencing only the protein-coding regions of the genome, known as the exome. While the exome makes up only about 1-2% of the genome, it contains the vast majority of disease-causing mutations. WES is often used in clinical settings to diagnose genetic disorders and in research to study the genetic basis of diseases.
Whole Genome Sequencing (WGS): WGS involves sequencing the entire genome, including both coding and non-coding regions. WGS provides the most comprehensive view of an individual’s genetic makeup and can reveal a wide range of genetic variations, including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations. WGS is used in research to study complex diseases, population genetics, and evolutionary biology. It is also increasingly being used in clinical settings for diagnosing genetic disorders, predicting disease risk, and guiding personalized treatment decisions.

Each of these approaches has its own strengths and limitations, and the choice of sequencing method depends on the specific goals of the study or clinical application. Targeted sequencing is often more cost-effective and efficient for studying specific regions of interest, while WGS provides the most comprehensive and detailed genetic information.

Sampling genome sequence

Sampling the genome sequence involves focusing on the regions that are the most different between individuals, as the vast majority of the genome (about 99.9%) is identical between individuals. By concentrating on these variable regions, researchers can identify genetic differences that may be associated with traits, diseases, or other characteristics of interest.

One approach to sampling the genome is to sequence just the approximately one million most different locations, known as single nucleotide polymorphisms (SNPs). This approach provides a snapshot of genetic variation across the genome and is commonly used in studies of human population genetics and complex diseases.

Another approach is to sequence only the genes, which make up about 1-2% of the genome. This method, known as whole exome sequencing, focuses on the protein-coding regions of the genome, where most disease-causing mutations are located. Whole exome sequencing is often used in clinical settings to diagnose genetic disorders and in research to study the genetic basis of diseases.

For a more comprehensive view, researchers can sequence the entire genome, including both coding and non-coding regions. This approach, known as whole genome sequencing, provides a detailed map of an individual’s genetic makeup and can reveal a wide range of genetic variations, including SNPs, insertions, deletions, and structural variations. Whole genome sequencing is used in research to study complex diseases, population genetics, and evolutionary biology, as well as in clinical settings for diagnosing genetic disorders and guiding personalized treatment decisions.

Single nucleotide polymorphisms

Single nucleotide polymorphisms (SNPs) are variations in a single nucleotide (DNA letter) that occur at a specific position in the genome. They are called “single” because they involve a single nucleotide change, “nucleotide” because the change occurs at the level of individual DNA letters (A, C, G, or T), and “polymorphism” because the variation is common in the population (occurring in at least 1% of individuals).

SNPs can occur within a gene, where they can influence traits such as disease susceptibility or drug response, or they can occur between genes, where they may have less direct effects on phenotype but can still be useful as genetic markers for mapping traits of interest. SNPs are the most common type of genetic variation in the human genome and are the basis for much of the genetic diversity seen among individuals. They are used in a wide range of genetic studies, including association studies to identify genetic variants associated with diseases or traits, population genetics studies to understand human evolutionary history and migration patterns, and pharmacogenomics studies to predict how individuals will respond to drugs based on their genetic makeup.

Humans, like many other organisms, are diploid, meaning they have two sets of chromosomes, one inherited from each parent. This means that for each SNP, an individual can have two copies of the variant allele.

When both copies of a SNP are the same variant, it is called homozygous (e.g., CC). If the two copies are different, it is called heterozygous (e.g., CT).

SNPs that are physically close to each other on a chromosome tend to be inherited together as a block, a phenomenon known as linkage disequilibrium. This means that if one SNP in a block is known, other SNPs in the block can be predicted with high accuracy, which is useful for genetic studies.

A “tag SNP” is a SNP that is used to represent a set of other SNPs that are in linkage disequilibrium with it. By genotyping a small number of tag SNPs, researchers can indirectly infer the genotypes of other SNPs in the same block, reducing the number of SNPs that need to be directly genotyped in a study.

Ethnicity plays a significant role in genetic variation, as different populations can have distinct allele frequencies for certain genetic variants. Large-scale projects, such as HapMap and the 1000 Genomes Project, have collected genotypic data from diverse populations to better understand human genetic variation.

The HapMap project, which stands for Haplotype Map of the Human Genome, was launched to identify and catalog genetic similarities and differences in human populations worldwide. It focused on studying common patterns of genetic variation and haplotype structure, particularly in populations with African, Asian, and European ancestry. The HapMap project provided valuable data for understanding the genetic basis of complex diseases and traits.

The 1000 Genomes Project aimed to build a comprehensive map of human genetic variation by sequencing the genomes of individuals from diverse populations around the world. This project significantly expanded our knowledge of human genetic variation and provided a valuable resource for studying the genetic basis of diseases and population history.

Some of the populations included in these projects are:

Yoruba in Ibadan, Nigeria (YRI)
Japanese in Tokyo, Japan (JPT)
Han Chinese in Beijing, China (CHB)
Utah residents with ancestry from northern and western Europe (CEU)

These populations, or ethnic groups, were chosen to represent a broad range of human genetic diversity. Researchers can use data from these projects to study how genetic variation relates to traits and diseases in different populations, leading to a better understanding of human biology and evolution.

For example, let’s consider the SNP rs1834640. This SNP is a common genetic variant that has been studied in various populations. By examining its allele frequencies in different populations, researchers can gain insights into its evolutionary history and potential associations with traits or diseases.

Precision medicine on Dr. Watson

Precision medicine applied to an individual, such as Dr. Watson, involves analyzing their genome sequence to understand various aspects of their health and potential responses to treatments. Here’s how it could be approached:

Potential Genetic Risk for Disease: By analyzing Dr. Watson’s genome, researchers can identify genetic variants associated with an increased risk of certain diseases. For example, specific SNPs may indicate a higher risk for conditions like cardiovascular disease, cancer, or Alzheimer’s disease.
Expected Drug Response: Genetic variations can also influence how individuals respond to medications. Pharmacogenomic studies can identify genetic markers that affect drug metabolism, efficacy, and potential side effects. This information can help tailor medication choices and dosages to maximize effectiveness and minimize adverse reactions.
Optimal Disease Treatment: With a detailed understanding of Dr. Watson’s genetic makeup, healthcare providers can design personalized treatment plans. This may involve selecting medications that are most likely to be effective based on his genetic profile or adjusting treatment regimens to account for his unique genetic factors.

Challenges in precision medicine include the vast amount of data involved in analyzing a person’s genome. Even though humans are about 99.9% genetically identical, the remaining 0.1% difference translates to millions of genetic variations across the 3 billion DNA letters in the genome. Determining which of these variations are clinically relevant and how they impact health and treatment outcomes is a complex task that requires sophisticated analysis and interpretation.

In Dr. Watson’s case, analyzing his genome would require identifying relevant genetic variations, interpreting their significance in relation to disease risk and drug response, and integrating this information into his healthcare plan. Despite the challenges, precision medicine offers the potential to greatly improve health outcomes by providing personalized and targeted interventions based on an individual’s unique genetic makeup.

Computational challenges

The computational challenges involved in analyzing a large number of genome sequences are significant and require specialized algorithms and computing resources. Here’s how each step would be addressed:

Alignment to the Reference Genome: To align each piece of the genome sequence to the reference genome, bioinformatics tools use algorithms such as the Burrows-Wheeler Transform (BWT) and the FM-index, which enable rapid and efficient alignment of short DNA sequences to a reference genome. These algorithms are implemented in software packages like BWA and Bowtie.
Processing 100 Million DNA Pieces: Processing a large number of DNA sequences requires high-performance computing resources. Parallel processing techniques can be used to distribute the workload across multiple processors or compute nodes, speeding up the analysis. Cloud computing platforms also offer scalable resources for processing large genomic datasets.
Identifying Differences from the Reference Genome: Once the sequences are aligned, identifying the differences (variants) from the reference genome involves comparing the aligned sequences to the reference and identifying positions where the sequences differ. This process, known as variant calling, is performed using algorithms that consider factors such as sequencing errors and alignment quality to accurately identify variants.

Overall, analyzing large genomic datasets requires a combination of specialized algorithms, high-performance computing resources, and efficient data management techniques. Advances in these areas have greatly improved our ability to analyze complex genomic data and extract meaningful insights into genetic variation and its impact on health and disease.

When comparing common DNA variants to those found in Genome-Wide Association Studies (GWAS), it’s essential to understand the context and implications of these findings. Here’s how one might react to such findings:

Understanding the Association: It’s crucial to recognize that an association between a genetic variant and a trait or disease does not imply causation. GWAS findings provide statistical associations, and further research is needed to understand the biological mechanisms underlying these associations.
Personal Health Considerations: If an individual’s genetic test indicates a variant associated with a specific trait or disease, it does not necessarily mean they will develop that trait or disease. Genetics is just one factor, and environmental and lifestyle factors also play a significant role.
Medical Consultation: If genetic testing reveals variants associated with a higher risk of certain conditions, individuals may consider discussing the results with a genetic counselor or healthcare provider. This can help them understand the implications of the findings and make informed decisions about their health.
Behavioral Changes: Some genetic findings may suggest lifestyle changes that could reduce the risk of developing certain conditions. For example, individuals with a genetic predisposition to type 2 diabetes may benefit from maintaining a healthy diet and regular exercise routine.
Emotional Response: Learning about genetic risk factors can sometimes evoke strong emotional responses. It’s important to approach these findings with a balanced perspective and seek support if needed.

In summary, reacting to genetic findings from GWAS requires a nuanced understanding of genetics and its implications for health. Consulting with healthcare professionals and genetic counselors can help individuals make informed decisions about their health based on genetic information.

Predicting the effect of rare DNA variants is challenging, especially when they are less well-studied. Here’s how researchers might approach this task:

Functional Impact Prediction: Computational tools can predict the functional impact of genetic variants based on their location in the genome and the predicted effect on protein structure or function. These tools use algorithms that assess the likelihood of a variant affecting gene expression, protein function, or regulatory elements.
Population Frequency: Rare variants are often defined as those with a frequency of less than 1% in the population. However, the definition of “rare” can vary depending on the specific population being studied. Understanding the frequency of a variant in different populations can provide insights into its potential impact.
Clinical Correlation: Rare variants that are known to cause severe diseases, such as sickle-cell anemia, Tay-Sachs disease, or muscular dystrophy, have been extensively studied and characterized. For newly discovered rare variants, researchers may look for similarities to known disease-causing variants and assess their potential impact based on this information.
Functional Studies: Experimental studies, such as functional assays or animal models, can provide direct evidence of the impact of a rare variant. These studies can help elucidate the molecular mechanisms underlying the variant’s effects and its potential role in disease.
Integration of Multiple Data Sources: To predict the effect of rare variants, researchers often integrate multiple sources of data, including genetic, genomic, and functional data. This integrative approach can provide a more comprehensive understanding of the potential impact of rare variants on health and disease.

In summary, predicting the effect of rare DNA variants requires a combination of computational, experimental, and clinical approaches. Integrating these approaches can help researchers understand the functional significance of rare variants and their potential role in disease.

Dr. Watson’s genome contains a total of 3.3 million genetic variants. Of these, 2.7 million are common variants, meaning they are present in a significant portion of the population, while 600,000 are rare variants, which are less common.

Among these variants, 10,500 result in changes in the protein sequence. This includes 9,000 common variants and 1,500 rare variants. Common variants are more likely to have been studied and understood in terms of their potential effects on protein function.

Of all the variants, approximately 7% were predicted to be “probably damaging” to protein function. This prediction is based on computational algorithms that assess the potential impact of a variant on protein structure or function. However, it’s important to note that these predictions are not definitive and would need to be validated through experimental studies to confirm their effects.

Understanding the distribution and potential impact of genetic variants in Dr. Watson’s genome provides valuable insights into his genetic makeup and potential predisposition to certain traits or diseases. Further research and analysis would be needed to fully understand the implications of these variants for his health and well-being.

Generating protein structures

Generating protein structures is both a science and an art, relying on a combination of experimental techniques and computational modeling. Here’s an overview of how protein structures are generated and studied:

Experimental Techniques: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy are the two primary experimental methods used to determine protein structures. X-ray crystallography involves crystallizing a protein and then using X-rays to determine the positions of atoms in the crystal. NMR spectroscopy, on the other hand, involves analyzing the interactions between atomic nuclei in a protein to determine its structure in solution.
Protein Data Bank (PDB): The Protein Data Bank is a repository of experimentally determined protein structures. It contains over 100,000 structures and is a valuable resource for researchers studying protein structure and function.
Computational Prediction: For proteins whose structures have not been experimentally determined, computational methods can be used to predict their structures. One common approach is homology modeling, which predicts a protein’s structure based on its similarity to proteins with known structures. Other methods, such as ab initio modeling, predict protein structures based on physical principles and statistical potentials.
Unknown Structures: Despite advances in experimental and computational methods, there are still many proteins for which the structure is unknown. These proteins are often referred to as “protein structures” and represent a significant challenge in structural biology.

Overall, generating and studying protein structures is a complex and interdisciplinary field that relies on a combination of experimental and computational techniques. The ability to predict and understand protein structures is crucial for advancing our understanding of biological processes and developing new treatments for diseases.

What does a mutation do?

A mutation is a change in the DNA sequence of a gene, which can lead to changes in the protein that the gene codes for. The effects of a mutation depend on several factors, including the protein’s function, structure, the specific amino acid change, and any interactions the protein has with other molecules. Here’s how mutations can affect proteins, metabolites, and drugs:

Proteins: Mutations can alter the structure and function of proteins in various ways. For example, a mutation may change a single amino acid in the protein sequence, leading to a change in protein folding, stability, or activity. Mutations can also result in the production of a truncated protein that is shorter than the normal protein, affecting its function. Additionally, mutations can disrupt protein-protein interactions or protein-DNA interactions, impacting the protein’s role in cellular processes.
Metabolites: Proteins play essential roles in metabolism, including enzyme-catalyzed reactions that convert metabolites into different forms. Mutations in genes encoding enzymes can alter the activity or specificity of these enzymes, leading to changes in metabolite levels or the production of abnormal metabolites. These changes can affect cellular metabolism and may contribute to disease states.
Drugs: Proteins are the targets of many drugs, and mutations in the genes encoding these proteins can affect drug binding and efficacy. For example, a mutation in a drug target protein may alter the protein’s structure at the drug-binding site, reducing the affinity of the drug for the protein. This can lead to decreased drug efficacy or the development of drug resistance.

In summary, mutations can have a wide range of effects on proteins, metabolites, and drug interactions. Understanding the effects of mutations is crucial for predicting their impact on biological systems and for developing strategies to treat genetic diseases and combat drug resistance.

Effects of variation

Variations in protein structure, such as those caused by mutations, can have various effects on protein function. Understanding these effects is crucial for predicting how mutations may impact health and disease. Here are some key points:

Changes in Protein Function: Mutations can alter the structure of a protein, affecting its function. For example, a mutation may change the active site of an enzyme, preventing it from binding to its substrate and carrying out its normal function.
Drug Interactions: Mutations can also affect how proteins interact with drugs. A mutation may alter the binding site of a drug, making it less effective or ineffective. Conversely, a mutation may create a new binding site, allowing a different drug to bind and potentially have a therapeutic effect.
Balance Between Protein Forms: Proteins often exist in multiple forms, such as active and inactive states. Mutations can shift the balance between these forms, leading to changes in protein function. For example, a mutation may stabilize the active form of a protein, leading to increased activity.
Drug Response: Understanding how mutations affect protein function is essential for predicting how individuals will respond to drug treatment. Some mutations may render a drug ineffective, while others may enhance its effects or lead to unexpected side effects.
Research and Therapeutics: Studying the effects of mutations on protein structure and function is a major area of research. This knowledge is essential for developing new therapeutic strategies, such as personalized medicine approaches that target specific mutations or protein variants.

In summary, variations in protein structure can have profound effects on protein function, drug interactions, and disease susceptibility. Understanding these effects is crucial for advancing our understanding of biology and developing new treatments for a wide range of diseases.

The Integrative Genomics Viewer (IGV) is a versatile tool used for visualizing and exploring genomic data. It supports various types of genomic data and can be used in different forms:

IGV Desktop: The original IGV is a Java-based desktop application. It provides a user-friendly interface for loading and visualizing genomic data from various sources, including local files and remote servers. Users can explore genomic data interactively, zooming in and out of specific regions, viewing gene annotations, and comparing multiple datasets.
IGV-Web: IGV-Web is a web-based version of the tool that allows users to visualize genomic data directly in a web browser. It provides similar functionality to the desktop version but can be accessed and used remotely, making it convenient for collaborative research and data sharing.
igv.js: igv.js is a JavaScript component that developers can use to embed IGV functionality into web pages or web applications. It provides a flexible and customizable way to integrate genomic data visualization into online resources.

Overall, IGV is a powerful tool for visualizing and exploring genomic data, supporting a wide range of genomic data types and providing interactive features for data analysis and interpretation. Its availability in multiple forms makes it accessible to a broad range of users, from individual researchers to large collaborative projects.

Cn3D (“see in 3D”) is a helper application for web browsers that allows users to view 3-dimensional structures from NCBI’s Entrez Structure database. It provides a way to visualize structures in a 3D space, along with sequence and alignment information. However, support for Cn3D on platforms other than Windows is no longer available, and the current version is expected to be the last publicly distributed version, with support potentially ending by 2024.

For users looking for an alternative, iCn3D (“I see in 3D”) is recommended. iCn3D is a web-based tool that allows users to view 3D structures directly in a web browser, without the need to install a separate application. It provides similar functionality to Cn3D, including the simultaneous display of structure, sequence, and alignment, as well as powerful annotation and alignment editing features.

Overall, both Cn3D and iCn3D are useful tools for visualizing and exploring 3D structures, with iCn3D being the recommended option for users seeking a web-based solution.

Exercise 1: Browsing James Watson’s genome

We’ll use the free IGV Genome Browser (http://www.broadinstitute.org/igv/) to look at his
sequenced genome to get a peek at the first steps of precision medicine.
1. On the top left pulldown bar, select the version of the human genome to be “Human
(hg38)”. If you use another version of the genome, the coordinates won’t match up and
the genome-mapped DNA sequences won’t make sense.
2. Note that the numbers across the top refer to chromosomes, and the blue graph (one
set of data called a “track”) at the bottom shows where the genes are. Since we start
out looking at the whole genome, the “Gene” track doesn’t make much sense.
3. Click on 6 to go to chromosome 6. The top now shows a graphical representation of
chr6.
4. On the top right, zoom in by clicking on the “+” in the box. The red box on the chr6
figure now indicates the zoomed-in region. Individual genes should start appearing at
the bottom. The blue boxes are the gene exons, connected by lines (spanning the
introns) with and arrowheads indicating the gene direction (“forward” if to the left or
“reverse” if to the right). If you keep zooming in, the actual genome DNA sequence
appears. This is the “reference genome” sequence assembled from several different
people.
5. To load James Watson’s genome in the browser, go to File > “Load from File…” and
navigate to the file Watson_chr6.bwa.bam. (IGV also needs a file called
Watson_chr6.bwa.bam.bai, which should be in the same folder.) If it loaded OK, you
should see some gray lines in a new track. These are sequenced DNA fragments of his
genome. If you can’t see anything, zoom in more.
6. Let’s concentrate on how Watson’s genome differs from the reference genome. This
can be due to
a. deletions (black horizontal bars indicating missing DNA letters),
b. insertions (purple vertical bars [with a dot in the middle] indicating extra DNA
letters in his genome), and
c. single-nucleotide variants (SNVs, indicated by colored vertical bars, which
change to letters when you zoom in)
7. Many of these differences are due to technical reasons. This is typically the case if a
position represented by overlapping reads shows only rare differences.
8. Try to find a genome location with a consistent difference between Watson’s genome
and the reference genome. Here are some examples. You can enter them into the box
before the “Go” button.
chr6:41,670,335-41,670,375
chr6:41,674,433-41,674,473
chr6:85,624,634-85,624,674
chr6:41,671,536-41,671,576
Why are some sites only partially different from the reference genome?
9. We’d go crazy is we had to find all of the SNVs manually, so we ran a computer program
to identify them automatically. We’ll load that set of SNVs as a new track. Go to File >
“Load from File…” and load Watson.DP_5.SNVs.bwa.vcf.sorted.bed. Zoom out and
select some of these automatically found SNVs. Did the computer program do a good
job?
10. Are these positions where Watson’s genome is different from the reference genome
expected or unexpected? To help answer this question we can load sites of known
genetic variation (SNPs, defined as SNVs where at least 1% of the population has a
different DNA letter). Go to File > “Load from File…” and load SNP144.sorted.bed This
track also displays the rs ID for each common variant (SNP), and in some cases, these
can be used to mine information about predicted/expected effect of the SNP. Note that
most of Watson’s variants are known variants.
11. Let’s zoom in on just one variant (rs1799971) that brings up some interesting issues. It’s
in the region of chr6:154,039,642-154,039,682. Searching the rs ID on the web should
bring you to SNPedia, which describes the expected effect of having different alleles
(DNA letters) of the site. Remember that this is only one of 3 million SNVs in Watson’s
genome. How would you react to a finding like this in your genome? Would you want
to publicly share this information? Why or why not? Watson’s genome is one of few
that are publicly available. Most sequenced genomes – and there lots of them – are
available only to doctors and/or specific researchers. People have very different
opinions on privacy of genomic information! Another interesting SNP (rs2802292) is in
the region chr6:108,587,295-108,587,335.
12. [Just to think about] Is there anything in his genome that Dr. Watson should be worried
about? Precision medicine researchers and companies can use computers to analyze all
of his variants and compare them to variants that are known to cause disease.
Sometimes this is very informative – but we still have a lot to learn!

Exercise 2: Examining the protein structure of the Bcr-Abl fusion
protein

In this exercise we’ll examine how an amino acid change at a single residue may lead to serious
consequences. In this case, the change affects the binding of a drug to the protein. We’ll look at a
protein (Bcr-Abl tyrosine kinase), and the interaction with a drug, an inhibitor (PHA-739358). This
protein is a result of BCR-ABL fusion gene (two genes that are abnormally merged)that causes chronic
myeloid leukemia (CML). The mutation, T315I, a tyrosine (T) to isoleucine (I) at position 315, affects the
binding of a drug (Gleevec) given to treat CML. As a result, someone who has this particular mutation
may not benefit from this drug and would need a different one.
1. Go the NCBI’s Molecular Modeling Database (MMDB) at
http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml and search using our structure’s
PDB DB: 2V7A
2. The search should bring you to the page
http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?uid=59785. Multiple software
tools can be used to view protein structures. We’re going to use Cn3D (abbreviation for “see in
3-D”), which is already installed on the BaRC laptops. Under “View or Save 3D Structure”, click
on the “View Structure” button. Open the file using “Cn3D”.
• You should see two windows: i) protein 3D structure and ii) Sequence/Alignment
Viewer
• Enlarge the two windows to make more use of your display/screen.
3. Get a feel for the structure by rotating (using your mouse) and also zoom in/out (View  Zoom
In or Zoom out; you can use the shortcut as well)
• Can you identify the drug that is bound to this protein?
• You’ll also see a grey sphere, can you identify what this represents? [Hint: This is listed
in the bottom of the NCBI MMDB page you visited to download the structure.]
4. The default rendering is “worms” and in purple color. Try changing the rendering (Style 
Rendering Shortcuts), and color (Style  Coloring Shortcuts).
5. The second window has the amino acid sequence (in single-letter abbreviations) that makes-up
this protein. Highlight certain residues and see where they are in the protein 3D structure.
Examine how the amino acid sequence, which can be considered one-dimensional, is quite
different from the actual 3D protein structure.
6. The mutation of interest, T315I, is at the position described as “PDB 315”. In the typical Bcr-Abl,
this position is the amino acid threonine (T), but in this mutated variant, the amino acid is
isoleucine (I). Identity this position on the sequence and check if it’s indeed an isoleucine (I).
We’ll highlight this residue. First we want to see the side chains of the amino acids, not just the
backbone (Style  Rendering Shortcuts  Toggle Sidechains). Select residue 315 (should be
highlighted in yellow), can you see the residue on the 3D structure? You may have to rotate the
protein.
7. We’ll now highlight this residue differently than default, go to annotate (Style  Annotate 
New) and give a name (“ mutation”) and description (“T315I mutation”). Click on “Edit Style”, in
“Protein backbone”, choose “Space Fill” and change Color Scheme to “User Selection”. Choose
a color of your choice under “User Color”, and select Apply. Do the same for “Protein
sidechains” and choose a different color but for “Rendering” use “Ball and Stick”. Select Done
for Style Options once complete, and click on OK in Edit Annotation.
[Note: Your selection will be still in yellow since the residue “I” was selected, simply click
anywhere on the protein sequence to un-highlight and get the selected coloring.] Zoom into
the residue and see how close it is to the drug.
8. Precision medicine application: Patients with chronic myelogenous leukemia (CML) are often
treated with a drug that binds inside a “pocket” of the Bcr-Abl fusion protein (right next to
amino acid 315) and reduces the effect of this oncoprotein. But if amino acid 315 is mutated
from threonine to isoleucine, theh side chain of the amino acid sticks out more into the drugbinding pocket. The drug no longer binds very well so the patient often develops resistance to
the drug, and the doctor needs to try another drug.