Molecular Biology for Computer Scientists
February 29, 2024Molecular Biology for Computer Scientists
For computer scientists entering the field of molecular biology, a major challenge lies in mastering the intricate complexities of existing biological knowledge and its extensive technical vocabulary. Questions regarding the origin, function, and structure of living systems have been explored by cultures throughout history, with the last two generations producing particularly fruitful work. The resulting knowledge of living systems is too detailed and complex for any individual to fully comprehend, with some dedicating entire scientific careers to the study of a single biomolecule. In the following pages, I aim to provide enough background for computer scientists to grasp much of the biology discussed in this book. This chapter offers a brief overview, only scratching the surface of the depth, variety, complexity, and stunning beauty of the universe of living things.
A significant portion of the content that follows does not exclusively focus on molecular biology. To elucidate the actions of molecules, it is often necessary to delve into concepts involving cells, embryological development, or evolution. Biology is notably holistic, where events at one level can impact and be influenced by events at vastly different scales or over time. Absorbing a survey of the basic background material is essential for grasping the significance of the molecular biology discussed elsewhere in this book. Context is paramount in life, as it is in cognition.
As you read, remember this guiding principle: for every generalization I make about biology, there may be thousands of exceptions. The world is filled with diverse living organisms, and only a few generalizations hold true universally. I will strive to cover the fundamental principles, but remain aware of the many exceptions. Understanding biology also involves learning its language. Biologists, like other scientists, use technical terms for precise communication. Mastering this terminology will allow you to access a wealth of biological literature. The notes provide information on terminology and other essential aspects. Let’s begin our journey from the beginning.
What Is Life?
No simple definition can fully capture our intuitive understanding of what distinguishes living things from non-living ones. While the ability to reproduce is a central feature of life, it alone is not enough to define it; computer programs can replicate endlessly without being considered alive. Similarly, crystals can influence their surroundings to form structures like themselves, yet they are not considered alive. Most living organisms take in materials from their environment and utilize energy to transform these materials into components of themselves or their offspring. Viruses, however, do not exhibit this characteristic; they consist mainly of genetic material wrapped in a protective coat, relying on host cells to replicate. The question of whether viruses qualify as a form of life is a subject of debate.
Another approach to defining life is to emphasize its interconnectedness. All living organisms share a common ancestry, no matter how different they may appear. The diversity and complexity of organisms have arisen through the process of evolution, which comprises three key components: inheritance (the transmission of characteristics from parents to offspring), variation (the mechanisms leading to offspring differing from their parents), and selection (the process by which certain organisms and their characteristics are favored for reproduction over others). These components collectively define the evolutionary process, which is arguably the best way to define life as it exists on Earth. Evolution not only helps define life but also provides insights into how living systems operate.
Evolution is a cumulative process where inheritance plays a crucial role in determining the structure and function of organisms. While there is some variation from one generation to the next, many aspects of organisms, such as the molecules carrying energy or genetic information, have remained relatively unchanged since the common ancestor billions of years ago. However, perfect inheritance alone would lead to populations of identical organisms, which is not observed.
For evolution to occur, there must be a source of variation in inheritance. In biology, several sources contribute to this variation. Mutation, which involves random changes in inherited material, is one such source. However, sexual recombination and various genetic rearrangements also play roles in generating variation. Even viruses can contribute to genetic variation by leaving permanent traces in the genes of their hosts. These sources of variation modify the message passed from parent to offspring, exploring a vast space of possible characteristics. It is widely accepted in evolutionary biology that the majority of variations are either neutral or deleterious. Similar to how small changes in a complex computer system can have far-reaching and damaging consequences (despite programmers making these changes intentionally to improve the code), biological variations can have significant effects over time. However, given enough time, the exploration of this space has led to the emergence of many viable organisms.
Living organisms have successfully adapted to a wide range of challenges and continue to thrive. Selection is the process that determines which variants will persist, thereby shaping which parts of the space of possible variations will be explored. Natural selection is based on the reproductive fitness of each individual, which measures how many surviving offspring an organism can produce. Organisms that are well adapted to their environment are more likely to produce successful offspring. Due to competition for limited resources, only organisms with high fitness levels will survive, while those less well adapted will die out.
I have likened evolution to a search through a vast space of possible organism characteristics, which can be precisely defined. All of an organism’s inherited characteristics are contained in a single messenger molecule: deoxyribonucleic acid, or DNA. These characteristics are represented in a simple, linear, four-element code. The translation of this code into all the inherited characteristics of an organism, such as its body plan or the wiring of its nervous system, is complex. The specific genetic encoding for an organism is called its genotype, while the resulting physical characteristics are called its phenotype. In the search space metaphor, every point in the space is a genotype. Evolutionary variations, such as mutation, sexual recombination, and genetic rearrangements, represent the legal moves in this space. Selection acts as an evaluation function that determines how many other points a point can generate and how long each point persists.
The distinction between genotype and phenotype is significant because small steps in genotype space can lead to significant changes in phenotype space. Additionally, while the search occurs in genotype space, selection operates on phenotypes. It is challenging to quantify the size of phenotype space, but for organisms with a large amount of genetic material, such as the Lily flower, there are roughly 10^70,000,000,000 possible genotypes of that size or less, illustrating the vastness of this space. Reproductive events (moves) occur asynchronously, both with each other and with the selection process, adding non-deterministic elements to the process. This search process, running for billions of iterations and examining trillions of points in parallel at each iteration, contributes to the wondrous abilities and tremendous diversity of living things.
The Unity and the Diversity of Living Things
Life exhibits extraordinary diversity. The differences between a tiny archebacterium living in a superheated sulfur vent at the ocean’s bottom and a two-ton polar bear roaming the Arctic span orders of magnitude in many dimensions. Many organisms consist of a single cell, while a Sperm Whale has more than 10^15 cells. Although environments that are very acidic, alkaline, or salty are generally deadly, living things can be found in all of them. Whether hot or cold, wet or dry, oxygen-rich or anaerobic, nearly every niche on the planet has been invaded by life. The diversity of approaches to gathering nutrients, detecting danger, moving around, finding mates (or other forms of reproduction), raising offspring, and other activities of living creatures is truly awe-inspiring.
Although our understanding of life at the molecular level is less detailed, it appears that this diversity is reflected there as well. For example, proteins with very similar shapes and identical functions can have radically different chemical compositions. Similarly, organisms that look quite similar to each other may have very different genetic blueprints. All of the genetic material in an organism is called its genome. Genetic material is discrete and hence has a particular size, although the size of the genome is not directly related to the complexity of the organism. The size of genomes varies from about 5,000 elements in a very simple organism (e.g., the viruses SV40 or φX) to more than 10^11 elements.
*Evolution has also become an inspiration to a group of researchers interested in de- signing computer algorithms, e.g. Langton (1989).
in some higher plants; people have about 3×109 elements in their genome.
Despite this incredible diversity, nearly all of the same basic mechanisms are present in all organisms. All living things are made of cells*: membrane- enclosed sacks of chemicals carrying out finely tuned sequences of reactions. The thousand or so substances that make up the basic reactions going on in- side the cell (the core metabolic pathways) are remarkably similar across all living things. Every species has some variations, but the same basic materials are found from bacteria to human. The genetic material that codes for all of these substances is written in more or less the same molecular language in every organism. The developmental pathways for nearly all multicellular or- ganisms unfold in very similar ways. It is this underlying unity that offers the hope of developing predictive models of biological activity. It is the process of evolution that is responsible both for the diversity of living things and for their underlying similarities. The unity arises through inheritance from com- mon ancestors; the diversity from the power of variation and selection to search a vast space of possible living forms.
Prokaryotes & Eukaryotes, Yeasts & People
Non-biologists often fail to appreciate the tremendous number of different kinds of organisms in the world. Although no one really knows, estimates of the number of currently extant species range from 5 million to 50 million (May, 1988).† There are at least 300,000 different kinds of beetles alone, and probably 50,000 species of tropical trees. Familiar kinds of plants and ani- mals make up a relatively small proportion of the kinds of living things, per- haps only 20%. Vertebrates (animals with backbones: fish, reptiles, amphib- ians, birds, mammals) make up only about 3% of the species in the world.
Since Aristotle, scholars have tried to group these myriad species into meaningful classes. This pursuit remains active, and the classifications are, to some degree, still controversial. Traditionally, these classifications have been based on the morphology of organisms. Literally, morphology means shape, but it is generally taken to include internal structure as well. Morhpology is only part of phenotype, however; other parts include physiology, or the functioning of living structures, and development. Structure, development and function all influence each other, so the dividing lines are not entirely clear.
In recent years, these traditional taxonomies have been shaken by infor- mation gained from analyzing genes directly, as well as by the discovery of an entirely new class of organisms that live in hot, sulphurous environments in the deep sea.
Here, I will follow the classification proposed by Woese, Kandler & Wheelis (1990), although some aspects of their taxonomy remain controversial. They developed their classification of organisms by using distances based on sequence divergence in a ubiquitous piece of genetic sequence. As depicted in Figure 1, there are three primary divisions: the Archaea, the Bacteria, and the Eucarya.
Eucarya (also known as eukaryotes) are the organisms with which we are most familiar. They have cells that contain nuclei, specialized areas in the cell that hold genetic material. Eukaryotic cells also have other specialized cellular compartments called organelles. Examples of organelles include mitochondria, where respiration takes place—a process by which cells use oxygen to improve their efficiency at turning food into energy—and chloroplasts, which are organelles found in plants that capture energy from sunlight. All multicellular organisms (e.g., humans, mosquitoes, and maple trees) are Eucarya, as are many single-celled organisms, such as yeasts and paramecia.
Even within Eucarya, there is more diversity than many non-biologists expect. Within the domain of eukaryotes, there are generally considered to be at least four kingdoms: animals, green plants, fungi, and protists. From a genetic perspective, protists, typically defined as single-celled organisms other than fungi, appear to encompass several kingdoms, including ciliates (cells with many external hairs, or cilia), flagellates (cells with a single, long external fiber), and microsporidia. The taxonomic tree continues down about a dozen levels, ending with particular species at the leaves. All of these eukaryotic life forms share many similarities with human beings, which is why studying them can yield valuable insights into ourselves.
Bacteria (sometimes also referred to as eubacteria or prokaryotes) are ubiquitous single-celled organisms. They are found everywhere—on this page, in the air you breathe, and in your gut, for example. The membranes that enclose these cells are typically made of different materials than those surrounding eukaryotes, and they lack nuclei or other organelles (though they do possess ribosomes, which are sometimes considered organelles; see below). Bacteria primarily reproduce by dividing, and when food is abundant, the survival of the fittest in bacteria often means the survival of those that can divide the fastest (Alberts et al., 1989). Bacteria include not only disease-causing “germs” but also many types of algae and a wide variety of symbiotic organisms, including soil bacteria that fix nitrogen for plants and Escherichia coli, a bacterium that resides in human intestines and is necessary for normal digestion. E. coli is commonly found in laboratories because it is easy to culture and extensively studied.
Evolutionary Time and Relatedness
There are so many different kinds of life, and they live in so many different ways. It is amazing that their underlying functioning is so similar. The reason that there is unity within all of that diversity is that all organisms appear to have evolved from a common ancestor. This fundamental claim underpins nearly all biological theorizing, and there is substantial evidence for it.
All evolutionary theories hold that the diversity of life arose by inherited variation through an unbroken line of descent. This common tree of descent is the basis for the taxonomy described above, and pervades the character of all biological explanation. There is a great deal of argument over the detailed functioning of evolution (e.g. whether it happens continuously or in bursts), but practically every biologist agrees with that basic idea.
There are several methods to estimate the divergence time between two organisms, indicating the last time they shared a common ancestor. The more closely related two species are, the more recently they diverged. Phenotypic similarity can indicate genotypic similarity, allowing organisms to be classified based on their structure, which is the traditional method. Advancements in DNA sequencing have enabled direct estimates of genetic divergence by comparing genetic sequences. If the rate of genetic change can be quantified and standardized, these differences can be translated into a “molecular clock.” Li & Graur (1991) provide a good introduction to this method. The underlying and somewhat controversial assumption is that in some parts of the genome, the rate of mutation is relatively constant. Various methods are used to identify these areas, estimate the rate of change, and calibrate the clock. This technique has largely confirmed estimates made with other methods and is considered potentially reliable, although not yet fully established. Most of the dates discussed below were derived from traditional (archaeological) dating methods.
Understanding the basic timeline of life on Earth can provide a rough idea of the degrees of relatedness among creatures. The oldest known fossils, stromatolites found in Australia, suggest that life began at least 3.8 billion years ago. Geological evidence indicates that a major meteor impact around 4 billion years ago vaporized all the oceans, effectively destroying any existing life. Life on Earth began almost as soon as it could have. Early life forms likely resembled modern bacteria in important ways, as simple, single-celled organisms without nuclei or other organelles. Life remained like this for nearly 2 billion years.
About halfway through the history of life, a significant change occurred: Eucarya emerged. There is evidence that eucarya originated as symbiotic collections of simpler cells that were eventually assimilated and became organelles (see, e.g., Margolis, 1981). The advantages of these specialized cellular organelles made early eucarya very successful. Single-celled eucarya became very complex, developing mechanisms for movement, prey detection, paralysis, and engulfment.
The next major change in the history of life was the invention of sex. Evo- lution, as you recall, is a mechanism based on the inheritance of variation. Where do these variations come from? Before the advent of sex, variations arose solely through individual, random changes in genetic material. A muta- tion might arise, changing one element in the genome, or a longer piece of a genome might be duplicated or moved. If the changed organism had an ad- vantage, the change would propagate itself through the population. Most mu- tations are neutral or deleterious, and evolutionary change by mutation is a very slow, random search of a vast space. The ability of two successful or- ganisms to combine bits of their genomes into an offspring produced variants with a much higher probability of success. Those moves in the search space are more likely to produce an advantageous variation than random ones. Al- though you wouldn’t necessarily recognize it as sex when looking under a microscope, even some Bacteria exchange genetic material. How and when sexual recombination first evolved is not clear, but it is quite ancient. Some have argued that sexual reproduction was a necessary precursor to the devel- opment of multicellular organisms with specialized cells (Buss, 1987). The advent of sex dramatically changed the course of evolution. The new mecha- nism for the generation of variation focused nature’s search through the space of possible genomes, leading to an increase in the proportion of advan- tageous variations, and an increase in the rate of evolutionary change.
This is probably a good place to correct a common misperception, namely that some organisms are more “primitive” than others. Every existing organ- ism has, tautologically, made it into the modern era. Simple modern organ- isms are not primitive. The environment of the modern world is completely unlike that of earth when life began, and even the simplest existing creatures have evolved to survive in the present. It is possible to use groups of very distantly related creatures (e.g. people and bacteria) to make inferences about ancient organisms; whatever people and bacteria have in common are char- acteristics that were most likely shared by their last common ancestor, many eons ago. Aspects of bacteria which are not shared with people may have evolved as recently as any human characteristic not shared with bacteria. This applies to the relation between people and apes, too: apes are not any more like ancestral primates than we are. It is what we have in common with other organisms that tells us what our ancestors were like; the differences be- tween us and other organisms are much less informative.
The emergence of multicellular organisms, whether or not it was a direct result of sexual reproduction, marked a significant milestone in the history of life, leading to a remarkable diversification of organisms and an increase in complexity. This event occurred approximately one billion years ago, around three-quarters of the way through the history of life.
Most organisms visible to the naked eye are multicellular, although blue-green algae in ponds and swimming pools are actually a type of bacteria. Multicellular organisms derive their primary evolutionary advantage from cellular specialization. By having specialized cells, these organisms can occupy environmental niches that single-celled organisms cannot exploit. In multicellular organisms, cells that are far apart can exchange matter, energy, or information for their mutual benefit. For example, in higher plants, cells in the roots and leaves exist in vastly different environments and exchange resources not available locally.
One crucial distinction between multicellular organisms and colonies of unicellular organisms (e.g., coral) is the separation of germ line (reproductive) cells from somatic (all other) cells. Germ cells, such as sperm and eggs, are responsible for producing new organisms, while somatic cells are specialized for various tasks, such as skin, nerve, or blood cells. Somatic cells divide and produce more of the same type of cell, a process that ends with mitosis, resulting in the creation of two identical daughter cells. This entire process is known as the cell cycle.
Only changes in germ cells are inherited from an organism to its off- spring. A variation that arises in a somatic cell will affect all of the cell’s de- scendents, but it will not affect any of the organism’s descendents. Germ cells divide in a process called meiosis; part of this process is the production of sperm and egg cells, each of which have only half the usual genetic mater- ial. The advent of this distinction involved a complex and intricate balance between somatic cells becoming an evolutionary deadends and the improved competitive ability of a symbiotic collection of closely related cells.
Multicellular organisms all originate from a single cell, the fertilized egg. Through a process called cellular differentiation, this single cell gives rise to all the specialized cells in the body. The journey from a fertilized egg to a fully developed adult is incredibly intricate, involving not only cellular differentiation but also cell migration and arrangement, dynamic changes in gene expression, and even programmed cell death to remove certain cell groups that serve as temporary scaffolding during development. The transition from single-celled to multicellular lifeforms required numerous dramatic innovations, fundamentally shifting the focus of natural selection from individual cells to the collective organism. The reproductive success of a cell line within a multicellular organism may not necessarily align with the success of the entire organism. Embryology and development are complex topics that are briefly touched upon in this discussion.
While much of the discussion has focused on organisms seemingly simple and distantly related to humans, on a biochemical level, humans share many similarities with other eukaryotes, especially multicellular ones. Genetic and biochemical distances do not always correlate with morphological differences. For example, two species of frogs that appear similar may be more genetically distant from each other than humans are from cows. Much of human biochemistry was established by the time multicellular organisms emerged on Earth, highlighting the importance of understanding simpler organisms like yeasts in elucidating human biology.
This brief overview has covered the diversity of life forms and key evolutionary events leading up to the origin of multicellular organisms. In the next section, we will delve deeper into the workings of these complex organisms and explore the intricacies of eukaryotic cells in more detail.
Living Parts: Tissues, Cells, Compartments and Organelles
The key advantage of multicellular organisms over their single-celled counterparts lies in the specialization of cells. In larger organisms, not every cell needs to perform all functions such as extracting nutrients, protecting itself, sensing the environment, moving, or reproducing. Instead, these complex tasks can be divided among different classes of cells, allowing them to work together and achieve feats that single cells cannot. Cells specialized for specific functions form tissues, and this process is known as cellular differentiation. However, differentiated cells, except for reproductive cells, cannot reproduce an entire organism.
In humans and most other multicellular animals, there are fourteen major tissue types. Each tissue type serves a specific function. Some familiar tissue types include bones, muscles, cardiovascular tissue, nerves, and connective tissue (e.g., tendons and ligaments). Other tissues make up the digestive, respiratory, urinary, and reproductive systems. Skin and blood are unique tissue types, composed of highly specialized cells. The lymphatic tissue, found in organs like the spleen and lymph nodes, makes up the immune system. Endocrine tissue consists of hormone-producing glands (e.g., the adrenal gland, which produces adrenaline) that control various aspects of the body. Epithelium, the most basic tissue type, lines the body’s cavities, secreting substances like mucus and absorbing water and nutrients, particularly in the intestines.
*Cancer is an example where a single cell line within a multicellular organism reproduces to the detriment of the whole.
In a typical vertebrate, there are more than 200 different specialized cell types. These cells vary in size and function. For instance, a single nerve cell can connect your foot to your spinal cord, and a drop of blood contains more than 10,000 cells. Some cells divide rapidly, such as bone marrow cells dividing every few hours, while adult nerve cells can live for 100 years without dividing. Once a cell has differentiated, it cannot change into another type of cell. Despite the diversity of cell types, all cells in a multicellular organism have the same genetic code. The differences between them arise from variations in gene expression—whether or not a gene’s product is produced and in what quantity.
Control of gene expression is a complex process involving many biological substances that bind to DNA or other biomolecules that bind to DNA. Genes code for products that can turn other genes on or off, which in turn regulate additional genes, creating a highly interconnected network. Development, the process by which cells differentiate into specific cell types at specific times and locations, is a key area of research in biology. Understanding how cells differentiate and the regulatory processes involved is a fundamental aspect of understanding complex biological systems.
The Composition of Cells
Despite their differences, most cells have a great deal in common with each other. Every cell, whether a Archaea at the bottom of the ocean or a cell in a hair follicle on the top of your head has certain basic qualities: they contain cytoplasm and genetic material, are enclosed in a membrane and have the basic mechanisms for translating genetic messages into the main type of bio- logical molecule, the protein. All eucaryotic cells share additional components. Each of these basic parts of a cell is described briefly below:
Membranes are the boundaries between the cell and the outside world. Although there is no one moment that one can say life came into being, the origin of the first cell membrane is a reasonable starting point. At that mo- ment, self-reproducing systems of molecules were individuated, and cells came into being. All present day cells have a phospholipid cell membrane. Phospholipids are lipids (oils or fats) with a phosphate group attached. The end with the phosphate group is hydrophillic (attracted to water) and the lipid end is hydrophobic (repelled by water). Cell membranes consist of two layers of these molecules, with the hydrophobic ends facing in, and the hydrophillic ends facing out. This keeps water and other materials from getting through the membrane, except through special pores or channels.
A lot of the action in cells happens at the membrane. For single celled organisms, the membrane contains molecules that sense the environment, and in some cells it can surround and engulf food, or attach and detach parts of it- self in order to move. In Bacteria and Archaea, the membrane plays a crucial
role in energy production by maintaining a large acidity difference between the inside and the outside of the cell. In multicellular organisms, the mem- branes contain all sorts of signal transduction mechanisms, adhesion molecules, and other machinery for working together with other cells.
Proteins are the molecules that accomplish most of the functions of the living cell. The number of different structures and functions that proteins take on in a single organism is staggering. They make possible all of the chemical reactions in the cell by acting as enzymes that promote specific chemical reactions, which would otherwise occur only so slowly as to be otherwise negligible. The action of promoting chemical reactions is called catalysis, and enzymes are sometimes refered to as catalysts, which is a more general term. Proteins also provide structural support, and are the keys to how the immune system distinguishes self from invaders. They provide the mechanism for acquiring and transforming energy, as well as translating it into physical work in the muscles. They underlie sensors and the transmis- sion of information as well.
All proteins are constructed from linear sequences of smaller molecules called amino acids. There are twenty naturally occurring amino acids. Long proteins may contain as many as 4500 amino acids, so the space of possible proteins is very large: 204500 or 105850. Proteins also fold up to form partic- ular three dimensional shapes, which give them their specific chemical func- tionality. Although it is easily demonstrable that the linear amino acid se- quence completely specifies the three dimensional structure of most proteins, the details of that mapping is one of the most important open questions of bi- ology. In addition a protein’s three dimensional structure is not fixed; many proteins move and flex in constrained ways, and that can have a significant role in their biochemical function. Also, some proteins bind to other groups of atoms that are required for them to function. These other structures are called prosthetic groups. An example of a prosthetic group is heme, which binds oxygen in the protein hemoglobin. I will discuss proteins in more de- tail again below.
Genetic material codes for all the other constituents of the the cell. This information is generally stored in long strands of DNA. In Bacteria, the DNA is generally circular. In Eucaryotes, it is linear. During cell division Eucary- otic DNA is grouped into X shaped structures called chromosomes. Some viruses (like the AIDS virus) store their genetic material in RNA. This genet- ic material contains the blueprint for all the proteins the cell can produce. I’ll have much more to say about DNA below.
Nuclei are the defining feature of Eucaryotic cells. The nucleus contains the genetic material of the cell in the form of chromatin. Chromatin contains long stretches of DNA in a variety of conformations,* surrounded by nuclear proteins. The nucleus is separated from the rest of the cell by a nuclear mem- brane. Nuclei show up quite clearly under the light microscope; they are perhaps the most visible feature of most cells.
Cytoplasm is the name for the gel-like collection of substances inside the cell. All cells have cytoplasm. The cytoplasm contains a wide variety of different substances and structures. In Bacteria and Archaea, the cytoplasm contains all of the materials in the cell. In Eucarya, the genetic material is segre- gated into the cell nucleus.
Ribosomes are large molecular complexes, composed of several proteins and RNA molecules. The function of ribosomes is to assemble proteins. All cells, including Bacteria and Archaea have ribosomes. The process of translating genetic information into proteins is described in detail below. Ribosomes are where that process occurs, and are a key part of the mechanism for accomplishing that most basic of tasks.
Mitochondria and Chroloplasts are cellular organelles involved in the production the energy that powers the cell. Mitochondria are found in all eu- caryotic cells, and their job is respiration: using oxygen to efficiently turn food into energy the cell can use. Some bacteria and archaea get their energy by a process called glycolysis, from glyco- (sugar) and -lysis (cleavage or destruction). This process creates two energy-carrying molecules for every
molecule of sugar consumed. As oxygen became more abundant†, some or-
ganisms found a method for using it (called oxidative phosphorylation) to make an order of magnitude increase in their ability to extract energy from food, getting 36 energy-carrying molecules for every sugar.
These originally free living organisms were engulfed by early eucaryotes. This symbiosis gradually became obligatory as eucaryotes came to depend on their mitochondria for energy, and the mitochondria came to depend on the surrounding cell for many vital functions and materials. Mitochondria still have their own genetic material however, and, in sexually reproducing organisms, are inherited only via the cytoplasm of the egg cell. As a conse- quence, all mitochondria are maternally inherited.
Like the mitochondria, chloroplasts appear to have originated as free-liv- ing bacteria that eventually became obligatory symbionts, and then parts of eucaryotic plant cells. Their task is to convert sunlight into energy-carrying molecules.
Other Parts of Cells. There are other organelles found in eucaryotic
*Conformation means shape, connoting one of several possible shapes. DNA confor- mations include the traditional double helix, a supercoiled state where certain parts of the molecule are deeply hidden, a reverse coiled state called Z-DNA, and several oth- ers.
†There was very little oxygen in the early atmosphere. Oxygen is a waste product of glycolysis, and it eventually became a significant component of the atmosphere. Al- though many modern organisms depend on oxygen to live, it is a very corrosive sub- stance, and living systems had to evolve quite complex biochemical processes for dealing with it.
cells. The endoplasmic reticulum (there are two kinds, rough and smooth) is involved in the production of the cell membrane itself, as well as in the pro- duction of materials that will eventually be exported from the cell. The Golgi apparatus are elongated sacs that are involved in the packaging of materials that will be exported from the cell, as well as segregating materials in the cell into the correct intracellular compartment. Lysosomes contain substances that are used to digest proteins; they are kept separate to prevent damage to other cellular components. Some cells have other structures, such as vacuoles of lipids for storage (like the ones often found around the abdomen of middle- aged men).
Now that you have a sense of the different components of the cell, we can proceed to examine the activities of these components. Life is a dynamical system, far from equilibrium. Biology is not only the study of living things, but living actions.
Life as a Biochemical Process
Beginning with the highest levels of taxonomy, we have taken a quick tour of the varieties of organisms, and have briefly seen some of their impor- tant parts. So far, this account has been entirely descriptive. Because of the tremendous diversity of living systems, descriptive accounts are a crucial un- derpinning to any more explanatory theories. In order to understand how bio- logical systems work, one has to know what they are.
Knowledge of cells and tissues makes possible the functional accounts of physiology. For example, knowing that the cells in the bicep and in the heart are both kinds of muscle helps explain how the blood circulates. However, at this level of description, the work that individual cells are able to do remains mysterious. The revolution in biology over the last three decades resulted from the understanding cells in terms of their chemistry. These insights began with descriptions of the molecules involved in living processes, and now increasingly provides an understanding of the molecular structures and functions that are the fundamental objects and actions of living material.
More and more of the functions of life (e.g. cell division, immune reac- tion, neural transmission) are coming to be understood as the interactions of complicated, self-regulating networks of chemical reactions. The substances that carry out and regulate these activities are generally referred to as bio- molecules. Biomolecules include proteins, carbohydrates, lipids—all called macromolecules because they are relatively large—and a variety of small molecules. The genetic material of the cell specifies how to create proteins, as well as when and how much to create. These proteins, in turn, control the flow of energy and materials through the cell, including the creation and transformation of carbohydrates, lipids and other molecules, ultimately ac- complishing all of the functions that the cell carries out. The genetic material
itself is also now known to be a particular macromolecule: DNA.
In even the simplest cell, there are more than a thousand kinds of biomol- ecules interacting with each other; in human beings there are likely to be more than 100,000 different kinds of proteins specified in the genome (it is unlikely that all of them are present in any particular cell). Both the amount of each molecule and its concentration in various compartments of the cell determines what influence it will have. These concentrations vary over time, on scales of seconds to decades. Interactions among biomolecules are highly non-linear, as are the interactions between biomolecules and other molecules from outside the cell. All of these interactions take place in parallel among large numbers of instances of each particular type. Despite this daunting complexity, insights into the structure and function of these molecules, and into their interactions are emerging very rapidly.
One of the reasons for that progress is the conception of life as a kind of information processing. The processes that transform matter and energy in living systems do so under the direction of a set of symbolically encoded in- structions. The “machine” language that describes the objects and processes of living systems contains four letters, and the text that describes a person has about as many characters as three years’ worth of the New York Times (about 3×109). In the next section, we will delve more deeply into the the chemistry of living systems.
The Molecular Building Blocks of Life
Living systems process matter, energy and information. The basic principle of life, reproduction, is the transformation of materials found in the environ- ment of an organism into another organism. Raw materials from the local en- vironment are broken down, and then reassembled following the instructions in the genome. The offspring will contain instructions similar to the parent. The matter, energy and information processing abilities of living systems are very general; one of the hallmarks of life is its adaptability to changing cir- cumstances. Some aspects of living systems have, however, stayed the same over the years. Despite nearly 4 billion years of evolution, the basic molecu- lar objects for carrying matter, energy and information have changed very lit- tle. The basic units of matter are proteins, which subserve all of the structural and many of the functional roles in the cell; the basic unit of energy is a phosphate bond in the molecule adenosine triphosphate (ATP); and the units of information are four nucleotides, which are assembled together into DNA and RNA.
The chemical composition of living things is fairly constant across the en- tire range of life forms. About 70% of any cell is water. About 4% are small molecules like sugars and inorganic ions*. One of these small molecules is ATP, the energy carrier. Proteins make up between 15% and 20% of the cell;
DNA and RNA range from 2% to 7% of the weight. The cell membranes, lipids and other, similar molecules make up the remaining 4% to 7% (Al- berts, et al., 1989).
Energy
Living things obey all the laws of chemistry and physics, including the sec- ond law of thermodynamics, which states that the amount of entropy (disor- der) in the universe is always increasing. The consumption of energy is the only way to create order in the face of entropy. Life doesn’t violate the sec- ond law; living things capture energy in a variety of forms, use it to create in- ternal order, and then transfer energy back to the environment as heat. An in- crease in organization within a cell is coupled with a greater increase in disorder outside the cell.
Living things must capture energy, either from sunlight through photosyn- thesis or from nutrients by respiration. The variety of chemicals that can be oxidized by various species to obtain energy through respiration is immense, ranging from simple sugars to complex oils and even sulfur compounds from deep sea vents (in the case of Archaea).
In many cases, the energy is first available to the cell as an electrochemi- cal gradient across the cell membrane. The cell can tap into electrochemical gradient by coupling the energy that results from moving electrons across the membrane to other processes. There are many constraints on the flow of en- ergy through a living system. Most of the chemical reactions that organisms need to survive require an input of a minimum amount of energy to take place at a reasonable rates; efficient use of energy dictates that this must be delivered in a quanta exceeding the minimum requirement only slightly.
The energy provided for biochemical reactions has to be useable by many different processes. It must be possible to provide energy where it is needed, and to store it until it is consumed. The uses of energy throughout living systems are very diverse. It is needed to synthesize and transport biomole- cules, to create mechanical action through the muscle proteins actin and myosin, and to create and maintain electrical gradients, such as the ones that neurons use to communicate and compute.
Storing and transporting energy in complex biochemical systems runs the
*An inorganic ion is a charged atom, or a charged small group of atoms, not involv- ing carbon. These substances, like iron and zinc, play small but vital role. For exam- ple, changing the balance of calcium and sodium ions across a cell membrane is the basic method for exciting of neurons.
The individual building blocks of the larger molecules, i.e. amino acids and nucleic acids, are also considered small molecules when not part of a larger structure. Some of these molecules play roles in the cell other than as components of large molecules. For example, the nucleic acid adenine is at the core of the energy carrying molecule adenosine triphosphate (ATP).
risk of disrupting chemical bonds other than the target ones, so the unit of en- ergy has to be small enough not to do harm, but large enough to be useful. The most common carrier of energy for storage and transport is the outer- most phosphate bond in the molecule adenosine triphosphate, or ATP. This molecule plays a central role in every living system: it is the carrier of ener- gy. Energy is taken out of ATP by the process of hydrolysis, which removes the outermost phosphate group, producing the molecule adenosine diphos- phate (ADP). This process generates about 12 kcal per mole* of ATP, a quan- tity appropriate for performing many cellular tasks. The energy “charge” of a cell is expressed in the ratio of ATP/ADP and the electrochemical difference between the inside and the outside of the cell (which is called the transmem- brane potential). If ATP is depleted, the movement of ions caused by the transmembrane potential will result in the synthesis of additional ATP. If the transmembrane potential has been reduced (for example, after a neuron fires), ATP will be consumed to pump ions back across the gradient and re- store the potential.
ATP is involved in most cellular processes, so it is sometimes called a currency metabolite. ATP can also be converted to other high energy phos- phate compounds such as creatine phosphate, or other nucleotide triphos- phates. In turn, these molecules provide the higher levels of energy necessary to transcribe genes and replicate chromosomes. Energy can also be stored in different chemical forms. Carbohydrates like glycogen provide a moderate density, moderately accessible form of energy storage. Fats have very high energy storage density, but the energy stored in them takes longer to retrieve.
Proteins
Proteins are essential components of living organisms, serving various functions. They provide structural support, facilitate chemical reactions as enzymes, regulate gene expression, enable sensory perception, and participate in immune responses. Understanding the composition and function of proteins is fundamental to molecular biology.
Despite their diverse functions, all proteins are composed of amino acids. Amino acids share a common structure, consisting of a central carbon atom (C), an amino group (NH3), a carboxyl group (COOH), and a side chain (R group). There are 20 different amino acids commonly found in proteins, each with a unique side chain that determines its properties.
The sequence of amino acids in a protein is determined by the genetic code encoded in an organism’s DNA. This sequence dictates the protein’s three-dimensional structure, which is crucial for its function. Proteins can fold into complex shapes, allowing them to interact with other molecules in specific ways.
The energy content of food is often measured in kilocalories (kcal), which is the amount of energy needed to raise the temperature of one liter of water by one degree Celsius. In biology, the term “mole” refers to a specific amount of a substance, measured by the number of molecules (approximately 6 x 10^23 molecules per mole).
one end, a carboxyl group (COOH) at the other, and a variable sidechain (R), as shown in Figure 2. These chemical groups determine how the molecule functions, as Mavrovouniotis’s chapter in this volume explains. For example, under biological conditions the amino end of the molecule is positively charged, and the carboxyl end is negatively charged. Chains of amino acids are assembled by a reaction that occurs between the nitrogen atom at the amino end of one amino acid and the carbon atom at the carboxyl end of an- other, bonding the two amino acids and releasing a molecule of water. The linkage is called a peptide bond, and long chains of amino acids can be strung together into polymers*, called polypeptides, in this manner. All pro- teins are polypeptides, although the term polypeptide generally refers to chains that are shorter than whole proteins.
When a peptide bond is formed, the amino acid is changed (losing two hydrogen atoms and an oxygen atom), so the portion of the original molecule integrated into the polypeptide is often called a residue. The sequence of amino acid residues that make up a protein is called the protein’s primary structure. The primary structure is directly coded for in the genetic material: The individual elements of a DNA molecule form triples which unambiguously specify an amino acid. A genetic sequence maps directly into a sequence of amino acids. This process is discussed in greater detail below.
It is interesting to note that only a small proportion of the very many pos- sible polypeptide chains are naturally occurring proteins. Computationally, this is unsurprising. Many proteins contain more than 100 amino acids (some
*Polymers are long strings of similar elements; -mer means “element,” as in monomer, dimer, etc. Homopolymer is a term that refers to polymers made up of all the same element; heteropolymers are made of several different units. Proteins and DNA are both heteropolymers. Glycogen, a substance used for the medium-term storage of excess energy, is an example of a homopolymer.
have more than 4000). The number of possible polypeptide chains of length 100 is 20100 or more than 10130. Even if we take the high estimates of the number of species (5×107) and assume that they all have as many different proteins as there are in the most complex organism (<107) and that no two organisms share a single protein, the ratio of actual proteins to possible polypeptides is much less than 1:10100—a very small proportion, indeed.
The twenty naturally occuring amino acids all have the common elements shown in Figure 2. The varying parts are called sidechains; the two carbons and the nitrogen in the core are sometimes called the backbone. Peptide bonds link together the backbones of a sequence of amino acids. That link can be characterized as having two degrees of rotational freedom, the phi () and psi () angles (although from the point of view of physics this is a dras- tic simplification, in most biological contexts it is valid). The conformation of a protein backbone (i.e. its shape when folded) can be adequately de- scribed as a series of / angles, although it is also possible to represent the shape using the Cartesian coordinates of the central backbone atom (the alpha carbon, written C), or using various other representational schemes (see, e.g., Hunter or Zhang & Waltz in this volume).
The dimensions along which amino acids vary are quite important for a number of reasons. One of the major unsolved problems in molecular biolo- gy is to be able to predict the structure and function of a protein from its amino acid sequence. It was demonstrated more than two decades ago that the amino acid sequence of a protein determines ultimate conformation and, therefore, its biological activity and function. Exactly how the properties of the amino acids in the primary structure of a protein interact to determine the protein’s ultimate conformation remains unknown. The chemical properties of the individual amino acids, however, are known with great precision. These properties form the basis for many representations of amino acids, e.g. in programs attempting to predict structure from sequence. Here is a brief summary of some of them.
Glycine is the simplest amino acid; its sidechain is a single hydrogen atom. It is nonpolar, and does not ionize easily. The polarity of a molecule refers to the degree that its electrons are distributed asymmetrically. A non- polar molecule has a relatively even distribution of charge. Ionization is the process that causes a molecule to gain or lose an electron, and hence become charged overall. The distribution of charge has a strong effect on the behavior of a molecule (e.g. like charges repel). Another important characteristic of glycine is that as a result of having no heavy (i.e. non-hydrogen) atoms in its sidechain, it is very flexible. That flexibility can give rise to unusual kinks in the folded protein.
Alanine is also small and simple; its sidechain is just a methyl group (consisting of a carbon and three hydrogen atoms). Alanine is one of the most
commonly appearing amino acids. Glycine and alanine’s sidechains are aliphatic, which means that they are straight chains (no loops) containing only carbon and hydrogen atoms. There are three other aliphatic amino acids: valine, leucine and isoleucine. The longer aliphatic sidechains are hydrophobic. Hydrophobicity is one of the key factors that determines how the chain of amino acids will fold up into an active protein. Hydrophobic residues tend to come together to form compact core that exclude water. Because the environment inside cells is aqueous (primarily water), these hydrophobic residues will tend to be on the inside of a protein, rather than on its surface.
In contrast to alanine and glycine, the sidechains of amino acids phenylalanine, tyrosine and tryptophan are quite large. Size matters in protein folding because atoms resist being too close to one another, so it is hard to pack many large sidechains closely. These sidechains are also aromatic, meaning that they form closed rings of carbon atoms with alternating double bonds (like the simple molecule benzene). These rings are large and inflexible. Phenylalanine and tryptophan are also hydrophobic. Tyrosine has a hydroxyl group (an OH at the end of the ring), and is therefore more reactive than the other sidechains mentioned so far, and less hydrophobic. These large amino acids appear less often than would be expected ifproteins were composed randomly. Serine and threonine also contain hydroxyl groups, but do not have rings.
Another feature of importance in amino acids is whether they ionize to form charged groups. Residues that ionize are characterized by their pK, which indicates at what pH (level of acidity) half of the molecules of that amino acid will have ionized. Arginine and lysine have high pK’s (that is, they ionize in basic environments) and histidine, gluatmic acid and aspartic acid have low pK’s (they ionize in acidic ones). Since like charges repel and opposites attract, charge is an important feature in predicting protein confor- mation. Most of the charged residues in a protein will be found at its surface, although some will form bonds with each other on the inside of the molecule (called salt-bridges) which can provide strong constraints on the ultimate folded form.
Cysteine and methionine have hydrophobic sidechains that contain a sul- phur atom, and each plays an important role in protein structure. The sulphurs make the amino acids’ sidechains very reactive. Cysteines can form disulphide bonds with each other; disulphide bonds often hold distant parts of a polypeptide chain near each other, constraining the folded conformation like salt bridges. For that reason, cysteines have a special role in determining the three dimensional structure of proteins. The chapter by Holbrook, Muskal and Kim in this volume discusses the prediction of this and other folding constraints. Methionine is also important because all eucaryotic proteins, when originally synthesized in the ribosome, start with a methionine. It is a kind of “start” signal in the genetic code. This methionine is generally removed before the protein is released into the cell, however.
Histidine is a relatively rare amino acid, but often appears in the active site of an enzyme. The active site is the small portion of an enzyme that ef- fects the target reaction, and it is the key to understanding the chemistry in- volved. The rest of the enzyme provides the necessary scaffolding to bring the active site to bear in the right place, and to keep it away from bonds that it might do harm to. Other regions of enzymes can also act as a switch, turn- ing the active site on and off in a process called allosteric control. Because histidine’s pK is near the typical pH of a cell, it is possible for small, local changes in the chemical environment to flip it back and forth between being charged and not charged. This ability to flip between states makes it useful for catalyzing chemical reactions. Other charged residues also sometimes play a similar role in catalysis.
With this background, it is now possible to understand the basics of the protein folding problem which is the target of many of the AI methods ap- plied in this volume. The genetic code specifies only the amino acid se- quence of a protein. As a new protein comes off the ribosome, it folds up into the shape that gives it its biochemical function, sometimes called its active conformation (the same protein unfolded into some other shape is said to be denatured, which is what happens, e.g. to the white of an egg when you cook it). In the cell, this process takes a few seconds, which is a very long time for a chemical reaction. The complex structure of the ribosome may play a role in protein folding, and a few proteins need helper molecules, termed chaperones to fold properly. However, these few seconds are a very short time com- pared to how long it takes people to figure out how a protein will fold. In raw terms, the folding problem involves finding the mapping from primary sequence (a sequence of from dozens to several thousand symbols, drawn from a 20 letter alphabet) to the real-numbered locations of the thousands of constituent atoms in three space.
Although all of the above features of amino acids play some role in protein folding, there are few absolute rules. The conformation a protein finally assumes will minimize the total “free” energy of the molecule. Going against the tendencies described above (e.g. packing several large sidechains near each other) increases the local free energy, but may reduce the energy elsewhere in the molecule. Each one of the tendencies described can be traded off against some other contribution to the total free energy of the folded protein. Given any conformation of atoms, it is possible in principle to compute its free energy. Ideally, one could examine all the possible conformations of a protein, cal- culate the free energy by applying quantum mechanical rules, and select the minimum energy conformation as a prediction of the folded structure. Unfortunately, there are very many possible conformations to test, and each energy calculation itself is prohibitively complex. A wide variety of approaches have been taken to making this problem tractable, and, given a few hours of super-
computer time, it is currently possible to evaluate several thousand possible conformations. These techniques are well surveyed in Karplus & Petsko (1990). An alternative to the pure physical simulations are the various AI approaches which a significant portion of this volume is dedicated to describing.
The position of the atoms in a folded protein is called its tertiary struc- ture. The primary structure is the amino acid sequence. Secondary structure refers to local arrangements of a few to a few dozen amino acid residues that take on particular conformations that are seen repeatedly in many different proteins. These shapes are stabilized by hydrogen bonds (a hydrogen bond is a relatively weak bond that also plays a role in holding the two strands of the DNA molecule together). There are two main kinds of secondary structure: corkscrew-shaped conformations where the amino acids are packed tightly together, called -helices, and long flat sheets made up of two or more adjacent strands of the molecule, extended so that the amino acids are stretched out as far from each other as they can be. Each extended chain is called a – strand, and two or more -strands held together by hydrogen bonds are called a -sheet. -sheets can be composed of strands running in the same direction (called a parallel -sheet) or running in the opposite direction (antiparallel). Other kinds of secondary structure include structures that are even more tightly packed than -helices called 3-10 helices, and a variety of small structures that link other structures, called -turns. Some local combinations of secondary structures have been observed in a variety of different proteins. For example, two -helices linked by a turn with an approximately 60° angle have been observed in a variety of proteins that bind to DNA. This pattern is called the helix-turn-helix motif, and is an example of what is known as super-secondary structure. Finally, some proteins only become functional when assembled with other molecules. Some proteins bind to copies of themselves; for example, some DNA-binding proteins only function as dimers (linked pairs). Other proteins require prostehtic groups such as heme or chlorophyl. Additions necessary to make the folded protein active are termed the protein’s quaternary structure.
Nucleic Acids
If proteins are the workhorses of the biochemical world, nucleic acids are their drivers; they control the action. All of the genetic information in any living creature is stored in deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), which are polymers of four simple nucleic acid units, called nu- cleotides. There are four nucleotides found in DNA. Each nucleotide consists of three parts: one of two base molecules (a purine or a pyrimidine), plus a sugar (ribose in RNA and deoxyribose DNA), and one or more phosphate groups. The purine nucleotides are adenine (A) and guanine (G), and the pyrimidines are cytosine (C) and thymine (T). Nucleotides are sometimes called bases, and, since DNA consists of two complementary strands bonded
together, these units are often called base-pairs. The length of a DNA se- quences is often measured in thousands of bases, abbreviated kb. Nucleotides are generally abbreviated by their first letter, and appended into sequences, written, e.g., CCTATAG. The nucleotides are linked to each other in the polymer by phosphodiester bonds. This bond is directional, a strand of DNA has a head (called the 5’ end) and a tail (the 3’ end).
One well known fact about DNA is that it forms a double helix; that is, two helical (spiral-shaped) strands of the polypeptide, running in opposite di- rections, held together by hydrogen bonds. Adenines bond exclusively with the thymines (A-T) and guanines bond exclusively with cytosines (G-C). Al- though the sequence in one strand of DNA is completely unrestricted, be- cause of these bonding rules the sequence in the complementary strand is completely determined. It is this feature that makes it possible to make high fidelity copies of the information stored in the DNA. It is also exploited when DNA is transcribed into complementary strands of RNA, which direct the synthesis of protein. The only difference is that in RNA, uracil (U) takes the place of thymine; that is, it bonds to adenine.
DNA molecules take a variety of conformations (shapes) in living sys- tems. In most biological circumstances, the DNA forms a classic double helix, called B-DNA; in certain circumstances, however, it can become su- percoiled or even reverse the direction of its twist (this form is called Z- DNA). These alternative forms may play a role in turning particular genes on and off (see below). There is some evidence that the geometry of the B-DNA form (e.g for example, differing twist angles between adjacent base pairs) may also be exploited by cell mechanisms. The fact that the conformation of the DNA can have a biological effect over and above the sequence it encodes highlights an important lesson for computer scientists: there is more infor- mation available to a cell than appears in the sequence databases. This les- son also applies to protein sequences, as we will see in the discussion of post-translational modification.
Now that we have covered the basic structure and function of proteins and nucleic acids, we can begin to put together a picture of the molecular pro- cessing that goes on in every cell.
Genetic Expression: From Blueprint to Finished Product
Genes, the Genome and the Genetic Code
The genetic information of an organism can be stored in one or more distinct DNA molecules; each is called a chromosome. In some sexually reproducing organisms, called diploids, each chromosome contains two similar DNA molecules physically bound together, one from each parent. Sexually repro- ducing organisms with single DNA molecules in their chromosomes are
called haploid. Human beings are diploid with 23 pairs of linear chromo- somes. In Bacteria, it is common for the ends of the DNA molecule to bind together, forming a circular chromosome. All of the genetic information of an organism, taken together as a whole, is refered to as its genome.
The primary role of nucleic acids is to carry the encoding of the primary structure of proteins. Each non-overlapping triplet of nucleotides, called a codon, corresponds to a particular amino acid (see table 1). Four nucleotides can form 43 = 64 possible triplets, which is more than the 20 needed to code
for each amino acid (pairs would provide only 16 codons). Three of these codons are used to designate the end of a protein sequence, and are called stop codons. The others all code for a particular amino acid. That means that most amino acids are encoded by more than one codon. For example, alanine is represented in DNA by the codons GCT, GCC, GCA and GCG. Notice that the first two nucleotides of these codons are all identical, and that the third is redundant. Although this is not true for all of the amino acids, most codon synonyms differ only in the last nucleotide. This phenomenon is called the degeneracy of the code. Whether it is an artifact of the evolution, or serves a purpose such as allowing general changes in the global composi- tion of DNA (e.g. increasing the proportion of purines) without changing the coded amino acids is still unknown.
There are some small variations in the translation of codons into amino acids from organism to organism. Since the code is so central to the function- ing of the cell, it is very strongly conserved over evolution. However, there are a few systems that use a slightly different code. An important example is found in mitochondria. Mitochondria have their own DNA, and probably represent previously free living organisms that were enveloped by eucary- otes. Mitochondrial DNA is translated using a slightly different code, which is more degenerate (has less information in the third nucleotide) than the standard code. Other organisms that diverged very early in evolution, such as the ciliates, also use different codes.
The basic process of synthesizing proteins maps from a sequence of codons to a sequence of amino acids. However, there are a variety of impor- tant complications. Since codons come in triples, there are three possible places to start parsing a segment of DNA. For example, the chain
…AATGCGATAAG… could be read …AAT-GCG-ATA… or …ATG-CGA-
TAA… or …TGC-GAT-AAG. This problem is similar to decoding an asyn-
chronous serial bit stream into bytes. Each of these parsings is called a read- ing frame. A parsing with a long enough string of codons with no intervening stop codons is called an open reading frame, or ORF; and could be translated into a protein. Organisms sometimes code different proteins with overlap- ping reading frames, so that if the reading process shifts by one character, a completely different, but still functional protein results! More often, frame shifts, which can be introduced by insertions and deletions in the DNA se-
quence or transcriptional “stuttering,” produce nonsense.
Not only are there three possible reading frames in a DNA sequence, it is possible to read off either strand of the double helix. Recall that the second strand is the complement of the first, so that our example above (AATGC- GATAAG) can also be read inverted and in the opposite direction, e.g. CT- TATCGCATT. This is sometimes called reading from the antisense or comple- mentary strand. An antisense message can also be parsed three ways, making a total of 6 possible reading frames for every DNA sequence. There are known examples of DNA sequences that code for proteins in both directions with sev- eral overlapping reading frames: quite a feat of compact encoding.
And there’s more. DNA sequences coding for a single protein in most eu- caryotes have noncoding sequences, called introns, inserted into them. These introns are spliced out before the sequence is mapped into amino acids. Different eucaryotes have a variety of different systems for recogniz- ing and removing these introns. Most bacteria don’t have introns. It is not known whether introns evolved only after the origin of eucaryotes, or whether selective pressure has caused bacteria to lose theirs. The segments of DNA that actually end up coding for a protein are called exons. You can keep these straight by remembering that introns are insertions, and that exons are expressed.
DNA contains a large amount of information in addition to the coding se- quences of proteins. Every cell in the body has the same DNA, but each cell type has to generate a different set of proteins, and even within a single cell type, its needs change throughout its life. An increasing number of DNA sig- nals that appear to play a role in the control of expression are being charac- terized. There are a variety of signals identifying where proteins begin and end, where splices should occur, and an exquisitely detailed set of mecha- nisms for controlling which proteins should be synthesized and in what quantities. Large scale features of a DNA molecule, such as a region rich in Cs and Gs can play a biologically important role, too.
Finally, some exceptions to the rules I mentioned above should be noted. DNA is sometimes found in single strands, particularly in some viruses. Viruses also play other tricks with nucleic acids, such as transcribing RNA into DNA, going against the normal flow of information in the cell. Even non-standard base-pairings sometimes play an important role, such as in the structure of transfer RNA (see below).
RNA: Transcription, Translation, Splicing & RNA Structure
The process of mapping from DNA sequence to folded protein in eucaryotes involves many steps (see Figure 3). The first step is the transcription of a portion of DNA into an RNA molecule, called a messenger RNA (mRNA). This process begins with the binding of a molecule called RNA polymerase
to a location on the DNA molecule. Exactly where that polymerase binds de- termines which strand of the DNA will be read and in which direction. Parts of the DNA near the beginning of a protein coding region contain signals which can be recognized by the polymerase; these regions are called promot- ers. (Promoters and other control signals are discussed further below.) The polymerase catalyzes a reaction which causes the DNA to be used as a tem- plate to create a complementary strand of RNA, called the primary tran- script. This transcript contains introns as well as exons. At the end of the transcript, 250 or more extra adenosines, called a poly-A tail, are often added to the RNA. The role of these nucleotides is not known, but the distinctive signature is sometimes used to detect the presence of mRNAs.
The next step is the splicing the exons together. This operation takes takes place in a ribosome-like assembly called a spliceosome. The RNA re- maining after the introns have been spliced out is called a mature mRNA. It is then transported out of the nucleus to the cytoplasm, where it then binds to a ribosome.
A ribosome is a very complex combination of RNA and protein, and its operation has yet to be completely understood. It is at the ribosome that the mRNA is used as a blueprint for the production of a protein; this process is called translation. The reading frame that the translation will use is deter- mined by the ribosome. The translation process depends on the presence of molecules which make the mapping from codons in the mRNA to amino acids; these molecules are called transfer-RNA or tRNAs. tRNAs have an anti-codon (that binds to its corresponding codon) near one end and the cor- responding amino acid on the other end. The anti-codon end of the tRNAs bind to the mRNA, bringing the amino acids corresponding the mRNA se- quence into physical proximity, where they form peptide bonds with each other. How the tRNAs find only the correct amino acid was a mystery until quite recently. This process depends on the three dimensional structure of the RNA molecule, which is discussed in Steeg’s chapter of this volume. As the protein comes off the ribosome, it folds up into its native conformation. This process may involve help from the ribosome itself or from chaperone mole- cules, as was described above.
Once the protein has folded, other transformations can occur. Various kinds of chemical groups can be bound to different places on the proteins, in- cluding sugars, phosphate, actyl or methyl groups. These additions can change the hyrogen bonding proclivity or shape of the protein, and may be necessary to make the protein active, or may keep it from having an effect before it is needed. The general term for these transformations is post-trans- lational modifications. Once this process is complete, the protein is then transported to the part of the cell where it will accomplish its function. The transport process may be merely passive diffusion through the cytoplasm, or there may be an active transport mechanism that moves the protein across membranes or into the appropriate cellular compartment.
Genetic Regulation
Every cell has the same DNA. Yet the DNA in some cells codes for the proteins needed to function as, say, a muscle, and other code for the proteins to make the lens of the eye. The difference lies in the regulation of the genetic machinery. At any particular time, a particular cell is producing only a small fraction of the proteins coded for in its DNA. And the amount of each protein produced must be precisely regulated in order for the cell to function properly. The cell will change the proteins it synthesizes in response to the environment or other cues. The mechanisms that regulate this process constitute a finely tuned, highly parallel system with extensive multifactoral feed- back and elaborate control structure. It is also not yet well understood.
Genes are generally said to be on or off (or expressed/not expressed), al- though the amount of protein produced is also important. The production process is controlled by a complex collection of proteins in the nucleus of eucaryotic cells that influence which genes are expressed. Perhaps the most important of these proteins are the histones, which are tightly bound to the DNA in the chromosomes of eucaryotes. Histones are some of the most con- served proteins in all of life. There are almost no differences in the sequence of plant and mammalian histones, despite more than a billion years of divergence in their evolution. Other proteins swarm around the DNA, some influencing the production of a single gene (either encouraging or inhibiting it), while others can influence the production of large numbers of genes at once. An important group of these proteins are called topoisomerases; they rearrange and untangle the DNA in various ways, and are the next most prevalent proteins in the chromosome.
Many regulatory proteins recognize and bind to very specific sequences in the DNA. The sequences that these proteins recognize tend to border the protein coding regions of genes, and are known generally as control regions. Sequences that occur just upstream (towards the 5′ end) of the coding region that encourage the production of the protein are called promoters. Similar regions either downstream of the coding region or relatively far upstream are called enhancers. Sequences that tend to prevent the production of a protein are called repressors. Karp’s chapter in this volume discusses how this com- plex set of interactions can be modeled in knowledge-based systems.
Cells need to turn entire suites of genes on and off in response to many different events, ranging from normal development to trying to repair dam- age to the cell. The control mechanisms are responsive to the level of a product already in the cell (for homeostatic control) as well as to a tremendous variety of extracellular signals. Perhaps the most amazing activities in gene regulation occur during development; not only are genes turned on and off with precise timing, but the control can extend to producing alternative splicings of the nascent primary transcripts (as is the case in the transition from fetal to normal hemoglobin).
Catalysis & Metabolic Pathways
The translation of genes into proteins, crucial as it is, is only a small portion of the biochemical activity in a cell. Proteins do most of the work of managing the flow of energy, synthesizing, degrading and transporting materials, sending and receiving signals, exerting forces on the world, and providing structural support. Systems of interacting proteins form the basis for nearly every process of living things, from moving around and digesting food to thinking and reproducing. Somewhat surprisingly, a large proportion of the chemical processes that underlie all of these activities are shared across a very wide range of organisms. These shared processes are collectively referred to as intermediary metabolism. These include the catabolic processes for breaking down proteins, fats and carbohydrates (such as those found in food) and the anabolic processes for building new materials. Similar collections of reactions that are more specialized to particular organisms are called secondary metabolism. The substances that these reactions produce and consume are called metabolites.
The biochemical processes in intermediary metabolism are almost all catalyzed reactions. That is, these reactions would barely take place at all at normal temperatures and pressures; they require special compounds that facilitate the reaction — these compounds are called catalysts or enzymes. (It is only partially in jest that many biochemistry courses open with the professor saying that the reactions that take place in living systems are ones you were taught were impossible in organic chemistry.) Catalysts are usually named after the reaction they facilitate, usually with the added suffixase. For ex- ample, alcohol dehydrogenase is the enzyme that turns ethyl alcohol into acetaldehyde by removing two hydrogen atoms. Common classes of enzymes include dehydrogenases, synthetases, proteases (for breaking down proteins), decarboxylases (removing carbon atoms), transferases (moving a chemical group from one place to another), kinases, phosphatases (adding or removing phosphate groups, respectively) and so on. The materials transformed by catalysts are called substrates. Unlike the substrates, catalysts themselves are not changed by the reactions they participate in. A final point to note about enzymatic reactions is that in many cases the reactions can proceed in either direction. That is, and enzyme that transforms substance A into substance B can often also facilitate the transformation of B into A. The direction of the transformation depends on the concentrations of the substrates and on the energetics of the reaction (see Mavrovouniotis’ chapter in this volume for further discussion of this topic).
Even the basic transformations of intermediary metabolism can involve dozens or hundreds of catalyzed reactions. These combinations of reactions, which accomplish tasks like turning foods into useable energy or compounds are called metabolic pathways. Because of the many steps in these pathways and the widespread presence of direct and indirect feedback loops, they can exhibit many counterintuitive behaviors. Also, all of these chemical reactions are going on in parallel. Mavrovouniotis’s chapter in this volume describes an efficient system for making inferences about these complex systems.
In addition to the feedback loops among the substrates in the pathways, the presence or absence of substrates can affect the behavior of the enzymes themselves, through what is called allosteric regulation. These interactions occur when a substance binds to an enzyme someplace other than its usual active site (the atoms in the molecule that have the enzymatic effect). Bind- ing at this other site changes the shape of the enzyme, thereby changing its activity. Another method of controlling enzymes is called competitive inhibition. In this form of regulation, substance other than the usual substrate of the enzyme binds to the active site of the enzyme, preventing it from having an effect on its substrate.
These are the basic mechanisms underlying eucaryotic cells (and much of this applies to bacterial and archaeal ones as well). Of course, each particular activity of a living system, from the capture of energy to immune response, has its own complex network of biochemical reactions that provides the mechanism underlying the function. Some of these mechanisms, such as the secondary messenger system involving cyclic adenosine monophosphate (cAMP) are widely shared by many different systems. Others are exquisitely specialized for a particular task in a single species: my favorite example of this is the evidence that perfect pitch in humans (being able to identify musical notes absolutely, rather than relative to each other) is mediated by a single protein. The functioning of these biochemical networks is being unravelled at an ever increasing rate, and the need for sophisticated methods to analyze relevant data and build suitable models is growing rapidly.
Genetic Mechanisms of Evolution
In the beginning of this chapter, I discussed the central role that evolution plays in understanding living systems. The mechanisms of evolution at the molecular level are increasingly well understood. The similarities and differences among molecules that are closely related provide important information about the structure and function of those molecules. Molecules (or their sequences) which are related to one another are said to be homologous. Al- though genes or proteins that have similar sequences are often assumed to be homologous, there are well known counterexamples due to convergent evolution. In these cases, aspects of very distantly related organisms come to resemble one another through very different evolutionary pathways. Unless there is evidence to the contrary, it is usually safe to assume that macromolecular sequences that are similar to each other are homologous.
The sources of variation at the molecular level are very important to understanding how molecules come to differ from each other (or diverge). Per- haps the best known mechanism of molecular evolution is the point mutation, or the change of a single nucleotide in a genetic sequence. The change can be to insert a new nucleotide, to delete an existing one, or to change one nucleotide into another. Other mechanisms include large scale chromosomal rearrangements and inversions. An important kind of rearrangement is the gene duplication; in which additional copies of a gene are inserted into the genome. These copies can then diverge, so that, for example, the original functionality may be preserved at the same time as related new genes evolve. These duplication events can lead to the presence of pseudogenes, which are quite similar to actual genes, but are not expressed. These pseudogenes pre- sent challenges for gene recognition algorithms, such as the one proposed in Searls chapter in this volume. Sexual reproduction adds another dimension to the exchange of genetic material. DNA from the two parents of a sexually re- producing organism undergoes a process called crossover, which forms a kind of mosaic that is passed on to the offspring.
Most mutations have relatively little effect. Mutations in the middle of introns generally have no effect at all (although mutations at the ends of an intron can affect the splicing process). Mutations in the third position of most codons have little effect at the protein level because of the redundancy of the genetic code. Even mutations that cause changes in the sequence of a protein are often neutral, as demonstrated by Sauer, et al (1989). Their experimental method involved saturation mutagenesis which explores are relatively large proportion of the space of possible mutations in parallel. Neutral mutations are the basis of genetic drift, which is the phenomena that accounts for the differences between the DNA that codes for functionally identical proteins in different organisms. This drift is also the basis for the molecular clock, described above. Of course, some point mutations are lethal, and others lead to diseases such as cystic fibrosis. Very rarely, a mutation will be advantageous; it will then rapidly get fixed in the population, as the organisms with the conferred advantage out reproduce the ones with- out it. Diploid sexually reproducing organisms have two copies of each gene (one from each parent), resulting in an added layer of complexity in the effect of mutations. Sometimes the extra copy can compensate (or partially compensate) for a mutation.
Molecular evolution also involves issues of selection and inheritance. In- heritance requires that the genes from the parent be passed to the offspring. DNA itself is replicated by splitting the double helix into two complimentary strands and then extending a primer by attaching complementary nucleotides. This process is modelled in detail Brutlag, et al’s chapter in this volume. The molecular mechanisms underlying the whole complex process of cell division (i.e. the cell cycle) are strikingly conserved in eucaryotes, and knowledge about this process is growing rapidly (see, e.g., Hartwell (1991) for a review). Selection also occurs on factors that are only apparent on the molecular level, such as the efficiency of certain reaction pathways (see, e.g. Hochachka & Somero [1984]).
Sources of Biological Knowledge
The information in this chapter has been presented textbook style, with little discussion of how the knowledge arose, or where errors might have crept in. The purpose of this section is to describe some of the basic experimental methods of molecular biology. These methods are important not only in understanding the source of possible errors in the data, but also because computational methods for managing laboratory activities and analyzing raw data are another area where AI can play a role (see the chapters by Edwards, et al and Glasgow, et al, in this volume). I will also describe some of the many online information resources relevant to computational molecular biology that are available.
Model Organisms: Germs, Worms, Weeds, Bugs & Rodents
The investigation of the workings of even a single organism is so complex as to take many dedicated scientists many careers worth of time. Trying to study all organisms in great depth is simply beyond the abilities of modern biology. Furthermore, the techniques of biological experimentation are often complex, time consuming and difficult. Some of the most valuable methods in biological research are invasive, or require organisms to be sacrificed, or require many generations of observation, or observations on large populations. Much of this work is impractical or unethical to carry out on humans. For these reasons, biologists have selected a variety of model organisms for experimentation. These creatures have qualities that make possible con- trolled laboratory experiments at reasonable cost and difficulty with results that can often be extrapolated to people or other organisms of interest.
Of course, research involving humans can be done ethically, and in some areas of biomedical research, such as final drug testing, it is obligatory. Other research methods involve kinds of human cells can be grown successfully in the laboratory. Not many human cell types thrive outside of the body. Some kinds of human cancer cells do grow well in the laboratory, and these cells are an important vehicle for research.
Sometimes the selection of a new model organism can lead to great advances in a field. For example, the use of a particular kind of squid made possible the understanding of the functioning of neurons because it contained a motor neuron that is more than 10 times the size of most neural cells, and hence easy to find and use in experiments. There are experimentally useful
correlates of nearly every aspect of human biology found in some organism or another, but the following six organisms form the main collection of models used in molecular biology:
E. coli The ubiquitous intestinal bacterium Escherichia coli is a work- horse in biological laboratories. Because it is a relatively simple organism with fast reproduction time and is safe and easy to work with, E. coli has been the focus of a great deal of research in genetics and molecular biology of the cell. Although it is a Bacterium, many of the basic biochemical mechanisms of E. coli are shared by humans. For example, the first understanding of how genes can be turned on and off came from the study of a virus that infects these bacteria (Ptashne, 1987). E. coli is a common target for genetic engineering, where genes from other organisms are inserted into the bacterial genome then produced in quantity. E. coli is now the basis of the international biotechnology industry, churning out buckets full of human insulin, the heart attack drug TPA, and a wide variety of other substances.
Saccharomyces Saccharomyces cervesiae is better known as brewer’s yeast, and it is another safe, easy to grow, short generation time organism. Other yeasts, such as Schizosaccharomyces pombe, are also used extensively. Surprisingly, yeasts are very much like people in many ways. Unlike the bac- terium E. coli, yeasts are eucaryotes, with a cell nucleus, mitochondria, a eu- caryotic cell membrane, and many of the other cellular components and processes found in most other eucaryotes, including people. Because these yeasts are so easy to grow and manipulate, and because they are so biochemi- cally similar to people, many insights about the molecular processes involved in metabolism, biosynthesis, cell division, and other crucial areas of biology have come from the investigation of Saccharomyces (Saccharomyces is a genus name, which, when used alone, refers to all species that are within that genus). Yeasts play another important role in molecular biology. One of the crucial steps in sequencing large amounts of DNA is to be able to prepare many copies of moderate sized pieces of DNA. An widely used method for doing this is the yeast artificial chromosome (or YAC), which is discussed below.
Arabidopsis The most important application of increased biological under- standing is generally thought to be in medicine, and increased understanding of human biology has indeed led to dramatic improvements in health care. However, in terms of effect on human life, agriculture is just as significant. A great deal of research into genetics and biochemistry has been motivated by the desire to better understand various aspects of plant biology. An important model organism for plants is Arabidopsis thaliana, a common weed. Arabidopsis makes a good model because it undergoes the same processes of growth, development, flowering and reproduction as most higher plants, but it’s genome has 30 times less DNA than corn, and very little repetitive DNA. It also produces lots of seeds, and takes only about six weeks to grow to maturity. There are several other model organisms used to investigate botanical questions, including tomatoes, tobacco, carrots and corn.
C. elegans One of the most exciting model organisms to emerge recently has been the nematode worm Caenorhabditis elegans. This tiny creature, thousands of which can be found in a spadeful of dirt, has already been used to generate tremendous insight about cellular development and physiology. The adult organism has exactly 959 cells, and every normal worm consists of exactly the same collection of cells in the same places doing the same thing. It is one of the simplest creatures with a nervous system (which involves about a third of its cells). Not only is the complete anatomy of the organism known, but a complete cell fate map has been generated, tracing the develop- mental lineage of each of each cell throughout the lifespan of the organism. This map allows researchers to relate behaviors to particular cells, to trace the effects of genetic mutations very specifically, and perhaps to gain insight into the mechanisms of aging as well as development. A large, highly integrated picture and text database of information about the cell fates, genetic maps and sequences, mutation effects and other relevant information about C. elegans is currently under construction at the University of Arizona.
D. melanogaster Drosophila melanogaster, a common fruit fly, has long been a staple of classical genetics research. These flies have short generation times, and many different genetically determined morphological characteristics (e.g. eye color) that can readily be determined by visual inspection. Drosophila were used for decades in exploring patterns of inheritance; now that molecular methods can be applied, they have proven invaluable for a variety of studies of genetic expression and control. An important class of genetic elements that regulate many other genes, in effect, specifying complex genetic programs, were first discovered in Drosophila; these areas are called homeoboxes. Molecular genetics in Drosophila is also providing great in- sights into how complex body plans are generated.
M. musculus Mus musculus is the basic laboratory mouse. Mice are mammals, and, as far as biochemistry is concerned, are practically identical to people. Many questions about physiology, reproduction, functioning of the immune and nervous systems and other areas of interest can only be addressed by examining creatures that are very similar to humans; mice nearly always fit the bill. The similarities between mice and people mean also that the mouse is a very complicated creature; it has a relatively large, com- plex genome, and mouse development and physiology is not as regular or consistent as that of C. elegans or Drosophila. Although our depth of under- standing of the mouse will lag behind understanding of simpler organisms, the comparison of mouse genome to human is likely to be a key step, both in understanding their vast commonalities, and in seeing the aspects of our genes that make us uniquely human.
Experimental Methods
Molecular biologists have developed a tremendous variety of tools to address questions of biological function. This chapter can only touch briefly on a few of the most widely used methods, but the terminology and a sense of the kinds of efforts required to produce the data used by computer scientists can be important for understanding the strengths and limitations of various sources of data.
Imaging. The first understanding of the cellular nature of life came shortly after the invention of the light microscope, and microscopy remaines central to research in biology The tools for creating images have expanded tremendously. Not only are there computer controled light microscopes with a wide variety of imaging modalities, but there are now many other methods of generating images of the very small. The electron microscope offers extremely high resolution, although it requires exposing the imaged sample to high vacuum and other harsh treatments. New technologies including the Atomic Force Microscope (AFM) and the Scanning Tunnelling Microscope (STM) offer the potential to create images of individual molecules. Biologists use these tools extensively.
Gel Electrophoresis. A charged molecule, when placed in an electric field, will be accelerated; positively charged molecules will move toward negative electrodes and vice versa. By placing a mixture of molecules of interest in a medium and subjecting them to an electric charge, the molecules will migrate through the medium and separate from each other. How fast the molecules will move depends on their charge and their size—bigger molecules see more resistance from the medium. The procedure, called electrophoresis involves putting a spot of the mixture to be analyzed at the top of a polyacrylamide or agarose gel, and applying an electric field for a period of time. Then the gel is stained so that the molecules become visible; the stains appear as stripes along the gel, and are called bands. The location of the bands on the gel are proportional to the charge and size of the molecules in the mixture (see Figure 4 for an example). The intensity of the stain is an indication of the amount of a particular molecule in the mixture. If the molecules are all the same charge, or have charge proportional to their size (as, for example, DNA does) then electrophoresis separates them purely by size.
Often, several mixtures are run simultaneously on a single gel. This al- lows for easy calibration to standards, or comparison of the contents of different mixtures, showing, for example, the absence of a particular molecular component in one. The adjacent, parallel runs are sometimes called lanes. A variation on this technique allows the sorting of molecules by a chemical property called the isoelectric point, which is related to its pK. A combina- tion of the two methods, called 2D electrophoresis is capable of very fine
— + – + – + – + – + – + – +
Figure 4. This is an example of a gel electrophoresis run.. Each column was loaded with a different mixture. The mixtures are then separated vertically by their charge and size. The gel is then stained, producing dark bands where a molecule of a given size or charge is present in a mixture. In this gel, the columns marked with a – are a control group. The band marked with an arrow is filled only in the + columns.
distinctions, for example, mapping each protein in a cell to a unique spot in two-space, the size of the spot indicating the amount of the protein. Although there are still some difficulties in calibration and repeatability, this method is potentially a very powerful tool for monitoring the activities of large bio- chemical systems. In addition, if a desired molecule can be separated from the mixture this way, individual spots or bands can be removed from the gel for further processing, in a procedure called blotting.
Cloning. A group of cells with identical genomes are said to be clones of one another. Unless there are mutations, a single cell that reproduces asexually will produce identical offspring; these clones are sometimes called a cell line, and certain standardized cell lines, for example the HeLa cell line, play an important role in biological research.
This concept has been generalize to cloning individual genes. In this case, a piece of DNA containing a gene of interest is inserted into the genome of a target cell line, and the cells are screened so that all of the resulting cells have an identical copy of the desired genetic sequence. The DNA in these cells is said to be recombinant, and the cell will produce the protein coded for by the inserted gene.
Cloning a gene requires some sophisticated technology. In order for a cloned gene to be expressed, it must contain the appropriate transcription signals for the target cell line. One way biologists ensure that this will happen is to put the new gene into a bacteriophage (a virus that infects bacteria), or a plasmid (a circular piece of DNA found outside of the chromosome of bacteria that replicates independently of the bacteria’s chromosomal DNA). These devices for inserting foreign DNA into cells are called vectors.
In order to cut and paste desired DNA fragments into vectors, biologists use restriction enzymes, which cut DNA at precisely specified points. These enzymes are produced naturally by bacteria as a way of attacking foreign DNA. For example, the commonly used enzyme EcoRI (from E. coli) cuts DNA between the G and the A in the sequence GAATTC; these target sequences are called restriction sites. Everywhere a restriction site occurs in a DNA molecule treated with EcoRI, the DNA will be broken. Restriction enzymes play many roles in biology in addition to making gene cloning possible; a few others will be described below.
Both the insertion of the desired gene into the vector and the uptake of the vector by the target cells are effective only a fraction of the time. Fortunately, cells and vectors are small and it is relatively easy to grow a lot of them. The process is applied to a population of target cells, and then the resulting population is screened to identify the cells where the gene was successfully insert- ed. This can be difficult, so many vectors are designed to facilitate screening. One popular vector, pBR322, contains a naturally occurring transcription start signal and some antibiotic resistance genes, designed with conveniently placed restriction sites. If this vector is taken up by the target cells, it will confer resistance to certain antibiotics to them. By applying the antibiotic to the whole colony, the researcher can kill all the cells that did not get the cloned gene. More sophisticated manipulations involving multiple antibiotic resistances and carefully placed restriction sites can also be used to ensure that the gene was correctly taken up by the vector.
There are many variations on these techniques for inserting foreign genes. It is now possible to use simple bacteria to produce large amounts of almost any isolated protein, including, for example, human insulin. Although it is a more complex process, it is also possible to insert foreign genes into plants and animals, even people. A variety of efforts are underway to use these techniques to engineer organisms for agriculture, medicine and other applications. Not all of these applications are benign. One of the most successful early efforts was to increase the resistance of tobacco plants to pesticides, and there are clear military applications. On the other hand, these methods also promise new approaches to producing important rare biological compounds inexpensively (e.g. for novel cancer treatments or cleaning up toxic waste) and im- proving the nutritional value or hardiness of agricultural products. The entire field of genetic engineering is controversial, and there are a variety of controls on what experiments can be done and how they can be done.
Hybridization and Immunological Staining. Biological compounds can show remarkable specificity, for example, binding very selectively only to one particular compound. This ability plays an important role in the labo- ratory, where researchers can identify the presence or absence of a particular molecule (or even a region of a molecule) in vanishingly small amounts.
Antibodies are the molecules that the immune system uses to identify and fight off invaders. Antibodies are extremely specific, recognizing and binding to only one kind of molecule. Dyes can be attached to the antibody, forming a very specific system for identifying the presence (and possibly quantifying the amount) of a target molecule that is present in a system.
There is a conceptually related method for identifying very specifically the presence of a particular nucleotide sequence in a macromolecule. The complement to a single-stranded DNA sequence will bind quite specifically to that sequence. One technique measures how similar two related DNA sequences are by testing how strongly the single-stranded versions of the molecules stick to each other, or hybridize. The more easily they come apart, the more differences there are between their sequences. It is also possible to at- tach a dye or other marker to a specific piece of DNA (called a probe) and then hybridize it to a longer strand of DNA. The location along the strand that is complementary to the probe will then be marked. There are many variations on hybridization and immunological staining that are customized to the needs of a particular experiment.
Gene Mapping and Sequencing. The Human Genome Project is the ef- fort to produce a map and then the sequence of the human genome. The purpose of a genetic map is to identify the location and size of all of the genes of an organism on its chromosomes. This information is important for a variety of reasons. First, because crossover is an important component of inheritance in sexually reproducing organisms, genes that are near each other on the chromosome will tend to be inherited together. In fact, this forms the basis for linkage analysis, which is a technique that looks at the relationships be- tween genes (or phenotypes) in large numbers of matings (in this context, often called crosses) to identify which genes tend to be inherited together, and are therefore likely to be near each other. Second, it is possible to clone genes of known locations, opening up a wide range of possible experimental manipulations. Finally, it is currently possible to determine the sequence of moderate size pieces of DNA, so if an important gene has been mapped, it is possible to find the sequence of that area, and discover the protein that is responsible for the genetic characteristic. This is especially important for understanding the basis of inherited diseases.
The existence of several different kinds of restriction enzymes makes possible a molecular method of creating genetic maps. The application of each restriction enzyme (the process is called a digest) creates a different collection of restriction fragments (the cut up pieces of DNA). By using gel electrophoresis, it is possible to determine the size of these fragments. Using multiple enzymes, together and separately, results in sets of fragments which can be (partially) ordered with respect to each other, resulting in a genetic map. AI techniques for reasoning about partial orders have been effectively applied to the problem of assembling the fragments into a map (Letovsky &
Berlyn, 1992). These physical maps divide a large piece of DNA (like a chromosome) into parts, and and there is an associated method for obtaining any desired part.
Restriction fragment mapping becomes problematic when applied to large stretches of DNA, because the enzymes can produce many pieces of about the same size, making the map ambiguous. The use of different enzymes can help address this problem to a limited degree, but a variety of other techniques are now also used.
Being able to divide the genome into moderate sized chunks is a prerequisite to determining its sequence. Although there are several clever methods for determining the sequence of DNA molecule, all of them are limited to a resolution of well under a thousand base pairs at a time. In order to take this sequencing ability and determine the sequence of large pieces of DNA, many different overlapping chunks must be sequenced, and then these sequences must be assembled. In order to accomplish this task, it is necessary to break the DNA in an entire genome down into a set of more manageable sized pieces. The ordering of these pieces must be known (so they can be reassembled into a complete sequence), taken together the pieces must cover the en- tire genome, and the same set of pieces must be accessible to many different laboratories. This process is usually accomplished in several stages. The first stage generates relatively large pieces called contigs. Contigs are maintained in cloned cell lines so that they can be reproduced and distributed. Often, these pieces of DNA are made into Yeast artificial chromosomes, or YACs, which can hold up to about a million base pairs of sequence each, requiring on the order of 10,000 clones to adequately cover the entire human genome. Each of these is then broken down into sets of smaller pieces, often in the form of cosmids. A cosmid is a particular kind of bacteriophage (a virus that infects bacteria) that is capable of accepting inserts of 30,000 or so base pairs. The difficulties in generating and maintaining collections of clones that large have led to alternative technologies for large scale sequencing.
One alternative involves a new technology based on the polymerase chain reaction, or PCR. This mechanism was revolutionary because it made it possible to rapidly produce huge amounts of a specific region of DNA, sim- ply by knowing a little bit of the sequence around the desired region. PCR exponentially amplifies (makes copies of) a segment of a DNA molecule, given a unique pair of sequences that bracket the desired piece. First, short sequences of DNA (called oligonucleotides, or oligos) complementary to
*There are many interesting uses of this technology. For example, it gives law enforcement the ability to generate enough DNA for identification from vanishing small samples of tissue. A more amusing application is the rumored use of PCR to spy on what academic competitors are doing in their research. Almost any correspondence from a competitor’s lab will contain traces of DNA which can be amplified by PCR to identify the specific clones the lab is working with.
each of the bracketing sequences are synthesized. Creating short pieces of DNA with a specific sequence is routine technology, now often performed by laboratory robots. These pieces are called primers. The primers, the target DNA and the enzyme DNA polymerase are then combined. The mixture is heated, so that the hydrogen bonds in the DNA break and the molecule splits into two single strands. When the mixture cools sufficiently, the primers bond to the regions around the area of interest, and the DNA polymerase replicates the DNA downstream of the primers. By using a heat resistant polymerase from an Archaea species that lives at high temperatures, it is possible to rapidly cycle this process, doubling the amount of desired segment of DNA each time. This technology makes possible the exponential amplification of entire DNA molecules or any specific region of DNA for which bracketing primers can be generated.*
In order to use PCR for genome mapping and sequencing, a collection of unique (short) sequences spread throughout the genome must be identified for use as primers. The sequences must be unique in the genome so that the source of amplified DNA is unambiguous, and they have to be relatively short so that they are easy to synthesize. The sites in the genome that correspond to these sequences are called sequence tagged sites or STSs. The more STSs that are known, the finer grained the map of the genome they provide. Finding short, unique sequences even in 3×109 bp of DNA is not that difficult; a simple calculation shows that most sequences of length 16 or so can reasonably be expected to be unique in a genome of that size. An early goal of the Human Genome Project is to generate a list of STSs spaced at ap- proximately 100kbp intervals over the entire human genome. If it is possible to find STSs that adequately cover the genome, it will not be necessary to build and maintain libraries of 10,000 YACs and ten times as many cosmids. Any region of DNA of interest can be identified by two STSs that bracket it. Instead of having to maintain large clone collections, these STSs can be stored in a database, and any researcher who needs a particular section of DNA can synthesize the appropriate primers and use PCR to produce many copies of just of that section.
Another issue that has been raised about the project to sequence the genome is the need to know the sequences of all of the introns and other non- coding regions of DNA. One way to address this issue is to target only coding regions for sequencing. The ability to find the sequences that a particular cell is using to produce proteins at a particular point in time is also useful in a variety of other areas as well. This information can be gleaned by gathering the mRNAs present in the cytoplasm of the cell, and sequencing them. In- stead of sequencing the mRNAs directly, biologists use an enzyme called re- verse transcriptase to make DNA molecules complementary to the mRNAs (called cDNAs) and then sequence that DNA. Using PCR and other technology, it is possible to capture at least portions of most of the mRNAs a cell is
producing. By sequencing these cDNAs, researchers can focus their attention on just the parts of the genome that code for expressed proteins.
Large scale attempts to sequence at least part of all of the cDNAs that can be produced from brain tissue have resulted in partial sequences for more than 2500 new proteins in a very short period of time (Adams, et al, 1992). These sequences called ESTs, for expressed sequence tags can be used as PCR primers for future, more detailed experiments. This work has created controversy because of the ensuing attempt by the National Institutes of Health to patent the EST sequences.
Crystallography and NMR. Until the relationship between protein sequence and structure is more fully understood, the sequences produced by genome projects will provide only part of the biochemical story. Additional information about protein structure is necessary to understand how the proteins function. This structural information is at the present primarily gathered by X-ray crystallography. In order to determine the structure of a protein in this manner, a very large, pure crystal of the protein must be grown (this process can take years, and may never succeed for certain proteins). Then the X-ray diffraction pattern of the crystal is measured, and this information can be used indirectly to determine the positions of the atoms in the molecule. Glasgow, et al’s chapter in this volume describes this process in more detail. Because of the difficulties in crystallography, relatively few structures are known, but the number of new structures is growing exponentially, with a doubling time of a bit over two years.
A promising alternative to crystallography for determining protein structure is multi-dimensional nuclear magnetic resonance, or NMR. Although this process does not require the crystallization of the protein, there are technical difficulties in analyzing the data associated with large molecules like proteins. Edwards, et al’s chapter in this volume describes some of the challenges. Both crystallography and NMR techniques result in static protein structures, which are to some degree misleading. Proteins are flexible, and the patterns of their movement are likely to play an important role in their function. Although NMR has the potential to provide information about this facet of protein activity, there is very little data available currently.
Computational Biology
In the last five years, biologists have come to understand that sharing the results of experiments now takes more than simple journal publication. In the 1980s, many journals were overwhelmed with papers reporting novel sequences and other biological data. Paper publications of sequences are hard to analyze, prone to typographical errors, and take up valuable journal space.
*Researchers without internet access can contact NCBI by writing to NCBI/National Library of Medicine/Bethesda, MD 20894 USA or calling +1 (301) 496-2475.
Databases were established, journals began to require deposition into the databases before publication, and various tools began to appear for managing and analyzing the databases.
When Doolittle, et al (1983) used the nascent genetic sequence database to prove that a cancer causing gene was a close relative of a normal growth factor, molecular biology labs all over the world began installing computers or linking up to networks to do database searches. Since then, a bewildering variety of computational resources for biology have arisen. These databases and other resources are a valuable service not only to the biological community, but also to the computer scientist in search of domain information.
There is a database of databases, listing these resources which is maintained at Los Alamos National Laboratory. It is called LiMB(Lawton, Burks & Martinez, 1989), and contains descriptions, contacts and access methods for more than 100 molecular biology related databases. It is a very valuable tool for tracking down information. Another general source for databases and information about them is the National Center for Biotechnology Informa- tion (NCBI), which is part of the National Library of Medicine. Many data- bases are available via anonymous ftp from the NCBI server, ncbi.nlm.nih.gov.*
A few of the databases that may be of particular interest to computer scientists are described here. There are several databases that maintain genetic sequences, and they are increasingly coordinated. They are Genbank (Moore, Benton & Burks, 1990), the European Molecular Biology Laboratory nucleotide sequence database (EMBL) (Hamm & Cameron, 1986), and the DNA Database, Japan (DDBJ) (Miyazawa, 1990). NCBI will also provide a sequence database beginning in 1992. The main protein sequence database is the Protein Identification Resource (PIR) (George, Barker & Hunt, 1986). NCBI also provides a non-redundant combination of protein sequences from various sources (including translations of genetic sequences) in its NRDB.
Several databases contain information about three dimensional structures of molecules. The Protein Data Bank (PDB) maintained by Brookhaven National Laboratory, contains protein structure data, primarily from crystallographic data. BioMagRes (BMR) is a database of NMR derived data about proteins, including three dimensional coordinates, that is maintained at the University of Wisconsin, Madison (Ulrich, Markley & Kyogoku, 1989). CARBBANK, contains structural information for complex carbohydrates (Doubet, Bock, Smith, Albersheim & Darvill, 1989). Chemical Abstracts Service (CAS) Online Reg- istry File is a commercial database that contains more than 10 million chemical substances, many with three dimensional coordinates and other useful information. The Cambridge Structural Database contains small molecule structures, and is available to researchers at moderate charge.
Genetic map databases (GDB), as well as a database of inherited human diseases and characteristics (OMIM) are maintained at the Welch Medical
There is a database of information about compounds involved in interme- diary metabolism called CompoundKB, developed by Peter Karp that is available from NCBI. This database is available in KEE knowledge base form as well as several others, and there is associated LISP code which makes it attractive for artificial intelligence researchers; see Karp’s and Mavrovouniotis’s chapters in this volume for possible applications of the knowledge base.
Finally, one of the most important computer-based assets for a computer scientist interested in molecular biology information is the bulletin board system called bionet. This bboard is available through usenet as well as by electronic mail. The discussion groups include computational biology, infor- mation theory and software, as well as more than 40 other areas. Bionet is an excellent source for information and contacts with computationally sophisti- cated biologists.
Conclusion
AI researchers have often had unusual relationships with their collaboraors. “Experts” are somehow “knowledge engineered” so that what they know can be put into programs. Biology has a long history of collaborative research, and it does not match this AI model. Computer scientists and biologists often have differing expectations about collaboration, education, conferences and many other seemingly mundane aspects of research. In order to work with biologists, AI researchers must understand a good deal about the domain and find ways to bridge the gap between these rather different science cultures.
This brief survey of biology is intended to help the computer scientist get oriented and understand some of the commonly used terms in the domain. Many more detailed, but still accessible books are listed in the references. I find this material fascinating. Not only is it interesting as a domain for AI re- search, but it provides a rich set of metaphors for thinking about intelligence: genetic algorithms, neural networks and Darwinian automata are but a few of the computational approaches to behavior based on biological ideas. There will, no doubt, be many more.