History of Bioinformatics

History of Bioinformatics

November 5, 2018 Off By admin
Shares
Origin & History of Bioinformatics :

By 1981, 579 human genes had been mapped and mapping by in situ hybridization had become a standard method. Marvin Carruthers and Leory Hood made a huge leap in bioinformatics when they invented a method for automated DNA sequencing.

In 1988, the Human Genome organization (HUGO) was founded. This is an international organization of scientists involved in Human Genome Project. In 1989, the first complete genome map was published of the bacteria Haemophilus influenza .

The following year, the Human Genome Project was started. By 1991, a total of 1879 human genes had been mapped. In 1993, Genethon, a human genome research center in France Produced a physical map of the human genome. Three years later, Genethon published the final version of the Human Genetic Map. This concluded the end of the first phase of the Human Genome Project.

Bioinformatics was fuelled by the need to create huge databases, such as GenBank and EMBL and DNA Database of Japan to store and compare the DNA sequence data erupting from the human genome and other genome sequencing projects.
Today, bioinformatics embraces protein structure analysis, gene and protein functional information, data from patients, pre-clinical and clinical trials, and the metabolic pathways of numerous species.

Origin of bioinformatic/biological databses: 
The first bioinformatic/biological databases were constructed a few years after the first protein sequences began to become available. The first protein sequence reported was that of bovine insulin in 1956 , consisting of 51 residues. Nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine tRNA with 77 bases. Just a year later, Dayhoff gathered all the availablesequence data to create the first bioinformatic database .
The Protein DataBank followed in 1972 with a collectionof ten X-ray crystallographic protein
structures, and theSWISSPROT protein sequence database began in 1987. A huge variety of divergent data resources of different typesand sizes are now available either in the public domain or more recently from commercial third parties. All of the original databases were organised in a very
simpleway with data entries being stored in flat files, either one perentry, or as a single large text file. Re-write – Later on lookup indexes were added to allow convenient keyword searching of header information.

Origin of tools: 
After the formation of the databases, tools became available to search sequence databases – at first in a very simple way, looking for keyword matches and short sequence words, and then more sophisticated pattern matching and alignment based methods. The rapid but less rigorous BLAST algorithm has been the mainstay of sequence database searching since its introduction a decade ago, complemented by the more rigorous and slower FASTA and Smith Waterman algorithms. Suites of analysis algorithms, written by leading academic researchers at Stanford, CA, Cambridge, UK and Madison, WI for their in-house projects, began to become more widely available for basic sequence analysis. These algorithms were typically single function black boxes that took input and produced output in the form of formatted files. UNIX style commands were used to operate the algorithms, with some suites having hundreds of possible commands, each taking different command options and input formats. Since these early efforts, significant advances have been made in automating the collection of sequence information. Rapid innovation in biochemistry and instrumentation has brought us to the point where the entire genomic sequence of at least 20 organisms, mainly microbial pathogens, are known and projects to elucidate at least 100 more prokaryotic and eukaryotic genomes are currently under way. Groups are now even competing to finish the sequence of the entire human genome. With new technologies we can directly examine the changes in expression levels of both mRNA and proteins in living cells, both in a disease state or following an external challenge. We can go on to identify patterns of response in cells that lead us to an understanding of the mechanism of action of an agent on a tissue. The volume of data arising from projects of this nature is unprecedented in the pharma industry, and will have a profound effect on the ways in which data are used and experiments performed in drug discovery and development projects. This is true not least because, with much of the available interesting data being in the hands of commercial genomics companies, pharmcos are unable to get exclusive access to many gene sequences or their expression profiles. The competition between co-licensees of a genomic database is effectively a race to establish a mechanistic role or other utility for a gene in a disease state in order to secure a patent position on that gene. Much of this work is carried out by informatics tools. Despite the huge progress in sequencing and expression analysis technologies, and the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original systems gathered together by researchers 15-20 years ago. Many are simple extensions of the original academic systems, which have served the needs of both academic and commercial users for many years. These systems are now beginning to fall behind as they struggle to keep up with the pace of change in the pharma industry. Databases are still gathered, organised, disseminated and searched using flat files. Relational databases are still few and far between, and object-relational or fully object oriented systems are rarer still in mainstream applications. Interfaces still rely on command lines, fat client interfaces, which must be installed on every desktop, or HTML/CGI forms. Whilst they were in the hands of bioinformatics specialists, pharmcos have been relatively undemanding of their tools. Now the problems have expanded to cover the mainstream discovery process, much more flexible and scalable solutions are needed to serve pharma R&D informatics requirements.

There are different views of origin of Bioinformatics- From T K Attwood and D J Parry-Smith’s “Introduction to Bioinformatics”, Prentice-Hall 1999 [Longman Higher Education; ISBN 0582327881]: “The term bioinformatics is used to encompass almost all computer applications in biological sciences, but was originally coined in the mid-1980s for the analysis of biological sequence data.”
From Mark S. Boguski’s article in the “Trends Guide to Bioinformatics” Elsevier, Trends Supplement 1998 p1: “The term “bioinformatics” is a relatively recent invention, not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing. The National Center for Biotechnology Information (NCBI) , is celebrating its 10th anniversary this year, having been written into existence by US Congressman Claude Pepper and President Ronald Reagan in 1988. So bioinformatics has, in fact, been in existence for more than 30 years and is now middle-aged.”

A Chronological History of events:

1951 Pauling and Corey propose the structure for the alpha-helix and beta-sheet (Proc. Natl. Acad. Sci. USA, 27: 205-211, 1951; Proc. Natl. Acad. Sci. USA, 37: 729-740, 1951).
1953 – Watson & Crick propose the double helix model for DNA based x-ray data obtained by Franklin & Wilkins (Nature, 171: 737-738, 1953).
1954 – Perutz’s group develop heavy atom methods to solve the phase problem in protein crystallography.
1955 – The sequence of the first protein to be analysed, bovine insulin, is announed by F.Sanger.
1958 – The first integrated circuit is constructed by Jack Kilby at Texas Instruments.
The Advanced Research Projects Agency (ARPA) is formed in the US
1962 – Pauling’s theory of molecular evolution
1965 – Margaret Dayhoff’s Atlas of Protein Sequences
1968 – Packet-switching network protocols are presented to ARPA
1969 – The ARPANET is created by linking computers at Stanford, UCSB, The University of Utah and UCLA.
1970 – The details of the Needleman-Wunsch algorithm for sequence comparison are published.
1971- Ray Tomlinson (BBN) invents the email program.
1972 – The first recombinant DNA molecule is created by Paul Berg and his group.
1973 – The Brookhaven Protein DataBank is announeced (Acta.Cryst.B,1973,29:1764). Robert Metcalfe receives his Ph.D from Harvard University. His thesis describes Ethernet.
1974 – Vint Cerf and Robert Khan develop the concept of connecting networks of computers into an “internet” and develop the Transmission Control Protocol (TCP).
1975 – Microsoft Corporation is founded by Bill Gates and Paul Allen.
Two-dimensional electrophoresis, where separation of proteins on SDS polyacrylamide gel is combined with separation according to isoelectric points, is announced by P. H. O’Farrell (J. Biol. Chem., 250: 4007-4021, 1975).
1976 – The Unix-To-Unix Copy Protocol (UUCP) is developed at Bell Labs.
E. M. Southern published the experimental details for the Southern Blot technique of specific sequences of DNA (J. Mol. Biol., 98: 503-517, 1975).
1977 – The full description of the Brookhaven PDB (http://www.pdb.bnl.gov) is published (Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M.J.; J. Mol. Biol., 1977, 112:, 535).
Allan Maxam and Walter Gilbert (Harvard) and Frederick Sanger (U.K. Medical Research Council), report methods for sequencing DNA.
DNA sequencing and software to analyze it ( Staden )
1978 – The first Usenet connection is established between Duke and the University of North Carolina at Chapel Hill by Tom Truscott, Jim Ellis and Steve Bellovin.
1980 – The first complete gene sequence for an organism (FX174) is published. The gene consists of 5,386 base pairs which code nine proteins.
Wüthrich et. al. publish paper detailing the use of multi-dimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; Wüthrich, K.; Biochem. Biophys. Res. Comm., 1980, 95:, 1).
IntelliGenetics, Inc. founded in California. Their primary product is the IntelliGenetics Suite of programs for DNA and protein sequence analysis.
1981 – The Smith-Waterman algorithm for sequence alignment is published.
IBM introduces its Personal Computer to the market.
The concept of a sequence motif ( Doolittle )
1982 – Genetics Computer Group (GCG) created as a part of the University of Wisconsin of Wisconsin Biotechnology Center. The company’s primary product is The Wisconsin Suite of molecular biology tools.
GenBank Release 3 made public
Phage lambda genome sequenced
1983 – The Compact Disk (CD) is launched. Name servers are developed at the University of Wisconsin.
Sequence database searching algorithm ( Wilbur-Lipman )
LANL (Los Alamos National Laboratory) and LLNL (Lawrence Livermore National Laboratory) begin production of DNA clone (cosmid) libraries representing single chromosomes.
DNA analysis becomes viable with the discovery of Polymerase Chain Reaction. It allows small samples of DNA to be multiplied to produce a large enough sample to analyse
1984 – Jon Postel’s Domain Name System (DNS) is placed on-line.
The Macintosh is announced by Apple Computer.
1985 – The FASTP/FASTN algorithm is published.
Robert Sinsheimer holds meeting on human genome sequencing at University of California, Santa Cruz .
At OHER, Charles DeLisi and David A. Smith commission the first Santa Fe conference to assess the feasibility of a Human Genome Initiative
1986 – Following the Santa Fe conference, DOE OHER announces Human Genome Initiative. With $5.3 million, pilot projects begin at DOE national laboratories to develop critical resources and technologies.
The term “Genomics” appeared for the first time to describe the scientific discipline of mapping, sequencing, and analyzing genes. The term was coined by Thomas Roderick as a name for the new journal.
Amoco Technology Corporation acquires IntelliGenetics.
The SWISS-PROT database is created by the Department of Medical Biochemistry of the University of Geneva and the European Molecular Biology Laboratory (EMBL).
The PCR reaction is described by Kary Mullis and co-workers.
1987- The use of yeast artifical chromosomes (YAC) is described (David T. Burke, et. al., Science, 236: 806-812).
The physical map of e. coli is published (Y. Kohara, et. al., Cell 51: 319-337).
Perl (Practical Extraction Report Language) is released by Larry Wall.
Congressionally chartered DOE advisory committee, HERAC, recommends a 15-year, multidisciplinary, scientific, and technological undertaking to map and sequence the human genome. DOE designates multidisciplinary human genome centers.
NIH NIGMS begins funding of genome projects
1988 – National Center for Biotechnology Information (NCBI) created at NIH/NLM
EMBnet network for database distribution
The Human Genome Intiative is started (commission on Life Sciences, National Research council. Mapping and sequencing the Human Genome, National Academy Press: washington, D.C.), 1988.
The FASTA algorith for sequence comparison is published by Pearson and Lupman.
A new program, an Internet computer virus desined by a student, infects 6,000 military computers in the US.
Reports by congressional OTA and NAS NRC committees recommend concerted genome research program.
HUGO founded by scientists to coordinate efforts internationally
First annual Cold Spring Harbor Laboratory meeting on human genome mapping and sequencing.
DOE and NIH sign MOU outlining plans for cooperation on genome research.
Telomere (chromosome end) sequence having implications for aging and cancer research is identified at LANL
1989 – The genetics Computer Group (GCG) becomes a privatae company.
Oxford Molceular Group,Ltd.(OMG) founded, UK by Anthony Marchigton, David Ricketts, James Hiddleston, Anthony Rees, and W.Graham Richards. Primary products: Anaconds, Asp, Cameleon and others (molecular modeling, drug design, protein design).
DNA STSs recommended to correlate diverse types of DNA clones.
DOE and NIH establish Joint ELSI Working Group
1990 – The BLAST program (Altschul,et.al.) is implemented.
Molecular applications group is founded in California by Michael Levitt and Chris Lee. Their primary products are Look and SegMod which are used for molecular modeling and protein deisign.
InforMax is founded in Bethesda, MD. The company’s products address sequence analysis, database and data management, searching, publication graphics, clone construction, mapping and primer design.
DOE and NIH present joint 5-year U.S. HGP plan to Congress. The 15-year project formally begins.
Projects begun to mark gene sites on chromosome maps as sites of mRNA expression.
Research and development begun for efficient production of more stable, large-insert BACs
1991 – The research institute in Geneva (CERN) announces the creation of the protocols which make -up the World Wide Web.
The creation and use of expressed sequence tags (ESTs) is described.
Incyte Pharmaceuticals, a genomics company headquartered in Palo Alto California, is formed.
Myriad Genetics, Inc. is founded in Utah. The company’s goal is to lead in the discovery of major common human disease genes and their related pathways. The company has discovered and sequenced, with its academic collaborators, the following major genes: BRCA1, BRACA1 , CHD1, MMAC1, MMSC1, MMSC2, CtIP, p16, p19 and MTS2.
Human chromosome mapping data repository, GDB, established
1992 -Low-resolution genetic linkage map of entire human genome published.
Guidelines for data release and resource sharing announced by DOE and NIH
1993 – Sanger Centre , Hinxton, UK .
CuraGen Corporation is formed in New Haven, CT.
Affymetrix begins independent operations in Santa Clara, California.
International IMAGE Consortium established to coordinate efficient mapping and sequencing of gene-representing cDNAs.
DOE-NIH ELSI Working Group’s Task Force on Genetic and Insurance Information releases recommendations.
DOE and NIH revise 5-year goals [Science 262, 43-46 (Oct. 1, 1993)]
IOM releases U.S. HGP-funded report, “Assessing Genetic Risks.”
LBNL implements novel transposon-mediated chromosome-sequencing system.
GRAIL sequence-interpretation service provides Internet access at ORNL
1994 – Netscape Communications Corporation founded and releases Naviagator, the commerical version of NCSA’s Mozilla.
Gene Logic is formed in Maryland.
The PRINTS database of protein motifs is published by Attwood and Beck.
Oxford Molecular Group acquires IntelliGenetics.
EMBL European Bioinformatics Institute , Hinxton, UK.
Genetic-mapping 5-year goal achieved 1 year ahead of schedule.
Completion of second-generation DNA clone libraries representing each human chromosome by LLNL and LBNL
1995 – The Haemophilus influenzea genome (1.8) is sequenced ( Fleischmann et al. , Science 269 :496-512 (1995).
LANL and LLNL announce high-resolution physical maps of chromosome 16 and chromosome 19, respectively
The Mycoplasma genitalium genome is sequenced ( Fraser et. al. , Science 270 :397-403 (1995).
Moderate-resolution maps of chromosomes 3, 11, 12, and 22 maps published .
Physical map with over 15,000 STS markers published.
First (nonviral) whole genome sequenced (for the bacterium Haemophilus influenzae).
Sequence of smallest bacterium, Mycoplasma genitalium, completed; provides a model of the minimum number of genes needed for independent existence
1996 – The genome for Saccharomyces cerevisiae (baker’s yeadt, 12.1 Mb) is sequenced.
The prosite database is reported by Bairoch, et.al.
Methanococcus jannaschii genome sequenced; confirms existence of third major branch of life on earth.
DOE initiates 6 pilot projects on BAC end sequencing.
Saccharomyces cerevisiae (yeast) genome sequence completed by international consortium
Affymetrix produces the first commerical DNA chips.
Sequence of the human T-cell receptor region completed
1997 – The genome for E.coli (4.7 Mbp) is published.
Oxford Molecualr Group acquires the Genetics Computer Group.
LION bioscience AG founded as an intergrated genomics company with strong focus on bioinformatics. The company is built from IP out of the European Molecualr Biology Laboratory (EMBL), the European Bioinformtics Institute (EBI), the GErman Cancer Research Center (DKFZ), and the University of Heidelberg.
paradigm Genetics Inc., a company focussed on the application of genomic technologies to enhance worldwide food and fiber production, is founded in Research Triangle Park, NC.
deCode genetics publishes a paper that described the location of the FET1 gene, which is responsible for familial essential tremor, on chromosome 13 (Nature Genetics).
NIH NCHGR becomes National Human Genome Research Institute (NHGRI).
Second large-scale sequencing strategy meeting held in Bermuda
High-resolution physical maps of chromosomes X and 7 completed.
DOE-NIH Task Force on Genetic Testing releases final report and recommendations.
DOE forms Joint Genome Institute for implementing high-throughput activities at DOE human genome centers, initially in sequencing and functional genomics
1998 – The genomes for Caenorhabitis elegans and baker’s yeast are published.
The Swiss Institute of Bioinformatics is established as a non-profit foundation.
Craig Venter forms Celera in Rockville, Maryland.
PE Informatics was formed as a center of Excellence within PE Biosystems. This center brings together and leverges the complementary expertise of PE Nelson and Molecualr Informatics, to further complement the genetic instrumention expertise of Applied Biosystems.
Inpharmatica, a new Genomics and Bioinformatics company, is established by University College London, the Wolfson Institute for Biomedical Research, five leading scientists from major British academic centres and Unibio Limited.
GeneFormatics, a company dedicated to the analysis and predication of protein structure and function, is formed in San Diego.
Molecualr Simulations Inc. is acquired by Pharmacopeia.
1999 – deCode genetics maps the gene linked to pre-eclampsia as a locus on chromosome 2p13.
First Human Chromosome Completely Sequenced! On December 1, researchers in the Human Genome Project announced the complete sequencing of the DNA making up human chromosome 22.
Joint Genome Institute sequencing facility opens in Walnut Creek, CA.
Major Drug Firms Create Public SNP Consortium.
HGP advances goal for obtaining a draft sequence of the entire human genome from 2001 to 2000
2000 – The genome for Pseudomonas aeruginosa (6.3 Mbp) is published.
The A.thaliana genome (100 Mb) is secquenced.
The D.melanogaster genome (180 Mb) is sequenced.
Pharmacopeia acquires Oxoford Molecular Group.
HGP leaders and President Clinton announce the completion of a “working draft” DNA sequence of the human genome.
International research consortium publishes chromosome 21 genome, the smallest human chromosome and the second to be completely sequenced.
DOE researchers announce completion of chromosomes 5, 16, and 19 draft sequence.
International collaborators publish genome of fruit fly Drosophila melanogaster
2001 – The huam genome (3,000 Mbp) is published.
Human Chromosome 20 Finished – Chromosome 20 is the third chromosome completely sequenced to the high quality specified by the Human Genome Project
2002 – Structural Bioinformatics and GeneFormatics merge
An international sequencing consortium published the full genome sequence of the common house mouse (2.5 Gb). Whitehead Institute researcher Kerstin Lindblad-Toh is the lead author on the paper; her institution lead the project and contributed about half of the sequence. Washington University School of Medicine delivered about 30 percent of the sequence, and created the mouse BAC-based physical map. The Wellcome Trust Sanger Institute in the UK was the third major partner. Other institutes in the International Mouse Genome Sequencing Consortium included the University of California at Santa Cruz, the Institute for Systems Biology, and the University of Geneva.
Mouse Genome Sequencing Consortium publishes its draft sequence of mouse genome in the December 5, 2002, issue of Nature
International consortium led by the DOE Joint Genome Institute publishes draft sequence of Fugu rubripes.
2003 -Human Genome Project Completion, April 2003.
Human Chromosome 14 Finished – Chromosome 14 is the fourth chromosome to be completely sequenced
2004 – The draft genome sequence of the brown Norway laboratory rat, Rattus norvegicus, was completed by the Rat Genome Sequencing project Consortium.

Shares