Biological database- Past, Present and Future
June 18, 2019The Database Industry
Because of the high rate of data production and the need for researchers to have rapid access to new data, public databases have become the major medium through which genome sequence data are published. Public databases and the data services that support them are important resources in bioinformatics, and will soon be essential sources of information for all the molecular biosciences. However, successful public data services suffer from continually escalating demands from the biological community. It is probably important to realize from the very beginning that the databases will never completely satisfy a very large percentage of the user community. The range of interest within biology itself suggests the difficulty of constructing a database that will satisfy all the potential demands on it. There is virtually no end to the depth and breadth of desirable information of interest and use to the biological community.
EMBL and GenBank are the two major nucleotide databases. EMBL is the European version and GenBank is the American. EMBL and GenBank collaborate and synchronize their databases so that the databases will contain the same information. The rate of growth of DNA databases has been following an exponential trend, with a doubling time now estimated to be 9-12 months. In January 1998, EMBL contained more than a million entries, representing
more than 15,500 species, although most data is from model organisms such as Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans, Mus musculus an Arabidopsis thaliana. These databases are updated on a daily basis, but still you may find that a sequence referred to in the latest issue of a journal is not accessible. This is most often due to the fact that the release-date of the entry did not correlate with the publication date, or that the authors forgot to tell the databases that the sequences have been published.
The principal requirements on the public data services are:
1.Data quality – data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
2.Supporting data – database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to networkaccessible laboratory databases.
3.Deep annotation – deep, consistent annotation comprising supporting and ancillary information should be attached to each basic data object in the database.
4.Timeliness – the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission.
5. Integration – each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases.
Data
services should provide capabilities for following these links from one database or
data service to another.
The Creation of Sequence Databases
Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof), respectively. Sequences are represented in shorthand, using single letter designations. This decreases the space necessary to store information and increases processing speed for
analysis.
While most biological databases contain nucleotide and protein sequence information, there are also databases which include taxonomic information such as the structural and biochemical characteristics of organisms. The power and ease of using sequence information has however, made it the method of choice in modern analysis.
In the last three decades, contributions from the fields of biology and chemistry have facilitated an increase in the speed of sequencing genes and proteins. The advent of cloning technology allowed foreign DNA sequences to be easily introduced into bacteria. In this way, rapid mass production of particular DNA sequences, a necessary prelude to sequence determination, became possible. Oligonucleotide synthesis provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing. These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence. Alternatively, these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA sequences or to modify these sequences.
With these techniques in place, progress in biological research increased exponentially.
For researchers to benefit from all this information, however, two additional things were
required:
1) ready access to the collected pool of sequence information and
2) a way to extract from this pool only those sequences of interest to a given researcher.
Simply collecting, by hand, all necessary sequence information of interest to a given project
from published journal articles quickly became a formidable task.
After collection, the organization and analysis of this data still remained. It could take weeks to months for a
researcher to search sequences by hand in order to find related genes or proteins. Computer technology has provided the obvious solution to this problem. Not only can computers be used to store and organize sequence information into databases, but they can also be used to analyze sequence data rapidly. The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created.
Theoretical scientists have derived new and sophisticated algorithms which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models.
The physical linking of a vast array of computers in the 1970’s provided a few biologists with ready access to the expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it. Databases of existing sequencing data can be used to identify homologues of new molecules that have been amplified and sequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics.
Acquisition of sequence data
Bioinformatics tools can be used to obtain sequences of genes or proteins of interest, either from material obtained, labelled, prepared and examined in electric fields by individual researchers/groups or from repositories of sequences from previously investigated material.
Analysis of data
Both types of sequence can then be analysed in many ways with bioinformatics tools. They can be assembled. Note that this is one of the occasions when the meaning of a biological term differs markedly from a computational one. Computer scientists, banish from your mind any thought of assembly language. Sequencing can only be performed for relatively short stretches of a biomolecule and finished sequences are therefore prepared by arranging overlapping “reads” of monomers (single beads on a molecular chain) into a single continuous passage of “code”. This is the bioinformatic sense of assembly. They can be mapped (that is, their sequences can be parsed to find sites where so-called “restriction enzymes” will cut them). They can be compared, usually by aligning corresponding segments and looking for matching and mismatching letters in their sequences.
Genes or proteins which are sufficiently similar are likely to be related and are therefore said to be “homologous” to each other—the whole truth is rather more complicated than this. Such cousins are called “homologues”. If a homologue (a related molecule) exists then a newlydiscovered protein may be modelled—that is the three dimensional structure of the gene product can be predicted without doing laboratory experiments.
Bioinformatics is used in primer design. Primers are short sequences needed to make many copies of (amplify) a piece of DNA as used in PCR (the Polymerase Chain Reaction).
Bioinformatics is used to attempt to predict the function of actual gene products. Information about the similarity, and, by implication, the relatedness of proteins is used to trace the “family trees” of different molecules through evolutionary time. There are various other applications of computer analysis to sequence data, but, with so much raw data being generated by the Human Genome Project and other initiatives in biology, computers are presently essential for many biologists just to manage their day-to-day results Molecular modelling/structural biology is a growing field which can be considered part of bioinformatics. There are, for example, tools which allow you (often via the Net) to make pretty good predictions of the secondary structure of proteins arising from a given amino acid sequence, often based on known “solved” structures and other sequenced molecules acquired by structural biologists. Structural biologists use “bioinformatics” to handle the vast and complex data from X-ray crystallography, nuclear magnetic resonance (NMR) and electron microscopy investigations and create the 3-D models of molecules that seem to be everywhere in the media.
Database and bioinformatics
In the early 1990s, Tim Berners-Lee’s work as a researcher at the Conseil Europe´en pour la Recherche Nucle´aire (CERN) initiated the World Wide Web, a global information system made of interlinked documents. Since the mid-1990s, the Web has revolutionized culture, commerce and technology, and enabled near-instant communication for the first time in the history of mankind.
This technology also led to the creation of many bioinformatics resources accessible throughout the world. For example, the world’s first nucleotide sequence database, the EMBL Nucleotide Sequence Data Library (that included several other databases such as SWISS-PROT and REBASE), was made available on the Web in 1993. It was almost at the same time, in 1992, that the GenBank database became the responsibility of the NCBI (before it was under contract with Los Alamos National Laboratory). However, GenBank was very different from today and was distributed in print and as a CD-ROM in its first inception.
In addition, the well-known website of the NCBI was made available online in 1994 (including the tool BLAST, which allows to perform pairwise alignments efficiently). Then came the establishment of several major databases still used today: Genomes (1995), PubMed (1997) and Human Genome (1999).
The rise of Web resources also broadened and simplified access to bioinformatics tools, mainly through Web servers with a user-friendly graphical user interface.
Indeed, bioinformatics software often (i) require prior knowledge of UNIX-like operating systems, (ii) require the utilization of command lines (for both installation and usage) and (iii) require the installation of several software libraries (dependencies) before being usable, which can be unintuitive even for skilled bioinformaticians.
Fortunately, more developers try to make their tools available to the scientific community through easy-to-use graphical Web servers, allowing to analyze data without having to perform fastidious installation procedures. Web servers are now so present in modern science that the journal Nucleic Acids Research publishes a special issue on these tools each year (https://academic.oup.com/nar).
The use of Internet was not only restricted to analyzing data but also to share scientific studies through publications, which is the cornerstone of the scientific community. Since the creation of the first scientific journal, The Philosophical Transactions of the Royal Society, in 1665 and until recently, the scientists
shared their findings through print or oral media.
In the early 1980s, several projects emerged to investigate the possibility, advantages and disadvantages of using the Internet for scientific publications (including submission, revision and reading of articles). One of the first initiatives, BLEND, a 3-year study that used a cohort of around 50 scientists shed light on the possibilities and challenges of such projects. These studies pioneered the use of the Internet for both data set storage and publishing. This usage is exemplified by the implementation of preprint servers such as Cornell University’s arXiv (est. 1991) and Cold Spring Harbor’s bioRxiv (est. 2013) who perform these tasks simultaneously.
Evolution of Biological big data
Since the data size in storage systems are increasing day by day. Traditional storage system are proving unable to handle the abrupt increase in the size, complexity and variety of data produced by modern bioinformatics tools. This is a very big challenge before the bioinformatics researchers and scientists how to accommodate the bioinformatics big data to acquire, execute queries to perform analysis, interpretation visualization and dissemination of data. Computer scientists together with biologists are incorporating modern technologies and state of art and they are getting success to a large extent in handling the big data sets.
A high level System architecture for Big Data in bioinformatic have the five major parts-1.Bioinformatics Data Sources, 2. Hadoop Distributed File System, 3. Map Reduce Framework, 4. Bioinformatics Analysis Tools and 5.Visualization Tools. Each components of the architecture plays its own specific task which are complementary to each other.
Hadoop Distributed File System
One of the good alternative that is very useful to handle the big data of bioinformatics is to use of the Hadoop to accommodate voluminous data in Distributed File System (DFS) and also provides the computation capability for processing the data usingthe map reduce framework. HDFS is based upon master/slave architecture system. A HDFS Cluster comprises of a solitary Name Node, a master server that deals with the file system namespace and manages access to files by clients. Moreover, there are various Data Nodes, which handles storages. HDFS open a file system and permits client data to be put away in files. Inside, a file is split into one or more blocks and these blocks are stored on a set of Data Nodes i.e. client node. The Name Node performs file system tasks like opening, closing, and renaming files and directories. It also governs the mapping of blocks to Data Nodes. The Data Nodes are responsible for serving read and write requests from the client nodes. The Data Nodes also creates block, deletes block, and replicates blocks as ordered by the Name Node. The most noteworthy thing about HDFS is that, it supports multiple reads and one write of the data. Therefore, the write procedure is able to only append the data.
Map Reduce Framework
Map Reduce is a programming framework for processing large amount data sets in distributed fashion over many different nodes using algorithms on a HDFS. Jobs submitted to Map Reduce are broken into a group of map tasks and reduce tasks that execute in a distributed manner on a cluster of computers. The tasks normally are loading, parsing, transforming and filtering the data.
A set of files are given as input to the Map Reduce.
Each Map Task go through the following steps –
1. Record reader
2. Mapper
3. Combiner and
4. Partitioner
The intermediate keys and Value pairs are
generated by the maps task as its output. The reduce
tasks are also go through the following steps:
1.Shuffle,
2. Sort,
3. Reducer, and
4. Output format.
The elegance and strong points of the MapReduce is that it runs the code optimally on the nodes in which the data rests. So only codes are moved and not data sets in the network
Types of Big Data in Bioinformatics
There are generally five data types that are massive in size and most used in bioinformatics research: (i) gene expression data, (ii) DNA, RNA, and protein sequence data, (iii) protein–protein interaction (PPI) data, (iv) pathway data, and (v) gene ontology.
In gene expression analysis, the expression levels of thousands of genes are experimented and evaluated over various situations (e.g., separate developmental stages of the treatments and/or diseases). The expression levels for analysis are recorded by using microarray-based gene expression profiling. Gene-sample, gene-time, and gene-sample-time are three types of microarray data. In sequence analysis, DNA, RNA, or peptide sequences are operated by using several analytical methods. Their characteristics, functions, structures, and evolution are understood by attending this process. DNA sequencing can be applied for some purposes such as the study of genomes and proteins, evolutionary biology, identification of micro species, and forensic identification. PPIs offer essential information according to all the biological processes. Hence, protein functions can be properly given by forming and analyzing the PPI networks. Anomalous PPIs are the fundamentals of various diseases (e.g., Alzheimer’s disease and cancer). Pathway analysis is used to understand molecular basis of a disease. It identifies the genes and proteins which are related to the etiology of a disease. Moreover, this type of analysis estimates drug targets and manages the targeted literature searches. Gene ontology offers dynamic, structured, and speciesindependent gene ontologies for the three objectives of associated biological processes, cellular components, and molecular functions. It utilizes the controlled vocabularies for facilitating the query data at different levels.
A selection of big data projects in the life sciences:
Cancer Genome Atlas: This effort to map the genome of more than 25 types of cancers has generated 1 petabyte of data to date, representing 7,000 cases of cancer. Scientists expect 2.5 petabytes by completion.
Encyclopedia of DNA Elements (ENCODE): This map of the functional elements in the human genome — regions that turn genes on and off — contains more than 15 terabytes of raw data.
Human Microbiome Project: One of a number of projects characterizing the microbiome at different parts of the body, this effort has generated 18 terabytes of data — about 5,000 times more data than the original human genome project.
Earth Microbiome Project: A plan to characterize microbial communities across the globe, which has created 340 gigabytes of sequence data to date, representing 1.7 billion sequences from more than 20,000 samples and 42 biomes. Scientists expect 15 terabytes of sequence and other data by completion.
Genome 10K: The total raw data for this effort to sequence and assemble the DNA of 10,000 vertebrate species and analyze their evolutionary relationships will exceed 1 petabyte.
The Future
Because of high-performance computational platforms, these databases have become important in providing the infrastructure needed for biological research, from data preparation to data extraction. The simulation of biological systems also requires computational platforms, which further underscores the need for biological databases. The future of biological databases looks bright, in part due to the digital world.
In terms of research, bioinformatics tools should be streamlined for analyzing the growing amount of data generated from genomics, metabolomics, proteomics, and metagenomics. Another future trend will be the annotation of existing data and better integration of databases.
With a large number of biological databases available, the need for integration, advancements, and improvements in bioinformatics is paramount. Bioinformatics will steadily advance when problems about nomenclature and standardization are addressed. The growth of biological databases will pave the way for further studies on proteins and nucleic acids, impacting therapeutics, biomedical, and related fields.