Data errors in Bioinformatics databases
June 28, 2019Limitations of Bioinformatics databases
Based on their contents, biological databases can be roughly divided into three categories: primary databases, secondary databases, and specialized databases. Primary databases contain original biological data. They are archives of raw sequence or structural data submitted by the scientific community
One of the problems associated with biological databases is over reliance on sequence information and related annotations, without understanding the reliability of the information.
What is often ignored is the fact that there are many errors in sequence databases. There are also high levels of redundancy in the primary sequence databases.
Annotations of genes can also occasionally be false or incomplete. All these types of errors can be passed on to other databases, causing propagation of errors.
Most common errors in bioinformatics database
1.Sequencing error
Most errors in nucleotide sequences are caused by sequencing errors. Some of these errors cause frameshifts that make whole gene identification difficult or protein translation impossible
2.Cloning vector contamination
Sometimes, gene sequences are contaminated with sequences
from cloning vectors.
3. Redudancy of data
Redundancy is another major problem affecting primary databases. There is tremendous duplication of information in the databases, for various reasons. The causes of redundancy include repeated submission of identical or overlapping sequences by the same or different authors, revision of annotations, dumping of expressed sequence tags (EST) data, and poor database management that fails to detect the redundancy. This makes some primary databases excessively large and unwieldy for information retrieval.
4.Human error
The other common problem is erroneous annotations. Often, the same gene sequence is found under different names resulting in multiple entries and confusion about the data. There are also some errors that are simply caused by omissions or mistakes in typing.
Steps taken to reduce error in biological database
1.The National Center for Biotechnology Information (NCBI) has now created a nonredundant database, called RefSeq, in which identical sequences from the same organism and associated sequence fragments are merged into a single entry. Proteins sequences derived from the same DNA sequences are explicitly linked as related entries. Sequence variants from the same organism with very minor differences, which may well be caused by sequencing errors, are treated as distinctly related entries.
2.Another way to address the redundancy problem is to create sequence-cluster databases such as UniGene that coalesce EST sequences that are derived from the same gene.
3.To alleviate the problem of naming genes, reannotation of genes and
proteins using a set of common, controlled vocabulary to describe a gene or protein is necessary. The goal is to provide a consistent and unambiguous naming system for all genes and proteins. A prominent example of such systems is Gene Ontology.