Introduction to Machine learning-Bioinformatics
July 28, 2019Introduction
The Machine Learning field evolved from the broad field of Artificial Intelligence, which aims to mimic intelligent abilities of humans by machines. A dictionary definition of machine learning includes phrases such as “to gain knowledge, or understanding of, or skill in, by study, instruction, or experience,” and “modification of a behavioral tendency by experience”.
Machine learning usually refers to the changes in systems that perform tasks associated with artificial intelligence (AI). Such tasks involve recognition, diagnosis, planning, robot control, prediction, etc. The “changes” might be either enhancements to already performing systems or abinitio synthesis of new systems.
Importance of machine learning
• Some tasks cannot be defined well except by example; that is, we might be able to specify input/output pairs but not a concise relationship between inputs and desired outputs. We would like machines to be able to adjust their internal structure to produce correct outputs for a large number of sample inputs and thus suitably constrain their input/output function to approximate the relationship implicit in the examples.
• It is possible that hidden among large piles of data are important relationships and correlations. Machine learning methods can often be used to extract these relationships (data mining).
• Human designers often produce machines that do not work as well as desired in the environments in which they are used. In fact, certain characteristics of the working environment might not be completely known at design time. Machine learning methods can be used for on the job improvement of existing machine designs.
• The amount of knowledge available about certain tasks might be too large for explicit encoding by humans. Machines that learn this knowledge gradually might be able to capture more of it than humans would want to write down.
• Environments change over time. Machines that can adapt to a changing environment would reduce the need for constant redesign.
• New knowledge about tasks is constantly being discovered by humans. Vocabulary changes. There is a constant stream of new events in the world. Continuing redesign of AI systems to conform to new knowledge is impractical, but machine learning methods might be able to track much of it.
Types of machine learning
Machine learning is not only about classification. The following main classes of problems exist:
(i) Classification learning: learn to put instances into pre-defined classes
(ii) Association learning: learn relationships between the attributes
(iii) Clustering: discover classes of instances that belong together
(iv) Numeric prediction: learn to predict a numeric quantity instead of a class
Supervised and Unsupervised Learning
Supervised learning is the type of learning that takes place when the training instances are labelled with the correct result, which gives feedback about how learning is progressing. In unsupervised learning, the goal is harder because there are no pre-determined categorizations.
Supervised Learning
Supervised learning is fairly common in classification problems because the goal is often to get the computer to learn a classification system that we have created. Supervised learning is the most common technique for training neural networks and decision trees. Both of these techniques are highly dependent on the information given by the pre-determined classifications. In the case of neural networks, the classification is used to determine the error of the network and then adjust the network to minimize it, and in decision trees, the classifications are used to determine what attributes provide the most information that can be used to solve the classification puzzle.
Unsupervised learning
Unsupervised learning seems much harder: the goal is to have the computer learn how to do something that we don’t tell it how to do! There are actually two approaches to unsupervised learning. The first approach is to teach the agent not by giving explicit categorizations, but by using some sort of reward system to indicate success. A second type of unsupervised learning is called clustering. In this type of learning, the goal is not to maximize a utility function, but simply to find similarities in the training data. The assumption is often that the clusters discovered will match reasonably well with an intuitive classification. For instance, clustering individuals based on demographics might result in a clustering of the wealthy in one group and the poor in another.
Semisupervised learning
It is combination of unlabeled and labeled data. Here the objective is
build a function to correctly predict the output for unknown inputs or the inputs for whom the output is not known. The database here consists labeled data in less quantity and unlabeled data in more quantity.
Apart from supervised and unsupervised learning there are few other learning algorithms like reinforcement learning etc. But as the two supervised and unsupervised are being widely in most of the real world applications in all fields like computational biology, pattern recognition, etc.
Reinforcement learning
These algorithms are aimed at finding a policy that maps states of the world to actions. The actions are chosen among the options that an agent ought to take under those states, with the aim of maximizing some notion of long-term reward. Its main difference regarding the previous types of machine learning techniques is that
input–output pairs are not present in a database, and its goal resides in online performance.
Optimization
This can be defined as the task of searching for an optimal solution in a space of multiple possible solutions. As the process of learning from data can be regarded as searching for the model that best fits the data, optimization methods can be considered an ingredient in modeling. A broad collection of exact and heuristic optimization
algorithms has been proposed in the last decade.
Machine learning and statistics
Statistics more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all. Most machine learning algorithms use statistical tests when constructing rules or trees and for correcting models that are “overfitted” in that they depend too strongly on the details of the particular examples used to produce them. Statistical tests are used to validate machine learning models and to evaluate machine learning algorithms.
Selecting the Right Algorithm
Choosing the right algorithm can seem overwhelming—there are dozens of supervised and unsupervised machine learning algorithms, and each takes a different approach to learning. There is no best method or one size fits all. Finding the right algorithm is partly based on trial and error—even highly experienced data scientists cannot tell whether an algorithm will work without trying it out. Highly flexible models tend to overfit data by modeling minor variations that could be noise. Simple models are easier to interpret but might have lower accuracy. Therefore, choosing the right algorithm requires trading off one benefit against another, including model speed, accuracy, and complexity. Trial and error is at the core of machine learning—if one approach or algorithm does not work, you try another.
Machine algorithms in omics field
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms.The bioinformatics field is increasingly relying on machine learning (ML) algorithms to conduct predictive analytics and gain greater insights into the complex biological processes of the human body.Machine learning has been applied to six biological domains: genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Genomics
There is an increasing need for the development of machine learning systems that can automatically determine the location of protein-encoding genes within a given DNA sequence. This is a problem in computational biology known as gene prediction. Machine learning has also been used for the problem of multiple sequence alignment which involves aligning many DNA or amino acid sequences in order to determine regions of similarity that could indicate a shared evolutionary history. It can also be used to detect and visualize genome rearrangements.
Proteomics
Machine learning used to classify the amino acids of a protein sequence into one of three structural classes (helix, sheet, or coil).The current state-of-the-art in secondary structure prediction uses a system called DeepCNF (deep convolutional neural fields) which relies on the machine learning model of artificial neural networks to achieve an accuracy of approximately 84%. The theoretical limit for three-state protein secondary structure is 88–90%.Machine learning has also been applied to proteomics problems such as protein side-chain prediction, protein loop modeling, and protein contact map prediction.
Microarrays
One of the main problems in this field is identifying which genes are expressed based on the collected data.In addition, due to the huge number of genes on which data is collected by the microarray, there is a large amount of irrelevant data to the task of expressed gene identification, further complicating this problem. Machine learning presents a potential solution to this problem as various classification methods can be used to perform this identification. The most commonly used methods are radial basis function networks, deep learning, Bayesian classification, decision trees, and random forest.
Systems biology
Machine learning has been used to aid in the modelling of these complex interactions in biological systems in domains such as genetic networks, signal transduction networks, and metabolic pathways. Probabilistic graphical models, a machine learning technique for determining the structure between different variables, are one of the most commonly used methods for modeling genetic networks. In addition, machine learning has been applied to systems biology problems such as identifying transcription factor binding sites using a technique known as Markov chain optimization.Genetic algorithms, machine learning techniques which are based on the natural process of evolution, have been used to model genetic networks and regulatory structures.
Other systems biology applications of machine learning include the task of enzyme function prediction, high throughput microarray data analysis, analysis of genome-wide association studies to better understand markers of disease, protein function prediction.
Text mining
Machine learning can be used for this knowledge extraction task using techniques such as natural language processing to extract the useful information from human-generated reports in a database.This technique has been applied to the search for novel drug targets, as this task requires the examination of information stored in biological databases and journals.Annotations of proteins in protein databases often do notreflect the complete known set of knowledge of each protein, so additional information must be extracted from biomedical literature. Machine learning has been applied to automatic annotation of the function of genes and proteins, determination of the subcellular localization of a protein, analysis of DNA-expression arrays, large-scale protein interaction analysis, and molecule interaction analysis.
Another application of text mining is the detection and visualization of distinct DNA regions given sufficient reference data.
Commonly used machine learning algorithms in bioinformatics
Some of the most widely used learning algorithms are support vector machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm and Neural Networks (multilayer perception).
Commonly used Supervisied machine learning algorithms
Decision Tree Classifier
Decision tree classifiers are one of the very widely used classifiers because of many reasons like they are very simple, fast, effective and have very informative graphical representation. We apply recursive top down process to build the decision tree model which is very easy to understand and check. The decision tree has the top node called as root, and the other nodes are called as internal nodes. The tree is build recursively from root by taking into account one feature at a time, i.e. every node is corresponding to one input parameter. Then we divide the sample by asking recursive questions recursively. The leaf node is the prediction node.
Naïve Bayes Classifier
Naive Bayes classifier does the classification based on the parameters which are not dependent on each other. The Naïve Bayes classifier can be best described by equation 1.
P(C1|P1,P2)=P(P1|C1)P(P2|C2).P(C1)/P(P1)P(P2) (1)
Equation (1) gives the chance of input to belong to class C1
with parameter P1 and P2.
It defines the chance of getting class C1 with parameters P1 and P2 is the fraction whose numerator part is the product of chance of occurrence of P1 with class C1, chance of occurrence of parameter P2 with class C2 and chance of class C1 divided by the product of chance of occurrence of parameter P1 and chance of occurrence of parameter P2. Thus we can see that it is based on the Bayes formula.
Support Vector Machines
They are a standout amongst the most mainstream grouping procedures being used today. Its vigorous numerical premise and the great correctnesses that it exhibits in numerous genuine errands have set it among specialists’ top choices. SVMs outline tests into a higher-dimensional space where a maximal isolating hyperplane among the cases of various classes is built. The strategy works by developing another
two parallel hyper-planes on each side of this hyperplane. The SVM strategy tries to discover the isolating hyperplane that augments the territory of detachment between the two parallel hyperplanes. It is expected that a bigger partition between these parallel hyperplanes will infer a superior prescient precision of the classifier. As the amplest territory of division seems to be, indeed, controlled by a couple of tests that are near both parallel hyperplanes, these examples are called bolster vectors. They are likewise the most troublesome specimens to be accurately characterized. As much of the time, it isn’t conceivable to consummately isolate all the preparation purposes of various classes; the allowed remove between these misclassified focuses and the most distant side of the partition zone is constrained. In spite of the fact that SVM classifiers are mainstream because of the striking exactness levels accomplished in numerous bioinformatics issues, they are additionally censured for the absence of expressiveness and comprehensi-bility of their scientific ideas.
Commonly used Unsupervisied machine learning algorithms
Partitional Clustering
Clustering algorithms that have a place with this family relegate every sam ple to a one of a kind group, along these lines giving a segment of the arrangement of focuses. With a specific end goal to apply a partitional clustering calculation, the client needs to settle ahead of time the quantity of bunches in the parcel. Despite the fact that there are a few heuristic techniques for supporting the choice on the quantity of bunches, this issue still stays open. The k-implies calculation is the prototypical and best-known partitional clustering strategy. Its goal is to segment the arrangement of tests into K bunches so that the inside gathering entirety of squares is limited. In its essential frame, the calculation depends on the change country of two natural and quick advances. Prior to the cycle of these two stages begins, an irregular task of tests to K introductory clus-ters is performed. In the initial step, the examples are appointed to groups, generally to the bunch whose centroid is the nearest by the Euclidean separation. In the second step, new group cen-troids are recalculated. The emphasis of the two stages is ended when no development of a protest an alternate gathering will lessen the inside gathering total of squares. The writing gives a high jumper sity of varieties of the Kimplies calculation, particularly centered around enhancing the processing times. Its primary downside is that it doesn’t restore similar outcomes in two unique runs, since the last setup of bunches relies upon the underlying irregular relegate
ments of focuses to K introductory groups. In fluffy and probabilistic clustering, the examples are not compelled to have a place totally with one bunch. By means of these methodologies, each point has a level of having a place with each of the clus-ters. Guided by the minimization of intracluster difference, the writing demonstrates intriguing fluffy and probabilistic clustering techniques, and the field is as yet open for advance distribution openings.
Hierarchical Clustering
This is the most extensively utilized clustering worldview in bioinformatics. The yield of a various leveled clustering calculation is a settled and progressive arrangement of allotments/groups spoke to by a tree outline or dendrogram, with singular specimens toward one side (base) and a solitary bunch containing each component at the other (top). Agglomerative algorithms start at the base of the tree, though disruptive algorithms start at the best. Agglomerative techniques construct the dendrogram from the individual examples by iteratively combining sets of groups. Troublesome strategies seldom are connected because of their wastefulness. In view of the straightforwardness and high instinctive level of the dendrogram, the master can deliver a segment into a coveted number of disjoint gatherings by cutting the dendrogram at a given level. This ability to choose the quantity of definite groups to be contemplated has promoted the utilization of various leveled clustering among bio-specialists. A difference grid with the separation between sets of bunches is utilized to manage each progression of the agglomerative combining process. An assortment of
separation measures between bunches is accessible in the writing. The most well-known measures are single-linkage (the separation between two gatherings is the separation between their nearest individuals), finish linkage (characterized as the separation between the two most remote focuses), Ward_s progressive clustering strategy (at each phase of the calculation, the two gatherings that create the littlest increment in the aggregate inside gathering entirety of squares are amalgamated), centroid remove (characterized as the separation between the bunch means or centroids), middle separation (separate between the medians of the groups), and gathering normal linkage (normal of the dissimilarities between all sets of people, one from each gathering).
Open source Machine learning software tools
WEKA (Waikato Environment for Knowledge Analysis)
R-project and bioconductor
RapidMiner
Orange
References
1. Inza, I., Calvo, B., Armañanzas, R., Bengoetxea, E., Larrañaga, P., & Lozano, J. A. (2010). Machine learning: an indispensable tool in bioinformatics. In Bioinformatics methods in clinical research (pp. 25-48). Humana Press.
2. Aljarah, I., Ala’M, A. Z., Faris, H., Hassonah, M. A., Mirjalili, S., & Saadeh, H. (2018). Simultaneous feature selection and support vector machine optimization using the grasshopper optimization algorithm. Cognitive Computation, 10(3), 478-495.
3. Khattree, R., & Naik, D. (2007). Machine Learning Techniques for Bioinformatics. In Computational Methods in Biomedical Research (pp. 57-88). Chapman and Hall/CRC.