Revolutionizing Protein Function Prediction: The Introduction of DeepGO-SE
July 18, 2024Protein function prediction is one of the key challenges in modern biology and bioinformatics, enabling a deeper understanding of the roles and interactions of proteins within living systems. Accurate functional descriptions of proteins are essential for tasks such as drug target identification, understanding disease mechanisms, and improving biotechnological applications in industry. Despite advances in predicting protein structures, predicting protein function remains challenging due to the complexity and limited number of known functions.
The Gene Ontology (GO) is a vital tool in this domain, providing a structured vocabulary for molecular functions, biological processes, and cellular components. Researchers identify protein functions through experiments and add this data to knowledge bases like the UniProtKB/Swiss-Prot database, which contains manually curated GO annotations for over 550,000 proteins.
Recent advancements in protein function prediction methods leverage various data sources, such as sequence, interactions, tertiary structure, literature, coexpression, phylogenetic analysis, and GO information. These methods use advanced machine learning techniques, including deep convolutional neural networks (CNNs), language models like long short-term memory (LSTM) networks and transformers, and pretrained protein language models to represent amino acid sequences. Some methods incorporate protein-protein interactions using knowledge graph embeddings and graph convolutional neural networks, while others apply natural language models to scientific literature for automated function prediction.
However, many prediction methods still rely heavily on sequence similarity, which can be unreliable for proteins with little or no sequence similarity to known functional domains. This limitation highlights the need for methods that use diverse information sources to predict functions accurately, especially for biological processes and cellular components, which require knowledge of protein interactions within an organism’s proteome.
Ontologies, formal theories specifying class meanings using logic-based language, offer an underutilized information source for predicting protein functions. By incorporating formal axioms from GO into machine learning models, researchers can leverage background knowledge to improve prediction accuracy and efficiency through knowledge-enhanced machine learning.
Enter DeepGO-SE, a groundbreaking protein function prediction method that combines a pretrained large protein language model with a neuro-symbolic model performing function prediction as approximate semantic entailment. Using the ESM2 protein language model, DeepGO-SE generates protein representations and projects them into an embedding space (ELEmbeddings) created from GO axioms. This approach allows for the use of ontology axioms to enhance prediction accuracy.
DeepGO-SE also improves predictions for complex biological processes and cellular components by incorporating protein-protein interaction networks. This method demonstrates that while molecular functions can be predicted from single proteins, predicting biological processes and cellular components requires information about multiple protein interactions.
In rigorous evaluations using the UniProtKB/Swiss-Prot dataset, DeepGO-SE significantly outperformed baseline methods across all GO subontologies. The model achieved a maximum F measure of 0.554 for molecular functions, 0.432 for biological processes, and 0.721 for cellular components, highlighting its superior performance in function prediction.
Furthermore, DeepGO-SE was validated using the neXtProt dataset, where it achieved the best maximum F measure for molecular functions and biological processes, demonstrating its robustness across different datasets. The model’s ability to predict functions based on structural homologues further confirms its reliability.
DeepGO-SE’s innovative approach represents a significant advancement in protein function prediction, offering improved accuracy and the ability to predict novel protein functions. This breakthrough paves the way for better understanding of protein roles in living systems, with far-reaching implications for drug discovery, disease research, and biotechnology.
The DeepGO-SE Advantage:
- Knowledge-Enhanced Learning: DeepGO-SE integrates formal axioms from GO into machine learning models, improving prediction accuracy through knowledge-enhanced learning.
- Diverse Information Sources: The method combines data from protein sequences, structures, and interactions, offering a comprehensive approach to function prediction.
- Improved Accuracy: DeepGO-SE significantly outperforms existing methods, especially for complex biological processes and cellular components.
- Validation Across Datasets: The model’s robustness is confirmed through evaluations on both the UniProtKB/Swiss-Prot and neXtProt datasets.
- Structural Homologue Analysis: DeepGO-SE reliably predicts functions based on structural similarities, even for proteins with low sequence identity.
In summary, DeepGO-SE represents a revolutionary step forward in protein function prediction, leveraging advanced machine learning techniques and diverse data sources to provide more accurate and comprehensive functional descriptions of proteins. This development has the potential to transform various fields, from drug discovery to biotechnology, by enhancing our understanding of protein functions and interactions.
Journal reference:
Kulmanov, M., et al. (2024). Protein function prediction as approximate semantic entailment. Nature Machine Intelligence. doi.org/10.1038/s42256-024-00795-w.