Large Language Models

ChatGPT in Bioinformatics and Biomedical Informatics

December 18, 2024 Off By admin
Shares

The rapid evolution of artificial intelligence (AI) has brought transformative possibilities to scientific research, with large language models (LLMs) like ChatGPT leading the charge. Since its public debut in late 2022, ChatGPT has sparked immense interest across disciplines, including bioinformatics and biomedical informatics. A recent review offers a comprehensive look at how ChatGPT has been utilized in these fields during its first year, highlighting its potential, limitations, and future directions.


Introduction

In 2023, the scientific community embraced ChatGPT for its potential to address complex challenges through natural language understanding and response generation. The model’s ability to analyze biomedical literature, generate code, and perform computational tasks positions it as a versatile tool in bioinformatics and biomedical informatics. With over 2,000 PubMed-indexed articles referencing ChatGPT in 2023, the enthusiasm is clear. However, only about 30 publications explicitly explored its role in bioinformatics, underscoring the nascent stage of its adoption in this field.


Key Areas of Application

1. Omics Analysis

  • Cell Type Annotation:
    ChatGPT-4 has shown promise in annotating cell types in single-cell RNA sequencing (scRNA-Seq) data, aligning closely with manual annotations and surpassing traditional methods. For instance, it can predict cell types based on tissue names and marker genes, streamlining workflows for researchers without coding expertise. However, human validation remains essential to ensure accuracy.
  • Genomics:
    ChatGPT performs well in extracting gene names and identifying protein-coding genes but struggles with tasks like analyzing single nucleotide polymorphisms (SNPs) and alignments. Tools such as GeneGPT, which integrate external databases like NCBI, help bridge these gaps.

2. Biomedical Text Mining

  • Strengths:
    ChatGPT excels in tasks like question answering and summarization, often rivaling state-of-the-art (SOTA) domain-specific models when task-specific training data is limited.
  • Limitations:
    Its performance in named entity recognition and relation extraction lags behind SOTA models. Furthermore, when applied to biomedical knowledge graphs, ChatGPT tends to suggest known associations rather than uncover novel insights.
  • Improving Performance:
    Techniques like In-context Learning (ICL) and Chain-of-Thought (CoT) prompting enhance the model’s efficacy. Additionally, incorporating protein dictionaries and iterative prompt optimization, such as the tree-of-thought method, has shown incremental improvements.

3. Drug Discovery

  • Drug-Disease Associations:
    ChatGPT has demonstrated potential in identifying drug-disease relationships and generating molecular captions.
  • Challenges:
    Complex tasks like interpreting chemical structures and predicting drug-drug interactions (DDIs) reveal its limitations compared to SOTA models. Human oversight remains critical, as tools like ChatDrug illustrate the value of human-in-the-loop workflows.
  • Advancements:
    Instruction fine-tuning and task-specific tuning have improved performance, with open-source LLMs offering promising avenues for domain-specific applications.

4. Biomedical Image Understanding

  • Multimodal Models:
    The emergence of GPT-4 Vision (GPT-4V) enables analysis of medical images, such as performing visual question answering or assisting in classification tasks. Techniques like Visual Referring Prompting (VRP) aim to refine image-text interaction further.

5. Bioinformatics Programming

  • Code Generation:
    ChatGPT facilitates programming tasks by generating scripts and translating natural language queries into SQL statements, making bioinformatics tools more accessible to non-coders. However, its capabilities for creating complex, multi-step workflows are still limited.
  • Automation Tools:
    Platforms like RTutor.AI and Chatlize.AI leverage LLMs for automated bioinformatics workflows, while tools such as AutoBA improve error feedback and code robustness.
  • Benchmarks:
    Benchmarks like BIOCODER assess the quality of LLM-generated code, highlighting gaps in readability and execution success.

6. Bioinformatics Education

  • Potential:
    ChatGPT offers educational benefits, from explaining error messages to generating readable code. However, effective prompt engineering is a critical skill for maximizing its utility.
  • Risks:
    Over-reliance on AI could lead to superficial understanding, potentially undermining students’ grasp of core concepts. Developing repositories of high-quality prompts may help mitigate this risk.
TimeEvent
Late 2022Chat Generative Pre-trained Transformer (ChatGPT) is launched to the public, marking a new era in AI and attracting interest from the biomedical research community.
Early 2023Widespread exploration begins, applying ChatGPT across disciplines like bioinformatics and biomedical informatics.
Throughout 2023– Over 2000 manuscripts indexed in PubMed with the keyword “ChatGPT”.
– Research focuses on ChatGPT’s performance in bioinformatics tasks such as omics, genomics, text mining, drug discovery, image understanding, and bioinformatics programming.
– Development of tools like ChatDrug, DrugChat, and DrugAssist for enhancing drug discovery workflows.
– Novel prompting techniques like few-shot learning and Chain-of-Thought (CoT) prompting introduced.
– Use of Retrieval Augmented Generation (RAG) techniques to improve chatbot accuracy and relevance.
– Early studies highlight ChatGPT’s capabilities in bioinformatics coding tasks like sequence alignment and evolutionary tree construction.
Mid-2023Code Interpreter is integrated into ChatGPT-4; performance degradation observed in some text mining tasks, mitigated by iterative prompt optimization.
Late September 2023GPT-4V (Vision) is released, initiating studies on biomedical image understanding capabilities.
End of 2023– Focus on addressing ChatGPT’s limitations like “hallucinations” and enhancing performance through fine-tuning, prompt engineering, and human-in-the-loop workflows.
– Benchmark development begins for evaluating LLM capabilities in bioinformatics.
Early 2024
– Release of the open-source Gemma LLM by Google.
– Publication of the GeneGPT tool, linking Codex LLMs to the NCBI database.
May 2024GPT-4o is released with Code Interpreter as the default option.

Key Challenges and Limitations

  1. Hallucinations:
    ChatGPT often fabricates plausible but incorrect information, posing risks for non-experts who may not detect these errors.
  2. Reasoning and Quantitative Analysis:
    The model struggles with complex reasoning and mathematical tasks, necessitating supplementary human expertise.
  3. Variability:
    Model updates can alter its behavior, reducing the reliability of previously effective prompts.
  4. Ethical and Transparency Concerns:
    Security and transparency remain critical, particularly for sensitive biomedical data. Open-source models and local deployments offer alternatives to proprietary cloud-based solutions.

Future Directions

To fully harness ChatGPT’s potential in bioinformatics, several advancements are necessary:

  1. Enhanced Prompt Engineering: Iterative refinement and incorporation of domain-specific context.
  2. Retrieval-Augmented Generation (RAG): Integration with external knowledge bases for improved robustness.
  3. Domain-Specific Fine-Tuning: Training models with targeted datasets for specific bioinformatics tasks.
  4. Benchmark Development: Creating comprehensive metrics and datasets to evaluate performance across diverse applications.
  5. Human-AI Collaboration: Emphasizing human oversight to correct errors and ensure reliability.
  6. Ethical Considerations: Prioritizing responsible use and transparency in deploying LLMs.

Conclusion

ChatGPT has emerged as a promising tool for bioinformatics and biomedical informatics, demonstrating significant potential in text mining, omics, drug discovery, and education. However, its limitations—ranging from hallucinations to challenges in reasoning—highlight the need for human oversight, effective prompt engineering, and continuous research. By addressing these challenges, ChatGPT and similar LLMs can drive innovation in bioinformatics while safeguarding scientific rigor.

FAQ: ChatGPT in Bioinformatics and Biomedical Informatics

  • What are the primary areas where ChatGPT has been applied within bioinformatics and biomedical informatics?
  • ChatGPT’s applications have been explored across a wide range of areas including omics data analysis (like single-cell RNA sequencing), genetics, biomedical text mining, drug discovery, biomedical image analysis, bioinformatics programming, and bioinformatics education. Specifically, it’s been used for tasks like cell type annotation, gene identification, text-based question answering, and generating code for data analysis.
  • How well does ChatGPT perform in typical biomedical text mining tasks compared to specialized models? While ChatGPT has shown impressive capabilities in natural language processing, it doesn’t consistently outperform specialized models that have been fine-tuned for specific biomedical tasks, such as named entity recognition, relationship extraction, and sentence classification. However, in scenarios where task-specific training data is limited, ChatGPT and similar models may perform better than fine-tuned models due to their ability to work in a “zero-shot” setting. Additionally, strategic prompt engineering, such as providing examples or asking the chatbot to think step-by-step (Chain-of-Thought), has shown to improve ChatGPT’s performance.
  • What are some limitations of ChatGPT when applied to bioinformatics and biomedical informatics?
  • Despite its strengths, ChatGPT has several key limitations. It may struggle with tasks requiring deep reasoning, quantitative analysis, or that involve areas outside of its training data. ChatGPT also faces challenges in producing accurate and complex code, accurately relating genes and pathways, and may sometimes generate “hallucinated” information. Furthermore, its performance can vary between updates, potentially affecting the reliability of prompts over time. Data privacy and concerns around dependence on AI are also issues that are under consideration.
  • How can prompt engineering improve the performance of ChatGPT in bioinformatics and related fields?
  • Prompt engineering is crucial for enhancing ChatGPT’s responses. This includes strategies like “in-context learning” (providing examples of the desired output), using “Chain-of-Thought” prompting, specifying output formats and including detailed context. Iterative prompt refinement – repeatedly adjusting prompts based on the chatbot’s response – can also improve results. Furthermore, a tree-of-thought methodology has also been used to improve the chatbot’s performance in specific scenarios.
  • What is ‘prompt bioinformatics’ and how does it differ from prior methods?
  • “Prompt bioinformatics” refers to using natural language instructions to guide chatbots for bioinformatics analysis by generating executable code. It differs from older bioinformatics chatbots because the code is generated dynamically based on user instructions and can vary between sessions. This approach allows scientists with limited coding expertise to perform complex data analyses, but it also raises challenges in ensuring reproducibility of results.
  • How is ChatGPT being used in drug discovery and what are its strengths and limitations in this domain?
  • In drug discovery, ChatGPT is being used for tasks such as identifying drug-disease associations, generating molecular captions, and predicting drug-drug interactions. It demonstrates proficiency in textual chemistry but faces limitations when it comes to accurately interpreting molecular structure data or predicting drug interactions based on underlying mechanisms. While helpful, ChatGPT requires human oversight and expert knowledge, and some studies show it is unable to pass pharmacist licensing examinations.
  • What role does human-in-the-loop play when using ChatGPT in bioinformatics and drug discovery applications? Human involvement is critical for reliable use of ChatGPT, and the human-in-the-loop approach involves the human user critically evaluating chatbot’s responses, providing feedback, and using tools to refine the results. This iterative process is essential to ensure data accuracy, address limitations, and improve the model’s results. For example, some tools allow for conversational refinement of molecular structures or integration of external knowledge databases. In coding applications, error message feedback is an important loop.
  • What are some advancements in integrating large language models in bioinformatics?
  • Several advancements have been seen in this field including methods of task-tuning or instruction-tuning to adapt LLMs for specific tasks like molecular property prediction, drug-drug interaction prediction, and biomedical text mining. Fine-tuning on domain-specific datasets significantly improves the models’ ability to process and generate information accurately, even when there is limited training data. Another area of advancement is retrieval-augmented generation (RAG) to ground the responses of chatbots through external knowledge and literature.

Glossary of Key Terms

  • Bioinformatics: The application of computer science and information technology to the field of biology, often involving the analysis of large biological datasets.
  • Biomedical Informatics: The interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem-solving, and decision making, motivated by efforts to improve human health.
  • ChatGPT (Chat Generative Pre-trained Transformer): A large language model chatbot developed by OpenAI that is trained to generate human-like text.
  • Large Language Model (LLM): An AI model trained on a massive dataset of text and code that can generate and understand natural language.
  • Omics: A field of biology focused on the comprehensive analysis of large biological datasets, such as genomics (study of genes), proteomics (study of proteins), and transcriptomics (study of RNA transcripts).
  • Single-Cell RNA Sequencing (scRNA-Seq): A technique used to measure the gene expression levels in individual cells, providing a high-resolution view of cellular heterogeneity.
  • Prompt Engineering: The process of designing and refining input instructions to elicit desired responses from a large language model.
  • Zero-Shot Learning: The ability of a machine learning model to perform a task it has not been explicitly trained on, by applying existing knowledge.
  • One-Shot Learning: A learning paradigm where a model learns from a single example of a new task.
  • Few-Shot Learning: A learning paradigm where a model learns from a small number of examples of a new task.
  • In-Context Learning (ICL): A method of using examples within a prompt to guide an LLM’s response, without the need for explicit retraining.
  • Chain-of-Thought (CoT) Prompting: A prompting technique that encourages an LLM to break down a problem into intermediate steps to improve its ability to solve complex problems.
  • Tree of Thought (ToT): An extension of the CoT approach, where the model generates a tree-like structure of reasoning steps instead of a linear chain.
  • Retrieval-Augmented Generation (RAG): A technique that combines a retriever model, which fetches relevant data, with a generator model, which uses the retrieved information to generate responses.
  • Fine-Tuning: The process of further training a pre-trained model on a specific dataset or task to improve its performance in that area.
  • Task Tuning: The process of fine-tuning a pre-trained model on a specific task to enhance its performance on that task.
  • Instruction Tuning: The process of fine-tuning a pre-trained model to better understand and follow natural language instructions, improving its applicability across different tasks.
  • AI Hallucination: The phenomenon where a generative AI model produces false or misleading information not supported by its training data.
  • Biomedical Text Mining: The use of computational techniques to extract useful information and knowledge from biomedical text data, such as scientific articles and clinical notes.
  • Biological Pathway: A series of interactions between molecules in a cell that leads to a certain product or a change in the cell.
  • Protein-Protein Interaction (PPI): The physical contact between two or more proteins that form a complex, often performing a biological function.
  • Drug Discovery: The process of identifying new chemical compounds or biologics with therapeutic potential.
  • Drug-Drug Interaction (DDI): Occurs when the effect of one drug is altered by the presence of another drug.
  • Simplified Molecular-Input Line-Entry System (SMILES): A notation system for describing the structure of chemical molecules using a string of characters.
  • Prompt Bioinformatics: The use of natural language instructions (prompts) to guide chatbots for reliable and reproducible bioinformatics data analysis through code generation.
  • Structured Query Language (SQL): A programming language used to manage and query databases.

Reference

Wang, J., Cheng, Z., Yao, Q., Liu, L., Xu, D., & Hu, G. (2024). Bioinformatics and biomedical informatics with ChatGPT: Year one review. Quantitative Biology12(4), 345-359.

Shares