A Practical Introduction to Large Language Models in Biomedical Data Science Research
March 12, 2024Description Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable proficiency in understanding and generating language across a wide range of disciplines. In the field of biomedical data science and computational biology, LLMs can play a significant role in enhancing information accessibility, facilitating data analysis, and fostering knowledge discovery. This tutorial provides an introductory, hands-on approach to understanding and leveraging LLMs in the context of biomedical data science. The tutorial begins by establishing a foundational understanding of LLMs and Biomedical Data Science. It then explores key applications of LLMs in biomedical data science and computational biology, including retrieval-augmented generation, database functionalities, and code generation. Through relevant case studies, participants will learn how to effectively utilize LLMs to bridge the gap between technical feasibility and practical utility in biomedical data science. The tutorial also includes hands-on exercises to allow participants to apply their knowledge in real-time. Additionally, participants will become familiar with OpenAI’s ChatGPT and other open-source LLMs, gaining insights into their design, use cases, limitations, and future prospects.
Learning Objectives:
- Understanding the key aspects of large-scale biomedical data.
- Using LLMs to handle and interpret vast amounts of biomedical data.
- Exploring cutting-edge research topics through two invited talks.
- Utilizing OpenAI APIs for GPTs and open-source LLMs in Python.
- Enhancing coding efficiency in bioinformatics by integrating LLMs.
- Deploying LLMs for biomedical question-answering and academic literature exploration.
Intended Audience and Level:
This tutorial is aimed at graduate students, researchers, data analysts, and practitioners in bioinformatics, computational biology, and biomedical informatics. It is beneficial for those looking to leverage Large Language Models (LLMs) to enhance their analytical skills. The content is suitable for individuals interested in broadening and deepening their understanding of LLMs.
Introduction to LLMs with a focus on Biomedical Data Science
Large Language Models (LLMs) are a type of artificial intelligence that uses vast amounts of text data to understand and generate human-like language. They have gained significant attention in recent years for their ability to perform a wide range of natural language processing tasks, including text generation, translation, summarization, and question answering. In the context of biomedical data science, LLMs hold great promise for advancing research and innovation in healthcare.
Understanding LLMs: LLMs are typically based on deep learning architectures, such as transformers, which allow them to process and generate text with a high degree of complexity and accuracy. These models are trained on large-scale datasets, such as books, articles, and websites, to learn the statistical patterns and semantic relationships of language.
Applications in Biomedical Data Science: In biomedical data science, LLMs can be used for a variety of tasks, including:
- Text Mining: Extracting information from biomedical literature, electronic health records (EHRs), and other textual sources.
- Drug Discovery: Identifying potential drug candidates and predicting their properties based on chemical structures and biological data.
- Clinical Decision Support: Analyzing patient data to assist healthcare providers in making informed decisions about diagnosis and treatment.
- Genomics and Proteomics: Analyzing genetic sequences and protein structures to understand their functions and interactions.
Challenges and Considerations: While LLMs offer many benefits, there are also challenges and considerations to be aware of, including:
- Data Bias: LLMs can reflect and amplify biases present in the training data, which can lead to unfair or inaccurate outcomes, especially in sensitive areas like healthcare.
- Ethical and Privacy Concerns: The use of LLMs in healthcare raises concerns about patient privacy and the ethical implications of using AI to make decisions that affect people’s lives.
- Interpretability: LLMs are often considered “black boxes” because their internal workings are complex and not easily interpretable, which can make it difficult to understand how they arrive at their conclusions.
LLMs have the potential to revolutionize biomedical data science by enabling researchers and healthcare professionals to analyze and interpret large volumes of text data more effectively. However, it is essential to address challenges related to bias, privacy, and interpretability to ensure that LLMs are used responsibly and ethically in healthcare and other domains.
How to use GPT-3.5 and GPT-4 with Python
Using GPT-3.5 and GPT-4 with Python involves using OpenAI’s API. Here’s a general outline of the steps you would typically follow:
- Get API Access: Sign up for an API key from OpenAI to access GPT-3.5 and GPT-4. You can do this on the OpenAI website.
- Install OpenAI Python Package: Install the OpenAI Python package using pip:bash
pip install openai
- Set up Environment Variable: Set your API key as an environment variable to authenticate API requests. You can do this by adding the following line to your shell startup file (e.g.,
.bashrc
,.zshrc
):bashexport OPENAI_API_KEY="your-api-key-here"
Then, restart your shell or run
source ~/.bashrc
(orsource ~/.zshrc
). - Use OpenAI API in Python: Use the
openai
Python package to interact with the GPT models. Here’s an example of using GPT-3.5 and GPT-4 for text generation:pythonimport openai
# Set the API key
openai.api_key = "your-api-key-here"# Example prompt
prompt = "Once upon a time"# Generate text using GPT-3.5
response_3_5 = openai.Completion.create(
model="text-davinci-003", # GPT-3.5 model
prompt=prompt,
max_tokens=50
)# Generate text using GPT-4
response_4 = openai.Completion.create(
model="text-davinci-004", # GPT-4 model
prompt=prompt,
max_tokens=50
)print("GPT-3.5:", response_3_5["choices"][0]["text"])
print("GPT-4:", response_4["choices"][0]["text"])
Replace
"your-api-key-here"
with your actual API key. You can adjustprompt
andmax_tokens
to suit your needs.
Remember to handle API responses appropriately and refer to the OpenAI API documentation for more details on available models, parameters, and best practices.
How to use Open-source LLMs with Python
Using open-source Large Language Models (LLMs) like GPT-3.5 or GPT-4 with Python typically involves leveraging libraries such as Hugging Face’s Transformers or OpenAI’s own API. Here’s a general guide to get you started:
- Install the necessary libraries:bash
pip install transformers
- Load the model: You can load a pre-trained model using
AutoModelForCausalLM
for causal language modeling tasks orAutoModelForSequenceClassification
for text classification tasks, and specify the model and tokenizer name (e.g., “gpt2”, “gpt2-medium”, “gpt2-large”, “gpt2-xl”, etc.):pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gpt2" # Example model name, replace with your desired model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
- Generate text: Use the model to generate text based on a prompt:python
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
- Fine-tuning (Optional): If you want to fine-tune the model on your own dataset, you can do so by using the
Trainer
class from thetransformers
library. This involves defining a training dataset, a validation dataset, and specifying training parameters. - Using GPT-4 (or any future models): As of my last update, GPT-4 was not publicly released. If and when it becomes available, you would use it similarly to how GPT-3.5 is used, by specifying the appropriate model name in the
AutoModelForCausalLM
andAutoTokenizer
classes.
Remember to always respect the terms of service and usage limits of the model you are using, especially for models like GPT-3.5 and GPT-4 which may have usage restrictions.
Database Query Generation with LLMs
Database query generation using Large Language Models (LLMs) involves creating natural language queries to retrieve information from databases. This process can be achieved using Python libraries and LLMs like GPT-3.5 or GPT-4. Here’s a general approach to generating database queries with LLMs in Python:
- Install Required Libraries: Install the libraries required for interfacing with the LLM and the database. For LLMs, you might use the OpenAI Python package, and for databases, libraries like SQLAlchemy for SQL databases or pymongo for MongoDB.bash
pip install openai sqlalchemy pymongo
- Initialize the LLM: Initialize the LLM using the OpenAI API. You’ll need an API key from OpenAI.python
import openai
openai.api_key = 'YOUR_API_KEY'
- Provide Database Schema: Provide the schema of your database to the LLM so it understands the structure of the database and can generate relevant queries. This can be done through a prompt.
- Generate Queries: Use the LLM to generate queries based on user input or predefined scenarios.python
prompt = "Generate a SQL query to retrieve all customers from the database."
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
max_tokens=50
)query = response.choices[0].text.strip()
- Execute Queries: Execute the generated query on the database using the appropriate library.python
from sqlalchemy import create_engine
import pandas as pd# Example for SQL database
engine = create_engine('sqlite:///example.db')
df = pd.read_sql_query(query, engine)
- Process and Display Results: Process the query results and display them to the user.python
print(df)
- Iterate and Refine: Iterate on the process, refining the prompts and the way queries are generated based on the LLM’s responses and the effectiveness of the queries.
Note: Depending on the complexity and specific requirements of your project, you may need to adjust and customize the steps above. Also, ensure you handle sensitive information securely, especially when interacting with databases using LLMs.
Retrieval-augmented Generation with Large Language Models
Retrieval-augmented generation (RAG) is a technique that combines the capabilities of large language models (LLMs) with external knowledge sources, such as a document database, to improve the quality and relevance of generated text. This technique is particularly useful for generating informative and contextually relevant responses to queries. Here’s a general approach to implementing retrieval-augmented generation with LLMs like GPT-3.5 or GPT-4 in Python:
- Install Required Libraries: Install the libraries required for interfacing with the LLM and the document database.bash
pip install openai transformers
- Initialize the LLM: Initialize the LLM using the Hugging Face Transformers library. You’ll need to choose a specific model to use, such as GPT-3.5 or GPT-4, and load the tokenizer and model.python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_name = 'gpt2' # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
- Provide Context and Query: Provide the context and query to the model. The context can be the prompt or initial text, and the query can be the information you want to retrieve from the document database.python
context = "In a recent study, researchers found that"
query = "the impact of climate change on biodiversity."
input_text = context + " " + query
input_ids = tokenizer.encode(input_text, return_tensors='pt')
- Retrieve Relevant Documents: Use the query to retrieve relevant documents from the document database.python
relevant_documents = retrieve_documents_from_database(query)
- Generate Text: Generate text using the LLM, incorporating information from the retrieved documents.python
generated_text = model.generate(
input_ids=input_ids,
max_length=1000,
num_return_sequences=1,
top_k=50,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
no_repeat_ngram_size=3,
num_beams=5,
context=relevant_documents
)[0]
- Process and Display Results: Process the generated text and display it to the user.python
print(tokenizer.decode(generated_text, skip_special_tokens=True))
- Iterate and Refine: Iterate on the process, refining the query, document retrieval, and text generation steps based on the quality and relevance of the generated text.
Note: Retrieval-augmented generation can be complex and resource-intensive, depending on the size of the document database and the specific requirements of your application. Consider using caching or other optimization techniques to improve performance.
Code generation in Bioinformatics
Code generation in bioinformatics involves using large language models (LLMs) to automatically generate code for various bioinformatics tasks, such as data preprocessing, analysis, and visualization. This can help automate repetitive tasks and speed up the development process. Here’s a general approach to implementing code generation in bioinformatics using LLMs in Python:
- Install Required Libraries: Install the libraries required for interfacing with the LLM.bash
pip install openai transformers
- Initialize the LLM: Initialize the LLM using the Hugging Face Transformers library. You’ll need to choose a specific model to use, such as GPT-3.5 or GPT-4, and load the tokenizer and model.python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_name = 'gpt2' # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
- Provide Task Description: Provide a description of the bioinformatics task for which you want to generate code. This description should include details such as the input data format, the desired output, and any specific requirements or constraints.python
task_description = "Parse a FASTA file and calculate the GC content of each sequence."
- Generate Code: Use the LLM to generate code based on the task description.python
input_text = "Generate code to " + task_description
input_ids = tokenizer.encode(input_text, return_tensors='pt')
generated_code = model.generate(
input_ids=input_ids,
max_length=1000,
num_return_sequences=1,
top_k=50,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
no_repeat_ngram_size=3,
num_beams=5
)[0]
- Process and Use Generated Code: Process the generated code and use it in your bioinformatics workflow.python
generated_code_str = tokenizer.decode(generated_code, skip_special_tokens=True)
print(generated_code_str)
- Iterate and Refine: Iterate on the process, refining the task description and the way code is generated based on the quality and relevance of the generated code.
Note: Code generation with LLMs in bioinformatics can be challenging due to the complexity of bioinformatics tasks and the need for accurate and efficient code. It’s important to carefully review and test the generated code to ensure it meets your requirements and produces the desired results.
Large Language Models for Biomedicine: from PubMed Search to Gene Set Analysis
Large Language Models (LLMs) like GPT-3.5 and GPT-4 can be valuable tools in bioinformatics for tasks ranging from text mining PubMed articles to generating code for gene set analysis. Here’s a general approach to using LLMs for these tasks:
- Text Mining PubMed Articles:
- Retrieve Relevant Articles: Use the PubMed API or a Python library like Biopython to retrieve relevant articles based on keywords or specific queries.
- Preprocess Text: Clean and preprocess the text from the articles, removing noise and irrelevant information.
- Use LLM for Summarization: Utilize the LLM to generate summaries of the articles, extracting key information such as genes, proteins, diseases, and treatments.
pythonfrom transformers import pipeline
summarization_pipeline = pipeline("summarization", model="t5-base", tokenizer="t5-base")
summary = summarization_pipeline(article_text, max_length=150, min_length=30, do_sample=False)[0]['summary_text']
- Gene Set Analysis:
- Define Gene Sets: Define sets of genes or proteins related to a particular biological process or disease.
- Generate Code for Analysis: Use the LLM to generate code for performing gene set analysis, such as enrichment analysis or pathway analysis, using libraries like Biopython or pandas.
pythonanalysis_code = """
# Code for gene set analysis using Biopython or other libraries
"""
- Integration and Automation:
- Combine Steps: Integrate the text mining and gene set analysis steps into a single workflow, using the LLM to bridge the gap between retrieving information from articles and analyzing gene sets.
- Automation: Use the LLM to generate scripts or workflows that automate the entire process, from searching PubMed to performing gene set analysis.
- Iterate and Improve:
- Refinement: Continuously refine the process based on the quality of the generated summaries and analysis code.
- Feedback Loop: Incorporate feedback from domain experts to improve the relevance and accuracy of the generated results.
By leveraging the capabilities of LLMs, researchers in bioinformatics can streamline the process of extracting knowledge from biomedical literature and performing analyses on gene sets, ultimately accelerating research in the field.
Large language models (LLMs) have shown great promise in various biomedical applications, including PubMed search and gene set analysis. In this tutorial, we will demonstrate how to leverage LLMs, such as GPT-3.5 or GPT-4, for these tasks using Python. We’ll use the Hugging Face Transformers library for interacting with the LLM and other Python libraries for data processing.
Part 1: PubMed Search
Step 1: Install Required Libraries
First, install the necessary libraries:
pip install transformers requests
Step 2: Initialize the LLM
Initialize the LLM using the Hugging Face Transformers library:
from transformers import pipelinepubmed_search = pipeline("text-generation", model="GPT-3.5-turbo", device=0)
Step 3: Perform a PubMed Search
Perform a PubMed search using the LLM:
query = "Recent advances in cancer immunotherapy"
result = pubmed_search(query, max_length=200, do_sample=True, temperature=0.7, top_p=0.9, top_k=50, num_return_sequences=1)
Step 4: Display Search Results
Display the search results:
for i, res in enumerate(result):
print(f"Result {i+1}: {res['generated_text']}\n")
Part 2: Gene Set Analysis
Step 1: Install Required Libraries
Install the necessary libraries:
pip install gseapy pandas
Step 2: Perform Gene Set Analysis
Perform gene set analysis using the Gene Set Enrichment Analysis (GSEA) tool:
import gseapy as gpgeneset = "KEGG_2019_Human"
enr = gp.enrichr(gene_list=['BRCA1', 'TP53', 'EGFR'], description='test_gene_set', gene_sets=geneset)
Step 3: Display Gene Set Analysis Results
Display the gene set analysis results:
print(enr.res2d)
In this tutorial, we’ve demonstrated how to use LLMs for PubMed search and gene set analysis. These are just two examples of how LLMs can be applied in biomedicine. LLMs have the potential to revolutionize biomedical research by enabling researchers to perform complex tasks more efficiently and accurately.
AI in Biomedicine: Developing Representations of Disease-Relevant Molecules
Artificial Intelligence (AI) has revolutionized the field of biomedicine, particularly in the development of representations for disease-relevant molecules. This cutting-edge research area focuses on leveraging AI techniques to model and understand the complex interactions of molecules in biological systems, leading to significant advancements in drug discovery, personalized medicine, and disease diagnosis.
One of the key challenges in biomedicine is the vast and intricate nature of molecular data. Traditional methods struggle to efficiently analyze and interpret this data due to its complexity. AI offers a promising solution by providing powerful tools to process and extract meaningful insights from large-scale molecular datasets.
Machine learning algorithms, particularly deep learning models, have been instrumental in developing representations of disease-relevant molecules. These models can learn complex patterns and relationships within molecular data, enabling them to predict molecular properties, identify potential drug candidates, and uncover novel biological mechanisms.
For example, deep learning models such as graph neural networks (GNNs) have been successfully applied to represent molecules as graphs, where nodes represent atoms and edges represent chemical bonds. GNNs can learn to capture the structural and chemical properties of molecules, allowing them to predict molecular properties such as bioactivity, solubility, and toxicity.
Another approach is the use of generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs), to generate novel molecular structures with desired properties. These models can be used to design new drug candidates or optimize existing molecules for specific therapeutic purposes.
In addition to molecular representation, AI has also been applied to integrate molecular data with other types of biomedical data, such as genomic, clinical, and imaging data. This integrative approach enables researchers to gain a more comprehensive understanding of disease mechanisms and develop more effective treatments.
Overall, AI has significantly advanced the field of biomedicine by providing powerful tools to develop representations of disease-relevant molecules. These representations have the potential to revolutionize drug discovery and personalized medicine, leading to improved patient outcomes and better health outcomes for society as a whole.
Integrating Biomedical Data Database Development with LLMs
Integrating Biomedical Data Database Development with Large Language Models (LLMs) involves leveraging the capabilities of LLMs to facilitate the development and management of biomedical databases. Here’s a general approach to integrating LLMs into database development in the biomedical domain:
- Data Schema Design: Use LLMs to assist in designing the schema of the biomedical database. LLMs can help generate data models and relationships based on the desired functionalities and data types.
- Query Language Generation: LLMs can be used to generate database query languages (e.g., SQL) for creating, updating, and querying the database. This can simplify the process of writing complex queries and ensure compatibility with the database schema.
- Data Annotation and Curation: LLMs can assist in annotating and curating biomedical data. For example, LLMs can be used to automatically extract relevant information from text or images and annotate the data with appropriate metadata.
- Database Population: LLMs can help populate the database with initial data by generating synthetic data or transforming existing data into the required format. This can be particularly useful for testing and development purposes.
- Natural Language Interface: Develop a natural language interface for querying the biomedical database using LLMs. This can improve the accessibility of the database, especially for users who are not familiar with query languages.
- Data Integration: Use LLMs to integrate data from multiple sources into the biomedical database. LLMs can help match and merge data records, resolve conflicts, and ensure data consistency.
- Database Maintenance: LLMs can assist in database maintenance tasks, such as data cleaning, updating, and monitoring. LLMs can also be used to automate routine tasks, such as backups and security checks.
- Knowledge Discovery: LLMs can aid in knowledge discovery by analyzing the data in the biomedical database and identifying patterns, trends, and associations that may not be apparent through manual inspection.
Integrating LLMs into biomedical database development can streamline the process, improve data quality, and enhance the usability of the database. However, it’s important to consider the limitations and ethical implications of using LLMs in this context, such as ensuring data privacy and avoiding bias in data annotations and queries.
Querying PubMed with RAG to answer biomedical questions with GPT-4
To query PubMed using retrieval-augmented generation (RAG) to answer biomedical questions with GPT-4, you would follow a process that combines information retrieval from PubMed with the text generation capabilities of GPT-4. Here’s a general approach:
- Install Required Libraries: Install the libraries required for interfacing with PubMed and GPT-4.bash
pip install biopython transformers
- Initialize the GPT-4 Model: Initialize the GPT-4 model using the Hugging Face Transformers library.python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_name = 'gpt2-large'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
- Retrieve Relevant PubMed Documents: Use the Biopython library to search PubMed and retrieve relevant documents based on the query.python
from Bio import Entrez
def search_pubmed(query, num_results=5):
Entrez.email = 'your_email@example.com' # Set your email
handle = Entrez.esearch(db='pubmed', term=query, retmax=num_results)
record = Entrez.read(handle)
return record['IdList']query = "your biomedical question"
pmids = search_pubmed(query)
- Retrieve Abstracts or Full Texts: Retrieve abstracts or full texts of the relevant PubMed documents.python
def retrieve_pubmed_abstract(pmid):
handle = Entrez.efetch(db='pubmed', id=pmid, retmode='text', rettype='abstract')
return handle.read()documents = [retrieve_pubmed_abstract(pmid) for pmid in pmids]
- Generate Response: Generate a response using GPT-4, incorporating information from the retrieved PubMed documents.python
context = "\n\n".join(documents)
input_text = context + "\nQuestion: " + query
input_ids = tokenizer.encode(input_text, return_tensors='pt')response = model.generate(input_ids, max_length=500, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
generated_text = tokenizer.decode(response[0], skip_special_tokens=True)
- Display or Process Results: Process the generated response and display it to the user.python
print(generated_text)
- Iterate and Refine: Iterate on the process, refining the query, document retrieval, and text generation steps based on the quality and relevance of the generated response.
Note: The above code provides a general framework and may need to be adapted based on specific requirements and limitations of the PubMed API and GPT-4. Additionally, ensure compliance with the PubMed terms of use and data privacy policies when using the PubMed API.
Code generation in Bioinformatics with Opensource LLMs
Large Language Models (LLMs), such as GPT-3, have shown promising capabilities in code generation tasks, including those in bioinformatics. These models can generate code snippets based on natural language descriptions, which can be useful for automating repetitive tasks, prototyping, and even generating complex algorithms. Here’s an overview of how LLMs can be used for code generation in bioinformatics:
- Natural Language Input: Users can describe their bioinformatics task or problem in natural language, specifying the desired functionality, input data, and expected output.
- Code Generation: LLMs can then generate code snippets based on the input description. For example, given a request to process a FASTA file containing DNA sequences, an LLM could generate Python code to read the file, extract the sequences, and perform sequence analysis tasks.
- Syntax and Semantics: LLMs are trained on a vast amount of text data, including code snippets, which helps them understand both the syntax and semantics of programming languages. This allows them to generate code that is not only syntactically correct but also functionally relevant.
- Integration with Bioinformatics Tools: LLM-generated code can be integrated with existing bioinformatics tools and libraries, such as Biopython or Bioconductor, to perform complex analyses and computations.
- Automation and Efficiency: By automating the code generation process, LLMs can help bioinformaticians save time and effort, especially for repetitive or routine tasks.
- Quality and Validation: While LLMs can generate code, it’s important to validate the output to ensure its correctness and efficiency. Manual review and testing are often necessary, especially for critical applications.
It’s worth noting that while LLMs show great promise in code generation tasks, they are not without limitations. They may struggle with context understanding, especially in complex or ambiguous scenarios, and may generate code that needs further refinement or adaptation. As the field of LLMs continues to evolve, we can expect to see more advancements and improvements in code generation capabilities, benefiting the bioinformatics community and beyond.