Functional Annotation and Enrichment Analysis in Bioinformatics

March 18, 2024 Off By admin

Table of Contents

Course Description:

This course introduces students to the principles and methods of functional annotation and enrichment analysis in bioinformatics. Students will learn how to interpret large-scale omics data, such as gene expression, proteomics, and metabolomics, using various annotation databases and enrichment analysis tools. The course will cover the theoretical foundations as well as hands-on practical sessions to analyze real-world datasets.

Course Objectives:

Understand the concept of functional annotation and enrichment analysis
Learn to use bioinformatics tools and databases for functional annotation
Gain practical experience in performing enrichment analysis on omics data
Interpret and visualize enrichment analysis results

Introduction to Functional Annotation

Introduction to functional annotation and its importance

Functional annotation is a crucial process in bioinformatics that involves assigning biological information to genomic elements, such as genes or proteins, to understand their function, role in biological processes, and potential interactions with other molecules. This information is essential for interpreting the vast amounts of data generated by high-throughput sequencing and other omics technologies.

Functional annotation provides insights into the biological significance of genomic sequences, helping researchers understand how genes and proteins contribute to various biological processes, such as metabolism, development, and disease. It also aids in the identification of genes associated with specific traits or diseases, facilitating biomedical research and drug discovery.

There are several methods used for functional annotation, including sequence similarity searches against databases, domain prediction, and pathway analysis. These methods help researchers infer the function of unknown genes or proteins based on similarities to known sequences or conserved domains.

Overall, functional annotation plays a crucial role in advancing our understanding of genomics and molecular biology, enabling researchers to extract meaningful information from genomic data and accelerate discoveries in various fields, including medicine, agriculture, and environmental science.

Overview of annotation databases (GO, KEGG, Reactome, etc.)

Annotation databases play a crucial role in functional genomics by providing curated information about genes, proteins, pathways, and biological processes. Here’s an overview of some widely used annotation databases:

Gene Ontology (GO): GO provides a structured vocabulary to describe gene and protein attributes in terms of biological processes, cellular components, and molecular functions. It helps standardize the representation of gene and gene product attributes across species.
Kyoto Encyclopedia of Genes and Genomes (KEGG): KEGG is a comprehensive database that integrates genomic, chemical, and systemic functional information. It includes information on pathways, diseases, drugs, and organisms, making it valuable for understanding biological systems at a molecular level.
Reactome: Reactome is a curated database of biological pathways, focusing on human biology. It provides detailed information on molecular events, such as signaling pathways, metabolic pathways, and regulatory processes, helping researchers understand the biological context of genes and proteins.
UniProt: UniProt is a comprehensive resource for protein sequence and functional information, providing data on protein function, structure, domains, and post-translational modifications. It includes information from Swiss-Prot, TrEMBL, and the Protein Information Resource (PIR) databases.
PANTHER: PANTHER (Protein ANalysis THrough Evolutionary Relationships) is a classification system that classifies proteins by their functions, using evolutionary relationships to infer function. It provides information on protein families, subfamilies, and pathways.
InterPro: InterPro is an integrated resource for protein families, domains, and functional sites. It integrates data from several protein signature databases, such as Pfam, PRINTS, ProSite, and SMART, to provide comprehensive functional annotations for proteins.
STRING: STRING is a database of known and predicted protein-protein interactions, providing information on both physical and functional interactions. It helps researchers understand the networks of proteins that underlie biological processes.

These databases, among others, are valuable resources for functional annotation, pathway analysis, and understanding the biological context of genes and proteins in various organisms.

Hands-on: Introduction to using annotation databases

To get started with using annotation databases, let’s focus on using the Gene Ontology (GO) database as an example. The GO database provides a standardized vocabulary to describe the function of genes and proteins. Here’s a basic hands-on guide to using GO for functional annotation:

Accessing the GO Database:
- Go to the Gene Ontology Consortium website: http://geneontology.org/.
- Navigate to the “AmiGO 2” tab, which provides access to the GO database for searching and browsing.
Searching for Gene Functions:
- Use the search bar to enter the name or identifier of a gene or protein you’re interested in.
- Explore the search results to find relevant information about the gene’s function, including its involvement in biological processes, cellular components, and molecular functions.
Viewing Gene Annotations:
- Click on a gene or protein name in the search results to view detailed annotations.
- Explore the annotations to understand the gene’s role in biological processes, its cellular localization, and its molecular function.
Browsing Ontology Terms:
- Use the “Browse” tab in AmiGO 2 to explore the GO hierarchy.
- Navigate through the biological process, cellular component, and molecular function hierarchies to discover related terms and their definitions.
Downloading Annotations:
- The GO Consortium provides various ways to download annotation data for further analysis.
- Explore the “Downloads” section of the GO website to find relevant files, such as the GO annotation file (GAF) containing functional annotations for genes and proteins.
Using Annotations in Analysis:
- Once you have downloaded GO annotations, you can use them in bioinformatics analysis.
- For example, you can use GO annotations to perform enrichment analysis to identify biological processes or pathways that are overrepresented in a set of genes of interest.

By following these steps, you can start exploring and using annotation databases like GO to annotate genes and proteins, understand their functions, and perform functional analysis in your research.

Functional Annotation Tools and Software

Overview of popular annotation tools (DAVID, Enrichr, g:Profiler, etc.)

Popular annotation tools are invaluable for extracting biological insights from large-scale omics datasets. Here’s an overview of some widely used annotation tools:

DAVID (Database for Annotation, Visualization, and Integrated Discovery): DAVID provides a comprehensive set of functional annotation tools for understanding the biological meaning behind large lists of genes or proteins. It offers a range of functional annotation, clustering, and enrichment tools to help researchers identify biological themes within their datasets.
Enrichr: Enrichr is a web-based tool for gene set enrichment analysis and functional annotation. It allows users to input gene lists and obtain enrichment analysis results for a variety of gene sets, pathways, and functional annotations.
g:Profiler: g:Profiler is a toolset for functional enrichment analysis and conversions of gene lists. It provides access to a wide range of biological databases and allows users to analyze their data in the context of various functional categories, such as GO terms, pathways, and protein-protein interactions.
Metascape: Metascape is a web-based portal for gene annotation and analysis. It integrates functional enrichment, interactome analysis, gene annotation, and membership search into a single platform, making it a comprehensive tool for interpreting omics data.
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins): While primarily known for its protein-protein interaction network analysis capabilities, STRING also offers functional enrichment analysis tools. It allows users to explore the functional enrichments of their gene/protein lists in the context of known and predicted protein interactions.
GOATOOLS: GOATOOLS is a Python library for performing Gene Ontology (GO) enrichment analysis. It provides functions for conducting GO term enrichment analysis on gene sets and visualizing the results.
ClusterProfiler: ClusterProfiler is an R package for functional enrichment analysis of gene clusters. It provides a suite of functions for performing enrichment analysis using various annotation databases and visualizing the results.

These tools play a crucial role in interpreting omics data, helping researchers uncover biological insights, and generating hypotheses for further investigation.

Hands-on: Using functional annotation tools for gene set enrichment analysis

For this hands-on exercise, let’s use the Enrichr tool for gene set enrichment analysis (GSEA). Enrichr is a user-friendly web-based tool that provides enrichment analysis for a variety of gene sets and functional annotations. Here’s a step-by-step guide:

Accessing Enrichr:
- Go to the Enrichr website: https://maayanlab.cloud/Enrichr/.
Input Gene List:
- Click on the “Upload” tab.
- Either upload a file containing your gene list or paste the gene symbols directly into the text box.
Select Library:
- Choose a library for enrichment analysis. Enrichr provides a variety of libraries, including Gene Ontology (GO) terms, pathways, and other functional annotations.
- For example, you can select the “GO Biological Process 2018” library for GO term enrichment analysis.
Run Analysis:
- Click on the “Submit” button to run the analysis.
- Enrichr will process your gene list and provide enrichment analysis results.
View Enrichment Results:
- The results will be displayed as a bar graph showing the top enriched terms or pathways.
- You can click on individual bars to view more details about the enrichment analysis, including p-values and adjusted p-values.
Explore Additional Analyses:
- Enrichr also provides other tools for exploring your gene list, such as overlap analysis with existing gene sets and network visualization of enriched terms.
Download Results:
- You can download the enrichment analysis results in various formats for further analysis or visualization.

By following these steps, you can use Enrichr to perform gene set enrichment analysis on your gene list and gain insights into the biological processes, pathways, and functional annotations associated with your genes.

Gene Ontology (GO) Analysis

Introduction to Gene Ontology (GO)

Gene Ontology (GO) is a widely used bioinformatics resource that provides a structured vocabulary to describe the functions of genes and gene products in a consistent and computable manner. The GO project aims to standardize the representation of gene and gene product attributes across species, facilitating the comparison of functional information between different organisms.

The GO vocabulary is organized into three main categories, or ontologies, which describe different aspects of gene function:

Biological Process (BP): Describes the biological goals accomplished by one or more ordered assemblies of molecular functions. Examples include “cellular process” and “metabolic process.”
Molecular Function (MF): Describes the elemental activities of gene products at the molecular level, such as “catalytic activity” or “binding.”
Cellular Component (CC): Describes the places in the cell where a gene product is active, such as “nucleus” or “mitochondrion.”

Each term in the GO vocabulary is associated with a unique identifier and a definition, and terms are organized in a hierarchical structure, with more specific terms being children of more general terms. This hierarchical structure allows users to navigate the ontology and infer the functions of uncharacterized genes based on their relationships to known genes.

GO annotations are used to associate genes and gene products with GO terms, indicating their functional properties. These annotations are typically based on experimental data, computational predictions, or literature curation. GO annotations are widely used in bioinformatics to analyze gene expression data, identify enriched functional categories in gene sets, and infer biological functions from genomic data.

Overall, GO provides a powerful framework for systematically describing gene function and is an essential resource for functional genomics research.

GO term enrichment analysis

GO term enrichment analysis is a statistical method used to determine whether a particular set of genes or proteins is enriched for specific GO terms compared to what would be expected by chance. This analysis helps researchers identify biological processes, molecular functions, and cellular components that are overrepresented in their gene or protein list, providing insights into the functional significance of the genes or proteins under study.

Here’s a general overview of how GO term enrichment analysis is performed:

Input Data: The analysis starts with a list of genes or proteins of interest, typically identified through experimental or computational methods.
Background Set: A background set is defined, representing all genes or proteins that could potentially be included in the analysis. This set is used to calculate the expected frequency of GO terms by chance.
Statistical Test: A statistical test, such as Fisher’s exact test or the hypergeometric test, is applied to determine whether the frequency of a GO term in the input gene list is significantly higher than expected by chance, given the background set.
Multiple Testing Correction: Since multiple GO terms are typically tested simultaneously, adjustments for multiple testing, such as the Bonferroni correction or the false discovery rate (FDR) correction, are applied to control for the increased risk of false positives.
Result Interpretation: Enriched GO terms are identified based on their p-values or adjusted p-values. Terms with p-values below a specified threshold (e.g., 0.05 or adjusted p-value < 0.05) are considered statistically significant and are often further investigated to gain biological insights.
Visualization: The results of the enrichment analysis are often visualized using bar graphs or other graphical representations to highlight the most significantly enriched GO terms.

GO term enrichment analysis is a powerful tool for functional interpretation of high-throughput omics data, such as gene expression or proteomics data. It helps researchers identify the biological processes, molecular functions, and cellular components that are most relevant to their study, providing valuable insights into the underlying biology of the genes or proteins being analyzed.

Hands-on: GO analysis using R/Bioconductor

To perform Gene Ontology (GO) analysis using R and Bioconductor, you can use the topGO package, which provides functions for conducting enrichment analysis of GO terms. Here’s a basic example of how to perform GO analysis using topGO:

Install and load the topGO package (if not already installed):

if (!requireNamespace("BiocManager", quietly = TRUE))
 install.packages("BiocManager")

BiocManager::install(“topGO”)
library(topGO)

Prepare your data:
- Create a list of genes of interest, where each gene is represented by its Entrez gene ID or another identifier that can be mapped to GO terms.
- Create a mapping of genes to GO terms. This can be done using the GO.db package, which provides mappings between Entrez gene IDs and GO terms.

# Example gene list and mapping
 genes <- c("100","200","300","400","500")
 GO <- c("GO:0001234","GO:0005678","GO:0009101","GO:0004567","GO:0007890")
 gene2GO <- readMappings(file="gene2GO.txt") # assuming gene2GO.txt contains gene-to-GO mappings

Create a topGOdata object:
- Use the new function to create a topGOdata object, specifying the ontology (e.g., “BP” for biological process, “MF” for molecular function, “CC” for cellular component), the gene universe (all genes in your dataset), and the gene-to-GO mapping.

godata <- new("topGOdata",
 ontology = "BP",
 allGenes = gene2GO,
 annot = annFUN.gene2GO,
 geneSel = genes,
 nodeSize = 10)

Run the enrichment analysis:
- Use the runTest function to perform the enrichment analysis, specifying the algorithm (e.g., “classic” or “elim”), and the statistical test (e.g., “Fisher”).

result <- runTest(godata, algorithm = "classic", statistic = "fisher")

Retrieve and visualize the results:
- Use the GenTable function to retrieve the enriched GO terms and associated p-values.
- Use plotting functions (e.g., showSigOfNodes) to visualize the results.

table <- GenTable(godata, classicFisher = result, topNodes = length(result@score), numChar = 100)
 print(table)

This is a basic example of how to perform GO analysis using R and Bioconductor. The topGO package offers more advanced features for customization and visualization, so be sure to refer to the package documentation for more information.

Pathway Enrichment Analysis

Introduction to pathway databases (KEGG, Reactome, etc.)

Pathway databases are resources that provide curated information about biological pathways, including their components (such as genes, proteins, and metabolites) and their interactions. These databases help researchers understand the complex networks of molecular interactions that underlie biological processes. Here are some popular pathway databases:

KEGG (Kyoto Encyclopedia of Genes and Genomes): KEGG is a comprehensive database that integrates genomic, chemical, and systemic functional information. It provides information on pathways, diseases, drugs, and organisms, making it a valuable resource for understanding biological systems at a molecular level.
Reactome: Reactome is a curated database of biological pathways, focusing on human biology. It provides detailed information on molecular events, such as signaling pathways, metabolic pathways, and regulatory processes, helping researchers understand the biological context of genes and proteins.
WikiPathways: WikiPathways is a community-curated database of biological pathways. It allows researchers to contribute and edit pathway information, making it a dynamic and collaborative resource for pathway analysis.
BioCyc: BioCyc is a collection of Pathway/Genome Databases (PGDBs) that provide curated information on metabolic pathways and other biological processes for a wide range of organisms.
PID (Pathway Interaction Database): PID is a database of curated pathways and interactions in human cells. It provides information on signaling pathways, protein-protein interactions, and post-translational modifications.
Biocarta: BioCarta is a collection of curated pathway maps covering a wide range of biological processes, including cell signaling, metabolism, and disease pathways.

These pathway databases play a crucial role in bioinformatics and systems biology, providing researchers with valuable resources for understanding the complex interactions that govern biological processes.

Pathway enrichment analysis

Pathway enrichment analysis is a method used in bioinformatics to identify biological pathways that are significantly enriched in a set of genes or proteins of interest. It helps researchers understand the underlying biological processes associated with a particular gene set. Here’s a general overview of how pathway enrichment analysis works:

Input Data: The analysis starts with a list of genes or proteins that are of interest, typically derived from experimental data such as gene expression studies, proteomics, or genome-wide association studies (GWAS).
Background Set: A background set of genes or proteins is defined, which represents all genes or proteins that could potentially be included in the analysis. This set is often derived from the entire genome or proteome.
Pathway Database: A database of biological pathways is used, such as KEGG, Reactome, or GO (Gene Ontology), which contains information about the biological processes, molecular functions, and cellular components associated with genes or proteins.
Statistical Analysis: Statistical tests, such as Fisher’s exact test or hypergeometric test, are used to determine whether the input gene set is significantly overrepresented in any of the pathways compared to the background set.
Multiple Testing Correction: Since multiple pathways are typically tested simultaneously, adjustments for multiple testing, such as the Bonferroni correction or false discovery rate (FDR) correction, are applied to control for false positives.
Results Interpretation: Pathways that are found to be significantly enriched are considered to be biologically relevant to the input gene set. Researchers can then further investigate these pathways to gain insights into the underlying biological mechanisms.

Pathway enrichment analysis is commonly used in functional genomics and systems biology to interpret high-throughput data and gain a better understanding of complex biological systems.

Hands-on: Pathway analysis using Enrichr or similar tools

To perform pathway analysis using Enrichr, you can follow these steps:

Go to the Enrichr website: Navigate to the Enrichr website at https://maayanlab.cloud/Enrichr/.
Input your gene list: Click on the “Upload” tab and upload a file containing your list of genes or paste them directly into the text box. You can also choose a preloaded gene set if you don’t have your own.
Select a library: Choose a gene set library for pathway analysis. Enrichr offers several libraries, such as KEGG, Reactome, and WikiPathways.
Run the analysis: Click on the “Submit” button to run the analysis. Enrichr will perform the pathway enrichment analysis and provide you with a list of enriched pathways along with statistical information.
Explore the results: Review the results to identify pathways that are significantly enriched in your gene set. You can click on the pathway names to view more details and visualize the enrichment results.
Download the results: You can download the results in various formats for further analysis and interpretation.

Enrichr is a powerful tool for pathway analysis and can provide valuable insights into the biological significance of your gene set.

Functional Annotation of Non-Coding RNAs

Functional annotation of microRNAs and long non-coding RNAs

Functional annotation of microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) involves identifying and characterizing their biological functions, targets, and roles in cellular processes. Here’s an overview of how you can perform functional annotation for miRNAs and lncRNAs:

Prediction of Targets: Use bioinformatics tools such as TargetScan, miRanda, or DIANA-microT to predict potential target genes of miRNAs. For lncRNAs, tools like LncBase Predicted v.2, LncTar, and StarBase can be used to predict interactions with miRNAs and other RNA molecules.
Functional Enrichment Analysis: Perform functional enrichment analysis of predicted target genes to identify biological processes, molecular functions, and pathways enriched with target genes. Tools like Enrichr, DAVID, and g:Profiler can be used for this purpose.
Experimental Validation: Validate the predicted interactions and functions of miRNAs and lncRNAs experimentally using techniques such as luciferase reporter assays, RNA immunoprecipitation (RIP), and RNA pull-down assays.
Network Analysis: Construct regulatory networks involving miRNAs, lncRNAs, and target genes to understand their interactions and regulatory roles in cellular processes. Tools like Cytoscape can be used for network visualization and analysis.
Functional Studies: Perform functional studies, such as knockdown or overexpression experiments, to elucidate the roles of miRNAs and lncRNAs in specific biological processes or diseases.
Integration of Data: Integrate data from various sources, including expression profiles, sequence features, and epigenetic modifications, to gain a comprehensive understanding of the functions of miRNAs and lncRNAs.

Functional annotation of miRNAs and lncRNAs is a dynamic field, and ongoing research is continually expanding our understanding of their roles in gene regulation and cellular processes.

Hands-on: Functional annotation of non-coding RNAs using databases like miRBase

To perform functional annotation of non-coding RNAs (ncRNAs) using databases like miRBase, which primarily focuses on microRNAs (miRNAs), you can follow these steps:

Access miRBase: Go to the miRBase website at http://www.mirbase.org/.
Search for ncRNA: Use the search bar on the miRBase homepage to search for the non-coding RNA of interest. If you are specifically looking for miRNAs, you can enter the miRNA name or ID.
Retrieve information: Once you have found the ncRNA of interest, click on its entry to view detailed information about the ncRNA, including its sequence, genomic location, and related publications.
Functional annotation: Explore the functional annotations provided for the ncRNA, which may include information about its targets, expression patterns, and known functions. Note that miRBase primarily focuses on miRNAs, so the functional annotations may be limited to miRNA-specific functions.
Download data: You can download the sequence and annotation data for the ncRNA from miRBase for further analysis and use in your research.
Explore related resources: miRBase provides links to related resources and databases that may contain additional information about the ncRNA and its functions. Explore these resources to gather comprehensive information about the ncRNA’s functional annotations.

Keep in mind that miRBase is primarily focused on miRNAs, so if you are interested in other types of ncRNAs, you may need to explore other databases and resources that specialize in those types of ncRNAs.

Integration of Functional Annotation and Omics Data

Integrating functional annotation with transcriptomics, proteomics, and metabolomics data

Integrating functional annotation with transcriptomics, proteomics, and metabolomics data can provide a comprehensive view of the biological processes and pathways associated with non-coding RNAs (ncRNAs) and other molecules. Here’s how you can integrate these data types:

Transcriptomics Data: Analyze transcriptomics data to identify differentially expressed genes (DEGs) or ncRNAs under different conditions or treatments. Use functional annotation tools such as Enrichr, DAVID, or g:Profiler to perform functional enrichment analysis of DEGs or ncRNAs to identify enriched biological processes, molecular functions, and pathways.
Proteomics Data: Analyze proteomics data to identify differentially expressed proteins (DEPs) or protein-protein interactions (PPIs) associated with ncRNAs or other molecules of interest. Use tools like STRING or Cytoscape for PPI network analysis and functional annotation of proteins.
Metabolomics Data: Analyze metabolomics data to identify differentially regulated metabolites or metabolic pathways associated with ncRNAs or other molecules. Use pathway analysis tools like MetaboAnalyst or KEGG Mapper for functional annotation of metabolites and pathway enrichment analysis.
Integration: Integrate transcriptomics, proteomics, and metabolomics data using systems biology approaches to identify key regulatory networks and pathways associated with ncRNAs and other molecules. Use tools like OmicsNet or BioCyc for pathway and network integration analysis.
Visualization: Visualize the integrated data using network visualization tools like Cytoscape or pathway enrichment tools to gain insights into the interactions and relationships between ncRNAs, genes, proteins, and metabolites.

By integrating functional annotation with transcriptomics, proteomics, and metabolomics data, you can gain a more comprehensive understanding of the biological processes and pathways regulated by ncRNAs and other molecules, providing valuable insights into their roles in health and disease.

Hands-on: Integrative analysis of omics data using R/Bioconductor

To perform integrative analysis of omics data using R/Bioconductor, you can follow these general steps:

Load and preprocess data: Load your omics data (transcriptomics, proteomics, metabolomics) into R and preprocess it as needed (e.g., normalization, filtering).
Integrate data: Integrate omics data sets using appropriate methods, such as correlation analysis, co-expression analysis, or network-based integration. For example, you can use the merge function to combine different omics datasets based on common identifiers.
Perform statistical analysis: Perform statistical analysis to identify differentially expressed genes, proteins, or metabolites across conditions or groups of interest. Use methods such as limma, DESeq2, or edgeR for transcriptomics data, and similar packages for other omics data types.
Functional annotation: Use bioconductor packages like clusterProfiler, enrichR, or pathview to perform functional annotation and enrichment analysis of differentially expressed genes, proteins, or metabolites. This step will help you understand the biological processes and pathways associated with your omics data.
Integrate annotation with data: Integrate the functional annotation results with your omics data to gain insights into the biological relevance of the identified features. You can visualize the integrated results using tools like ggplot2 or pheatmap.
Network analysis: Perform network analysis to identify regulatory networks or interactions between different omics data types. Use packages like igraph or Cytoscape for network visualization and analysis.
Interpretation and visualization: Interpret the integrated results in the context of your research question and visualize the results using appropriate plots or graphs to communicate your findings effectively.

Here’s a basic example of how you might integrate transcriptomics and proteomics data:

# Load required libraries
 library("limma")
 library("clusterProfiler")

# Load and preprocess transcriptomics data
transcriptomics_data <- read.table(“transcriptomics_data.txt”, header = TRUE)
# Preprocess transcriptomics data as needed

# Load and preprocess proteomics data
proteomics_data <- read.table(“proteomics_data.txt”, header = TRUE)
# Preprocess proteomics data as needed

# Perform statistical analysis on transcriptomics data
transcriptomics_fit <- lmFit(transcriptomics_data, design)
transcriptomics_ebayes <- eBayes(transcriptomics_fit)
transcriptomics_results <- decideTestsDGE(transcriptomics_ebayes)

# Perform statistical analysis on proteomics data
proteomics_fit <- lmFit(proteomics_data, design)
proteomics_ebayes <- eBayes(proteomics_fit)
proteomics_results <- decideTestsDGE(proteomics_ebayes)

# Integrate transcriptomics and proteomics data
merged_data <- merge(transcriptomics_data, proteomics_data, by = “gene_id”)

# Perform functional annotation and enrichment analysis
gene_list <- merged_data$gene_id
enrichment_results <- enrichKEGG(gene = gene_list)

# Visualize the results
barplot(enrichment_results)

This is a simplified example, and you may need to adapt it based on the specifics of your data and research question.

Visualization of Enrichment Analysis Results

Visualization techniques for enrichment analysis results

Visualization of enrichment analysis results is crucial for interpreting and presenting the biological significance of the findings. Here are some common visualization techniques used for enrichment analysis results:

Bar Charts: Bar charts are simple and effective for displaying enriched terms or pathways along with their corresponding p-values or enrichment scores. They provide a quick overview of the most significant terms.
Heatmaps: Heatmaps can be used to visualize the enrichment scores or p-values of multiple enriched terms across different samples or conditions. They help identify patterns and clusters of enriched terms.
Dot Plots: Dot plots are similar to bar charts but use dots instead of bars to represent enriched terms. They are useful when there are many terms to display and can be organized horizontally or vertically.
Volcano Plots: Volcano plots display the log2 fold change on the x-axis and the -log10 p-value on the y-axis for each term. Enriched terms are often highlighted to show their significance.
Networks: Network visualizations can be used to display relationships between enriched terms, genes, proteins, or metabolites. They help understand the interconnectedness of biological processes.
UpSet Plots: UpSet plots visualize the overlap and intersections between different sets of enriched terms or pathways. They are useful for comparing results from multiple enrichment analyses.
Enrichment Maps: Enrichment maps are a type of network visualization that shows relationships between enriched terms based on their overlap in gene sets. They help identify common themes or pathways.
Enrichment Landscapes: Enrichment landscapes visualize the enrichment scores of multiple terms across different samples or conditions. They provide a comprehensive view of the enrichment results.
Interactive Visualizations: Interactive visualizations, such as those created using tools like Plotly or Shiny, allow users to explore enrichment results dynamically by zooming, filtering, and hovering over data points.

These visualization techniques can help researchers gain insights into the biological significance of enrichment analysis results and communicate their findings effectively.

Hands-on: Visualization of enrichment analysis results using tools like Cytoscape

To visualize enrichment analysis results using Cytoscape, you can follow these steps:

Prepare your enrichment analysis results: Ensure that you have the results of your enrichment analysis in a format that can be imported into Cytoscape. This typically includes the enriched terms or pathways along with their corresponding p-values or enrichment scores.
Install Cytoscape: If you haven’t already, download and install Cytoscape from the official website (https://cytoscape.org/).
Load your data into Cytoscape: Open Cytoscape and import your enrichment analysis results. You can do this by going to File > Import > Table from File and selecting your file containing the enrichment results. Make sure to select the appropriate delimiter and column headers.
Create a network: If your enrichment analysis results include relationships between enriched terms (e.g., based on shared genes), you can create a network in Cytoscape to visualize these relationships. Go to File > Import > Network from Table to import your network data.
Customize the network: Once you have imported your network, you can customize it using Cytoscape’s layout options, styling features, and annotations to make it more visually appealing and informative.
Visualize enrichment results: Use Cytoscape’s built-in features or plugins to visualize the enrichment results on the network. For example, you can use the EnrichmentMap plugin to create an enrichment map that shows relationships between enriched terms based on their overlap in gene sets.
Export the visualization: Once you are satisfied with your visualization, you can export it from Cytoscape in a variety of formats for publication or further analysis.

Cytoscape offers a wide range of customization options and plugins for visualizing and analyzing enrichment analysis results, making it a powerful tool for exploring biological data.

Case Studies and Practical Applications

Case studies demonstrating the application of functional annotation and enrichment analysis in different research areas

Functional annotation and enrichment analysis are widely used in various research areas to gain insights into the biological significance of gene sets or molecular pathways. Here are some case studies demonstrating their application in different research areas:

Cancer Research:
- Study: Researchers conducted an enrichment analysis of gene expression data from cancer patients to identify pathways associated with drug resistance.
- Findings: They found that the PI3K/AKT/mTOR pathway was significantly enriched in drug-resistant tumors, suggesting that targeting this pathway could improve treatment outcomes.
Neuroscience:
- Study: Scientists performed functional annotation of genes differentially expressed in the brains of Alzheimer’s disease patients compared to healthy controls.
- Findings: Enrichment analysis revealed enrichment of genes involved in synaptic transmission and neuroinflammation, providing insights into the molecular mechanisms of Alzheimer’s disease.
Plant Biology:
- Study: Researchers conducted functional annotation of genes involved in drought stress response in plants.
- Findings: Enrichment analysis identified pathways related to osmotic stress, hormone signaling, and antioxidant defense, highlighting key processes involved in plant adaptation to drought.
Infectious Diseases:
- Study: Scientists performed functional annotation of genes differentially expressed in response to viral infection.
- Findings: Enrichment analysis revealed enrichment of immune response pathways, cytokine signaling, and viral replication pathways, providing insights into host-virus interactions.
Developmental Biology:
- Study: Researchers conducted functional annotation of genes involved in embryonic development.
- Findings: Enrichment analysis identified pathways related to cell differentiation, morphogenesis, and signaling cascades, shedding light on the molecular mechanisms of development.

These case studies demonstrate how functional annotation and enrichment analysis can be applied across different research areas to uncover biological insights and inform further experimental studies.

Student presentations: Analyzing and presenting results from a real-world dataset

Analyzing and presenting results from a real-world dataset can be a valuable learning experience for students, providing them with hands-on practice in data analysis and presentation skills. Here’s a guide for students on how to analyze and present results from a real-world dataset:

Dataset Selection: Choose a dataset that is relevant to your field of study and research question. Ensure that the dataset is well-documented and contains the necessary information for your analysis.
Data Preparation: Clean and preprocess the dataset as needed. This may include removing missing values, standardizing data formats, and creating new variables or features.
Exploratory Data Analysis (EDA): Perform EDA to gain insights into the dataset. This may include summary statistics, data visualization (e.g., histograms, scatter plots), and identifying patterns or trends in the data.
Hypothesis Testing: Formulate hypotheses based on your research question and dataset. Use statistical tests to assess the significance of your findings.
Data Analysis: Conduct the main analysis of your dataset, which may involve statistical modeling, machine learning, or other analytical techniques. Ensure that your analysis is appropriate for the type of data and research question.
Results Interpretation: Interpret the results of your analysis in the context of your research question. Discuss the implications of your findings and any limitations of your study.
Visualization: Create visualizations to present your results effectively. Use charts, graphs, and tables to illustrate key findings.
Presentation: Prepare a clear and concise presentation of your results. Use slides or posters to highlight key findings, methodology, and conclusions.
Q&A Session: Be prepared to answer questions from your audience about your analysis and findings.
Feedback and Reflection: Seek feedback from your peers and instructors on your presentation. Reflect on the strengths and weaknesses of your analysis and presentation skills.

By following these steps, students can effectively analyze and present results from a real-world dataset, gaining valuable experience in data analysis and communication.

Practical exercise and case studies

Functional annotation and enrichment analysis

Gene Ontology

The Gene Ontology (GO) is a collaboration between several databases. The purpose of GO is to provide a set of standardized (everyone in the field agrees on them and uses them) terms and descriptions of biological processes, protein functions and cellular locations. A collection of such standardized terms and the relations that exist between them is called “ontology”. GO has a tree structure with IS-A and PART-OF relations. You can access GO via dedicated tools (QuickGO, AmiGO…) or via other databases (UniProt…). It is used to interpret the results of high-throughput analysis (clusters, lists of coregulated genes…). By looking at the GO terms associated with the genes in a list, GO terms that are overrepresented in the list compared to the complete genome can be detected, and the function of the genes in the list might be inferred.

Searching GO using dedicated tools

The two main dedicated tools for accessing GO are QuickGO from EBI and AmiGO from GO itself.

Via QuickGO

Go to http://www.ebi.ac.uk/QuickGO/. Search for the GO biological process apoptosis. On the page containing the summary of the search results you see that the ontologies are continuously changing since a number of listed terms are obsolete.

Click the GO accession number of the first obsolete GO term.

To know which term replaced the outdated one scroll down to the Replaced by section.</p>

Which term replaced the obsolete term

Go to the GO record of execution phase of apoptosis.

Scroll down to the Child Terms section. These are all more specific terms that are linked to execution phase of apoptosis.

What are the relationships that exist between execution phase of apoptosis and its child terms ?

There are 3 main relationships in GO:

IS-A e.g. term is a special form of another term.

PART-OF, e.g. phosphatidylserine exposure on apoptotic cell surface is a part of the complete execution phase of apoptosis.

REGULATES to indicate which proteins are regulating the execution phase of apoptosis. The latter is further subdivided into “positively regulates” and “negatively regulates”.

On the top of the page you see that there are 4422 annotations (proteins) linked to apoptosis and its child terms. Click the link. GO provides evidence for linking proteins to terms and references to where it has found this evidence.

How many proteins are linked to execution phase of apoptosis and its descendant terms ?

Fortunately, you can filter the list of proteins.

Click the Taxon button

select Human and click Apply

You now see the number of human annotations and the number of human proteins linked to execution phase of apoptosis and its child terms. There should be around 130 proteins linked to execution phase of apoptosis.

How many human proteins are linked to execution phase of apoptosis and its descendant terms ?

Look at the first hit in the results table.

To see what it means click the code.

This takes you to the GO Evidence codes page. Search for IEA.

You will see that it stands for Inferred from Electronic Annotation (IEA), which means that it has not yet been curated.

What does its evidence code mean ?

Click the Evidence button

Select all options containing the term manual assertion

Click Apply

Retrieve only manually asserted results ?

You now see the number of human annotations with experimental evidence. There should be around 90 proteins now.

Click the Export button. This opens a file where you can select the format you want to save in.

Download these final results ?

Via AmiGO

Go to AmiGO (http://amigo.geneontology.org/cgi-bin/amigo/go.cgi). Here you can: enter a gene or protein name and look for the corresponding GO terms

enter a GO term and search for the proteins that are linked to that term We will first search for the GO terms of a protein

In the search box type PCL5 and click Search.

Search for cyclin PCL5

Genes and gene products links to a list of GO terms and protein descriptions that contain the search term

Annotations links to a list of GO annotations of these proteins

Click the Annotations button.

Find the GO annotations of cyclin PCL5

You can filter the results.

Expand the Organism select box and click the + sign after Saccharomyces cerevisiae.

Find the GO annotations of yeast cyclin PCL5

You see the 9 term associations of yeast PCL5 in the center window.

Now we’ll do the reverse search: we will search for proteins that are involved in a GO term.

Go to AmiGO (http://amigo.geneontology.org/cgi-bin/amigo/go.cgi) In the search box type execution phase of apoptosis and click Search

Click the Annotations button to retrieve the proteins that are associated to apoptosis.

Retrieve all gene products related to execution phase of apoptosis

Searching GO via other databases

*via UniProt

We will search for all Uniprot entries with electron transfer activity.

Go to AmiGO (http://amigo.geneontology.org/cgi-bin/amigo/go.cgi) and fetch the GO accession number of electron transfer activity:

Type electron transfer activity in the search box. The GO accession number of electron carrier activity is GO:0009055.

Get the GO accession number of electron transfer activity

Now go to UniProt (http://www.unprot.org/) and find all related proteins.

Type GO:0009055 in the search text box Click Search

Search for all Uniprot entries with electron transfer activity.

This returns an enormous amount of proteins with electron carrier activity. This high number is because:

You also see the results from the highly redundant TrEMBL database There are many electron carrier proteins

You get results from all species

The keyword is recognized by Uniprot as a GO term and not only this term but also all child terms are included in the search

Click the UniProt ID in the search results to go to the Uniprot record of this protein.

Go to the Function section

Go to Gene Ontology (GO) – Molecular_Function

Click electron carrier activity

You are redirected to the QuickGO record for this function.

Go to the UniProt record of the first hit and check if it has electron carrier activity

Fortunately, there are many ways to filter the Uniprot results. The first way is to distinguish high-quality proteins from Swiss-Prot and proteins obtained by translating all coding sequences in the EMBL nucleotide database (TrEMBL-translated EMBL).

In the Filter section you can choose to only view the Reviewed proteins which are obtained from SwissProt.

Search all high quality Uniprot-SwissProt entries with electron carrier activity

In the Filter section you can choose to only view the Human proteins which are obtained from SwissProt. Note that the Reviewed filter is still active.

This will return around 100 proteins.

Search for all human high quality Uniprot-SwissProt entries with electron carrier activity

In the previous exercise you have seen that Uniprot records contain GO annotation, which means that you can find the GO terms of proteins via Uniprot.

*via InterPro

There are also cross references between the InterPro database of protein domains and GO: Interpro records contain GO annotation.

Scroll down to the GO terms section

The domain is involved in transport and can bind to retinoid.

Go to http://www.ebi.ac.uk/interpro/

Type IPR002449 in the search text box and click Search

Get the terms assigned to the retinol-binding domain (IPR002449).

*via Ensembl

Go to Ensembl (http://www.ensembl.org). Search the human F9 (Coagulation factor IX Precursor) gene and go to the gene page (see exercises on Ensembl

On the gene page, go to the left menu and click GO: Biological process in the Ontologies section. This opens the GO annotation on the gene page. When you scroll down you see that the protein is involved in proteolysis and blood coagulation.

What are the biological processes that the protein encoded by this gene is involved in ?

Click GO: Molecular function in the Ontologies section. You see that the protein can bind other proteins and calcium ions and has endopeptidase activity.

What are the molecular functions of the protein encoded by this gene ?

Pathway information

KEGG

KEGG (http://www.genome.jp/kegg/) is a set of 16 linked databases. the basic building blocks of KEGG are proteins and chemical substances. These building blocks are combined into modules (e.g. protein complexes) and pathways. Components, modules and pathways are linked to diseases and to drugs used to cure the diseases.

Look at a map of the prion disease pathway from KEGG.

On the KEGG home page (http://www.genome.jp/kegg/), you see a simple text search box: type prion in the box and click Search.

You are redirected to a results overview page (http://www.genome.jp/dbget-bin/www_bfind_sub? mode=bfind&max_hit=1000&dbkey=kegg&keywords=prion) : showing the resultss of your search in each of KEGG’s 16 databases. Click on the result of the KEGG Pathways database.

You are redirected to the KEGG Pathway record on the prion diseases pathway (http://www.genome.jp/dbget- bin/www_bget?map05020).

In the Pathway map section you find a graphical representation of the pathway. Click on the map to see an enlarged version.

You are redirected to a summary page showing the three links in the Compound database:

Go back to the KEGG Pathway record and click “KEGG COMPOUND (3)” in the right menu.

What are the names of the chemical compounds that are related to the prion disease pathway according to KEGG ?

STRING

String is a database containing known and predicted protein-protein interactions. It has very nice visualization of interaction networks and provides evidence for predicted interactions.

Go to the STRING website (http://string-db.org/).

Click the Search link

In the Protein Name box type the name of the protein In the Organism box type the name of the organism Click the SEARCH button

How to find the interaction network of a protein ?

On the top of the results page, the interaction network is visualzed. The network nodes are proteins. The edges represent the predicted functional associations. The color of the edges reflects the evidence:

Red line – indicates the presence of fusion evidence Green line – neighborhood evidence

Blue line – cooccurrence evidence Purple line – experimental evidence Yellow line – textmining evidence Light blue line – database evidence Black line – coexpression evidence.

When you scroll down you see the evidence table. You can click the dots in the table to get more details.

Initially, only the top 10 interactors will be shown. You can increase this via the Settings. Confidence scores can be interpreted as follows:

low confidence – 0.15 (or better) medium confidence – 0.4

high confidence – 0.7 highest confidence – 0.9

There is a options the set how many interactions are shown that directly connect with your input by setting the 1st shell and how many indirect interaction that connect to a protein in the first shell by setting the 2nd shell.

PSICQUIC

We want to know the interaction partners of human BRCA2. Searching for BRCA2 as such would result in an overview of interactions for all BRCA2, not only from human but also from mouse, rat… To make the search specific we first go to UniProt and fetch the UniProt ID of human BRCA2.

The best place to search is PSICQUIC, since it gives access to a multitude of interaction databases. It’s not really a database itself, it provides on-the-fly access to the interaction databases via a software program called a web service so you always get the most up-to-date results. The interaction databases install the web service and make their content thereby accessible via PSICQUIC. Most PSICQUIC providers are using UniProt IDs to describe their proteins, this makes the results from different databases easy to combine. This is another reason to use UniProt IDs.

Search for BRCA2 in UniProt: the identifier of the human copy is P51587.

what is the UniProt identifier of human BRCA2?

Go to PSICQUIC (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml). You can see the list of databases you will search and their current status (online or offline).

As you can see in the help file (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/help.xhtml) PSICQUIC offers the opportunity to search in specific fields of the interaction database records.<p>

Search for id: P51587 in PSICQUIC.

Search for records in all online databases with BRCA2 in one of the two identifier fields.

This returns many interactions but since you search many databases at the same time many of them will be identical. This is why PSICQUIC can cluster the results, removing the redundancy. Details on the clustering can be found on the help page (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/help.xhtml).

On the bottom of the results page click the Cluster this query button. Clustering will take a few moments, once it is done you can click the view link.

View the unique interactions.

You now see a list of interactions in which human BRCA2 participates: you can download this list and visualize the interaction network in Cytoscape.

What’s the biology behind a list of genes ?

Omics experiments typically generate lists of hundreds of interesting genes:

up- or downregulated genes identified in an RNA-Seq experiment somatically mutated genes in a tumor identified by exome sequencing proteins that interact with a bait identified in a proteomics experiment

Over-representation analysis

Since it’s impossible to evaluate each gene individually, the most meaningful approach is to see what functional annotations the genes in the list have in common e.g. are many of them involved in the same pathway ?

Functional characterization of a gene list involves the following steps:

Add functional annotations to the genes in the list
Define a background: typically the full set of all genes that were detected in the experiment
Perform a statistical test to identify enriched functions, diseases, pathways

Enriched means over-represented, occurring more frequently in the list than expected by chance based on the background data.

It is recommended to characterize up- and downregulated genes separately.

!! Thousands of pathways are tested for enrichment, this could lead to false positives. Multiple testing

correction is used to correct the p-values from the individual enrichment tests to reduce the chance of false positives !!

ToppGene: a very up-to-date tool for functional enrichment analysis

Yays for ToppGene

ToppGene (https://toppgene.cchmc.org/) is the most up-to-date portal for gene list functional enrichment. See this overview (https://toppgene.cchmc.org/navigation/database.jsp) of their resources of functional annotations and their last update date.

The ToppFun tool returns enriched terms from GO, Mouse Phenotype, Pathways, Protein Interactions, Protein Domains, transcription factor binding sites, miRNA-target genes, disease-gene associations, drug-gene interactions, and gene expression sets, compiled from various data sources…

It supports gene symbols, Ensembl, Entrez, RefSeq and UniProt IDs from human. However, since gene symbols for human, mouse and rat are identical the tool can also be used for mouse and rat.

Nays for ToppGene

In ToppGene the background is by default the full set of all annotated genes in the genome. As a result the analysis will not control for experimental bias (bias of diseases, functions, pathways, and upstream regulators results toward genes expressed in the tissue that you are studying) and you will see enriched pathways… that are just related to the tissue or cell type you work in and not to the conditions you study.

How to do functional enrichment analysis with ToppFun ?

On the ToppGene (https://toppgene.cchmc.org/) page click the first link ToppFun (https://toppgene.cchmc.org/enrichment.jsp): Transcriptome, ontology, phenotype, proteome… Enter gene symbols or Ensembl IDs in the box Training Gene Set

Click Submit Query

If the gene list contains non-approved symbols or duplicates, they are listed under Genes Not found.

In the section Calculations select the functional annotation types you want to test (all in our case) and select the multiple correction method (default FDR is ok) and the significance cut-off level (default 0.05 is ok)

Click Start

Input Parameters summarizes the input parameters of the search. Click the Show Detail (red) link to see them.

Training results contains the enrichment analysis results. Download all (blue) will download the analysis results as a text file.

Click the Display chart (green) link to visualize the results

If you want to see which genes from your list belong to a certain annotation click the number in the

Genes from innput column

DAVID: most outdated but allows to set custom background

Yays for DAVID

In DAVID (https://david.ncifcrf.gov/home.jsp) the background can be specified by the user. As a result the user can submit the genes that show some evidence of expression in her experiment thereby controlling for experimental bias (bias of diseases, functions, pathways, and upstream regulators results toward genes expressed in the tissue that was studied). So you will not see enriched pathways… that are just related to the tissue or cell type you work in but only those that were affected by the conditions you study.

It supports various IDs: gene symbols, Ensembl, Entrez, RefSeq and UniProt IDs from an extensive list of organisms (they claim to support 65,000 species).

Nays for DAVID

DAVID is the least up-to-date portal for gene list functional enrichment. See this overview (https://david.ncifcrf.gov/content.jsp?file=release.html) to see when the last update was performed.

How to do enrichment analysis in DAVID

We will use the Ensembl IDs of the upregulated genes

As a background we will use the ENSEMBL IDs of the genes that showed minimal expression in our experiment, obtained by filtering genes with less than 10 counts over all samples.

How to do functional enrichment analysis with DAVID?

On the DAVID (https://david.ncifcrf.gov/home.jsp) home page click the first link Functional annotation (https://david.ncifcrf.gov/summary.jsp)

Upload the file with the Ensembl IDs by clicking the Choose file button Specify that you are uploading Ensembl gene IDs

Specify that this is a Gene list you’re uploading

Submit list

Do the same thing for the background list but this time specify it’s a background list

If DAVID complaints that you are either not sure which identifier type your list contains, or less than 80% of your list has mapped to your chosen identifier type, just ignore him and click Continue to use the IDs that DAVID could map

Select to use the Functional annotation tool You will be redirected to the results page Expand Gene_Ontology on the results page

For each root ontology of GO there are multiple categories to choose from:

1 uses the GO terms at the root of the ontology tree (general terms like metabolism, development and localization)

5 includes the terms at the ends of the branches (specific terms like striated muscle development). The specific terms are much more informative than the general terms. Since proteins are automatically associated to all parent (more general) terms in GO the percentage of associated proteins is expected to become smaller for higher (more specific) levels

direct results are filtered to remove redundant GO entries, the most specific is retained

all are the unfiltered results from the complete GO tree (specific and general terms combined)

The percentage and the number defines how many genes of the uploaded list could be linked to annotation on this level of GO. Click Chart to view the enriched terms.

DAVID not only uses GO but also pathway annotation, annotation from swissprot, protein domains and motifs… You can get a summary of all the results by clicking the Functional annotation clustering button at the bottom of the page.

Results are now combined into groups of related terms (regardless of the ontology they come from)

If you want to view the genes from your list that are associated to one of the enriched annotations simply click the blue bar:

WebGestalt: all organisms but one resource at a time

Yays for WebGestalt

In WebGestalt (http://www.webgestalt.org/) the background can be specified by the user. As a result the user can submit the genes that show some evidence of expression in her experiment thereby controlling for experimental bias (bias of diseases, functions, pathways, and upstream regulators results toward genes expressed in the tissue that was studied). So you will not see enriched pathways… that are just related to the tissue or cell type you work in but only those that were affected by the conditions you study.

This tool largely overlaps in data-sources with DAVID but updates them more regularly, the last update was done in 2019.

It supports various IDs: gene symbols, Ensembl, Entrez, RefSeq and UniProt IDs from 12 different model organisms. For other organisms it allows to upload your own functional annotation database (see section 3.1 of the manual (http://www.webgestalt.org/WebGestalt_2019_Manual.pdf) of this tool).

How to do enrichment analysis in WebGestalt

We will use the upregulated genes

As a background we will use the ENSEMBL IDs of the genes that showed minimal expression (http://data.bits.vib.be/pub/trainingen/summer/EnsemblTissue.txt) in our experiment, obtained by filtering genes with less than 10 counts over all samples.

How to calculate enrichment of KEGG pathways in a list of genes?

In the Organism box select the correct organism

In the Method box select Over-Representation Analysis

In the Functional Database boxes select pathway and KEGG

In the Gene ID type box select the correct ID type Upload the list of IDs

In the Reference Gene List section do the same thing for the background list in the Upload User Reference Set File and Select ID type box

Click the Submit button

The analysis will take a few minutes.

The Enrichment results can be visualized as a table, a bar chart or a Volcano plot. Dark blue bars are considered significantly enriched.

Clicking a bar shows the details on the bottom half of the page:

FDR is the corrected p-value (red)

Mapped input represents your gene list (green)

gene set is the total group of genes in the background with this annotation (green)

overlap is the number of genes from your list with this annotation. They are listed in the table (green)

Repeat the enrichment analysis on Wiki pathways. Many more tables can be generated in WebGestalt and you should choose the type of enrichment that fits your experimental needs. Data can be saved back to disk for further use.

g:Profiler: many organisms but limited resources

Yays for g:Profiler

g:Profiler (https://biit.cs.ut.ee/gprofiler/) supports a long list of organisms (https://biit.cs.ut.ee/gprofiler/page/organism-list) and allows to upload your own background file.

It is very regularly updated (https://biit.cs.ut.ee/gprofiler/page/news).

Nays for g:Profiler

It has less resources than the other tools since it retrieves functional annotations from Ensembl representing GO terms, pathways, networks, regulatory motifs, and disease phenotypes.

How to do enrichment analysis in g:Profiler

We will use the upregulated genes of our RNASeq training for enrichment analysis.

How to calculate enrichment in a list of genes?

For Enrichment analysis you need to use the g:GOSt tool (green).

Upload query: a file with gene IDs (in our case Ensembl IDs – one per line) (red). Select the Organism you need (blue)

Expand Advanced options and select Custom in the Statistical domain scope (purple) Upload the background list (purple)

Click the Run query button

This tool produces visually attractive results. Every dot in the graph represents a functional annotation. Hover your mouse over a dot to show details like the name of the annotation and the corrected p-value.

Also the detailed results are very visual.

Gene set enrichment analysis

Some omics experiments generate a ranked list of genes:

genes ranked by differential expression score from a RNA-Seq experiment genes ranked by sensitivity in a genome-wide CRISPR screen

mutated genes ranked by a score from a cancer driver prediction method

…

To analyze these lists, the following steps are taken:

The genes are divided into groups based on functional annotation (gene sets)
For every group enrichment of high or low scores is calculated

Groups of related genes are called gene sets: a pathway gene set includes all genes in a pathway.

This is why this type of analysis is called GSEA, Gene Set Enrichment Analysis. It assumes a whole ranked list (after filtering genes with very low counts) as input.

GSEA

GSEA is most often done in R or via software that you install on your computer like GSEA (http://software.broadinstitute.org/gsea/) from the Broad Institute.

GSEA is recommended when ranks are available for most of the genes in the genome (e.g. RNA-Seq data). It is not suitable when only a small portion of genes have ranks available (e.g. an experiment that identifies mutated cancer genes).

You have to install the tool on your computer (https://software.broadinstitute.org/gsea/index.jsp). An icon will appear on your desktop.

Input files

The format of the input file is very important. It should be a tab-delimited text file where: column 1 should contain gene IDs

column 2 should contain descriptions but may be NAs

next columns should contain normalized counts (one column/sample).

Columns must have headers:

NAME for column 1 Description for column 2

Sample names for the next columns

The first line of the file should be: #1.2

The second line should be: number_of_genes tab number_of_samples

Save the file as .gct !

Apart from these data you also need a .cls file with the metadata (grouping info of the samples). This is a space delimited text file:

line 1: number_of_samples space number_of_groups space 1 line 2: # space class0_name space class1_name

line 3: for every sample 0 or 1 separated by spaces

Analysis

Originally GSEA was created to analyze microarray results but you can use it for analyzing RNA-Seq data, albeit with some tweaking of the parameter settings.

We will use the filtered genes of our RNASeq training for gene set enrichment analysis. You also need the corresponding cls file .

How to perform GSEA on a full list of genes with normalized counts?

Load the data into GSEA. Load both the .gct and the .cls file.

Run GSEA: fill in the parameter settings. Click the question mark (red) on the bottom of the page to view descriptions of these parameters.

Use the GO: all (green) gene set

Use 10 permutations (green). If all goes well, repeat the analysis with 1000 permutations.

Select Collapse (blue): this is necessary for RNA-Seq data to associate the gene IDs of your list to the probes of the chip platform

Use gene set permutations (blue). Broad advises phenotype permutations (group labels will be shuffled to create random data to compare with) but they will only work when you have at least 7 samples per group.

Choose Human_ENSEMBL_Gene_ID_MSigDB.vX.chip as ChIP Platform (blue). Although you didn’t actually do chips (microarrays) he needs to map the Ensembl Gene IDs in the data file to functional annotations.

Click the Run button at the bottom of the page.

In the left lower corner of the user interface there’s a section called GSEA reports. It shows the status of analyses run in this session, including the currently running analysis:

Click the green text to display the results in a browser.

g:Profiler

If you only have scores for a subset of the genome you should analyze the data using g:Profiler (https://biit.cs.ut.ee/gprofiler/) with the Ordered query option.

Your list should consist of gene IDS ordered according to decreasing importance (in this case increasing corrected p-value for differential expression).

g:Profiler performs enrichment analysis with increasingly larger numbers of genes starting from the top of the list. This procedure identifies functional annotations that associate to the most dramatic changes, as well as broader terms that characterize the gene set as a whole.

Repeat the enrichment analysis with the ranked gene list.

Resources of functional annotation

Functional annotations can be very diverse: molecular functions, pathways (genes that work together to carry out a biological process), interactions, gene regulation, involvement in disease…

Online enrichment analysis tools often have functional annotation built-in for a limited set of organisms but some tools like WebGestallt also allow to upload your own annotation.

Pathguide (http://www.pathguide.org/) contains info about hundreds of pathway and molecular interaction related resources. It allows organism-based searches to find resources that contain functional info on the organism you work on.

Gene sets based on GO, pathways, omics studies, sequence motifs, chromosomal position, oncogenic and immunological expression signatures, and various computational analyses maintained by the GSEA team of MSigDB (http://www.msigdb.org). The GSEA tool from Broad will use this database by default.

Choosing the right background

Functional enrichment methods require the definition of background genes for comparison. All annotated protein-coding genes are often used as default. This leads to false positives if the experiment measured only a subset of all genes. So you should use a custom background if the tool allows it: e.g. the filtered genes from an RNASeq experiment or all proteins that were detected in a proteomics experiment.

IPA: network analysis using highly curated data

Only for VIB members and holders of an IPA license.

Follow the instructions on our wiki page on IPA (http://wiki.bits.vib.be/index.php/PubMA_Exercise.7)

Cytoscape: free tool for network and pathway analysis

Cytoscape (http://www.cytoscape.org/) is a free tool for visualizing and analyzing interaction networks and pathways.

In Cytoscape, the two basic elements of a network are nodes and edges. A node represents an individual entity in a network, such as a protein. An edge represents a relationship between two nodes. Each edge has a source and a target: the source is the node from which the edge originates, and the target is the node where the edge terminates.

* Visualizing the BRCA2 interaction network

Open Cytoscape and import the file of the BRCA2 interactions that you downloaded from PSIQUIC. To load interaction data in Cytoscape you need to indicate which column contains the source nodes and which column contains the target nodes.

In the top menu expand File

Select Import -> Network -> File

Browse to the file containing the BRCA2 interactions that you downloaded from PSIQUIC and click Open

Click the first column and click the Source node button Click the second column and click the Target node button Click OK

Import the data of the BRCA2 interactions that you downloaded from PSIQUIC into Cytoscape

The data is opened and visualized as a network

iRegulon detects regulatory networks in a set of genes

Additional features are available as apps and one of the apps we’re going to explore is iRegulon (see slides).

As an input, iRegulon needs a set of coregulated genes. We are going to use genes upregulated under hypoxia (http://www.ncbi.nlm.nih.gov/pubmed?CrntRpt=DocSum&cmd=search&term=16565084). You can download the set of genes from the Broad website (http://software.broadinstitute.org/gsea/msigdb/geneset_page.jsp? geneSetName=ELVIDGE_HYPOXIA_UP).

To use the file in Cytoscape you have to remove the header lines (e.g. in WordPad) so that what remains is just a list of gene symbols.

Click OK

In the top menu expand File

Select Import -> Network -> File

Select the file containing the gene set and click Open

Selet Column 1 for Source Interaction

Select Show Text File Import Options in the Advanced section

Deselect Transfer first line as column names in the Column Names section. If you don’t do it it will use the first row as a column name and you loose one of the genes in the set.

Open the data in Cytoscape

A network of 171 unconnected nodes is created.

Leave parameters at their default settings and click Submit

Select all the nodes in the network: in the top menu expand Select

Select Nodes -> Select all nodes. You can see that the nodes are selected because they change color to yellow.

In the top menu expand Apps

Selet iRegulon -> Predict regulators and targets

Predict regulators and their targets

The results are found in the Results panel: a list of enriched motifs.

Similar motifs are represented in the same color. You can view details of the motif by clicking a row in the table.

The Transcription factors tab contains a list of candidate TFs that can bind to the enriched motifs