python-bioinformatics-basics

Adding Gene Names to a Volcano Plot from DESeq2

January 10, 2025 Off By admin
Shares

In this guide, we will walk through the process of adding gene names to a volcano plot generated from DESeq2 results. The volcano plot is a useful visualization tool for differential expression analysis, and adding gene names to the plot can help identify significant genes of interest.


Step 1: Install and Load Required Libraries

First, install and load the necessary R packages. We will use ggplot2 for plotting and ggrepel to avoid overlapping labels.

R
Copy
# Install necessary packages if not already installed
install.packages("ggplot2")
install.packages("ggrepel")

# Load libraries
library(ggplot2)
library(ggrepel)

Step 2: Prepare DESeq2 Results

Assume you have already run DESeq2 and obtained the results object res. If not, here is a quick example of how to generate it:

R
Copy
# Example DESeq2 workflow
library(DESeq2)

# Create a DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ condition)

# Run DESeq2
dds <- DESeq(dds)

# Get results
res <- results(dds)

Step 3: Convert DESeq2 Results to a Data Frame

The res object from DESeq2 is not a data frame by default. Convert it to a data frame and add a column for gene names.

R
Copy
# Convert DESeq2 results to a data frame
res_df <- as.data.frame(res)

# Add a column for gene names
res_df$gene <- rownames(res_df)

Step 4: Create a Volcano Plot with Gene Names

Now, create a volcano plot using ggplot2 and add gene names for significantly differentially expressed genes.

R
Copy
# Define significance thresholds
padj_threshold <- 0.05
log2FC_threshold <- 1

# Add a column to indicate significance
res_df$sig <- ifelse(res_df$padj < padj_threshold & abs(res_df$log2FoldChange) > log2FC_threshold, "Significant", "Not Significant")

# Create the volcano plot
volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = sig)) +
  geom_point(size = 1) +
  scale_color_manual(values = c("black", "red")) +
  theme_minimal() +
  ggtitle("Volcano Plot of Differential Expression") +
  xlab("log2 Fold Change") +
  ylab("-log10 Adjusted p-value")

# Add gene names for significant genes
volcano_plot <- volcano_plot +
  geom_text_repel(
    data = subset(res_df, padj < padj_threshold & abs(log2FoldChange) > log2FC_threshold),
    aes(label = gene),
    box.padding = 0.5,
    max.overlaps = Inf,
    size = 3
  )

# Display the plot
print(volcano_plot)

Step 5: Customize the Volcano Plot (Optional)

You can further customize the volcano plot by adjusting the number of labeled genes, changing colors, or modifying the plot theme.

Example: Label Top 20 Genes

R
Copy
# Sort by significance and select top 20 genes
top_genes <- res_df[order(res_df$padj), ][1:20, ]

# Add labels for top 20 genes
volcano_plot <- volcano_plot +
  geom_text_repel(
    data = top_genes,
    aes(label = gene),
    box.padding = 0.5,
    max.overlaps = Inf,
    size = 3,
    color = "blue"
  )

# Display the plot
print(volcano_plot)

Step 6: Save the Volcano Plot

Save the volcano plot to a file for later use or publication.

R
Copy
# Save the plot as a PDF
ggsave("volcano_plot.pdf", plot = volcano_plot, width = 10, height = 8)

# Save the plot as a PNG
ggsave("volcano_plot.png", plot = volcano_plot, width = 10, height = 8, dpi = 300)

Tips and Tricks

  1. Avoid Overlapping Labels: Use ggrepel::geom_text_repel to prevent overlapping gene labels.
  2. Adjust Significance Thresholds: Modify padj_threshold and log2FC_threshold to focus on genes of interest.
  3. Highlight Specific Genes: Manually add labels for specific genes by filtering the data frame.
  4. Customize Colors: Use scale_color_manual to customize the colors for significant and non-significant genes.
  5. Interactive Plots: Use plotly::ggplotly to create an interactive volcano plot.

By following this guide, you can create a volcano plot with gene names from DESeq2 results, making it easier to interpret and share your differential expression analysis.

Shares