Utilizing ggplot2: 50 Essential Visualizations for Bioinformatics Analysis
March 31, 2024What type of visualization to use for what sort of problem? This tutorial helps you choose the right type of chart for your specific objectives and how to implement it in R using ggplot2.
This tutorial is primarily geared towards those having some basic knowledge of the R programming language and want to make complex and nice looking charts with R ggplot2.
Table of Contents
Top 50 ggplot2 Visualizations – The Master List
An effective chart is one that:
- Conveys the right information without distorting facts.
- Is simple but elegant. It should not force you to think much in order to get it.
- Aesthetics supports information rather that overshadow it.
- Is not overloaded with information.
The list below sorts the visualizations based on its primary purpose. Primarily, there are 8 types of objectives you may construct plots. So, before you actually make the plot, try and figure what findings and relationships you would like to convey or examine through the visualization.
Correlation
The following plots help to examine how well correlated two variables are.
Scatterplot
The most frequently used plot for data analysis is undoubtedly the scatterplot. Whenever you want to understand the nature of relationship between two variables, invariably the first choice is the scatterplot.A scatter plot is a common visualization tool used in bioinformatics to explore relationships between two numerical variables. It can be particularly useful for examining correlations or patterns in data, such as gene expression levels, protein interactions, or sequence similarities. Scatter plots can also be used to identify outliers or clusters within the data, providing insights into underlying biological processes.
It can be drawn using geom_point(). Additionally, geom_smooth which draws a smoothing line (based on loess) by default, can be tweaked to draw the line of best fit by setting method=’lm’.
# install.packages(“ggplot2”) # load package and data
options(scipen=999) # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
data(“midwest”, package = “ggplot2”)
# midwest <- read.csv(“http://goo.gl/G1K41K”) # bkup data source
# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(aes(col=state, size=popdensity)) + geom_smooth(method=”loess”, se=F) + xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle=”Area Vs Population”, y=”Population”,
x=”Area”, title=”Scatterplot”,
caption = “Source: midwest”)
plot(gg)
[Back to Top]
Scatterplot With Encircling
When presenting the results, sometimes I would encirlce certain special group of points or region in the chart so as to draw the attention to those peculiar cases. This can be conveniently done using the geom_encircle() in ggalt package. In bioinformatics, a scatter plot with encircling is a visualization technique used to highlight clusters or groups of data points within a scatter plot. This can be particularly useful for identifying subpopulations or patterns in biological data. The encircling is typically done by drawing a polygon or ellipse around the relevant data points, making it easier to visually distinguish them from the rest of the data.
Within geom_encircle(), set the data to a new dataframe that contains only the points (rows) or interest. Moreover, You can expand the curve so as to pass just outside the points. The color and size (thickness) of the curve can be modified as well. See below example.
# install ‘ggalt’ pkg
# devtools::install_github(“hrbrmstr/ggalt”)
options(scipen = 999) library(ggplot2) library(ggalt)
midwest_select <- midwest[midwest$poptotal > 350000 &
midwest$poptotal <= 500000 & midwest$area > 0.01 & midwest$area < 0.1, ]
# Plot
ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(aes(col=state, size=popdensity)) + # draw points geom_smooth(method=”loess”, se=F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) + # draw smoothing line
geom_encircle(aes(x=area, y=poptotal), data=midwest_select, color=”red”,
size=2,
expand=0.08) + # encircle
labs(subtitle=”Area Vs Population”, y=”Population”,
x=”Area”,
title=”Scatterplot + Encircle”, caption=”Source: midwest”)
[Back to Top]
Jitter Plot
In bioinformatics, a jitter plot is a type of visualization used to display individual data points along with their distribution. It is particularly useful when dealing with datasets that have many overlapping points, such as in gene expression data or sequence alignment scores. By adding a small amount of random noise (jitter) to the data points, the plot can reveal underlying patterns or clusters that would not be apparent in a standard scatter plot. Jitter plots are often used in conjunction with other plots, such as box plots or violin plots, to provide a more comprehensive view of the data.
Let’s look at a new data to draw the scatterplot. This time, I will use the mpg dataset to plot city mileage (cty) vs highway mileage (hwy).
# load package and data
library(ggplot2)
data(mpg, package=”ggplot2″) # alternate source: “http://goo.gl/uEeRGu”)
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(mpg, aes(cty, hwy))
# Scatterplot
g + geom_point() + geom_smooth(method=”lm”, se=F) + labs(subtitle=”mpg: city vs highway mileage”,
y=”hwy”,
x=”cty”,
title=”Scatterplot with overlapping points”, caption=”Source: midwest”)
What we have here is a scatterplot of city and highway mileage in mpg dataset. We have seen a similar scatterplot and this looks neat and gives a clear idea of how the city mileage (cty) and highway mileage (hwy) are well correlated.
But, this innocent looking plot is hiding something. Can you find out?
dim(mpg)
The original data has 234 data points but the chart seems to display fewer points. What has happened? This is because there are many overlapping points appearing as a single dot. The fact that both cty and hwy are integers in the source dataset made it all the more convenient to hide this detail. So just be extra careful the next time you make scatterplot with integers.
So how to handle this? There are few options. We can make a jitter plot with jitter_geom(). As the name suggests, the overlapping points are randomly jittered around its original position based on a threshold controlled by the width argument.
# load package and data
library(ggplot2)
data(mpg, package=”ggplot2″)
# mpg <- read.csv(“http://goo.gl/uEeRGu”)
# Scatterplot
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(mpg, aes(cty, hwy))
g + geom_jitter(width = .5, size=1) +
labs(subtitle=”mpg: city vs highway mileage”, y=”hwy”,
x=”cty”,
title=”Jittered Points”)
More points are revealed now. More the width, more the points are moved jittered from their original position.
[Back to Top]
Counts Chart
In bioinformatics, a counts chart is a type of visualization used to display the distribution of counts for different categories or features in a dataset. This type of chart is commonly used in analyses such as differential gene expression, where the counts represent the number of reads or occurrences of a particular gene or feature in different samples or conditions. Counts charts can take various forms, such as bar charts, histograms, or violin plots, depending on the nature of the data and the specific analysis being performed.
The second option to overcome the problem of data points overlap is to use what is called a counts chart. Whereever there is more points overlap, the size of the circle gets bigger.
# load package and data
library(ggplot2)
data(mpg, package=”ggplot2″)
# mpg <- read.csv(“http://goo.gl/uEeRGu”)
# Scatterplot
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(mpg, aes(cty, hwy))
g + geom_count(col=”tomato3″, show.legend=F) +
labs(subtitle=”mpg: city vs highway mileage”, y=”hwy”,
x=”cty”,
title=”Counts Plot”)
[Back to Top]
Bubble plot
In bioinformatics, a bubble plot is a type of visualization used to display three-dimensional data in a two-dimensional format. It is similar to a scatter plot but adds a third dimension by varying the size of the points (bubbles) to represent the magnitude of a third variable. This can be useful for visualizing relationships between three variables, such as gene expression levels, sample conditions, and gene functions. The position of the bubbles on the plot corresponds to the values of the two variables being compared, while the size of the bubbles indicates the magnitude of the third variable.
While scatterplot lets you compare the relationship between 2 continuous variables, bubble chart serves well if you want to understand relationship within the underlying groups based on:
- A Categorical variable (by changing the color) and
- Another continuous variable (by changing the size of points).
In simpler words, bubble charts are more suitable if you have 4-Dimensional data where two of them are numeric (X and Y) and one other categorical (color) and another numeric variable (size).
The bubble chart clearly distinguishes the range of displ between the manufacturers and how the slope of lines-of-best-fit varies, providing a better visual comparison between the groups.
# load package and data
library(ggplot2)
data(mpg, package=”ggplot2″)
# mpg <- read.csv(“http://goo.gl/uEeRGu”)
mpg_select <- mpg[mpg$manufacturer %in% c(“audi”, “ford”, “honda”, “hyundai”), ]
# Scatterplot
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(mpg_select, aes(displ, cty)) +
labs(subtitle=”mpg: Displacement vs City Mileage”, title=”Bubble chart”)
g + geom_jitter(aes(col=manufacturer, size=hwy)) +
geom_smooth(aes(col=manufacturer), method=”lm”, se=F)
[Back to Top]
Animated Bubble chart
In bioinformatics, an animated bubble chart is a dynamic visualization that uses animation to show changes in data over time or other relevant dimensions. This type of chart typically displays bubbles that vary in size and possibly color, representing different categories or values, while the animation sequence highlights how these bubbles change position, size, or color over time or in response to other variables. Animated bubble charts can be useful for visualizing temporal patterns in biological data, such as changes in gene expression levels across different time points or conditions.
An animated bubble chart can be implemented using the gganimate package. It is same as the bubble chart, but, you have to show how the values change over a fifth dimension (typically time).
The key thing to do is to set the aes(frame) to the desired column on which you want to animate. Rest of the procedure related to plot construction is the same. Once the plot is constructed, you can animate it using gganimate() by setting a chosen interval.
# Source: https://github.com/dgrtwo/gganimate
# install.packages(“cowplot”) # a gganimate dependency # devtools::install_github(“dgrtwo/gganimate”) library(ggplot2)
library(gganimate) library(gapminder)
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, frame = year)) +
geom_point() +
geom_smooth(aes(group = year), method = “lm”, show.legend = FALSE) +
facet_wrap(~continent, scales = “free”) +
scale_x_log10() # convert to log scale
gganimate(g, interval=0.2)
[Back to Top]
Marginal Histogram / Boxplot
In bioinformatics, a marginal histogram or boxplot is a type of visualization that combines a scatter plot with histograms or boxplots along the axes. This allows for the simultaneous visualization of the marginal distributions of two variables along with their joint distribution.
For example, in a scatter plot with marginal histograms, the main plot displays the relationship between two variables with points, while the histograms on the top and right margins show the distributions of each variable individually. This can be useful for exploring correlations between variables and identifying patterns in the data.
Similarly, in a scatter plot with marginal boxplots, the main plot shows the relationship between two variables with points, while the boxplots on the top and right margins display summary statistics (such as median, quartiles, and outliers) for each variable. This can provide additional insights into the data distribution and help identify potential outliers or trends.
If you want to show the relationship as well as the distribution in the same chart, use the marginal histogram. It has a histogram of the X and Y variables at the margins of the scatterplot.
This can be implemented using the ggMarginal() function from the ‘ggExtra’ package. Apart from a histogram, you could choose to draw a marginal boxplot or density plot by setting the respective type option.
# load package and data library(ggplot2) library(ggExtra)
data(mpg, package=”ggplot2″)
# mpg <- read.csv(“http://goo.gl/uEeRGu”)
# Scatterplot
theme_set(theme_bw()) # pre-set the bw theme. mpg_select <- mpg[mpg$hwy >= 35 & mpg$cty > 27, ] g <- ggplot(mpg, aes(cty, hwy)) +
geom_count() +
geom_smooth(method=”lm”, se=F)
ggMarginal(g, type = “histogram”, fill=”transparent”)
ggMarginal(g, type = “boxplot”, fill=”transparent”)
# ggMarginal(g, type = “density”, fill=”transparent”)
[Back to Top]
Correlogram
In bioinformatics, a correlogram is a visual representation of the correlation matrix between variables. It is a matrix of plots, where each plot shows the correlation between two variables as a colored square or circle. The color or size of the square/circle indicates the strength and direction of the correlation, with different colors or sizes representing different correlation values (e.g., positive, negative, or no correlation).
Correlograms are useful for identifying patterns of correlation between variables in large datasets, such as gene expression data or clinical measurements. They can help researchers identify potential relationships between variables and guide further analysis.
Correlogram let’s you examine the corellation of multiple continuous variables present in the same dataframe. This is conveniently implemented using the ggcorrplot package.
# devtools::install_github(“kassambara/ggcorrplot”)
library(ggplot2) library(ggcorrplot)
# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)
# Plot
ggcorrplot(corr, hc.order = TRUE, type = “lower”,
lab = TRUE, lab_size = 3, method=”circle”,
colors = c(“tomato2”, “white”, “springgreen3″),
title=”Correlogram of mtcars”, ggtheme=theme_bw)
[Back to Top]
Deviation
Compare variation in values between small number of items (or categories) with respect to a fixed reference.
Diverging bars
In bioinformatics, diverging bars are a type of bar chart used to visualize the differences or changes between two groups or conditions. The chart consists of bars that extend from a central axis, with each bar representing the magnitude of a variable for each group. The bars on one side of the axis typically represent one group or condition, while the bars on the other side represent the other group or condition.
Diverging bars are useful for comparing the relative sizes of values between two groups, highlighting differences, and showing the direction of the change (e.g., increase or decrease). They are commonly used in bioinformatics to visualize various types of data, such as gene expression levels in different tissues or conditions, mutation frequencies in cancer samples, or metabolite concentrations in different biological samples.
Diverging Bars is a bar chart that can handle both negative and positive values. This can be implemented by a smart tweak with geom_bar(). But the usage of geom_bar() can be quite confusing. Thats because, it can be used to make a bar chart as well as a histogram. Let me explain.
By default, geom_bar() has the stat set to count. That means, when you provide just a continuous X variable (and no Y variable), it tries to make a histogram out of the data.
In order to make a bar chart create bars instead of histogram, you need to do two things.
- Set stat=identity
- Provide both x and y inside aes() where, x is either character or factor and y is numeric.
In order to make sure you get diverging bars instead of just bars, make sure, your categorical variable has 2 categories that changes values at a certain threshold of the continuous variable. In below example, the mpg from mtcars dataset is normalised by computing the z score. Those vehicles with mpg above zero are marked green and those below are marked red.
library(ggplot2) theme_set(theme_bw())
# Data Prep
data(“mtcars”) # load data
mtcars$`car name` <- rownames(mtcars) # create new column for car names
mtcars$mpg_z <- round((mtcars$mpg – mean(mtcars$mpg))/sd(mtcars$mpg), 2) # compute norm alized mpg
mtcars$mpg_type <- ifelse(mtcars$mpg_z < 0, “below”, “above”) # above / below avg flag
mtcars <- mtcars[order(mtcars$mpg_z), ] # sort
mtcars$`car name` <- factor(mtcars$`car name`, levels = mtcars$`car name`) # convert to factor to retain sorted order in plot.
# Diverging Barcharts
ggplot(mtcars, aes(x=`car name`, y=mpg_z, label=mpg_z)) + geom_bar(stat=’identity’, aes(fill=mpg_type), width=.5) + scale_fill_manual(name=”Mileage”,
labels = c(“Above Average”, “Below Average”),
values = c(“above”=”#00ba38”, “below”=”#f8766d”)) +
labs(subtitle=”Normalised mileage from ‘mtcars'”, title= “Diverging Bars”) +
coord_flip()
[Back to Top]
Diverging Lollipop Chart
In bioinformatics, a diverging lollipop chart is a variation of the traditional lollipop chart that is used to visualize the differences or changes between two groups or conditions. Like a standard lollipop chart, it consists of a vertical line (the “lollipop”) for each data point, with the length of the line representing the magnitude of the data point. However, in a diverging lollipop chart, the lines are divided into two segments, with one segment representing one group or condition and the other segment representing the other group or condition. The chart typically includes a central axis to help compare the lengths of the segments and visualize the differences between the two groups. Diverging lollipop charts are useful for comparing the relative sizes of values between two groups and highlighting differences or trends in the data. They are commonly used in bioinformatics to visualize gene expression data, mutation frequencies, or other types of comparative data.
Lollipop chart conveys the same information as bar chart and diverging bar. Except that it looks more modern. Instead of geom_bar, I use geom_point and geom_segment to get the lollipops right. Let’s draw a lollipop using the same data I prepared in the previous example of diverging bars.
library(ggplot2)
theme_set(theme_bw())
ggplot(mtcars, aes(x=`car name`, y=mpg_z, label=mpg_z)) + geom_point(stat=’identity’, fill=”black”, size=6) + geom_segment(aes(y = 0,
x = `car name`, yend = mpg_z, xend = `car name`),
color = “black”) + geom_text(color=”white”, size=2) + labs(title=”Diverging Lollipop Chart”,
subtitle=”Normalized mileage from ‘mtcars’: Lollipop”) +
ylim(-2.5, 2.5) +
coord_flip()
[Back to Top]
Diverging Dot Plot
In bioinformatics, a diverging dot plot is a variation of the dot plot that is used to compare the distribution of data points between two groups or conditions. In a diverging dot plot, each data point is represented by a dot, and the dots are grouped by their respective groups or conditions. The plot typically includes a central axis that separates the two groups, with the dots on one side representing one group and the dots on the other side representing the other group.
Diverging dot plots are useful for visualizing differences in the distribution of data points between two groups and identifying patterns or trends in the data. They can be particularly effective for comparing gene expression levels, mutation frequencies, or other types of data between different biological samples or conditions.
Dot plot conveys similar information. The principles are same as what we saw in Diverging bars, except that only point are used. Below example uses the same data prepared in the diverging bars example (Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Diverging%20Bars).
library(ggplot2)
theme_set(theme_bw())
# Plot
ggplot(mtcars, aes(x=`car name`, y=mpg_z, label=mpg_z)) + geom_point(stat=’identity’, aes(col=mpg_type), size=6) + scale_color_manual(name=”Mileage”,
labels = c(“Above Average”, “Below Average”),
values = c(“above”=”#00ba38”, “below”=”#f8766d”)) +
geom_text(color=”white”, size=2) +
labs(title=”Diverging Dot Plot”,
subtitle=”Normalized mileage from ‘mtcars’: Dotplot”) +
ylim(-2.5, 2.5) +
coord_flip()
[Back to Top]
Area Chart
In bioinformatics, an area chart is a type of visualization that displays data as a series of data points connected by lines and the area between the lines and the x-axis is filled with color or shading. This chart is useful for showing trends over time or other ordered categories.
In the context of bioinformatics, area charts can be used to represent various types of data, such as gene expression levels over time, the abundance of different species in a microbial community over time, or the distribution of protein domains along a protein sequence. Area charts are particularly effective for visualizing cumulative data or stacked proportions, where the total value at each point is of interest.
Area charts are typically used to visualize how a particular metric (such as % returns from a stock) performed compared to a certain baseline. Other types of %returns or %change data are also commonly used. The geom_area() implements this.
library(ggplot2) library(quantmod)
data(“economics”, package = “ggplot2”)
# Compute % Returns
economics$returns_perc <- c(0, diff(economics$psavert)/economics$psavert[-length(economics$p savert)])
# Create break points and labels for axis ticks
brks <- economics$date[seq(1, length(economics$date), 12)]
lbls <- lubridate::year(economics$date[seq(1, length(economics$date), 12)])
# Plot
ggplot(economics[1:100, ], aes(date, returns_perc)) +
geom_area() +
scale_x_date(breaks=brks, labels=lbls) + theme(axis.text.x = element_text(angle=90)) + labs(title=”Area Chart”,
subtitle = “Perc Returns for Personal Savings”, y=”% Returns for Personal savings”, caption=”Source: economics”)
[Back to Top]
Ranking
Used to compare the position or performance of multiple items with respect to each other. Actual values matters somewhat less than the ranking.
Ordered Bar Chart
In bioinformatics, an ordered bar chart is a type of bar chart where the bars are arranged in a specific order based on the values they represent. The bars are typically sorted either in ascending or descending order of the values they represent, making it easier to compare the values and identify trends in the data.
Ordered bar charts are commonly used in bioinformatics to visualize various types of data, such as gene expression levels, protein abundances, or mutation frequencies. They can be particularly useful for identifying outliers, comparing the relative sizes of values, or highlighting patterns in the data.
Ordered Bar Chart is a Bar Chart that is ordered by the Y axis variable. Just sorting the dataframe by the variable of interest isn’t enough to order the bar chart. In order for the bar chart to retain the order of the rows, the X axis variable (i.e. the categories) has to be converted into a factor.
Let’s plot the mean city mileage for each manufacturer from mpg dataset. First, aggregate the data and sort it before you draw the plot. Finally, the X variable is converted to a factor.
Let’s see how that is done.
# Prepare data: group mean city mileage by manufacturer.
cty_mpg <- aggregate(mpg$cty, by=list(mpg$manufacturer), FUN=mean) # aggregate
colnames(cty_mpg) <- c(“make”, “mileage”) # change column names
cty_mpg <- cty_mpg[order(cty_mpg$mileage), ] # sort
cty_mpg$make <- factor(cty_mpg$make, levels = cty_mpg$make) # to retain the order in plot.
head(cty_mpg, 4)
#> make mileage #> 9 lincoln 11.33333
#> 8 land rover 11.50000
#> 3 dodge 13.13514
#> 10 mercury 13.25000
The X variable is now a factor, let’s plot.
library(ggplot2)
theme_set(theme_bw())
# Draw plot
ggplot(cty_mpg, aes(x=make, y=mileage)) + geom_bar(stat=”identity”, width=.5, fill=”tomato3″) + labs(title=”Ordered Bar Chart”,
subtitle=”Make Vs Avg. Mileage”, caption=”source: mpg”) +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
[Back to Top]
Lollipop Chart
In bioinformatics, a lollipop chart is a type of chart that is used to visualize data points as markers along a line. It is similar to a scatter plot but with a line connecting each data point to a horizontal axis, resembling a series of lollipops.
Lollipop charts are often used to display gene expression data, where each data point represents the expression level of a gene. The length of the line (the “stick” of the lollipop) represents the magnitude of the expression level, while the data point (the “lollipop head”) indicates the exact value. Lollipop charts are useful for comparing multiple data points and identifying trends or patterns in the data.
Lollipop charts conveys the same information as in bar charts. By reducing the thick bars into thin lines, it reduces the clutter and lays more emphasis on the value. It looks nice and modern.
library(ggplot2)
theme_set(theme_bw())
# Plot
ggplot(cty_mpg, aes(x=make, y=mileage)) + geom_point(size=3) + geom_segment(aes(x=make,
xend=make, y=0,
yend=mileage)) +
labs(title=”Lollipop Chart”, subtitle=”Make Vs Avg. Mileage”, caption=”source: mpg”) +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
[Back to Top]
Dot Plot
In bioinformatics, a dot plot is a graphical method used to compare two biological sequences. It is a simple and intuitive way to visualize the similarity between sequences by plotting one sequence along the x-axis and the other sequence along the y-axis.
In a dot plot, a dot is placed at each position where the corresponding residues in the two sequences are identical. By examining the pattern of dots, researchers can identify regions of similarity, such as conserved domains or regions of sequence homology. Dot plots can also reveal insertions, deletions, or other mutations that have occurred between the sequences.
Dot plots can be especially useful for analyzing pairwise sequence alignments, identifying repeats or patterns within a single sequence, or comparing sequences to detect similarities or differences.
Dot plots are very similar to lollipops, but without the line and is flipped to horizontal position. It emphasizes more on the rank ordering of items with respect to actual values and how far apart are the entities with respect to each other.
library(ggplot2) library(scales)
theme_set(theme_classic())
# Plot
ggplot(cty_mpg, aes(x=make, y=mileage)) + geom_point(col=”tomato2″, size=3) + # Draw points geom_segment(aes(x=make,
xend=make, y=min(mileage),
yend=max(mileage)), linetype=”dashed”,
size=0.1) + # Draw dashed lines
labs(title=”Dot Plot”,
subtitle=”Make Vs Avg. Mileage”, caption=”source: mpg”) +
coord_flip()
[Back to Top]
Slope Chart
In bioinformatics, a slope chart is a type of chart used to visualize changes in data over time or between different conditions. It consists of a series of lines that connect data points representing the same entity (e.g., gene, protein) across different time points or conditions. Each line represents a different entity, and the slope of the line indicates the rate or direction of change for that entity.
Slope charts are useful for comparing the trends or changes in multiple entities simultaneously. They can be particularly effective for visualizing gene expression data, where each line represents the expression levels of a different gene across different experimental conditions or time points. Slope charts can help researchers identify genes that show similar or opposite trends in expression and understand the overall patterns in the data.
Slope charts are an excellent way of comparing the positional placements between 2 points on time. At the moment, there is no builtin function to construct this. Following code serves as a pointer about how you may approach this.
library(ggplot2) library(scales) theme_set(theme_classic())
# prep data
df <- read.csv(“https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv”) colnames(df) <- c(“continent”, “1952”, “1957”)
left_label <- paste(df$continent, round(df$`1952`),sep=”, “) right_label <- paste(df$continent, round(df$`1957`),sep=”, “) df$class <- ifelse((df$`1957` – df$`1952`) < 0, “red”, “green”)
# Plot
p <- ggplot(df) + geom_segment(aes(x=1, xend=2, y=`1952`, yend=`1957`, col=class), size=.75, s how.legend=F) +
geom_vline(xintercept=1, linetype=”dashed”, size=.1) + geom_vline(xintercept=2, linetype=”dashed”, size=.1) + scale_color_manual(labels = c(“Up”, “Down”),
values = c(“green”=”#00ba38”, “red”=”#f8766d”)) + # color of lines
labs(x=””, y=”Mean GdpPerCap”) + # Axis labels
xlim(.5, 2.5) + ylim(0,(1.1*(max(df$`1952`, df$`1957`)))) # X and Y axis limits
# Add texts
p <- p + geom_text(label=left_label, y=df$`1952`, x=rep(1, NROW(df)), hjust=1.1, size=3.5)
p <- p + geom_text(label=right_label, y=df$`1957`, x=rep(2, NROW(df)), hjust=-0.1, size=3.5)
p <- p + geom_text(label=”Time 1″, x=1, y=1.1*(max(df$`1952`, df$`1957`)), hjust=1.2, size=5) # tit le
p <- p + geom_text(label=”Time 2″, x=2, y=1.1*(max(df$`1952`, df$`1957`)), hjust=-0.1, size=5) # ti tle
# Minify theme
p + theme(panel.background = element_blank(), panel.grid = element_blank(),
axis.ticks = element_blank(), axis.text.x = element_blank(), panel.border = element_blank(), plot.margin = unit(c(1,2,1,2), “cm”))
[Back to Top]
Dumbbell Plot
In bioinformatics, a dumbbell plot is a type of chart used to compare the changes in values between two related data points. It consists of a series of lines (resembling dumbbells) that connect two data points, one representing the initial value and the other representing the final value.
Dumbbell plots are particularly useful for visualizing changes in gene expression, protein abundance, or other biological measurements between two conditions or time points. Each line in the plot represents a different gene, protein, or biological feature, and the length of the line indicates the magnitude of the change. Dumbbell plots can help researchers quickly identify trends, outliers, or patterns in the data.
Dumbbell charts are a great tool if you wish to: 1. Visualize relative positions (like growth and decline) between two points in time. 2. Compare distance between two categories.
In order to get the correct ordering of the dumbbells, the Y variable should be a factor and the levels of the factor variable should be in the same order as it should appear in the plot.
# devtools::install_github(“hrbrmstr/ggalt”)
library(ggplot2) library(ggalt) theme_set(theme_classic())
health <- read.csv(“https://raw.githubusercontent.com/selva86/datasets/master/health.csv”) health$Area <- factor(health$Area, levels=as.character(health$Area)) # for right ordering of the du mbells
# health$Area <- factor(health$Area)
gg <- ggplot(health, aes(x=pct_2013, xend=pct_2014, y=Area, group=Area)) +
geom_dumbbell(color=”#a3c4dc”, size=0.75, point.colour.l=”#0e668b”) +
scale_x_continuous(label=percent) +
labs(x=NULL, y=NULL,
title=”Dumbbell Chart”,
subtitle=”Pct Change: 2013 vs 2014″, caption=”Source: https://github.com/hrbrmstr/ggalt”) +
theme(plot.title = element_text(hjust=0.5, face=”bold”),
plot.background=element_rect(fill=”#f7f7f7″), panel.background=element_rect(fill=”#f7f7f7″), panel.grid.minor=element_blank(), panel.grid.major.y=element_blank(), panel.grid.major.x=element_line(), axis.ticks=element_blank(), legend.position=”top”, panel.border=element_blank())
plot(gg)
[Back to Top]
Distribution
When you have lots and lots of data points and want to study where and how the data points are distributed.
Histogram
By default, if only one variable is supplied, the geom_bar() tries to calculate the count. In order for it to behave like a bar chart, the stat=identity option has to be set and x and y values must be provided.
Histogram on a continuous variable
In bioinformatics, histograms are often used to visualize the distribution of continuous variables, such as gene expression levels, protein abundances, or sequence lengths.
A histogram represents the frequency or count of data points that fall within specific ranges, or “bins,” of the continuous variable. The x-axis of the histogram represents the range of values for the variable, divided into bins, while the y-axis represents the frequency or count of data points in each bin.
Histograms are useful for understanding the distribution of a variable and can help identify patterns, outliers, or other features in the data. For example, a histogram of gene expression levels could reveal whether the data is normally distributed, skewed, or has multiple peaks, providing insights into the underlying biology.
Histogram on a continuous variable can be accomplished using either geom_bar() or geom_histogram(). When using geom_histogram(), you can control the number of bars using the bins option. Else, you can set the range covered by each bin using binwidth. The value of binwidth is on the same scale as the continuous variable on which histogram is built. Since, geom_histogram gives facility to control both number of bins as well as binwidth, it is the preferred option to create histogram on continuous variables.
library(ggplot2)
theme_set(theme_classic())
# Histogram on a Continuous (Numeric) Variable
g <- ggplot(mpg, aes(displ)) + scale_fill_brewer(palette = “Spectral”)
g + geom_histogram(aes(fill=class), binwidth = .1, col=”black”,
size=.1) + # change binwidth
labs(title=”Histogram with Auto Binning”,
subtitle=”Engine Displacement across Vehicle Classes”)
g + geom_histogram(aes(fill=class),
bins=5, col=”black”,
size=.1) + # change number of bins
labs(title=”Histogram with Fixed Bins”,
subtitle=”Engine Displacement across Vehicle Classes”)
[Back to Top]
Histogram on a categorical variable
In bioinformatics, a histogram is typically used to visualize the distribution of a continuous variable. However, it is possible to create a histogram-like plot for a categorical variable by plotting the frequency or count of each category.
For example, if you have a dataset containing the types of mutations found in a set of genes, you could create a histogram-like plot where each bar represents the frequency of each mutation type. The x-axis would represent the different mutation types, and the y-axis would represent the frequency or count of each mutation type.
While not a traditional histogram, this type of plot can still be useful for visualizing the distribution of a categorical variable and identifying the most common categories or patterns in the data.
Histogram on a categorical variable would result in a frequency chart showing bars for each category. By adjusting width, you can adjust the thickness of the bars.
library(ggplot2)
theme_set(theme_classic())
# Histogram on a Categorical variable
g <- ggplot(mpg, aes(manufacturer))
g + geom_bar(aes(fill=class), width = 0.5) + theme(axis.text.x = element_text(angle=65, vjust=0.6)) + labs(title=”Histogram on Categorical Variable”,
subtitle=”Manufacturer across Vehicle Classes”)
[Back to Top]
Density plot
In bioinformatics, a density plot is a type of plot used to visualize the distribution of a continuous variable. It is similar to a histogram but uses a smoothed line (the density curve) to represent the distribution of the data instead of bars.
A density plot is useful for visualizing the shape of the distribution, including whether it is symmetric, skewed, or multimodal. It can also help identify outliers and visualize the overlap between different groups or conditions.
In a density plot, the x-axis represents the values of the continuous variable, and the y-axis represents the density or relative frequency of those values. Areas under the curve represent the probability density of the variable at different values.
library(ggplot2)
theme_set(theme_classic())
# Plot
g <- ggplot(mpg, aes(cty))
g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title=”Density plot”,
subtitle=”City Mileage Grouped by Number of cylinders”, caption=”Source: mpg”,
x=”City Mileage”,
fill=”# Cylinders”)
[Back to Top]
Box Plot
In bioinformatics, a box plot (or box-and-whisker plot) is a type of chart used to visualize the distribution of a continuous variable through its quartiles. It is particularly useful for comparing the distribution of a variable across different groups or conditions.
A box plot consists of a box that spans the interquartile range (IQR) of the data, with a line indicating the median value. “Whiskers” extend from the box to the minimum and maximum values within a certain distance from the quartiles, often defined as 1.5 times the IQR. Points outside this range are often considered outliers and are plotted individually.
Box plots provide a visual summary of the central tendency, spread, and variability of a dataset, making them useful for identifying differences in data distributions between groups or conditions.
Box plot is an excellent tool to study the distribution. It can also show the distributions within multiple groups, along with the median, range and outliers if any.
The dark line inside the box represents the median. The top of box is 75%ile and bottom of box is 25%ile. The end points of the lines (aka whiskers) is at a distance of 1.5*IQR, where IQR or Inter Quartile Range is the distance between 25th and 75th percentiles. The points outside the whiskers are marked as dots and are normally considered as extreme points.
Setting varwidth=T adjusts the width of the boxes to be proportional to the number of observation it contains.
library(ggplot2)
theme_set(theme_classic())
# Plot
g <- ggplot(mpg, aes(class, cty))
g + geom_boxplot(varwidth=T, fill=”plum”) +
labs(title=”Box plot”,
subtitle=”City Mileage grouped by Class of vehicle”, caption=”Source: mpg”,
x=”Class of Vehicle”,
y=”City Mileage”)
library(ggthemes)
g <- ggplot(mpg, aes(class, cty))
g + geom_boxplot(aes(fill=factor(cyl))) + theme(axis.text.x = element_text(angle=65, vjust=0.6)) + labs(title=”Box plot”,
subtitle=”City Mileage grouped by Class of vehicle”, caption=”Source: mpg”,
x=”Class of Vehicle”,
y=”City Mileage”)
[Back to Top]
Dot + Box Plot
In bioinformatics, a dot-and-box plot (or dot-and-whisker plot) is a combination of a dot plot and a box plot. It is used to visualize the distribution of a continuous variable across different categories or groups.
In a dot-and-box plot, the central box represents the interquartile range (IQR) of the data, with a line indicating the median value. Dots or points are then overlaid on the plot to represent individual data points. These points can provide a clearer picture of the distribution of the data, especially when the sample size is small.
The combination of the box plot and the dot plot allows for a more comprehensive visualization of the data, providing information about both the central tendency and the spread of the data, as well as the distribution of individual data points.
On top of the information provided by a box plot, the dot plot can provide more clear information in the form of summary statistics by each group. The dots are staggered such that each dot represents one observation. So, in below chart, the number of dots for a given manufacturer will match the number of rows of that manufacturer in source data.
library(ggplot2)
theme_set(theme_bw())
# plot
g <- ggplot(mpg, aes(manufacturer, cty)) g + geom_boxplot() + geom_dotplot(binaxis=’y’,
stackdir=’center’,
dotsize = .5, fill=”red”) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title=”Box plot + Dot plot”,
subtitle=”City Mileage vs Class: Each dot represents 1 row in source data”, caption=”Source: mpg”,
x=”Class of Vehicle”,
y=”City Mileage”)
[Back to Top]
Tufte Boxplot
A Tufte boxplot, named after the statistician and data visualization expert Edward Tufte, is a minimalistic variation of the traditional box plot. It emphasizes simplicity and clarity, aiming to provide a clear representation of the data distribution without unnecessary elements.
In a Tufte boxplot, the box is often reduced to a simple line, representing the median value of the data. The whiskers extend from this line to the minimum and maximum values within a certain distance, typically 1.5 times the interquartile range (IQR). Points outside this range are considered outliers and are plotted individually.
Tufte boxplots are designed to highlight the key features of the data distribution while minimizing clutter and maximizing data-ink ratio, which is a concept introduced by Tufte to emphasize the importance of maximizing the proportion of ink on the plot that represents data.
In bioinformatics, a Tufte boxplot can be used to visualize the distribution of a continuous variable, such as gene expression levels or protein abundances, across different biological conditions or groups. The Tufte boxplot emphasizes simplicity and clarity, making it a useful tool for quickly comparing data distributions and identifying outliers or trends.
To create a Tufte boxplot in bioinformatics, you would typically use software or programming libraries that support this type of visualization, such as R with the ggplot2 package. The plot would display a minimalistic representation of the data, with a line representing the median value and whiskers extending to show the range of the data within a certain distance from the median. Points outside this range would be plotted as individual outliers.
Tufte box plot, provided by ggthemes package is inspired by the works of Edward Tufte. Tufte’s Box plot is just a box plot made minimal and visually appealing.
library(ggthemes) library(ggplot2)
theme_set(theme_tufte()) # from ggthemes
# plot
g <- ggplot(mpg, aes(manufacturer, cty)) g + geom_tufteboxplot() +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title=”Tufte Styled Boxplot”,
subtitle=”City Mileage grouped by Class of vehicle”, caption=”Source: mpg”,
x=”Class of Vehicle”,
y=”City Mileage”)
[Back to Top]
Violin Plot
In bioinformatics, a violin plot is a type of chart used to visualize the distribution of a continuous variable or the comparison of multiple continuous variables. It is similar to a box plot but provides a more detailed representation of the data distribution, including the probability density of the data at different values.
A violin plot consists of a mirrored density plot on each side of a central box plot. The central box plot represents the median and quartiles of the data, while the mirrored density plots show the kernel density estimation of the data distribution. The width of the violin at each point represents the density of data points at that value.
Violin plots are particularly useful for comparing the distribution of a variable across different groups or conditions in bioinformatics. They provide a more informative and visually appealing alternative to traditional box plots, especially when dealing with complex datasets or small sample sizes.
A violin plot is similar to box plot but shows the density within groups. Not much info provided as in boxplots. It can be drawn using geom_violin().
library(ggplot2)
theme_set(theme_bw())
# plot
g <- ggplot(mpg, aes(class, cty)) g + geom_violin() + labs(title=”Violin plot”,
subtitle=”City Mileage vs Class of vehicle”, caption=”Source: mpg”,
x=”Class of Vehicle”,
y=”City Mileage”)
[Back to Top]
Population Pyramid
In bioinformatics, a population pyramid is a type of chart used to visualize the distribution of a population across different age groups or categories. While population pyramids are more commonly used in demographics and social sciences, they can also be applied in bioinformatics to represent population structures, such as the distribution of different cell types in a tissue sample or the abundance of different species in a microbial community.
A population pyramid typically consists of two bar charts, one on either side of a central axis. The bars on each side represent different categories or groups, such as age groups or species, and the length of each bar represents the proportion or abundance of that category. The central axis represents the total population or total number of individuals in the sample.
Population pyramids can provide valuable insights into the composition and structure of populations, helping researchers identify patterns, trends, or imbalances in the data. They can be especially useful for comparing populations across different conditions or time points in bioinformatics studies.
Population pyramids offer a unique way of visualizing how much population or what percentage of population fall under a certain category. The below pyramid is an excellent example of how many users are retained at each stage of a email marketing campaign funnel.
library(ggplot2) library(ggthemes)
options(scipen = 999) # turns of scientific notations like 1e+40
# Read data
email_campaign_funnel <- read.csv(“https://raw.githubusercontent.com/selva86/datasets/master/e mail_campaign_funnel.csv”)
# X Axis Breaks and Labels
brks <- seq(-15000000, 15000000, 5000000)
lbls = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), “m”)
# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) + # Fill column geom_bar(stat = “identity”, width = .6) + # draw the bars scale_y_continuous(breaks = brks, # Breaks
labels = lbls) + # Labels
coord_flip() + # Flip axes
labs(title=”Email Campaign Funnel”) + theme_tufte() + # Tufte theme from ggfortify theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) + # Centre plot title
scale_fill_brewer(palette = “Dark2”) # Color palette
[Back to Top]
Composition
Waffle Chart
In bioinformatics, a waffle chart is a type of visualization that is used to represent data as a grid of squares, where each square represents a unit of the data. Waffle charts are particularly useful for showing the relative proportions of different categories within a dataset.
In a waffle chart, the total number of squares in the grid represents the total number of units in the dataset. Each category is represented by a portion of the grid that corresponds to its proportion of the total. For example, if one category represents 25% of the total, it would be represented by a quarter of the grid.
Waffle charts can be effective for visualizing proportions and making comparisons between categories, especially when the number of categories is small. However, they can be challenging to interpret accurately when the number of categories is large or when the proportions are very different from each other.
Waffle charts is a nice way of showing the categorical composition of the total population. Though there is no direct function, it can be articulated by smartly maneuvering the ggplot2 using geom_tile() function. The below template should help you create your own waffle.
var <- mpg$class # the categorical data
## Prep data (nothing to change here) nrows <- 10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(var) * ((nrows*nrows)/(length(var)))) categ_table
#> 2seater compact midsize minivan pickup subcompact suv
#> 2 20 18 5 14 15 26
df$category <- factor(rep(names(categ_table), categ_table))
# NOTE: if sum(categ_table) is not 100 (i.e. nrows^2), it will need adjustment to make the sum to 10 0.
## Plot
ggplot(df, aes(x = x, y = y, fill = category)) + geom_tile(color = “black”, size = 0.5) + scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), trans = ‘reverse’) +
scale_fill_brewer(palette = “Set3”) +
labs(title=”Waffle Chart”, subtitle=”‘Class’ of vehicles”, caption=”Source: mpg”) +
theme(panel.border = element_rect(size = 2), plot.title = element_text(size = rel(1.2)), axis.text = element_blank(),
axis.title = element_blank(), axis.ticks = element_blank(), legend.title = element_blank(), legend.position = “right”)
[Back to Top]
Pie Chart
In bioinformatics, a pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions. The size of each slice is proportional to the quantity it represents. Pie charts are often used to show the composition of something, such as the different categories of a dataset.
In bioinformatics, pie charts can be used to visualize the distribution of data, such as the relative abundance of different species in a microbial community, the proportion of different functional categories in a gene set, or the distribution of mutations in a cancer sample. Pie charts can be a useful tool for quickly conveying the overall composition of a dataset, but they can be less effective than other types of charts, such as bar charts or box plots, for comparing the sizes of individual categories or showing trends in the data.
Pie chart, a classic way of showing the compositions is equivalent to the waffle chart in terms of the information conveyed. But is a slightly tricky to implement in ggplot2 using the coord_polar().
library(ggplot2) theme_set(theme_classic())
# Source: Frequency table
df <- as.data.frame(table(mpg$class)) colnames(df) <- c(“class”, “freq”)
pie <- ggplot(df, aes(x = “”, y=freq, fill = factor(class))) + geom_bar(width = 1, stat = “identity”) + theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill=”class”, x=NULL, y=NULL,
title=”Pie Chart of class”,
caption=”Source: mpg”)
pie + coord_polar(theta = “y”, start=0)
# Source: Categorical variable. # mpg$class
pie <- ggplot(mpg, aes(x = “”, fill = factor(class))) +
geom_bar(width = 1) +
theme(axis.line = element_blank(), plot.title = element_text(hjust=0.5)) +
labs(fill=”class”, x=NULL, y=NULL,
title=”Pie Chart of class”, caption=”Source: mpg”)
pie + coord_polar(theta = “y”, start=0)
# http://www.r-graph-gallery.com/128-ring-or-donut-plot/
[Back to Top]
Treemap
In bioinformatics, a treemap is a visualization technique used to represent hierarchical data in a nested, rectangular form. The hierarchy is represented by nested rectangles, with each rectangle’s size proportional to a specific value. Treemaps are useful for visualizing the relative sizes of different categories or subcategories within a dataset.
In bioinformatics, treemaps can be used to visualize various types of data, such as the distribution of functional categories in a gene set, the composition of microbial communities at different taxonomic levels, or the allocation of resources in a biological system. Treemaps can provide a clear and intuitive way to explore hierarchical data structures and identify patterns or trends within the data.
Treemap is a nice way of displaying hierarchical data by using nested rectangles. The treemapify package provides the necessary functions to convert the data in desired format (treemapify) as well as draw the actual plot (ggplotify).
In order to create a treemap, the data must be converted to desired format using treemapify(). The important requirement is, your data must have one variable each that describes the area of the tiles, variable for fill color, variable that has the tile’s label and finally the parent group.
Once the data formatting is done, just call ggplotify() on the treemapified data.
library(ggplot2) library(treemapify)
proglangs <- read.csv(“https://raw.githubusercontent.com/selva86/datasets/master/proglanguages.c
sv”)
# plot
treeMapCoordinates <- treemapify(proglangs,
area = “value”, fill = “parent”, label = “id”, group = “parent”)
treeMapPlot <- ggplotify(treeMapCoordinates) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
scale_fill_brewer(palette = “Dark2”)
print(treeMapPlot)
[Back to Top]
Bar Chart
In bioinformatics, a bar chart is a common visualization tool used to represent categorical data. It consists of rectangular bars with lengths proportional to the values they represent. Bar charts are useful for comparing the sizes of different categories or groups within a dataset.
In bioinformatics, bar charts can be used to visualize various types of data, such as the abundance of different species in a microbial community, the distribution of gene functions in a functional annotation dataset, or the frequency of different mutations in a cancer sample. Bar charts can help researchers quickly identify patterns, trends, or outliers in the data and make comparisons between different categories or groups.
By default, geom_bar() has the stat set to count. That means, when you provide just a continuous X variable (and no Y variable), it tries to make a histogram out of the data.
In order to make a bar chart create bars instead of histogram, you need to do two things.
- Set stat=identity
- Provide both x and y inside aes() where, x is either character or factor and y is numeric.
A bar chart can be drawn from a categorical column variable or from a separate frequency table. By adjusting width, you can adjust the thickness of the bars. If your data source is a frequency table, that is, if you don’t want ggplot to compute the counts, you need to set the stat=identity inside the geom_bar().
# prep frequency table
freqtable <- table(mpg$manufacturer) df <- as.data.frame.table(freqtable) head(df)
#> Var1 Freq
#> 1 audi 18
#> 2 chevrolet 19
#> 3 dodge 37
#> 4 ford 25
#> 5 honda 9
#> 6 hyundai 14
# plot
library(ggplot2) theme_set(theme_classic())
# Plot
g <- ggplot(df, aes(Var1, Freq))
g + geom_bar(stat=”identity”, width = 0.5, fill=”tomato2″) +
labs(title=”Bar Chart”, subtitle=”Manufacturer of vehicles”,
caption=”Source: Frequency of Manufacturers from ‘mpg’ dataset”) +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
It can be computed directly from a column variable as well. In this case, only X is provided and stat=identity is not set.
# From on a categorical column variable
g <- ggplot(mpg, aes(manufacturer))
g + geom_bar(aes(fill=class), width = 0.5) + theme(axis.text.x = element_text(angle=65, vjust=0.6)) + labs(title=”Categorywise Bar Chart”,
subtitle=”Manufacturer of vehicles”, caption=”Source: Manufacturers from ‘mpg’ dataset”)
[Back to Top]
Change
Time Series Plot From a Time Series Object (ts)
In bioinformatics, time series analysis is often used to study data collected over time, such as gene expression levels, protein abundances, or environmental variables. R provides a convenient way to work with time series data through the ts
(time series) object and various plotting functions.
The ggfortify package allows autoplot to automatically plot directly from a time series object (ts).
## From Timeseries object (ts) library(ggplot2) library(ggfortify)
theme_set(theme_classic())
# Plot autoplot(AirPassengers) + labs(title=”AirPassengers”) +
theme(plot.title = element_text(hjust=0.5))
Time Series Plot From a Data Frame
Using geom_line(), a time series (or line chart) can be drawn from a data.frame as well. The X axis breaks are generated by default. In below example, the breaks are formed once every 10 years.
Default X Axis Labels
library(ggplot2)
theme_set(theme_classic())
# Allow Default X Axis Labels ggplot(economics, aes(x=date)) + geom_line(aes(y=returns_perc)) + labs(title=”Time Series Chart”,
subtitle=”Returns Percentage from ‘Economics’ Dataset”, caption=”Source: Economics”,
y=”Returns %”)
Time Series Plot For a Monthly Time Series
If you want to set your own time intervals (breaks) in X axis, you need to set the breaks and labels using scale_x_date().
library(ggplot2) library(lubridate) theme_set(theme_bw())
economics_m <- economics[1:24, ]
# labels and breaks for X axis text
lbls <- paste0(month.abb[month(economics_m$date)], ” “, lubridate::year(economics_m$date)) brks <- economics_m$date
# plot
ggplot(economics_m, aes(x=date)) + geom_line(aes(y=returns_perc)) + labs(title=”Monthly Time Series”,
subtitle=”Returns Percentage from Economics Dataset”, caption=”Source: Economics”,
y=”Returns %”) + # title and caption
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor grid
[Back to Top]
Time Series Plot For a Yearly Time Series
library(ggplot2) library(lubridate) theme_set(theme_bw())
economics_y <- economics[1:90, ]
# labels and breaks for X axis text
brks <- economics_y$date[seq(1, length(economics_y$date), 12)] lbls <- lubridate::year(brks)
# plot
ggplot(economics_y, aes(x=date)) + geom_line(aes(y=returns_perc)) + labs(title=”Yearly Time Series”,
subtitle=”Returns Percentage from Economics Dataset”, caption=”Source: Economics”,
y=”Returns %”) + # title and caption
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor grid
[Back to Top]
Time Series Plot From Long Data Format: Multiple Time Series in Same Dataframe Column
In this example, I construct the ggplot from a long data format. That means, the column names and respective values of all the columns are stacked in just 2 variables (variable and value respectively). If you were to convert this data to wide format, it would look like the economics dataset.
In below example, the geom_line is drawn for value column and the aes(col) is set to variable. This way, with just one call to geom_line, multiple colored lines are drawn, one each for each unique value in variable column. The scale_x_date() changes the X axis breaks and labels, and scale_color_manual changes the color of the lines.
data(economics_long, package = “ggplot2”)
head(economics_long)
#> date variable value value01 #> <date> <fctr> <dbl> <dbl>
#> 1 1967-07-01 pce 507.4 0.0000000000
#> 2 1967-08-01 pce 510.5 0.0002660008
#> 3 1967-09-01 pce 516.3 0.0007636797
#> 4 1967-10-01 pce 512.9 0.0004719369
#> 5 1967-11-01 pce 518.1 0.0009181318
#> 6 1967-12-01 pce 525.8 0.0015788435
library(ggplot2) library(lubridate) theme_set(theme_bw())
df <- economics_long[economics_long$variable %in% c(“psavert”, “uempmed”), ] df <- df[lubridate::year(df$date) %in% c(1967:1981), ]
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)] lbls <- lubridate::year(brks)
# plot
ggplot(df, aes(x=date)) + geom_line(aes(y=value, col=variable)) + labs(title=”Time Series of Returns Percentage”,
subtitle=”Drawn from Long Data format”, caption=”Source: Economics”, y=”Returns %”,
color=NULL) + # title and caption
scale_x_date(labels = lbls, breaks = brks) + # change to monthly ticks and labels
scale_color_manual(labels = c(“psavert”, “uempmed”),
values = c(“psavert”=”#00ba38”, “uempmed”=”#f8766d”)) + # line color
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor grid
[Back to Top]
Time Series Plot From Wide Data Format: Data in Multiple Columns of Dataframe
As noted in the part 2 (Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R- Code.html#2.%20Modifying%20Legend) of this tutorial, whenever your plot’s geom (like points, lines, bars, etc) changes the fill, size, col, shape or stroke based on another column, a legend is automatically drawn.
But if you are creating a time series (or even other types of plots) from a wide data format, you have to draw each line manually by calling geom_line() once for every line. So, a legend will not be drawn by default.
However, having a legend would still be nice. This can be done using the scale_aesthetic_manual() format of functions (like, scale_color_manual() if only the color of your lines change). Using this function, you can give a legend title with the name argument, tell what color the legend should take with the values argument and also set the legend labels.
Even though the below plot looks exactly like the previous one, the approach to construct this is different.
You might wonder why I used this function in previous example for long data format as well. Note that, in previous example, it was used to change the color of the line only. Without scale_color_manual(), you would still have got a legend, but the lines would be of a different (default) color. But in current example, without scale_color_manual(), you wouldn’t even have a legend. Try it out!
library(ggplot2) library(lubridate) theme_set(theme_bw())
df <- economics[, c(“date”, “psavert”, “uempmed”)]
df <- df[lubridate::year(df$date) %in% c(1967:1981), ]
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)] lbls <- lubridate::year(brks)
# plot
ggplot(df, aes(x=date)) + geom_line(aes(y=psavert, col=”psavert”)) + geom_line(aes(y=uempmed, col=”uempmed”)) + labs(title=”Time Series of Returns Percentage”,
subtitle=”Drawn From Wide Data format”,
caption=”Source: Economics”, y=”Returns %”) + # title and caption scale_x_date(labels = lbls, breaks = brks) + # change to monthly ticks and labels scale_color_manual(name=””,
values = c(“psavert”=”#00ba38”, “uempmed”=”#f8766d”)) + # line color
theme(panel.grid.minor = element_blank()) # turn off minor grid
[Back to Top]
Stacked Area Chart
In bioinformatics, a stacked area chart is a type of graph that is used to visualize the change in composition of a data series over time or across different categories. It is similar to a regular area chart, but instead of overlapping areas, the areas are stacked on top of each other, with each segment representing a different category or subgroup.
Stacked area charts are useful for showing the total size of a dataset while also illustrating the relative contribution of each category or subgroup to the total. They can be used to visualize various types of data in bioinformatics, such as the abundance of different species in a microbial community over time, the distribution of gene expression levels across different experimental conditions, or the composition of different functional categories in a gene set.
Stacked area chart is just like a line chart, except that the region below the plot is all colored. This is typically used when:
- You want to describe how a quantity or volume (rather than something like price) changed over time
- You have many data points. For very few data points, consider plotting a bar chart.
- You want to show the contribution from individual components.
This can be plotted using geom_area which works very much like geom_line. But there is an important point to note. By default, each geom_area() starts from the bottom of Y axis (which is typically 0), but, if you want to show the contribution from individual components, you want the geom_area to be stacked over the top of previous component, rather than the floor of the plot itself. So, you have to add all the bottom layers while setting the y of geom_area.
In below example, I have set it as y=psavert+uempmed for the topmost geom_area().
However nice the plot looks, the caveat is that, it can easily become complicated and uninterprettable if there are too many components.
library(ggplot2) library(lubridate) theme_set(theme_bw())
df <- economics[, c(“date”, “psavert”, “uempmed”)]
df <- df[lubridate::year(df$date) %in% c(1967:1981), ]
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)] lbls <- lubridate::year(brks)
# plot
ggplot(df, aes(x=date)) + geom_area(aes(y=psavert+uempmed, fill=”psavert”)) + geom_area(aes(y=uempmed, fill=”uempmed”)) + labs(title=”Area Chart of Returns Percentage”,
subtitle=”From Wide Data format”, caption=”Source: Economics”, y=”Returns %”) + # title and caption
scale_x_date(labels = lbls, breaks = brks) + # change to monthly ticks and labels
scale_fill_manual(name=””,
values = c(“psavert”=”#00ba38”, “uempmed”=”#f8766d”)) + # line color
theme(panel.grid.minor = element_blank()) # turn off minor grid
[Back to Top]
Calendar Heatmap
In bioinformatics, a calendar heatmap is a visualization technique used to represent time-based data, such as gene expression levels, protein abundances, or environmental variables, in a calendar-like format. Each cell in the calendar represents a day, week, or month, depending on the granularity of the data, and the color of the cell represents the value of the data for that time period.
Calendar heatmaps are useful for identifying patterns and trends in time-based data, such as seasonal variations or weekly cycles. They can also be used to compare multiple time series data sets or to visualize the distribution of events over time.
When you want to see the variation, especially the highs and lows, of a metric like stock price, on an actual calendar itself, the calendar heat map is a great tool. It emphasizes the variation visually over time rather than the actual value itself.
This can be implemented using the geom_tile. But getting it in the right format has more to do with the data preparation rather than the plotting itself.
# http://margintale.blogspot.in/2012/04/ggplot2-time-series-heatmaps.html
library(ggplot2) library(plyr) library(scales) library(zoo)
df <- read.csv(“https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv”) df$date <- as.Date(df$date) # format date
df <- df[df$year >= 2012, ] # filter reqd years
# Create Month Week
df$yearmonth <- as.yearmon(df$date) df$yearmonthf <- factor(df$yearmonth)
df <- ddply(df,.(yearmonthf), transform, monthweek=1+week-min(week)) # compute week number
of month
df <- df[, c(“year”, “yearmonthf”, “monthf”, “week”, “monthweek”, “weekdayf”, “VIX.Close”)]
head(df)
#> year yearmonthf monthf week monthweek weekdayf VIX.Close
#> 1 2012 | Jan 2012 | Jan | 1 | 1 | Tue | 22.97 |
#> 2 2012 | Jan 2012 | Jan | 1 | 1 | Wed | 22.22 |
#> 3 2012 | Jan 2012 | Jan | 1 | 1 | Thu | 21.48 |
#> 4 2012 | Jan 2012 | Jan | 1 | 1 | Fri | 20.63 |
#> 5 2012 | Jan 2012 | Jan | 2 | 2 | Mon | 21.07 |
#> 6 2012 | Jan 2012 | Jan | 2 | 2 | Tue | 20.69 |
# Plot
ggplot(df, aes(monthweek, weekdayf, fill = VIX.Close)) +
geom_tile(colour = “white”) + facet_grid(year~monthf) + scale_fill_gradient(low=”red”, high=”green”) + labs(x=”Week of Month”,
y=””,
title = “Time-Series Calendar Heatmap”, subtitle=”Yahoo Closing Price”, fill=”Close”)
[Back to Top]
Slope Chart
In bioinformatics, a slope chart can be used to visualize changes in data values between two time points, conditions, or groups. It consists of a series of lines that connect data points representing the same entity (e.g., gene, protein) across different conditions or time points. Each line represents a different entity, and the slope of the line indicates the rate or direction of change for that entity.
Slope charts are useful for comparing trends or changes in multiple entities simultaneously, such as changes in gene expression levels between two experimental conditions or changes in protein abundances over time. They can help identify genes, proteins, or other biological features that show similar or opposite trends and provide insights into the overall patterns in the data.
Slope chart is a great tool of you want to visualize change in value and ranking between categories. This is more suitable over a time series when there are very few time points.
library(dplyr) theme_set(theme_classic())
source_df <- read.csv(“https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/cancer_su rvival_rates.csv”)
# Define functions. Source: https://github.com/jkeirstead/r-slopegraph
tufte_sort <- function(df, x=”year”, y=”value”, group=”group”, method=”tufte”, min.space=0.05) { ## First rename the columns for consistency
ids <- match(c(x, y, group), names(df)) df <- df[,ids]
names(df) <- c(“x”, “y”, “group”)
## Expand grid to ensure every combination has a defined value tmp <- expand.grid(x=unique(df$x), group=unique(df$group)) tmp <- merge(df, tmp, all.y=TRUE)
df <- mutate(tmp, y=ifelse(is.na(y), 0, y))
## Cast into a matrix shape and arrange by first column
require(reshape2)
tmp <- dcast(df, group ~ x, value.var=”y”) ord <- order(tmp[,2])
tmp <- tmp[ord,]
min.space <- min.space*diff(range(tmp[,-1])) yshift <- numeric(nrow(tmp))
## Start at “bottom” row
## Repeat for rest of the rows until you hit the top for (i in 2:nrow(tmp)) {
## Shift subsequent row up by equal space so gap between ## two entries is >= minimum
mat <- as.matrix(tmp[(i-1):i, -1]) d.min <- min(diff(mat))
yshift[i] <- ifelse(d.min < min.space, min.space – d.min, 0)
}
tmp <- cbind(tmp, yshift=cumsum(yshift)) scale <- 1
tmp <- melt(tmp, id=c(“group”, “yshift”), variable.name=”x”, value.name=”y”)
## Store these gaps in a separate variable so that they can be scaled ypos = a*yshift + y
tmp <- transform(tmp, ypos=y + scale*yshift)
return(tmp)
}
plot_slopegraph <- function(df) {
ylabs <- subset(df, x==head(x,1))$group yvals <- subset(df, x==head(x,1))$ypos fontSize <- 3
gg <- ggplot(df,aes(x=x,y=ypos)) + geom_line(aes(group=group),colour=”grey80″) + geom_point(colour=”white”,size=8) +
geom_text(aes(label=y), size=fontSize, family=”American Typewriter”) +
scale_y_continuous(name=””, breaks=yvals, labels=ylabs)
return(gg)
}
## Prepare data
df <- tufte_sort(source_df,
x=”year”, y=”value”, group=”group”, method=”tufte”, min.space=0.05)
df <- transform(df,
x=factor(x, levels=c(5,10,15,20),
labels=c(“5 years”,”10 years”,”15 years”,”20 years”)), y=round(y))
## Plot
plot_slopegraph(df) + labs(title=”Estimates of % survival rates”) +
theme(axis.title=element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust=0.5,
family = “American Typewriter”, face=”bold”),
axis.text = element_text(family = “American Typewriter”, face=”bold”))
Seasonal Plot
In bioinformatics, a seasonal plot is a type of visualization used to explore and analyze data that exhibit seasonal patterns or trends. It is particularly useful for time series data, where the data points are collected at regular intervals over time.
A seasonal plot typically consists of a line or a set of points representing the data values plotted against time, with the x-axis representing the time intervals (e.g., months, quarters) and the y-axis representing the data values. The plot may also include additional elements such as trend lines, seasonal averages, or confidence intervals to help identify and interpret seasonal patterns in the data.
Seasonal plots are commonly used in bioinformatics to analyze various types of data, such as gene expression levels, protein abundances, or environmental variables, to identify recurring patterns or cycles that may be related to biological processes or external factors.
If you are working with a time series object of class ts or xts, you can view the seasonal fluctuations through a seasonal plot drawn using forecast::ggseasonplot. Below is an example using the native AirPassengers and nottem time series.
You can see the traffic increase in air passengers over the years along with the repetitive seasonal patterns in traffic. Whereas Nottingham does not show an increase in overal temperatures over the years, but they definitely follow a seasonal pattern.
library(ggplot2) library(forecast)
theme_set(theme_classic())
# Subset data
nottem_small <- window(nottem, start=c(1920, 1), end=c(1925, 12)) # subset a smaller timewindo w
# Plot
ggseasonplot(AirPassengers) + labs(title=”Seasonal plot: International Airline Passengers”)
ggseasonplot(nottem_small) + labs(title=”Seasonal plot: Air temperatures at Nottingham Castle”)
[Back to Top]
Groups
Hierarchical Dendrogram
In bioinformatics, a hierarchical dendrogram is a tree-like diagram used to visualize the results of hierarchical clustering, a method used to group similar objects into clusters based on their similarity.
In a hierarchical dendrogram, each leaf node represents an individual object (e.g., gene, protein, sample), and the branches of the tree represent the clusters formed by grouping similar objects together. The length of the branches indicates the degree of similarity between clusters or individual objects, with shorter branches indicating greater similarity.
Hierarchical dendrograms are useful for visualizing relationships between objects in a dataset and identifying clusters or groups of objects that share similar characteristics. They are commonly used in bioinformatics for clustering gene expression data, protein sequences, or other types of biological data to identify patterns or relationships between biological entities.
# install.packages(“ggdendro”) library(ggplot2) library(ggdendro)
theme_set(theme_bw())
hc <- hclust(dist(USArrests), “ave”) # hierarchical clustering
# plot
ggdendrogram(hc, rotate = TRUE, size = 2)
[Back to Top]
Clusters
In bioinformatics, clusters refer to groups of objects that are similar to each other based on certain criteria. Clustering is a common technique used to group biological entities such as genes, proteins, or samples into clusters based on their similarities in expression patterns, sequence similarities, or other biological characteristics.
It is possible to show the distinct clusters or groups using geom_encircle(). If the dataset has multiple weak features, you can compute the principal components and draw a scatterplot using PC1 and PC2 as X and Y axis.
The geom_encircle() can be used to encircle the desired groups. The only thing to note is the data argument to geom_circle(). You need to provide a subsetted dataframe that contains only the observations (rows) that belong to the group as the data argument.
# devtools::install_github(“hrbrmstr/ggalt”)
library(ggplot2) library(ggalt) library(ggfortify) theme_set(theme_classic())
# Compute data with principal components ——————
df <- iris[c(1, 2, 3, 4)]
pca_mod <- prcomp(df) # compute principal components
# Data frame of principal components ———————-
df_pc <- data.frame(pca_mod$x, Species=iris$Species) # dataframe of principal components
df_pc_vir <- df_pc[df_pc$Species == “virginica”, ] # df for ‘virginica’ df_pc_set <- df_pc[df_pc$Species == “setosa”, ] # df for ‘setosa’ df_pc_ver <- df_pc[df_pc$Species == “versicolor”, ] # df for ‘versicolor’
# Plot
ggplot(df_pc, aes(PC1, PC2, col=Species)) + geom_point(aes(shape=Species), size=2) + # draw points labs(title=”Iris Clustering”,
subtitle=”With principal components PC1 and PC2 as X and Y axis”, caption=”Source: Iris”) +
coord_cartesian(xlim = 1.2 * c(min(df_pc$PC1), max(df_pc$PC1)),
ylim = 1.2 * c(min(df_pc$PC2), max(df_pc$PC2))) + # change axis limits geom_encircle(data = df_pc_vir, aes(x=PC1, y=PC2)) + # draw circles geom_encircle(data = df_pc_set, aes(x=PC1, y=PC2)) +
geom_encircle(data = df_pc_ver, aes(x=PC1, y=PC2))
Spatial
In bioinformatics, a spatial chart is a type of visualization used to represent spatial data, such as the location of genes on a chromosome, the distribution of mutations in a cancer sample, or the spatial arrangement of cells in a tissue sample. Spatial charts can help researchers visualize and analyze complex spatial relationships in biological data.
The ggmap package provides facilities to interact with the google maps api and get the coordinates (latitude and longitude) of places you want to plot. The below example shows satellite, road and hybrid maps of the city of Chennai, encircling some of the places. I used the geocode() function to get the coordinates of these places and qmap() to get the maps. The type of map to fetch is determined by the value you set to the maptype.
You can also zoom into the map by setting the zoom argument. The default is 10 (suitable for large cities). Reduce this number (up to 3) if you want to zoom out. It can be zoomed in till 21, suitable for buildings.
# Better install the dev versions ———-
# devtools::install_github(“dkahle/ggmap”) # devtools::install_github(“hrbrmstr/ggalt”)
# load packages library(ggplot2) library(ggmap) library(ggalt)
# Get Chennai’s Coordinates ——————————–
chennai <- geocode(“Chennai”) # get longitude and latitude
# Get the Map
# Google Satellite Map
chennai_ggl_sat_map <- qmap(“chennai”, zoom=12, source = “google”, maptype=”satellite”)
# Google Road Map
chennai_ggl_road_map <- qmap(“chennai”, zoom=12, source = “google”, maptype=”roadmap”)
# Google Hybrid Map
chennai_ggl_hybrid_map <- qmap(“chennai”, zoom=12, source = “google”, maptype=”hybrid”)
# Open Street Map
chennai_osm_map <- qmap(“chennai”, zoom=12, source = “osm”)
# Get Coordinates for Chennai’s Places ———————
chennai_places <- c(“Kolathur”,
“Washermanpet”, “Royapettah”, “Adyar”,
“Guindy”)
places_loc <- geocode(chennai_places) # get longitudes and latitudes
# Plot Open Street Map
chennai_osm_map + geom_point(aes(x=lon, y=lat),
data = places_loc, alpha = 0.7,
size = 7,
color = “tomato”) +
geom_encircle(aes(x=lon, y=lat),
data = places_loc, size = 2, color = “blue”)
# Plot Google Road Map
chennai_ggl_road_map + geom_point(aes(x=lon, y=lat),
data = places_loc, alpha = 0.7,
size = 7,
color = “tomato”) +
geom_encircle(aes(x=lon, y=lat),
data = places_loc, size = 2, color = “blue”)
# Google Hybrid Map
chennai_ggl_hybrid_map + geom_point(aes(x=lon, y=lat),
data = places_loc, alpha = 0.7,
size = 7,
color = “tomato”) +
geom_encircle(aes(x=lon, y=lat),
data = places_loc, size = 2, color = “blue”)
Open Street Map
Google Road Map
Google Hybrid Map