microarray analysis

Mastering Microarray Data Analysis: A Step-by-Step R/Bioconductor Tutorial

March 18, 2024 Off By admin
Shares

Table of Contents

Installation

Install R (and RStudio)

R is available at the CRAN website. Upon choosing a CRAN mirror, you can download R.
R is available for Linux, Mac and Windows.

You can download RStudio from the RStudio website.

Install R packages

Check out this video tutorial on installing packages in R.

Part of the reason that R is so popular is the enormous diversity of packages that are available for R. Packages are collections of R programs that are able to perform a certain analysis, e.g. Matrix is a package that contains all the R-code you need for creating and working with matrices. R packages are available at the CRAN and Bioconductor websites.

Issues with installation of R/R packages

Mac: error message: Setting LC_CTYPE failed, using “C” during package installation

  • Close RStudio
  • Open Terminal
  • Type:
    defaults write org.R-project.R force.LANG en_US.UTF-8
  • Close Terminal
  • Start RStudio and retry installation

Creating a project in RStudio

An R project is a folder in RStudio where all your work on one project (e.g. a chapter in your PhD dissertation) is gathered. Projects help you to stay organized. When you create a project R will load data and scripts from and save results to this folder.

Creating and running scripts

A script is a text file that contains all the commands you will use. You cannot only write and run scripts but you can also save them so next time you need to do a similar analysis you can change and re-run the script with minimal effort. An R project can contain multiple scripts.

Commands

Scripts consist of a list of commands. Commands in R have a certain format:

output <- method(list of arguments)

Alternatively, you may use the following format:

output = method(list of arguments)

For example:

p = ggplot(mtcars,(aes(wt,mpg))

In this example ggplot() is the method. It generates a plot so the plot p is the output of the method. Before a function can start the actions and calculations that it encodes, it needs prior information: input data and parameter settings. These are called the arguments of the function. In this example the arguments are:
mtcars: a data frame consisting of 9 columns containing the input data
aes(wt,mpg): defines the columns you want to plot (wt and mpg) and how you want to plot them (wt on the X- axis and mpg on the Y-axis)

Creating a new script

Loading packages in R

You only need to install a package once. But each time you want to use a package you have to load it (activate its functions).

Getting help / documentation in R

You can find a lot of documentation online:

  • The documentation section of the R website
    Unfortunately this section is nor easily accessible nor well-structured and it can be quite a challenge to consult the help files of different R packages and functions online. By far the most user-friendly interface for searching the R documentation is the Rdocumentation website.
  • Documentation of RStudio
  • Quick R: for those who would like to make the transition to R (from SAS, SPSS, Stata)
  • R-bloggers: R-news and tutorials contributed by bloggers
  • Inside-R: a website created by people who are using R containing examples, how-to’s, packages…

Some R commands allow you to consult the documentation:

You can also ask for information on specific topics, e.g. find out what arguments a method needs…

Data types in R

A data frame contains several elements. It essentially has a matrix structure but some of the columns of the data frame might be matrices themselves. The different elements are allowed to contain different data types.

All columns in a matrix must have the same data type and length.

Install the required R packages

To analyze microarray data, you need a specific R package, called Bioconductor.
However, Bioconductor uses functions and object from various other R packages, so you need to install these R packages too:

  • Matrix
  • lattice
  • fdrtool
  • rpart

So Bioconductor will only work after you have installed these packages.

 

Additionally, you will need an R-package for making graphs of the data, called ggplot2

Install Bioconductor packages from Bioconductor repository

You need the following Bioconductor packages for Affymetrix array analysis:

    • affy, affyPLM and simpleaffy if you want with older arrays (3′ arrays)
    • oligo if you work with newer arrays (HTA, Gene ST…)

It is important to realize that it is best to pick one of these two choices. If you load multiple packages with similar functionality, e.g. both affy and oligo or both simpleaffy and oligo, R will become confused. The affy-based packages and oligo contain methods with the same name (e.g. intensity(), MAplot(), rma()) but with slightly different code. So R will not know if it has to use the intensity() method from the affy packages or from oligo. By default it chooses affy over oligo. So you always have to specify the packagename for the oligo methods

  • affydata if you need example data sets
  • ArrayExpress if you want to obtain data from ArrayExpress
  • limma
  • Biobase
  • Biostrings
  • genefilter
  • Bioconductor packages for the annotation of the spots on the arrays

How to install Bioconductor packages in R

To install the annotation packages for the arrays that you used in your experiment, go to the list of annotation packages generated by Affymetrix.
In this list you can find the name of the cdf file that corresponds to the array that you have used (in our example we used ATH1 arrays so the cdf file is called ath1121501cdf).

cdfFile.png

To install custom annotation packages (not from Affymetrix), go to BrainArray’s list of custom annotation packages.

Installing Bioconductor packages from source

Some Bioconductor packages will not install by running the biocLite() command. In that case you can always install them from source. In this case you go to the Bioconductor page of the package you wish to instal, as an example we take the Biostrings package (allthough it installs fine using the biocLite() command.

 

Updating installed Bioconductor packages

After some time, Bioconductor packages might become outdated. Bioconductor packages are updated fairly regularly.

Loading packages in R

Before you can do any programming in RStudio, you need to create a new project and a new script.
To use the installed R and BioConductor packages in R, you have to load them first.

Using a method from a specific package

If you have loaded multiple packages with similar functionality, e.g. both affy and oligo or both affyPLM and oligo, R might become confused. Affy or affyPLM and oligo contain methods with the same name e.g. intensity(), MAplot(), rma()… but with slightly different code. If you now simply use the command

intensity(data)

R will not know if it has to use the intensity() method from affy or from oligo. By default it chooses affy over oligo. To tackle this problem you have to specify the packagename in front of the name of the method for oligo, e.g.

oligo::intensity(data)

and R will know that it has to use the intensity() method of the oligo package.

Getting help / documentation in R

 

Bioconductor is object-oriented R. It means that a package consists of classes. The classes define the behaviour and characteristics of a set of similar objects that belong to the class. The characteristics that objects of a class can have are called slots while the behaviour of the objects (the actions they can do) is described by the methods of a class.
For instance, there is a class AffyBatch, consisting of containers that hold microarray data in a structured way. You can create an AffyBatch object to hold your data. AffyBatches have many slots, one for the intensities, one for the sample annotation, one for the probe set annotation… There are numerous methods that operate on AffyBatches: you can retrieve perfect match intensities, you can create a boxplot, you can normalize the data…

The documentation of a class shows:

  • a short description of the class
  • how to create objects of the class
  • a list of slots of the class
  • a list of methods that can operate on objects of the class
  • some examples

Open the CEL-files in R

If you use Affymetrix chips your microarray data will consist of a series of CEL files containing raw intensities for each probe on the array. Save all CEL files you want to analyze in a single directory. They may be gzipped (*.gz) – you do not need to gunzip them.

 

Open CEL files from 3′ Affymetrix Arrays (older ones) using affy

Data from older Affymetrix arrays should be analysed using the affy package.

If you are using a custom cdf file you need an additional line of code telling R that you are not using the standard Affymetrix cdf but a custom one.

Open CEL files from newer Affymetrix Arrays (HTA, Gene ST…) using oligo

For these arrays the old affy package doesn’t work – instead you’ll have to use the oligo package

Looking at the data

  • To view the full content of a simple object (vector, list, matrix or data frame): type the name of the object in the console.
  • To view specific items of a simple object: refer to their location using [ ] e.g.
    object[1,2] will retrieve the item on the first row and second column
    object[1:2,] will retrieve all columns of the first two rows
  • To view an entire column of a data frame: use $ e.g. dataframe$columnname
  • To access slots of complex objects: use object@ followed by the name of the slot.

How to retrieve intensities using affy

Although AffyBatches have a slot assayData to hold the raw intensities, this slot is never directly accessed. Instead two methods give access to the raw intensities in an AffyBatch: exprs() and intensity(). These methods extract the intensities of all probes (both PM and MM probes) from the AffyBatch.

Since we will only work with PM probes, you might want to look at the intensities of the PM probes only using the pm() method.

It is of course much more useful to retrieve intensities based on probe set ID then on location in the CEL file. Often you’ll want to retrieve the data of all the probes in the probe set that represents your favourite gene.

To plot the intensities of all the probes of a probe set, we will use ggplot(). It’s not the easiest way to go but ggplot() generates the nicest plots. To use ggplot() data has to be in the correct format (a data frame with informative column names) and categorical variables need to be transformed into factors.

BioC6.png

Retrieving annotation

Retrieving sample annotation using affy

Apart from the expression data itself, microarray data sets need to include information about the samples that were hybridized to the arrays, e.g. sample labels. AffyBatch objects have several slots (characteristics). One of them is called phenoData. It contains labels for the samples. Despite the name, there is no implication that the labels should be phenotypic, in fact they often indicate genotypes such as wild-type or knockout. They can also contain other types of information about the samples e.g. which samples are replicates, which samples were scanned by the same scanner, the amount of RNA that was hybridized to the arrays…
However, for most data sets the phenoData has not been defined.

If phenoData contains no information: CEL-files are given an index 1-n (n = total number of files). You can give the samples more accurate names so these can be used in the plots that we are going to create later on. We will see how to do this when we create the plots.

When you import CEL-files from GEO or ArrayExpress the phenoData should normally already contain informative names but many submitters skip this step so many data sets are not well annotated.

Retrieving probe annotation using affy

Microarray data sets should include information on the probes. AffyBatches have a slot called featureData, a data frame that contains labels for the probes. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.

Retrieving experiment annotation using affy

Microarray data sets should also include information on the experiment. AffyBatches have a slot for this called experimentData. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.

Retrieving annotation using oligo

All commands described above also work for the oligo package.

Retrieving other types of information

On Affymetrix arrays you should see an average of 10-20 probes per gene.

Quality control of microarray data

Giving the samples informative names

The code in this section works for both affy and oligo

If the phenoData object, that was created in the step where we retrieved the sample annotation, does not contain any information, Bioconductor will just give the CEL-files an index 1-6. However, the phenoData will be used as labels in plots. Give the samples more accurate names so they can be used in the plots that we are going to create.

 

You have to adjust this code according the number of groups and the number of replicates you have and change the sample names to names that are relevant for you. The order of the samples in the AffyBatch is determined by the CEL-file name of the sample (the CEL files are alphabetically ordered).

Create plots to assess the quality of the data

Instead of printing these plots in RStudio or the R editor, we will save the plots to our hard drive.

How to create microarray pictures

The code on this page works for both affy and oligo

The figure below shows all microarray images on the same figure. This works for small arrays but gives problems for human arrays since they are so big that it’s hard to fit them on a single figure. That’s why the code above generates a separate figure for each array.

How to create chip pseudo-images

Chip pseudo images in affy

Chip pseudo-images are very useful for detecting spatial differences (artifacts) on the invidual arrays (so not for comparing between arrays).

Pseudo-images are generated by fitting a probe-level model (PLM) to the data that assumes that all probes of a probe set behave the same in the different samples: probes that bind well to their target should do so on all arrays, probes that bind with low affinity should do so on all arrays…

You can create pseudo-images based on the residuals or the weights that result from a comparison of the model (the ideal data, without any noise) to the actual data. These weights or residuals may be graphically displayed using the image() function in Bioconductor.

The model consists of a probe level (assuming that each probe should behave the same on all arrays) and an array level (taking into account that a gene can have different expression levels in different samples) parameter.

The method to compute the model is quite intensive, requiring a lot of resources. If you have a large data set, you might receive an out-of-memory error when fitting the model.

Based on weights

Weights represent how much the original data contribute to the model: outliers are strongly downweighted because they are so different from the ideal data. Weights have values between 0 and 1. So the smaller the weight of a probe
-> the more the probe is not showing the typical behavior that it shows on the other arrays
-> the more its intensity can be considered an outlier

In the figure below, all pseudoimages were plotted on a single figure. Again, this only works for small arrays like the ATH arrays.

BioC10.png

 

Small weights (outliers) are indicated in green on the figure.

If you have small arrays and you want to plot all pseudoimages on a single figure, use the following code:

op = par(mfrow = c(2,3))
for (i in 1:6){image(Pset,which=i,main=ph@data$sample[i])}

 

Based on residuals

Residuals are the second quantity used for chip pseudo-images. They represent the difference between the original data and the ideal data according to the model. So the more a residual deviates from 0, the larger the difference between the data and the model. Residuals can be

  • positive: the intensity of the probe is larger than the ideal value according to the model
  • negative: the intensity of the probe is smaller than the ideal value according to the model

 

If you have small arrays, you might fit all six pseudoimages on one plot as shown below:

op = par(mfrow = c(2,3))
for (i in 1:6){image(Pset,which=i,type='resids',main=ph@data$sample[i])}
BioC11.png

 

Positive residuals are plotted in red, negative residuals in blue.

Chip pseudo images in oligo

The fitPLM() method is a medthod from the affyPLM package and does not work on FeatureSets so in oligo we need to use the alternative fitProbeLevelModel() method. This method fits robust probe level linear models to all the probe sets in an FeatureSet.

Pset = fitProbeLevelModel(data)

Creating pseudoimages based on weights is done exactly the same as in affy.
Creating pseudoimages based on residuals is slightly different from affy:

for (i in 1:6)
{
name = paste("pseudoimage",i,".jpg",sep="")
jpeg(name)
image(Pset,which=i,type="residuals",main=ph@data$sample[i])
dev.off()
}

How to create histograms of microarray data

The code on this page works for both affy and oligo

You might want to fit all histograms on the same figure:

BioC12.png

 

The code for this is as follows:

op = par(mfrow = c(2,3))
for(i in 1:6){hist(data[,i],lwd=2,which='pm',ylab='Density',xlab='Log2 intensities',main=ph@data$sample[i])}

You can further play with the appearance of the plot.

BioC13.png

 

Although the hist() method is very useful and easy to use, you might want to generate the histogram with ggplot(), since this gives you more freedom to make more changes to the plot. Check out the ggplot documentation.

Box plots

Boxplots and histograms show the same differences in probe intensity behavior between arrays. In order to perform meaningful statistical analysis and inferences from the data, you need to ensure that all the samples are comparable. To examine and compare the overall distribution of log transformed PM intensities between the samples you can use a histogram but you will get a clearer view with a box plot. Box plots show:

  • the median: center value, half of the intensities are lower than this value, half of the intensities are higher (= line in the box)
  • the upper quartile: a quarter of the values are higher than this quartile (= upper border of the box)
  • the lower quartile: a quarter of the values are lower than this quartile (= lower border of the box)
  • the range: minimum and maximum value (= borders of the whiskers)
  • individual extreme values (= points outside the whiskers)

The code on this page works for both affy and oligo

BioC15.png

 

Box plots allow you to assess if the scale and distribution of the data on different arrays is comparable. Differences in shape or center of the boxes indicate that normalization of the data is required.

You can also create box plots using ggplot.

BioC17.png

 

BioC20.png

 

After normalization, none of the samples should stand out from the rest. The different arrays should have the same (or at least a very comparable) median expression level. Also the scale of the boxes should be almost the same indicating that also the spread of the intensity values on the different arrays is comparable.

MA plots

Originally these MA plots were developed for two-color arrays to detect differences between the two color labels on the same array, and for these arrays they became hugely popular. This is why more and more people are now also using them for Affymetrix arrays but on Affymetrix only use a single color label. So people started using them to compare each Affymetrix array to a pseudo-array. The pseudo array consists of the median intensity of each probe over all arrays.

The MA plot shows to what extent the variability in expression depends on the expression level (more variation on high expression values?). In an MA-plot, A is plotted versus M:

  • M is the difference between the intensity of a probe on the array and the median intensity of that probe over all arrays
    M = logPMInt_array – logPMInt_medianarray
  • A is the average of the intensity of a probe on that array and the median intesity of that probe over all arrays
    A = (logPMInt_array + logPMInt_medianarray)/2

The code on this page works for both affy and oligo

BioC18.png

 

Ideally, the cloud of data points should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar for different A values (average intensities). You see that the spread of the cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of) this dependency, we will normalize the data.

 

BioC22.png

 

When you compare this plot to the one created for the raw intensities, you see a much more symmetric and even spread of the data indicating that the dependence of the variability on the average expression level is not as strong as it was before normalization.

Ideally, the cloud of data points in the MA-plot should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar across different array-medianarray combinations. You see that the spread of the point cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of this) dependency, we will normalize the data.

Calculate quality measures to assess the quality of the data

Calculate quality measures in affy

Apart from images, you can also calculate simple quality metrics to assess the quality of the arrays.

 

Calculate quality measures in oligo

The qc() method is implemented in the simpleaffy package but not in the oligo package. This means that you either have to

  • load both simpleaffy and oligo but then you always have the problem that you have to explicitly write oligo:: in front of every method you want to use from the oligo package
  • first load simpleaffy, perform the qc() analysis, restart R, load oligo and perform the rest of the analysis

Normalization of microarray data

There are many sources of noise in microarray experiments:

  • different amounts of RNA used for labeling and hybridization
  • imperfections on the array surface
  • imperfect synthesis of the probes
  • differences in hybridization conditions

Systematic differences between the samples that are due to noise rather than true biological variability should be removed in order to make biologically meaningfull conclusions about the data.

 

Normalization using RMA

The code in this section works for both affy and oligo

The standard method for normalization is RMA. RMA is one of the few normalization methods that only uses the PM probes:

  • Background correction to correct for spatial variation within individual arrays: a background-corrected intensity is calculated for each PM probe in such a way that all background corrected intensities are positive
  • Log transformation to improve the distribution of the data: the base-2 logarithm of the background corrected intensity is calculated for each probe. The log transformation will make the data less skewed and more normally distributed and provide an equal spread of up- and downregulated expression ratios
  • Quantile normalization to correct for variation between the arrays: equalizes the data distributions of the arrays and make the samples completely comparable
  • Probe normalization to correct for variation within probe sets: equalizes the behavior of the probes between the arrays and combines normalized data values of probes from a probe set into a single value for the whole probe set

 

 

Normalization using GCRMA

GCRMA in affy

GCRMA is based on RMA, having all the good sides of RMA. The difference lies in the background correction, all other steps are the same. GCRMA corrects for non-specific binding to the probes in contrast to RMA which completely ignores the issue of non-specific binding.

GCRMA uses probe sequence information to estimate probe affinity to non-specific binding. Each nucleotide of the probe contributes to the affinity of the probe. The contributions of each nucleotide on each position of the probes were estimated in a pilot experiment with artificial DNAs that mismatch the probes, done by the people who developed the algorithm. In this experiment there was no specific binding so the only thing was measured was non-specific binding. The data of this experiment allowed to estimate the affinities of the probes.

GCRMA allows you to use the affinities calculated based on this reference experiment or you can let GCRMA compute probe affinities based on the signals of the negative control probes of your own microarray experiment.

The gcrma package contains all available methods to perform GCRMA normalization.

GCRMA in oligo

GCRMA makes use of the IndexProbes() method a lot, which is a method that works on AffyBatches and is not implemented for FeatureSets. As a result, there is no easy way to do GCRMA normalization in the oligo package.

How to compare raw and back

ground-corrected microarray data

Comparing raw and background-corrected data after RMA
BioC19.png

 

If there is no difference between raw and background corrected data, the data points should end up on the diagonal. This is of course not the case. Only low intensities are strongly affected by the background subtraction: this is normal since the background intensity is a small value. Subtracting a small value from a small value has a much bigger impact than subtracting a small value from a high value.

Comparing raw and background corrected data after GCRMA

Checking the effect of the normalization

Comparing raw and background-corrected data

Comparing raw and background-corrected data after RMA
BioC19.png

 

If there is no difference between raw and background corrected data, the data points should end up on the diagonal. This is of course not the case. Only low intensities are strongly affected by the background subtraction: this is normal since the background intensity is a small value. Subtracting a small value from a small value has a much bigger impact than subtracting a small value from a high value.

Comparing raw and background corrected data after GCRMA

Comparing raw and normalized data

Comparing raw and normalized data in affy

After normalization you can compare raw and normalized data. You can do this for individual genes or for all genes.
We’ll start by making the comparison for individual genes.

Comparing raw and normalized data in oligo

After normalization oligo does use probe set IDs as row names in the data.matrix object so you can retrieve normalized data for a specific probe set e.g. for HTA 2.0 arrays:

data.matrix["2824546_st",]

However, retrieving raw data by specifying a probe set ID in the pm() method does not work in oligo.

Comparing raw and normalized data using boxplots

The code on this page works for both affy and oligo

BioC15.png

 

Box plots allow you to assess if the scale and distribution of the data on different arrays is comparable. Differences in shape or center of the boxes indicate that normalization of the data is required.

You can also create box plots using ggplot.

BioC17.png

 

BioC20.png

 

After normalization, none of the samples should stand out from the rest. The different arrays should have the same (or at least a very comparable) median expression level. Also the scale of the boxes should be almost the same indicating that also the spread of the intensity values on the different arrays is comparable.

How to create MA plot of microarray data

The code on this page works for both affy and oligo

BioC18.png

 

Ideally, the cloud of data points should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar for different A values (average intensities). You see that the spread of the cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of) this dependency, we will normalize the data.

 

BioC22.png

 

When you compare this plot to the one created for the raw intensities, you see a much more symmetric and even spread of the data indicating that the dependence of the variability on the average expression level is not as strong as it was before normalization.

How to create a PCA plot of microarray data

The code on this page works for both affy and oligo

PCA performs a transformation of the data into principal components. Principal components (PCs) are linear combinations of probe sets (or genes), ranked in such a way that the first PC is the linear combination of probe sets that captures the direction of the largest variation in the data set. The second PC captures the second largest variation in the data set and so on…

The first PC is an axis in space, so you can project each sample on this axis and the variance between the projected samples will be the highest among all possible choices of first axis. The second PC is another axis in space, perpendicular to the first. Projecting the samples on these two axes generates the plot that you see below in which the variation between the samples is as large as possible.

BioC40.png

 

When you have groups in your data this should lead to a clear distinction between the groups.

Identification of DE genes

Identification of DE genes is not done by the affy nor the oligo package but by the limma package. Limma uses the output of the rma() method (data.rma) as input. Since the output of the rma() method is the same in the affy and in the oligo package, limma works well with both packages. This means that all following code is valid for all normalized Affymetrix data regardless of the package that was used for normalization.

Two groups of samples

The limma package contains functions for using a t-test or an ANOVA to identify differential expression in microarray data. These functions can be used for all array platforms and work even for microarray data with complex designs of multiple samples. The central idea is to fit a linear model to the expression data of each gene. The expression data can be log-ratios or log-intensities. Computational methods are used to borrow information across genes making the analyses stable even for experiments with a small number of arrays. Limma is designed to be used in conjunction with the affy package.

In the above example we compared samples from two groups but in many microarray experiments, there are more than two groups. In this case we have to perform a moderated one-way ANOVA for each gene.

Three groups of samples

Paired data

Up to now we have always assumed that the groups are independent, meaning that samples of one group have no relation whatsoever with samples from another group. In independent data sets you have two groups of subjects (patients, mice, plants,….). You use one group as controls and you give the other group a treatment.
Paired or dependent data means that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
It is important to tell limma if your data is paired or not since you need to use a different type of statistical test on paired compared to independent data:

  • paired t-test instead of regular two-samples t-test
  • repeated measures ANOVA instead of regular one-way ANOVA

Two grouping variables

Sometimes you have two grouping variables, e.g.

  • gender and treatment if you think a drug has a different effect on men and women
  • strain and treatment if you think a drug has a different effect on different mice strains

How to compare microarray data grouped according to two variables

As an example we will compare 4 groups of plants:
1. healthy control plants
2. plants infected with a pathogen
3. plants inhabited by beneficial microorganisms
4. plants inhabited by beneficial microorganisms and infected with a pathogen
It means that we have two grouping variables: pathogen and beneficial MO. All four combinations of pathogen and beneficial MO are observed, so this is a two-factor design.

It is especially important with a multi-factor design to decide what are the comparisons of interest. We want to know which genes respond to:

  • which genes respond to the pathogen infection in healthy plants: compare 2 and 1
  • which genes respond to the benifical MO in healthy plants: compare 3 and 1
  • which genes respond to the pathogen infection in beneficial MO inhabited plants: compare 4 and 3

This approach for analyzing two-factor deigns is recommended for most users, because the contrasts that are being tested are defined explicitly. It will work for simple comparisons.

However, if you want to see if the genes that respond to pathogen infection are different in BMO inhabited and control plants, you need to use an interaction term in the design matrix. To do this, you use a model formula.
Model formulas in R offer lots of extras for setting up the design matrix. However, they also require a high level of statistical understanding in order to use reliably, and they are not well described in the main R documentation.

The two approaches considered are equivalent and should yield identical results for corresponding comparisons.

More info on creating design matrices can be found in the limma user guide

Generating a Volcano plot

The best way to decide on the number of DE genes you are going to select is via a Volcano plot.

BioC26.png

 

Volcano plots arrange genes along biological and statistical significance. The X-axis gives the log fold change between the two groups (log: so that up and down regulation appear symmetric), and the Y-axis represents the p-value of a t-test comparing samples (on a negative log scale so smaller p-values appear higher up). The first axis indicates biological impact of the change; the second indicates the statistical evidence of the change.
To decide on the number of DE genes that you’re going to proceed with, you can make Volcano plots highlighting different numbers of genes. As you increase the number of genes highlighted on the plot, you can see at what point you begin to select genes with rather low fold changes. Make a conservative decision about the number of genes you want to use for follow up.

Adjusting for multiple testing and defining DE genes

You are doing a t-test on each gene, meaning that you will be doing more than 20000 t-tests on the data set. You have to adjust the p-values of the t-tests for multiple testing or you will generate too many false positives.

Of course your final aim is to generate a table containing the genes that you consider DE (the genes with the lowest adjusted p-values and the most extreme log fold changes). You can then use this table to search for functional relations between these genes…

Creating lists of probe set IDs of DE genes for functional analysis

We go back to our simple comparison of mutant and wild-type samples. We have created a list of upregulated genes called topups and a list of downregulated genes (topdowns).

What if you used decideTests() to identify DE genes ? This method generates a matrix containing labels (0, 1 or -1) for each gene in each contrast.

You’ll probably want to write the list of IDs to your computer so that you can use it in tools like DAVID.

Creating a heat map of the DE genes

BioC30.png

Creating a Venn diagram to compare results of multiple comparisons

This can be easily done on the output of the decideTests() method.

Shares