Combining two plots, a histogram and a scatter plot, with `par()` function.

Introduction to R for Genomic Data Analysis

March 29, 2024 Off By admin
Shares

Table of Contents

Introduction to Computational Genomics

Computational genomics is a field of study that uses computational approaches to analyze and interpret genomic data. It involves the development and application of algorithms, statistical models, and computational tools to study various aspects of genomics, including genome sequencing, assembly, annotation, variation, expression, and evolution. Computational genomics plays a crucial role in advancing our understanding of biological processes, disease mechanisms, and evolutionary history encoded in genomes.

Importance of Computational Genomics:

  1. Genomic Data Analysis: With the advent of high-throughput sequencing technologies, the volume of genomic data has increased exponentially. Computational genomics provides the tools and methods to analyze and interpret this vast amount of data efficiently.
  2. Biological Discovery: Computational genomics helps in identifying genes, regulatory elements, and genetic variants associated with traits and diseases. It enables researchers to uncover biological insights that would be challenging or impossible to discover through traditional laboratory methods alone.
  3. Personalized Medicine: Computational genomics contributes to the field of personalized medicine by analyzing individual genomes to predict disease risks, choose optimal treatments, and understand drug responses based on genetic information.
  4. Evolutionary Studies: Computational genomics helps in studying the evolutionary history of species by comparing genomes across different organisms. It provides insights into evolutionary processes, adaptation, and speciation.
  5. Biotechnology and Agriculture: Computational genomics is applied in biotechnology and agriculture to improve crop yields, develop genetically modified organisms, and understand the genetic basis of traits relevant to agriculture and food security.
  6. Drug Discovery and Development: Computational genomics is used in drug discovery to identify drug targets, predict drug efficacy, and understand the genetic basis of drug resistance.
  7. Data Integration and Systems Biology: Computational genomics integrates genomic data with other omics data (e.g., transcriptomics, proteomics) to understand complex biological systems at a holistic level. It contributes to the field of systems biology, which aims to model and understand biological systems as a whole.

In conclusion, computational genomics is a multidisciplinary field that plays a crucial role in advancing biological research, personalized medicine, biotechnology, and various other areas by providing the tools and insights to analyze and understand genomic data.

Overview of high-dimensional genomics data

High-dimensional genomics data refers to datasets that contain a large number of variables (features) relative to the number of samples. These datasets arise from various high-throughput genomic technologies, such as next-generation sequencing (NGS), microarrays, and mass spectrometry, which can generate vast amounts of data for each biological sample. Here’s an overview of high-dimensional genomics data:

Types of High-Dimensional Genomics Data:

  1. Genomic Sequencing Data: Includes whole-genome sequencing (WGS), whole-exome sequencing (WES), and targeted sequencing data, which provide information about the genetic variation (e.g., single nucleotide polymorphisms, insertions, deletions) in an individual’s genome.
  2. Transcriptomic Data: Includes RNA sequencing (RNA-seq) data, which measures the expression levels of genes in a sample. It can also include microarray data, which provides a snapshot of gene expression levels.
  3. Epigenomic Data: Includes data on epigenetic modifications, such as DNA methylation and histone modifications, which can regulate gene expression without changing the underlying DNA sequence.
  4. Proteomic Data: Includes data on the proteins present in a sample, often obtained through mass spectrometry-based techniques.
  5. Metabolomic Data: Includes data on small molecules (metabolites) present in a sample, which can provide insights into metabolic processes.

Challenges of High-Dimensional Genomics Data:

  1. Curse of Dimensionality: As the number of features increases, the sparsity of the data increases, making it more challenging to analyze and interpret.
  2. Data Preprocessing: High-dimensional data often requires extensive preprocessing steps, such as normalization, transformation, and feature selection, to remove noise and irrelevant features.
  3. Statistical Analysis: Traditional statistical methods may not be suitable for high-dimensional data due to issues like multiple testing and overfitting. Specialized methods, such as machine learning algorithms, are often used.
  4. Interpretation: Interpreting results from high-dimensional data can be challenging, as the sheer number of features can lead to complex and potentially misleading conclusions.

Approaches for Analyzing High-Dimensional Genomics Data:

  1. Dimensionality Reduction: Techniques such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) can reduce the dimensionality of the data while preserving important information.
  2. Feature Selection: Identifying a subset of relevant features can help reduce noise and improve the performance of predictive models. Methods include filter, wrapper, and embedded approaches.
  3. Machine Learning: Supervised and unsupervised machine learning algorithms can be used for tasks such as classification, clustering, and regression on high-dimensional genomics data.
  4. Integration of Multiple Data Types: Integrating data from multiple omics layers (e.g., genomics, transcriptomics, proteomics) can provide a more comprehensive view of biological systems.
  5. Visualization: Effective visualization techniques can help explore and understand high-dimensional genomics data, allowing researchers to identify patterns and trends.

In summary, high-dimensional genomics data presents unique challenges and opportunities for researchers. By leveraging advanced computational and statistical methods, researchers can extract valuable insights from these complex datasets to further our understanding of biology and disease.

Role of computational analysis in genomic research

Computational analysis plays a critical role in genomic research by enabling the processing, analysis, and interpretation of large-scale genomic data. Here are some key roles of computational analysis in genomic research:

  1. Data Processing: Computational tools are used to preprocess raw genomic data generated by high-throughput sequencing technologies, such as read alignment, quality control, and error correction.
  2. Variant Calling: Computational algorithms are used to identify genetic variants, such as single nucleotide polymorphisms (SNPs) and insertions/deletions (indels), from genomic sequencing data.
  3. Genome Assembly: Computational methods are used to assemble short DNA sequences (reads) into longer contiguous sequences (contigs) or complete genomes, particularly in de novo sequencing projects.
  4. Gene Prediction and Annotation: Computational tools are used to predict the location and structure of genes in a genome, as well as annotate genes with functional information based on sequence similarity and other features.
  5. Comparative Genomics: Computational analysis is used to compare genomes across different species to identify evolutionary relationships, conserved regions, and species-specific features.
  6. Functional Genomics: Computational approaches are used to analyze gene expression (e.g., RNA-seq data), protein-protein interactions, and regulatory networks to understand gene function and regulation.
  7. Metagenomics: Computational tools are used to analyze microbial communities from environmental samples, allowing researchers to study the genetic diversity and functional potential of microbial ecosystems.
  8. Structural Genomics: Computational methods are used to predict the three-dimensional structures of proteins and nucleic acids, providing insights into their functions and interactions.
  9. Phylogenetics: Computational analysis is used to reconstruct evolutionary relationships among species based on genetic data, providing insights into the history of life on Earth.
  10. Clinical Genomics: Computational analysis is used in clinical genomics to identify genetic variants associated with diseases, predict patient outcomes, and guide personalized treatment strategies (precision medicine).

Overall, computational analysis is essential for extracting meaningful information from genomic data, advancing our understanding of biology, and driving discoveries in genetics, evolution, and personalized medicine.

Data Analysis Steps in Genomics

Data acquisition and preprocessing

Data acquisition and preprocessing are crucial steps in genomic research, ensuring that data is collected, formatted, and cleaned for downstream analysis. Here’s an overview of these processes:

Data Acquisition:

  1. Experimental Design: Determine the research questions and hypotheses, as well as the experimental design (e.g., sample selection, treatments).
  2. Data Generation: Use high-throughput sequencing technologies (e.g., NGS, microarrays) to generate genomic data (e.g., DNA sequences, gene expression levels).
  3. Data Sources: Access public repositories (e.g., NCBI, ENSEMBL) or collaborate with institutions to obtain existing genomic datasets relevant to your research.
  4. Quality Control: Perform quality control checks on raw data to ensure data integrity and reliability (e.g., check sequencing quality scores, remove low-quality reads).

Data Preprocessing:

  1. Read Alignment: Align sequencing reads to a reference genome or transcriptome to determine their genomic locations and identify genetic variants.
  2. Variant Calling: Identify genetic variants (e.g., SNPs, indels) by comparing aligned reads to the reference genome and filtering out false positives.
  3. Gene Expression Quantification: For RNA-seq data, quantify gene expression levels using tools like Salmon, featureCounts, or HTSeq.
  4. Normalization: Normalize gene expression data to account for differences in sequencing depth and other technical biases.
  5. Data Transformation: Transform data if necessary (e.g., log transformation for gene expression data) to meet assumptions of downstream analyses.
  6. Batch Effect Correction: Correct for batch effects if data was generated in different batches or experimental conditions to avoid confounding effects in downstream analysis.
  7. Missing Data Handling: Address missing data by imputation or exclusion, ensuring that missing values do not bias the analysis.
  8. Data Integration: Integrate multiple omics datasets (e.g., genomics, transcriptomics, proteomics) if applicable, to gain a more comprehensive understanding of biological systems.

Software and Tools:

  • Bioinformatics Tools: Use tools such as BWA, Bowtie, or HISAT2 for read alignment, and GATK, SAMtools, or VarScan for variant calling.
  • Data Visualization Tools: Use tools like R (ggplot2, shiny) or Python (matplotlib, seaborn) for visualizing genomic data.

Challenges:

  • Computational Resources: High-throughput sequencing generates large datasets, requiring substantial computational resources for storage and analysis.
  • Data Quality: Ensuring data quality is crucial, as low-quality data can lead to incorrect conclusions.
  • Standardization: Different data formats and platforms can make data integration and comparison challenging.

By carefully acquiring and preprocessing genomic data, researchers can ensure the reliability and quality of their data, setting the stage for meaningful analyses and discoveries in genomic research.

Quality control and data cleaning

Quality control (QC) and data cleaning are essential steps in genomic data analysis to ensure that the data is accurate, reliable, and suitable for downstream analysis. Here’s an overview of these processes:

Quality Control (QC):

  1. Sequence Quality: Assess the quality of sequencing reads using tools like FastQC to check for issues such as base call accuracy, sequence duplication, and adapter contamination.
  2. Alignment Quality: Evaluate the quality of read alignments to the reference genome using tools like SAMtools or Picard to check for proper mapping, mapping quality scores, and coverage uniformity.
  3. Variant Quality: Assess the quality of called variants using tools like GATK or bcftools to filter out variants based on quality scores, read depth, and allele frequency.
  4. Batch Effects: Check for batch effects in the data if samples were processed in different batches, as these can lead to spurious associations. Use tools like ComBat or Surrogate Variable Analysis (SVA) to correct for batch effects.
  5. Duplicate Removal: Identify and remove duplicate reads to avoid bias in downstream analyses, as duplicate reads can skew variant calling and expression quantification.

Data Cleaning:

  1. Filtering: Remove low-quality reads, ambiguous bases, and reads with low mapping quality to improve the accuracy of downstream analyses.
  2. Normalization: Normalize gene expression data to correct for differences in sequencing depth and other technical biases using methods like TPM (Transcripts Per Million) or DESeq2 normalization.
  3. Outlier Detection: Identify and remove outliers in the data that may be due to experimental artifacts or sample processing errors.
  4. Missing Data Handling: Address missing data by imputing missing values or excluding samples with high levels of missing data, ensuring that missing values do not bias the analysis.
  5. Data Transformation: Transform data if necessary (e.g., log transformation for gene expression data) to meet assumptions of downstream analyses.
  6. Correcting Batch Effects: If batch effects are present, apply appropriate correction methods to remove batch effects and ensure that biological signals are not confounded by technical factors.

Software and Tools:

  • FastQC: For quality assessment of sequencing reads.
  • SAMtools and Picard: For processing and quality control of aligned reads.
  • GATK and bcftools: For variant calling and quality control of called variants.
  • DESeq2 and edgeR: For normalization and differential expression analysis of RNA-seq data.
  • PCA and t-SNE: For visualization of sample clustering and outlier detection.

By performing thorough quality control and data cleaning, researchers can ensure that their genomic data is reliable and suitable for downstream analysis, leading to more accurate and meaningful biological insights.

Exploratory data analysis (EDA) techniques

Exploratory Data Analysis (EDA) is a crucial step in genomic data analysis to understand the underlying patterns, distributions, and relationships within the data. Here are some common EDA techniques used in genomic research:

  1. Summary Statistics: Calculate basic statistics such as mean, median, variance, and range to summarize the central tendency and spread of the data.
  2. Histograms: Plot histograms to visualize the distribution of continuous variables (e.g., gene expression levels, variant allele frequencies) and identify patterns such as skewness or multimodality.
  3. Box Plots: Use box plots to visualize the distribution of a continuous variable across different groups (e.g., different sample types, treatment groups) and identify outliers and differences in central tendency.
  4. Scatter Plots: Create scatter plots to explore the relationship between two continuous variables (e.g., gene expression levels vs. gene length) and identify patterns such as correlations or clusters.
  5. Correlation Analysis: Calculate correlation coefficients (e.g., Pearson, Spearman) to quantify the strength and direction of the relationship between pairs of continuous variables.
  6. Principal Component Analysis (PCA): Perform PCA to reduce the dimensionality of high-dimensional data and visualize the overall structure and clustering of samples based on their genomic profiles.
  7. Heatmaps: Generate heatmaps to visualize patterns of gene expression, genetic variation, or other genomic features across samples or genomic regions, highlighting similarities and differences.
  8. Clustering Analysis: Use clustering algorithms (e.g., hierarchical clustering, k-means clustering) to group samples or genes based on their similarities in genomic profiles and visualize the clustering results.
  9. Genomic Data Visualization: Use specialized tools (e.g., genome browsers) to visualize genomic data such as read alignments, gene annotations, and genetic variants in the context of the reference genome.
  10. Interactive Visualization: Use interactive visualization tools (e.g., Plotly, Shiny) to explore genomic data dynamically, allowing for interactive exploration of patterns and trends.

By applying these EDA techniques, researchers can gain valuable insights into the structure, variability, and relationships within genomic datasets, guiding further analyses and hypothesis generation in genomic research.

Statistical analysis methods in genomics

Statistical analysis plays a crucial role in genomics, helping researchers interpret complex biological data and draw meaningful conclusions. Here are some key statistical analysis methods used in genomics:

  1. Differential Expression Analysis: Identify genes that are differentially expressed between two or more conditions (e.g., disease vs. control) using methods like DESeq2, edgeR, and limma.
  2. Variant Analysis: Identify genetic variants (e.g., SNPs, indels) associated with traits or diseases using association studies (e.g., GWAS) and methods like PLINK, SNPTEST, and SKAT.
  3. Survival Analysis: Analyze time-to-event data (e.g., survival times, relapse times) to identify factors associated with outcomes using methods like Kaplan-Meier analysis and Cox proportional hazards models.
  4. Clustering Analysis: Group samples or genes based on similarity in genomic profiles using hierarchical clustering, k-means clustering, or other clustering algorithms.
  5. Principal Component Analysis (PCA): Reduce the dimensionality of high-dimensional genomic data to visualize patterns and relationships among samples.
  6. Gene Set Enrichment Analysis (GSEA): Identify sets of genes that are overrepresented in a list of differentially expressed genes, providing insights into biological pathways and functions.
  7. Network Analysis: Analyze biological networks (e.g., protein-protein interaction networks, gene regulatory networks) to understand the relationships among genes and proteins in cellular processes.
  8. Bayesian Methods: Incorporate prior knowledge or assumptions into the analysis using Bayesian methods, which can be useful in small-sample genomic studies or when incorporating external data sources.
  9. Machine Learning: Apply machine learning algorithms (e.g., random forests, support vector machines) to predict outcomes, classify samples, or identify patterns in genomic data.
  10. Multi-Omics Integration: Integrate data from multiple omics layers (e.g., genomics, transcriptomics, proteomics) to gain a more comprehensive understanding of biological systems and disease mechanisms.

These statistical analysis methods, among others, help researchers extract valuable insights from genomic data, leading to discoveries in genetics, biology, and medicine.

Introduction to R for Genomic Data Analysis

The aim of computational genomics is to provide biological interpretation and insights from high-dimensional genomics data. Generally speaking, it is similar to any other kind of data analysis endeavor but oftentimes doing computational genomics will require domain-specific knowledge and tools.

As new high-throughput experimental techniques are on the rise, data analysis capabilities are sought-after features for researchers. The aim of this chapter is to first familiarize readers with data analysis steps and then provide basics of R programming within the context of genomic data analysis. R is a free statistical programming language that is popular among researchers and data miners to build software and analyze data. Although basic R programming tutorials are easily accessible, we are aiming to introduce the subject with the genomic context in the background. The examples and narrative will always be from real-life situations when you try to analyze genomic data with R. We believe tailoring material to the context of genomics makes a difference when learning this programming language for the sake of analyzing genomic data.

Steps of (genomic) data analysis

Regardless of the analysis type, data analysis has a common pattern. We will discuss this general pattern and how it applies to genomics problems. The data analysis steps typically include data collection, quality check and cleaning, processing, modeling, visualization, and reporting. Although one expects to go through these steps in a linear fashion, it is normal to go back and repeat the steps with different parameters or tools. In practice, data analysis requires going through the same steps over and over again in order to be able to do a combination of the following: a) answer other related questions, b) deal with data quality issues that are later realized, and c) include new data sets to the analysis.

We will now go through a brief explanation of the steps within the context of genomic data analysis.

 Data collection

Data collection refers to any source, experiment or survey that provides data for the data analysis question you have. In genomics, data collection is done by high-throughput assays, introduced in Chapter 1. One can also use publicly available data sets and specialized databases, also mentioned in Chapter 1. How much data and what type of data you should collect depends on the question you are trying to answer and the technical and biological variability of the system you are studying.

 Data quality check and cleaning

In general, data analysis almost always deals with imperfect data. It is common to have missing values or measurements that are noisy. Data quality check and cleaning aims to identify any data quality issue and clean it from the dataset.

High-throughput genomics data is produced by technologies that could embed technical biases into the data. If we were to give an example from sequencing, the sequenced reads do not have the same quality of bases called. Towards the ends of the reads, you could have bases that might be called incorrectly. Identifying those low-quality bases and removing them will improve the read mapping step.

 Data processing

This step refers to processing the data into a format that is suitable for exploratory analysis and modeling. Oftentimes, the data will not come in a ready-to-analyze format. You may need to convert it to other formats by transforming data points (such as log transforming, normalizing, etc.), or subset the data set with some arbitrary or pre-defined condition. In terms of genomics, processing includes multiple steps. Following the sequencing analysis example above, processing will include aligning reads to the genome and quantification over genes or regions of interest. This is simply counting how many reads are covering your regions of interest. This quantity can give you ideas about how much a gene is expressed if your experimental protocol was RNA sequencing. This can be followed by some normalization to aid the next step.

 Exploratory data analysis and modeling

This phase usually takes in the processed or semi-processed data and applies machine learning or statistical methods to explore the data. Typically, one needs to see a relationship between variables measured, and a relationship between samples based on the variables measured. At this point, we might be looking to see if the samples are grouped as expected by the experimental design, or are there outliers or any other anomalies? After this step you might want to do additional cleanup or re-processing to deal with anomalies.

Another related step is modeling. This generally refers to modeling your variable of interest based on other variables you measured. In the context of genomics, it could be that you are trying to predict disease status of the patients from expression of genes you measured from their tissue samples. Then your variable of interest is the disease status. This kind of approach is generally called “predictive modeling”, and could be solved with regression-based machine learning methods.

Statistical modeling would also be a part of this modeling step. This can cover predictive modeling as well, where we use statistical methods such as linear regression. Other analyses such as hypothesis testing, where we have an expectation and we are trying to confirm that expectation, is also related to statistical modeling. A good example of this in genomics is the differential gene expression analysis. This can be formulated as comparing two data sets, in this case expression values from condition A and condition B, with the expectation that condition A and condition B have similar expression values. You will see more on this in Chapter 3.

Visualization and reporting

Visualization is necessary for all the previous steps more or less. But in the final phase, we need final figures, tables, and text that describe the outcome of your analysis. This will be your report. In genomics, we use common data visualization methods as well as specific visualization methods developed or popularized by genomic data analysis. You will see many popular visualization methods in Chapters 3 and 6.

 Why use R for genomics ?

R, with its statistical analysis heritage, plotting features, and rich user-contributed packages is one of the best languages for the task of analyzing genomic data. High-dimensional genomics datasets are usually suitable to be analyzed with core R packages and functions. On top of that, Bioconductor and CRAN have an array of specialized tools for doing genomics-specific analysis. Here is a list of computational genomics tasks that can be completed using R.

Data cleanup and processing

Most of general data cleanup, such as removing incomplete columns and values, reorganizing and transforming data, can be achieved using R. In addition, with the help of packages, R can connect to databases in various formats such as mySQL, mongoDB, etc., and query and get the data into the R environment using database specific tools.

On top of these, genomic data-specific processing and quality check can be achieved via R/Bioconductor packages. For example, sequencing read quality checks and even HT-read alignments can be achieved via R packages.

General data analysis and exploration

Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R. You will see popular data analysis methods in Chapters 34 and 5.

  • Unsupervised data analysis: clustering (k-means, hierarchical), matrix factorization (PCA, ICA, etc.)
  • Supervised data analysis: generalized linear models, support vector machines, random forests

Genomics-specific data analysis methods

R/Bioconductor gives you access to a multitude of other bioinformatics-specific algorithms. Here are some of the things you can do. We will touch upon many of the following methods in Chapter 6 and onwards.

  • Sequence analysis: TF binding motifs, GC content and CpG counts of a given DNA sequence
  • Differential expression (or arrays and sequencing-based measurements)
  • Gene set/pathway analysis: What kind of genes are enriched in my gene set?
  • Genomic interval operations such as overlapping CpG islands with transcription start sites, and filtering based on overlaps
  • Overlapping aligned reads with exons and counting aligned reads per gene

Visualization

Visualization is an important part of all data analysis techniques including computational genomics. Again, you can use core visualization techniques in R and also genomics-specific ones with the help of specific packages. Here are some of the things you can do with R.

  • Basic plots: Histograms, scatter plots, bar plots, box plots, heatmaps
  • Ideograms and circos plots for genomics provide visualization of different features over the whole genome.
  • Meta-profiles of genomic features, such as read enrichment over all promoters
  • Visualization of quantitative assays for given locus in the genome

Getting started with R

Download and install R (http://cran.r-project.org/) and RStudio (http://www.rstudio.com/) if you do not have them already. Rstudio is optional but it is a great tool if you are just starting to learn R. You will need specific data sets to run the code snippets in this book; we have explained how to install and use the data in the Data for the book section in the Preface. If you have not used Rstudio before, we recommend running it and familiarizing yourself with it first. To put it simply, this interface combines multiple features you will need while analyzing data. You can see your code, how it is executed, the plots you make, and your data all in one interface.

Installing packages

R packages are add-ons to base R that help you achieve additional tasks that are not directly supported by base R. It is by the action of these extra functionality that R excels as a tool for computational genomics. The Bioconductor project (http://bioconductor.org/) is a dedicated package repository for computational biology-related packages. However main package repository of R, called CRAN, also has computational biology related packages. In addition, R-Forge (http://r-forge.r-project.org/), GitHub (https://github.com/), and Bitbucket (http://www.bitbucket.org) are some of the other locations where R packages might be hosted. The packages needed for the code snippets in this book and how to install them are explained in the Packages needed to run the book code section in the Preface of the book.

You can install CRAN packages using install.packages() (# is the comment character in R).

# install package named “randomForests” from CRAN

install.packages(“randomForests”)

You can install bioconductor packages with a specific installer script.

# get the installer package if you don’t have

install.packages(“BiocManager”)

# install bioconductor package “rtracklayer”

BiocManager::install(“rtracklayer”)

You can install packages from GitHub using the install_github() function from devtools package.

library(devtools)

install_github(“hadley/stringr”)

Another way to install packages is from the source.

# download the source file

download.file(

“https://github.com/al2na/methylKit/releases/download/v0.99.2/methylKit_0.99.2.tar.gz”,

destfile=”methylKit_0.99.2.tar.gz”)

# install the package from the source file

install.packages(“methylKit_0.99.2.tar.gz”,

repos=NULL,type=”source”)

# delete the source file

unlink(“methylKit_0.99.2.tar.gz”)

You can also update CRAN and Bioconductor packages.

# updating CRAN packages

update.packages()

# updating bioconductor packages

if (!requireNamespace(“BiocManager”, quietly = TRUE))

install.packages(“BiocManager”)

BiocManager::install()

Installing packages in custom locations

If you will be using R on servers or computing clusters rather than your personal computer, it is unlikely that you will have administrator access to install packages. In that case, you can install packages in custom locations by telling R where to look for additional packages. This is done by setting up an .Renviron file in your home directory and add the following line:

R_LIBS=~/Rlibs

This tells R that the “Rlibs” directory at your home directory will be the first choice of locations to look for packages and install packages (the directory name and location is up to you, the above is just an example). You should go and create that directory now. After that, start a fresh R session and start installing packages. From now on, packages will be installed to your local directory where you have read-write access.

Getting help on functions and packages

You can get help on functions by using help() and help.search() functions. You can list the functions in a package with the ls() function

library(MASS)

ls(“package:MASS”) # functions in the package

ls() # objects in your R enviroment

# get help on hist() function

?hist

help(“hist”)

# search the word “hist” in help pages

help.search(“hist”)

??hist

Computations in R

R can be used as an ordinary calculator, and some say it is an over-grown calculator. Here are some examples. Remember that # is the comment character. The comments give details about the operations in case they are not clear.

2 + 3 * 5 # Note the order of operations.

log(10) # Natural logarithm with base e

5^2 # 5 raised to the second power

3/2 # Division

sqrt(16) # Square root

abs(3-7) # Absolute value of 3-7

pi # The number

exp(2) # exponential function

# This is a comment line

Data structures

R has multiple data structures. If you are familiar with Excel, you can think of a single Excel sheet as a table and data structures as building blocks of that table. Most of the time you will deal with tabular data sets or you will want to transform your raw data to a tabular data set, and you will try to manipulate this tabular data set in some way. For example, you may want to take sub-sections of the table or extract all the values in a column. For these and similar purposes, it is essential to know the common data structures in R and how they can be used. R deals with named data structures, which means you can give names to data structures and manipulate or operate on them using those names. It will be clear soon what we mean by this if “named data structures” does not ring a bell.

Vectors

Vectors are one of the core R data structures. It is basically a list of elements of the same type (numeric, character or logical). Later you will see that every column of a table will be represented as a vector. R handles vectors easily and intuitively. You can create vectors with the c() function, however that is not the only way. The operations on vectors will propagate to all the elements of the vectors.

x<-c(1,3,2,10,5) #create a vector named x with 5 components

x = c(1,3,2,10,5)

x

## [1] 1 3 2 10 5

y<-1:5 #create a vector of consecutive integers y

y+2 #scalar addition

## [1] 3 4 5 6 7

2*y #scalar multiplication

## [1] 2 4 6 8 10

y^2 #raise each component to the second power

## [1] 1 4 9 16 25

2^y #raise 2 to the first through fifth power

## [1] 2 4 8 16 32

y #y itself has not been unchanged

## [1] 1 2 3 4 5

y<-y*2

y #it is now changed

## [1] 2 4 6 8 10

r1<-rep(1,3) # create a vector of 1s, length 3

length(r1) #length of the vector

## [1] 3

class(r1) # class of the vector

## [1] “numeric”

a<-1 # this is actually a vector length one

The standard assignment operator in R is <-. This operator is preferentially used in books and documentation. However, it is also possible to use the = operator for the assignment. We have an example in the above code snippet and throughout the book we use <- and = interchangeably for assignment.

Matrices

A matrix refers to a numeric array of rows and columns. You can think of it as a stacked version of vectors where each row or column is a vector. One of the easiest ways to create a matrix is to combine vectors of equal length using cbind(), meaning ‘column bind’.

x<-c(1,2,3,4)

y<-c(4,5,6,7)

m1<-cbind(x,y);m1

## x y

## [1,] 1 4

## [2,] 2 5

## [3,] 3 6

## [4,] 4 7

t(m1) # transpose of m1

## [,1] [,2] [,3] [,4]

## x 1 2 3 4

## y 4 5 6 7

dim(m1) # 2 by 5 matrix

## [1] 4 2

You can also directly list the elements and specify the matrix:

m2<-matrix(c(1,3,2,5,-1,2,2,3,9),nrow=3)

m2

## [,1] [,2] [,3]

## [1,] 1 5 2

## [2,] 3 -1 3

## [3,] 2 2 9

Matrices and the next data structure, data frames, are tabular data structures. You can subset them using [] and providing desired rows and columns to subset. Figure 2.1 shows how that works conceptually.

Slicing/subsetting of a matrix and a data frame.

FIGURE 2.1: Slicing/subsetting of a matrix and a data frame.

Data frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). A data frame can be constructed by the data.frame() function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.

chr <- c(“chr1”, “chr1”, “chr2”, “chr2”)

strand <- c(“-“,”-“,”+”,”+”)

start<- c(200,4000,100,400)

end<-c(250,410,200,450)

mydata <- data.frame(chr,start,end,strand)

#change column names

names(mydata) <- c(“chr”,”start”,”end”,”strand”)

mydata # OR this will work too

## chr start end strand

## 1 chr1 200 250 –

## 2 chr1 4000 410 –

## 3 chr2 100 200 +

## 4 chr2 400 450 +

mydata <- data.frame(chr=chr,start=start,end=end,strand=strand)

mydata

## chr start end strand

## 1 chr1 200 250 –

## 2 chr1 4000 410 –

## 3 chr2 100 200 +

## 4 chr2 400 450 +

There are a variety of ways to extract the elements of a data frame. You can extract certain columns using column numbers or names, or you can extract certain rows by using row numbers. You can also extract data using logical arguments, such as extracting all rows that have a value in a column larger than your threshold.

mydata[,2:4] # columns 2,3,4 of data frame

## start end strand

## 1 200 250 –

## 2 4000 410 –

## 3 100 200 +

## 4 400 450 +

mydata[,c(“chr”,”start”)] # columns chr and start from data frame

## chr start

## 1 chr1 200

## 2 chr1 4000

## 3 chr2 100

## 4 chr2 400

mydata$start # variable start in the data frame

## [1] 200 4000 100 400

mydata[c(1,3),] # get 1st and 3rd rows

## chr start end strand

## 1 chr1 200 250 –

## 3 chr2 100 200 +

mydata[mydata$start>400,] # get all rows where start>400

## chr start end strand

## 2 chr1 4000 410 –

Lists

A list in R is an ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. You can create a list with the list() function. Each object or element in the list has a numbered position and can have names. Below we show a few examples of how to create lists.

# example of a list with 4 components

# a string, a numeric vector, a matrix, and a scalar

w <- list(name=”Fred”,

mynumbers=c(1,2,3),

mymatrix=matrix(1:4,ncol=2),

age=5.3)

w

## $name

## [1] “Fred”

##

## $mynumbers

## [1] 1 2 3

##

## $mymatrix

## [,1] [,2]

## [1,] 1 3

## [2,] 2 4

##

## $age

## [1] 5.3

You can extract elements of a list using the [[]], the double square-bracket, convention using either its position in the list or its name.

w[[3]] # 3rd component of the list

## [,1] [,2]

## [1,] 1 3

## [2,] 2 4

w[[“mynumbers”]] # component named mynumbers in list

## [1] 1 2 3

w$age

## [1] 5.3

 Factors

Factors are used to store categorical data. They are important for statistical modeling since categorical variables are treated differently in statistical models than continuous variables. This ensures categorical data treated accordingly in statistical models.

features=c(“promoter”,”exon”,”intron”)

f.feat=factor(features)

An important thing to note is that when you are reading a data frame with read.table() or creating a data frame with data.frame() function, the character columns are stored as factors by default, to change this behavior you need to set stringsAsFactors=FALSE in read.table() and/or data.frame() function arguments.

Data types

There are four common data types in R, they are numeric, logical, character and integer. All these data types can be used to create vectors natively.

#create a numeric vector x with 5 components

x<-c(1,3,2,10,5)

x

## [1] 1 3 2 10 5

#create a logical vector x

x<-c(TRUE,FALSE,TRUE)

x

## [1] TRUE FALSE TRUE

# create a character vector

x<-c(“sds”,”sd”,”as”)

x

## [1] “sds” “sd” “as”

class(x)

## [1] “character”

# create an integer vector

x<-c(1L,2L,3L)

x

## [1] 1 2 3

class(x)

## [1] “integer”

Reading and writing data

Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is the BED format, which is used primarily by the UCSC genome browser but most other genome browsers and tools will support the BED file format. We have all the annotation data in BED format. You will read more about data formats in Chapter 6. In R, you can easily read tabular format data with the read.table() function.

enhancerFilePath=system.file(“extdata”,

“subset.enhancers.hg18.bed”,

package=”compGenomRData”)

cpgiFilePath=system.file(“extdata”,

“subset.cpgi.hg18.bed”,

package=”compGenomRData”)

# read enhancer marker BED file

enh.df <- read.table(enhancerFilePath, header = FALSE)

# read CpG island BED file

cpgi.df <- read.table(cpgiFilePath, header = FALSE)

# check first lines to see how the data looks like

head(enh.df)

## V1 V2 V3 V4 V5 V6 V7 V8 V9

## 1 chr20 266275 267925 . 1000 . 9.11 13.1693 -1

## 2 chr20 287400 294500 . 1000 . 10.53 13.0231 -1

## 3 chr20 300500 302500 . 1000 . 9.10 13.3935 -1

## 4 chr20 330400 331800 . 1000 . 6.39 13.5105 -1

## 5 chr20 341425 343400 . 1000 . 6.20 12.9852 -1

## 6 chr20 437975 439900 . 1000 . 6.31 13.5184 -1

head(cpgi.df)

## V1 V2 V3 V4

## 1 chr20 195575 195851 CpG:_28

## 2 chr20 207789 208148 CpG:_32

## 3 chr20 219055 219437 CpG:_33

## 4 chr20 225831 227155 CpG:_135

## 5 chr20 252826 256323 CpG:_286

## 6 chr20 275376 276977 CpG:_116

You can save your data by writing it to disk as a text file. A data frame or matrix can be written out by using the write.table() function. Now let us write out cpgi.df. We will write it out as a tab-separated file; pay attention to the arguments.

write.table(cpgi.df,file=”cpgi.txt”,quote=FALSE,

row.names=FALSE,col.names=FALSE,sep=”\t”)

You can save your R objects directly into a file using save() and saveRDS() and load them back in with load() and readRDS(). By using these functions you can save any R object whether or not it is in data frame or matrix classes.

save(cpgi.df,enh.df,file=”mydata.RData”)

load(“mydata.RData”)

# saveRDS() can save one object at a type

saveRDS(cpgi.df,file=”cpgi.rds”)

x=readRDS(“cpgi.rds”)

head(x)

One important thing is that with save() you can save many objects at a time, and when they are loaded into memory with load() they retain their variable names. For example, in the above code when you use load(“mydata.RData”) in a fresh R session, an object named cpg.df will be created. That means you have to figure out what name you gave to the objects before saving them. Conversely, when you save an object by saveRDS() and read by readRDS(), the name of the object is not retained, and you need to assign the output of readRDS() to a new variable (x in the above code chunk).

Reading large files

Reading large files that contain tables with base R function read.table() might take a very long time. Therefore, there are additional packages that provide faster functions to read the files. The data.table and readr packages provide this functionality. Below, we show how to use them. These functions with provided parameters will return equivalent output to the read.table() function.

library(data.table)

df.f=d(enhancerFilePath, header = FALSE,data.table=FALSE)

library(readr)

df.f2=read_table(enhancerFilePath, col_names = FALSE)

Plotting in R with base graphics

R has great support for plotting and customizing plots by default. This basic capability for plotting in R is referred to as “base graphics” or “R base graphics”. We will show only a few below. Let us sample 50 values from the normal distribution and plot them as a histogram. A histogram is an approximate representation of a distribution. Bars show how frequently we observe certain values in our sample. The resulting histogram from the code chunk below is shown in Figure 2.2.

# sample 50 values from normal distribution

# and store them in vector x

x<-rnorm(50)

hist(x) # plot the histogram of those values

Histogram of values sampled from normal distribution.

Histogram of values sampled from normal distribution.

We can modify all the plots by providing certain arguments to the plotting function. Now let’s give a title to the plot using the main argument. We can also change the color of the bars using the col argument. You can simply provide the name of the color. Below, we are using ‘red’ for the color. See Figure 2.3 for the result of this code chunk.

hist(x,main=”Hello histogram!!!”,col=”red”)

Histogram in red color.

Histogram in red color.

Next, we will make a scatter plot. Scatter plots are one the most common plots you will encounter in data analysis. We will sample another set of 50 values and plot those against the ones we sampled earlier. The scatter plot shows values of two variables for a set of data points. It is useful to visualize relationships between two variables. It is frequently used in connection with correlation and linear regression. There are other variants of scatter plots which show density of the points with different colors. We will show examples of those scatter plots in later chapters. The scatter plot from our sampling experiment is shown in Figure 2.4. Notice that, in addition to main argument we used xlab and ylab arguments to give labels to the plot. You can customize the plots even more than this. See ?plot and ?par for more arguments that can help you customize the plots.

# randomly sample 50 points from normal distribution

y<-rnorm(50)

#plot a scatter plot

# control x-axis and y-axis labels

plot(x,y,main=”scatterplot of random samples”,

ylab=”y values”,xlab=”x values”)

Scatter plot example.

Scatter plot example.

We can also plot boxplots for vectors x and y. Boxplots depict groups of numerical data through their quartiles. The edges of the box denote the 1st and 3rd quartiles, and the line that crosses the box is the median. The distance between the 1st and the 3rd quartiles is called interquartile tange. The whiskers (lines extending from the boxes) are usually defined using the interquartile range for symmetric distributions as follows: lowerWhisker=Q1-1.5[IQR] and upperWhisker=Q3+1.5[IQR].

In addition, outliers can be depicted as dots. In this case, outliers are the values that remain outside the whiskers. The resulting plot from the code snippet below is shown in Figure 2.5.

boxplot(x,y,main=”boxplots of random samples”)

Boxplot example

Boxplot example

Next up is the bar plot, which you can plot using the barplot() function. We are going to plot four imaginary percentage values and color them with two colors, and this time we will also show how to draw a legend on the plot using the legend() function. The resulting plot is in Figure

perc=c(50,70,35,25)

barplot(height=perc,

names.arg=c(“CpGi”,”exon”,”CpGi”,”exon”),

ylab=”percentages”,main=”imagine %s”,

col=c(“red”,”red”,”blue”,”blue”))

legend(“topright”,legend=c(“test”,”control”),

fill=c(“red”,”blue”))

Bar plot example

Bar plot example

Combining multiple plots

In R, we can combine multiple plots in the same graphic. For this purpose, we use the par() function for simple combinations. More complicated arrangements with different sizes of sub-plots can be created with the layout() function. Below we will show how to combine two plots side-by-side using par(mfrow=c(1,2)). The mfrow=c(nrows, ncols) construct will create a matrix of nrows x ncols plots that are filled in by row. The following code will produce a histogram and a scatter plot stacked side by side. The result is shown in Figure 2.7. If you want to see the plots on top of each other, simply change mfrow=c(1,2) to mfrow=c(2,1).

par(mfrow=c(1,2)) #

# make the plots

hist(x,main=”Hello histogram!!!”,col=”red”)

plot(x,y,main=”scatterplot”,

ylab=”y values”,xlab=”x values”)

Combining two plots, a histogram and a scatter plot, with `par()` function.

Combining two plots, a histogram and a scatter plot, with par() function.

 Saving plots

If you want to save your plots to an image file there are couple of ways of doing that. Normally, you will have to do the following:

  1. Open a graphics device.
  2. Create the plot.
  3. Close the graphics device.

pdf(“mygraphs/myplot.pdf”,width=5,height=5)

plot(x,y)

dev.off()

Alternatively, you can first create the plot then copy the plot to a graphics device.

plot(x,y)

dev.copy(pdf,”mygraphs/myplot.pdf”,width=7,height=5)

dev.off()

Plotting in R with ggplot2

In R, there are other plotting systems besides “base graphics”, which is what we have shown until now. There is another popular plotting system called ggplot2 which implements a different logic when constructing the plots. This system or logic is known as the “grammar of graphics”. This system defines a plot or graphics as a combination of different components. For example, in the scatter plot in 2.4, we have the points which are geometric shapes, we have the coordinate system and scales of data. In addition, data transformations are also part of a plot. In Figure 2.3, the histogram has a binning operation and it puts the data into bins before displaying it as geometric shapes, the bars. The ggplot2 system and its implementation of “grammar of graphics”1 allows us to build the plot layer by layer using the predefined components.

Next we will see how this works in practice. Let’s start with a simple scatter plot using ggplot2. In order to make basic plots in ggplot2, one needs to combine different components. First, we need the data and its transformation to a geometric object; for a scatter plot this would be mapping data to points, for histograms it would be binning the data and making bars. Second, we need the scales and coordinate system, which generates axes and legends so that we can see the values on the plot. And the last component is the plot annotation such as plot title and the background.

The main ggplot2 function, called ggplot(), requires a data frame to work with, and this data frame is its first argument as shown in the code snippet below. The second thing you will notice is the aes() function in the ggplot() function. This function defines which columns in the data frame map to x and y coordinates and if they should be colored or have different shapes based on the values in a different column. These elements are the “aesthetic” elements, this is what we observe in the plot. The last line in the code represents the geometric object to be plotted. These geometric objects define the type of the plot. In this case, the object is a point, indicated by the geom_point()function. Another, peculiar thing in the code is the + operation. In ggplot2, this operation is used to add layers and modify the plot. The resulting scatter plot from the code snippet below can be seen in Figure 2.8.

library(ggplot2)

myData=data.frame(col1=x,col2=y)

# the data is myData and I’m using col1 and col2

# columns on x and y axes

ggplot(myData, aes(x=col1, y=col2)) +

geom_point() # map x and y as points

Scatter plot with ggplot2

Scatter plot with ggplot2

Now, let’s re-create the histogram we created before. For this, we will start again with the ggplot() function. We are interested only in the x-axis in the histogram, so we will only use one column of the data frame. Then, we will add the histogram layer with the geom_histogram() function. In addition, we will be showing how to modify your plot further by adding an additional layer with the labs() function, which controls the axis labels and titles. The resulting plot from the code chunk below is shown in Figure 2.9.

ggplot(myData, aes(x=col1)) +

geom_histogram() + # map x and y as points

labs(title=”Histogram for a random variable”, x=”my variable”, y=”Count”)

Histograms made with ggplot2, the left histogram contains additional modifications introduced by `labs()` function.

FIGURE 2.9: Histograms made with ggplot2, the left histogram contains additional modifications introduced by labs() function.

We can also plot boxplots using ggplot2. Let’s re-create the boxplot we did in Figure 2.5. This time we will have to put all our data into a single data frame with extra columns denoting the group of our values. In the base graphics case, we could just input variables containing different vectors. However, ggplot2 does not work like that and we need to create a data frame with the right format to use the ggplot() function. Below, we first concatenate the x and y vectors and create a second column denoting the group for the vectors. In this case, the x-axis will be the “group” variable which is just a character denoting the group, and the y-axis will be the numeric “values” for the x and y vectors. You can see how this is passed to the aes() function below. The resulting plot is shown in Figure 2.10.

# data frame with group column showing which

# groups the vector x and y belong

myData2=rbind(data.frame(values=x,group=”x”),

data.frame(values=y,group=”y”))

# x-axis will be group and y-axis will be values

ggplot(myData2, aes(x=group,y=values)) +

geom_boxplot()

Boxplots using ggplot2.

Boxplots using ggplot2.

Combining multiple plots

There are different options for combining multiple plots. If we are trying to make similar plots for the subsets of the same data set, we can use faceting. This is a built-in and very useful feature of ggplot2. This feature is frequently used when investigating whether patterns are the same or different in different conditions or subsets of the data. It can be used via the facet_grid() function. Below, we will make two histograms faceted by the group variable in the input data frame. We will be using the same data frame we created for the boxplot in the previous section. The resulting plot is in Figure 2.11.

ggplot(myData2, aes(x=values)) +

geom_histogram() +facet_grid(.~group)

Combining two plots using ggplot2::facet_grid() function.

FIGURE 2.11: Combining two plots using ggplot2::facet_grid() function.

Faceting only works when you are using the subsets of the same data set. However, you may want to combine different types of plots from different data sets. The base R functions such as par() and layout() will not work with ggplot2 because it uses a different graphics system and this system does not recognize base R functionality for plotting. However, there are multiple ways you can combine plots from ggplot2. One way is using the cowplot package. This package aligns the individual plots in a grid and will help you create publication-ready compound plots. Below, we will show how to combine a histogram and a scatter plot side by side. The resulting plot is shown in Figure 2.12.

library(cowplot)

# histogram

p1 <- ggplot(myData2, aes(x=values,fill=group)) +

geom_histogram()

# scatterplot

p2 <- ggplot(myData, aes(x=col1, y=col2)) +

geom_point()

# plot two plots in a grid and label them as A and B

plot_grid(p1, p2, labels = c(‘A’, ‘B’), label_size = 12)

Combining a histogram and scatter plot using cowplot package. The plots are labeled as A and B using the arguments in plot_grid() function.

FIGURE 2.12: Combining a histogram and scatter plot using cowplot package. The plots are labeled as A and B using the arguments in plot_grid() function.

ggplot2 and tidyverse

ggplot2 is actually part of a larger ecosystem. You will need packages from this ecosystem when you want to use ggplot2 in a more sophisticated manner or if you need additional functionality that is not readily available in base R or other packages. For example, when you want to make more complicated plots using ggplot2, you will need to modify your data frames to the formats required by the ggplot() function, and you will need to learn about the dplyr and tidyr packages for data formatting purposes. If you are working with character strings, stringr package might have functionality that is not available in base R. There are many more packages that users find useful in tidyverse and it could be important to know about this ecosystem of R packages.

Functions and control structures (for, if/else etc.)

 User-defined functions

Functions are useful for transforming larger chunks of code to re-usable pieces of code. Generally, if you need to execute certain tasks with variable parameters, then it is time you write a function. A function in R takes different arguments and returns a definite output, much like mathematical functions. Here is a simple function that takes two arguments, x and y, and returns the sum of their squares.

sqSum<-function(x,y){

result=x^2+y^2

return(result)

}

# now try the function out

sqSum(2,3)

## [1] 13

Functions can also output plots and/or messages to the terminal. Here is a function that prints a message to the terminal:

sqSumPrint<-function(x,y){

result=x^2+y^2

cat(“here is the result:”,result,”\n”)

}

# now try the function out

sqSumPrint(2,3)

## here is the result: 13

Sometimes we would want to execute a certain part of the code only if a certain condition is satisfied. This condition can be anything from the type of an object (Ex: if the object is a matrix, execute certain code), or it can be more complicated, such as if the object value is between certain thresholds. Let us see how these if statements can be used. They can be used anywhere in your code; now we will use it in a function to decide if the CpG island is large, normal length or short.

cpgi.df <- read.table(“intro2R_data/data/subset.cpgi.hg18.bed”, header = FALSE)

# function takes input one row

# of CpGi data frame

largeCpGi<-function(bedRow){

cpglen=bedRow[3]-bedRow[2]+1

if(cpglen>1500){

cat(“this is large\n”)

}

else if(cpglen<=1500 & cpglen>700){

cat(“this is normal\n”)

}

else{

cat(“this is short\n”)

}

}

largeCpGi(cpgi.df[10,])

largeCpGi(cpgi.df[100,])

largeCpGi(cpgi.df[1000,])

Loops and looping structures in R

When you need to repeat a certain task or execute a function multiple times, you can do that with the help of loops. A loop will execute the task until a certain condition is reached. The loop below is called a “for-loop” and it executes the task sequentially 10 times.

for(i in 1:10){ # number of repetitions

cat(“This is iteration”) # the task to be repeated

print(i)

}

## This is iteration[1] 1

## This is iteration[1] 2

## This is iteration[1] 3

## This is iteration[1] 4

## This is iteration[1] 5

## This is iteration[1] 6

## This is iteration[1] 7

## This is iteration[1] 8

## This is iteration[1] 9

## This is iteration[1] 10

The task above is a bit pointless. Normally in a loop, you would want to do something meaningful. Let us calculate the length of the CpG islands we read in earlier. Although this is not the most efficient way of doing that particular task, it serves as a good example for looping. The code below will execute a hundred times, and it will calculate the length of the CpG islands for the first 100 islands in the data frame (by subtracting the end coordinate from the start coordinate).

Note:If you are going to run a loop that has a lot of repetitions, it is smart to try the loop with few repetitions first and check the results. This will help you make sure the code in the loop works before executing it thousands of times.

# this is where we will keep the lenghts

# for now it is an empty vector

result=c()

# start the loop

for(i in 1:100){

#calculate the length

len=cpgi.df[i,3]-cpgi.df[i,2]+1

#append the length to the result

result=c(result,len)

}

# check the results

head(result)

## [1] 277 360 383 1325 3498 1602

Apply family functions instead of loops

R has other ways of repeating tasks, which tend to be more efficient than using loops. They are known collectively as the “apply” family of functions, which include apply, lapply,mapply and tapply (and some other variants). All of these functions apply a given function to a set of instances and return the results of those functions for each instance. The difference between them is that they take different types of inputs. For example, apply works on data frames or matrices and applies the function on each row or column of the data structure. lapply works on lists or vectors and applies a function which takes the list element as an argument. Next we will demonstrate how to use apply() on a matrix. The example applies the sum function on the rows of a matrix; it basically sums up the values on each row of the matrix, which is conceptualized in Figure 2.13.

apply() concept in R.

apply() concept in R.

mat=cbind(c(3,0,3,3),c(3,0,0,0),c(3,0,0,3),c(1,1,0,0),c(1,1,1,0),c(1,1,1,0))

result<-apply(mat,1,sum)

result

## [1] 12 3 5 6

# OR you can define the function as an argument to apply()

result<-apply(mat,1,function(x) sum(x))

result

## [1] 12 3 5 6

Notice that we used a second argument which equals 1, that indicates that rows of the matrix/ data frame will be the input for the function. If we change the second argument to 2, this will indicate that columns should be the input for the function that will be applied. See Figure 2.14 for the visualization of apply() on columns.

apply() function on columns

FIGURE 2.14: apply() function on columns

result<-apply(mat,2,sum)

result

## [1] 9 3 6 2 3 3

Next, we will use lapply(), which applies a function on a list or a vector. The function that will be applied is a simple function that takes the square of a given number.

input=c(1,2,3)

lapply(input,function(x) x^2)

## [[1]]

## [1] 1

##

## [[2]]

## [1] 4

##

## [[3]]

## [1] 9

mapply() is another member of the apply family, it can apply a function on an unlimited set of vectors/lists, it is like a version of lapply that can handle multiple vectors as arguments. In this case, the argument to the mapply() is the function to be applied and the sets of parameters to be supplied as arguments of the function. As shown in the conceptualized Figure 2.15, the function to be applied is a function that takes two arguments and sums them up. The arguments to be summed up are in the format of vectors Xs and Ys. mapply() applies the summation function to each pair in the Xs and Ys vector. Notice that the order of the input function and extra arguments are different for mapply.

mapply() concept.

mapply() concept.

Xs=0:5

Ys=c(2,2,2,3,3,3)

result<-mapply(function(x,y) sum(x,y),Xs,Ys)

result

## [1] 2 3 4 6 7 8

Apply family functions on multiple cores

If you have large data sets, apply family functions can be slow (although probably still better than for loops). If that is the case, you can easily use the parallel versions of those functions from the parallel package. These functions essentially divide your tasks to smaller chunks, run them on separate CPUs, and merge the results from those parallel operations. This concept is visualized in Figure below 2.16, mcapply runs the summation function on three different processors. Each processor executes the summation function on a part of the data set, and the results are merged and returned as a single vector that has the same order as the input parameters Xs and Ys.

mcapply() concept.

mcapply() concept.

Vectorized functions in R

The above examples have been put forward to illustrate functions and loops in R because functions using sum() are not complicated and are easy to understand. You will probably need to use loops and looping structures with more complicated functions. In reality, most of the operations we used do not need the use of loops or looping structures because there are already vectorized functions that can achieve the same outcomes, meaning if the input arguments are R vectors, the output will be a vector as well, so no need for loops or vectorization.

For example, instead of using mapply() and sum() functions, we can just use the + operator and sum up Xs and Ys.

result=Xs+Ys

result

## [1] 2 3 4 6 7 8

In order to get the column or row sums, we can use the vectorized functions colSums() and rowSums().

colSums(mat)

## [1] 9 3 6 2 3 3

rowSums(mat)

## [1] 12 3 5 6

However, remember that not every function is vectorized in R, so use the ones that are. But sooner or later, apply family functions will come in handy.

Shares