Bioinformatics Tutorials by Santhilal :: R

<< Previous: UNIX
Next: RNA-seq >>


Basic R Introduction

R is a UNIX like platform where you make use of different statistical functions. There are also bioinformatics tools developed on R platform. Imagine UNIX as a platform and AWK as tool developed on UNIX platform. Likewise you have R and set of tools developed by different people for different bioinformatics analysis purpose. There are group of people involved in developing bioinformatics tools/packages using R platform called as Bioconductor. You must have notices in my last sentence I was mentioning a word 'package'. we call it package because it uses set of different commands from R in order to perform certain task (example, sequence matching or sequence alignment) and deliver it to the users as a complete package where users do not have to build the package from the scratch by themselves.

In this tutorial we are going to learn very minimal about R which is necessary for the course. If you like to learn complete R from very basic to advanced level, please click here. Some bioinformaticians manage all their work just by using R programming along with basic UNIX without need for learning complex programming languages like PERL, Python, JAVA etc,. If you become master in UNIX and R programming, you can do any kind of bioinformatics analysis and also you will able to develop tools on your own.

Installing R and bioconductor

Installing and opening R:
$$ sudo apt-get install r-base
$$ R
Installing Bioconductor inside R:
## try "http://", if "https://" URLs are not supported
> source("https://bioconductor.org/biocLite.R")
> biocLite()
Installing Bioconductor package edgeR:
EdgeR is an bioconductor package (User guide) used for differential gene expression analysis of RNA-seq samples. There are many packages available for this purpose. You have to choose the package by reading the user guide of the particular package. This used guide will tell you about the type of statistics it uses and in some case algorithms. It is always good habit for a bioinformatician to read the documentation/user guide of particular tool before using it. In the edgeR user guide you will find section 1.4 (Quick start) where you can choose these optimal steps needed for complete differential expression analysis. If you would like to alter the normalization procedure or different statistics, you can read the guide and make changes to the steps accordingly.
> source("https://bioconductor.org/biocLite.R")
> biocLite("edgeR")
Installing packages other than Bioconductor:
Lets install a package designed to generate heatmap. There is default heatmap package in R, but it is very primitive and gives low quality boring heatmaps (Image Left). Were are going to install a package called pretty heatmap 'pheatmap' (Image Right) which generates good quality heatmap and offers lot of customization.
> install.packages(pkgs="pheatmap")
To load installed package:
> library(pheatmap)
> library(edgeR)
To check the usage or documentation:
> help(pheatmap)
> ?pheatmap # Alternative to help
> ?edgeR



Matrices in R

Creating matrix in R:
> x <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
> x # Check created matrix
> t(x) # Reverse or transpose matrix
> y <- t(x) # Save transposed matrix to another variable
>
Load expression matrix (tab-separated table) in R:
> mymatrix <- read.table("http://www.kandurilab.org/users/santhilal/courses/R/hg38_ensembl_GRCh38.87_counts.matrix", header=TRUE, sep="\t",row.names=1)
> dim(mymatrix) # Checknumber of rows and columns in the matrix (not counting the header and rownames)
> head(mymatrix) # First 10 entries
> mymatrix[1,1] # accessing first row first column
> mymatrix[2,1] # accessing second row first column
> mymatrix[1:10,] # accessing first 10 rows and all columns
> mymatrix[1:10,1:2] # accessing first 10 rows and first two columns
> mymatrix[1:10,c(1,3,5)] # accessing first 10 rows and 1,3 and 5th columns
Question R1: Atleast two different commands/ways to display first 20 entries from above 'mymatrix'.

Correlations and plots

Correlation:
> mymatrix_cor <- cor(mymatrix)
> pheatmap(mymatrix_cor)
Saving the plot:
> png("./correlation.png") # Save the plot in current directory. To check directory you are in, getwd()
> pheatmap(mymatrix_cor)
> dev.off()
Question R2: Above correlation plot was generated using default 'pearson' method for correlation. Please generate same plot using 'spearman' correlation and save as JPEG file (complete code from loading the matrix till saving the plot has to be reported for the answer).

Subsetting & saving table in R

Load annotation table from 'practice':
> myannot <- read.table("practice/annotation/Homo_sapiens.GRCh38.87_annotation.table", header=TRUE, sep="\t", row.names=1) # Importing a text file into R
> colnames(myannot) # Check column names of loaded table
> summary(myannot$Class) # Summarize data from one column
Subsetting:
> myannot_lnc <- subset(myannot,myannot$Class=="lincRNA") # Extract only 'lincRNA' entries
> myannot_allNC <- subset(myannot,!(myannot$Class=="protein_coding")) # Extract all non-coding RNAs entries
> myannot_misc <- subset(myannot,myannot$Class=="lincRNA" | myannot$Class=="antisense") # Extract entries of class 'lincRNA' and 'antisense'
Question R3: How many genes are having length greater than or equal to 600 nts from 'myannot_misc' (just report the code you used to get the numbers)?

Saving the generated results as table on your local system:
> write.table(myannot_lnc,"./myannot_lnc_res.table", sep="\t", quote=FALSE) # quote, FALSE will remove the double-quotes generated during write.table function around each entries in the result
Since the generated results will have row names in the first place without any column name/id. We are going to add one more column to the result using row names and then assign the column names as "Geneid" to that new column.
> myannot_lnc_new <- cbind(rownames(myannot_lnc),myannot_lnc) # Merging/adding one new column using row names of the same table
> colnames(myannot_lnc_new)[1] <- "Geneid" # Add column name 'Geneid' to newly added/first column
> write.table(myannot_lnc_new,"./myannot_lnc_new_res.table", sep="\t", quote=FALSE)
Question R4: Like above task save the results 'myannot_allNC' and 'myannot_misc' to a file.

Visualizations in R

Bar Graph:
> barplot(myannot_lnc$Length)
More bar plots examples.
Boxplot: (Tip)
> boxplot(myannot_lnc$Length)
> boxplot(myannot_lnc$Length, outline=FALSE)
> boxplot(myannot_misc$Length~myannot_misc$Class) # Boxplot of length by assigning 'Class' as different groups
> boxplot(myannot_misc$Length~droplevels(myannot_misc$Class)) # Same as above except that we are dropping the other classes having 'zero' values
> boxplot(myannot_misc$Length~droplevels(myannot_misc$Class), outline=FALSE) # Removing outliers
More boxplots examples.
Histogram: (Tip)
> hist(myannot_lnc$Length) # Plot the length distribution for all the genes
> hist(myannot_lnc$Length[myannot_lnc$Length<1000]) # Plot the length distribution for genes shorter than 1000 nts
More histogram and density plots examples.
Pie Chart: (Tip)
> pie(summary(myannot$Class))
More pie chart examples.
Scatter Plot: (Tip)
> plot(mymatrix$Control1,mymatrix$Control2)
> abline(h=70000) # Adding random horizontal line
> abline(v=80000) # Adding random verticle line. To make it in one step abline(h=70000,v=80000).
> mylm <- lm(mymatrix$Control1~mymatrix$Control2) # Calculating regression line
> abline(mylm, col="red") # Adding regression line to the plot
Question R5: Scatter plot Log2 transformed expression values of 'Control1' and 'Control2' from above 'mymatrix'. Report similar code as above with your answer to the question.

More information on Regression lines

Introduction to remote server (cloud)

This section is the lecture with one assignment (Main assignment 4). The course participants will get a email with all lecture slides and the assignment questions.
<< Previous: UNIX
Next: RNA-seq >>

This documentation was prepared by Santhilal Subhash. If it was useful, please write a recommendation on my Linkedin profile. Enjoy! :)