1. Introduction to R

Introduction

What is R ?

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

Note that R is available as Free Software and runs on a wide variety of platforms UNIX, Linux, Windows or MacOS.

R is made of 5 components:

an effective data handling and storage facility (list, vectors, matrices)
a suite of operators for calculations on vector and matrices
a large collection of intermediate tools for data analysis
graphical facilities for data analysis
a simple and effective programming language

All of this makes R very difficult to master all the more that the functions produce different results in function of the parameters provided.

How can I use R ?

You can either :

use the text interface with the program R in a terminal
use the Rscript program to execute scripts
use the graphical interface called R-studio

Terminal

For example if you execute R in a terminal you will get somehting like this and you will be in interactive mode: you have to type your calculation and press enter to get the result

richer@richer:~/dev/R/code$ R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> q()
Save workspace image? [y/n/c]: y

You will have to type q() ↵ (where ↵ is the ENTER key on the keyboard) to quit R.

RStudio

Here is what the interface of the free version of RStudio looks like under Ubuntu 16.04:

To install RStudio under Linux Ubuntu 16.04, follow this link.

Packages

Sometimes you need to install packages that contain certain functions. Use the install.packages() command for this purpose:

install.packages('stringi')

How to define data ?

You can define different kinds of data like boolean, string, numeric value, vector, matrix and data frames (a matrix of different value types).

Boolean variables

In R, T represents TRUE and F FALSE. You can use the operators of the C language to create a boolean expression.

> a <- T    # a is TRUE
> b <- T    # b is TRUE
> a & b     # a AND b
[1] TRUE
> a | b     # A OR b
[1] TRUE
> a | !b    # A OR NOT(b)
[1] TRUE
> a & !b    # A AND NOT(b)
[1] FALSE

Strings

There are many functions that operate on strings, here are some examples.

> m <- "hello world!"
> nchar(m)   # size of the string
[1] 12
> toupper(m)    # convert to uppercase
[1] "HELLO WORLD!"
> substr(m, 7,4)  ## extract substring(string, from, to), so here it will 
                  ## not work because 4<7
[1] ""
> substr(m, 7,11) ## extract substring from characters 7 to 11
[1] "world"
> paste(rep("=*=", 6))
[1] "=*=" "=*=" "=*=" "=*=" "=*=" "=*="
> stringi::stri_dup("=*=",6)
[1] "=*==*==*==*==*==*="

The function strsplit can cut a string in function of some pattern:

Let's consider the following string: "A␣␣␣␣text␣␣with␣spaces␣␣␣␣"

## use space to separate words
> strsplit("A    text  with spaces     ", " ") 
[[1]]
 [1] "A"      ""       ""       ""       "text"   ""       "with"   "spaces"
 [9] ""       ""       ""       ""      
 
## use regular expression: the separator is represented by several spaces
> strsplit("A    text  with spaces     ", "[ ]+", perl = T) 
[[1]]
[1] "A"      "text"   "with"   "spaces"

Numbers

Numeric values are defined naturally:

> a <- 3.1415
> b <- a * 2.3 - 7.55
> a
[1] 3.1415
> b
[1] -0.32455

Vectors

You can create a list (also called a vector) using different operators :

use c(...) for unordered data : c(1,5,-7,3)
use x:y to create ordered vector of integers: 1:5
use seq() for oredered sequences of numbers
use numeric() to define a vector of numbers with initial value of 0 for each element

> x <- c(1.2, 3.5, 4, -7.2)
> x
[1]  1.2  3.5  4.0 -7.2> 
y <- 1:5
> y
[1] 1 2 3 4 5
##/* create vector of 10 values initialized with 0 */
> x <- numeric(10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

When you use seq(), you can specify to = or lengh.out = :

> seq(from = 1, to = 10, by = 0.5)
 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0
> seq(from = 10, to = 0, by = -2)
[1] 10  8  6  4  2  0
> seq(from = 1, by = 2, length.out=3)
[1] 1 3 5

You can access and modify the contents of a vector:

> x[1]
[1] 1.2
> x[4]
[1] -7.2
> x[4] <- 333.33
> x
[1]   1.20   3.50   4.00 333.33

Variables

To get the list of variables you can use ls() and to remove a variable you can use rm()

> x <- c(1,3,5,7)
> pi_div_2 = pi / 2
> pi = 3.1415
> pi_div_2 = pi / 2
> rasp_pi = 6.28
> ls()
[1] "pi"       "pi_div_2" "rasp_pi"  "x"       
> ls(pattern="pi")
[1] "pi"       "pi_div_2" "rasp_pi" 
> rm(rasp_pi)
> ls()
[1] "pi"       "pi_div_2" "x"       
> rm(list=ls())
> ls()
character(0)

Save and restore environment

> getwd()
[1] "/home/richer/public_html"
> setwd("/home/richer/dev/R/code")
> getwd()
[1] "/home/richer/dev/R/code"
> save.image()
> save.image('my_env.Renv')
> ls()
 [1] "m"             "m1"            "m2"            "model"        
 [5] "mydata"        "my_data"       "my_data2"      "my_data_frame"
 [9] "my_model"      "my_subtable"   "my_table"      "x"            
[13] "y"            
> rm(list=ls())
> ls()
character(0)
> load('my_env.Renv')
> ls()
 [1] "m"             "m1"            "m2"            "model"        
 [5] "mydata"        "my_data"       "my_data2"      "my_data_frame"
 [9] "my_model"      "my_subtable"   "my_table"      "x"            
[13] "y"

Input from code, keyboard, CSV or Table

Imagine you have the following data that you want to process:

Persons
age	gender	weight
25	male	160
30	female	110
56	male	220

Define data from R code

You can define them writing R code like this:

# create a data frame from scratch 
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220) 
mydata <- data.frame(age,gender,weight)

Enter data from keyboard

Or you can define a data frame that you can later fill with the edit() function:

# enter data using editor 
mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
# make assignment to save data
mydata <- edit(mydata)
# save in a file
> save(mydata,file="mydata.Rdata")

From file

You can put the data in a file either a table or a CSV file as follows.

Define data as a table

If you define the data as a table use the space as a separator and let's call the file mydata.Rtable:

age gender weight
25 male   160
30 female 110
56 male   220

Then you can read the table with the read.table() method, don't forget to set the header = T in order for the first line of the file to be considered as the names of the columns and not part of the data.

mydata <- read.table("mydata.Rtable", header = T)
> mydata
  age gender weight
1  25   male    160
2  30 female    110
3  56   male    220

Define data as a CSV (Comma Separated Values) file

age, gender, weight
25, male,   160
30, female, 110
56, male,   220

Then you can read the file using the read.csv() method and the first line will be considered as the header.

mydata <- read.csv("mydata.csv")
> mydata
  age gender weight
1  25   male    160
2  30 female    110
3  56   male    220

Exercises

Exercice 1.1

Define the following data as a table recorded in the file

Output

Operations on vectors

Select elements

> persons <- c("John", "Jack", "Bill", "Kim", "Jason", "Michael", "Britney", "Julia")
> length(persons)
[1] 8
 
> persons[c(4:6)]
[1] "Kim"     "Jason"   "Michael"
> persons[c(rep(TRUE,2),rep(FALSE,3),rep(T,5))]
[1] "John"    "Jack"    "Michael" "Britney" "Julia"   NA        NA

Name elements

> weights <- c(77, 58, 66, 82)
> names(weights) <- c("John", "Jack", "Bill", "Kim")
> weights
John Jack Bill  Kim 
  77   58   66   82 
> weights["Bill"]
Bill 
  66 
> weights[c("John","Bill")]
John Bill 
  77   66

Arithmetic operations and comparisons

> weights <- c(77, 58, 66, 82)
> names(weights) <- c("John", "Jack", "Bill", "Kim")
> weights
John Jack Bill  Kim 
  77   58   66   82 
> weights+1
John Jack Bill  Kim 
  78   59   67   83 
> weights+10
John Jack Bill  Kim 
  87   68   76   92 
> weights+c(10,20)
John Jack Bill  Kim 
  87   78   76  102 
> weights > 68
 John  Jack  Bill   Kim 
 TRUE FALSE FALSE  TRUE
> 77 %in% weights        # use operator x %in% v (is x inside vector v)
[1] TRUE

Sort, order

> weights <- c(85,78,54,98,66,78,77,62,89,92,76,77,55,68);
> sort(weights)
 [1] 54 55 62 66 68 76 77 77 78 78 85 89 92 98
> sort(weights[3:8])
[1] 54 62 66 77 78 98
> order(weights)
 [1]  3 13  8  5 14 11  7 12  2  6  1  9 10  4

Sum, mean, median, min, max, quantile, var, sd, summary

> weights <- c(85,78,54,98,66,78,77,62,89,92,76,77,55,68);
> sum(weights)
[1] 1055
> mean(weights)
[1] 75.35714
> var(weights)     # variance
[1] 175.3242
> sd(weights)      # standard deviation
[1] 13.241
> median(weights)
[1] 77
> min(weights)
[1] 54
> max(weights)
[1] 98
> quantile(weights)
   0%   25%   50%   75%  100% 
54.00 66.50 77.00 83.25 98.00 
> summary(weights)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54.00   66.50   77.00   75.36   83.25   98.00

Operations on matrices

Define a matrix

You can use:

diag() to define an identity matrix
or the matrix() function for which you can specify the number or rows and columns

> id <- diag(4)  # create identity matrix of size 4x4
> id
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1
 
> m1 <- matrix(1:10, nrow=2, ncol=5)
> m1
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
> m2 <- matrix(c(1,-1,2,-2,3,-3), nrow=2, ncol=3)
> m2
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   -1   -2   -3
 
> print(readLines("matrix.Rdata"))
[1] "1  2  3  4  5  6  7  8  9 10 11 12"
> m3 <- matrix(scan("matrix.Rdata"), nrow=4, ncol=3)
Read 12 items
> m3
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

cbind, rbind

Combine matrices horizontally (cbind) or vertically (rbind)

> x1 <- c(1,2,3)
> x2 <- c(4,5,6)
> cbind(x1,x2)
     x1 x2
[1,]  1  4
[2,]  2  5
[3,]  3  6
> rbind(x1,x2)
   [,1] [,2] [,3]
x1    1    2    3
x2    4    5    6
 
> m <- matrix(10:15, nrow=3, ncol=2)
> m
     [,1] [,2]
[1,]   10   13
[2,]   11   14
[3,]   12   15
> cbind(m,x1)
           x1
[1,] 10 13  1
[2,] 11 14  2
[3,] 12 15  3

Sum, difference, product, division

Use +, -, %*% and %/%.

> m1 <- matrix(1:16, nrow=4, ncol=4)
> m2 <- matrix(16:1, nrow=4, ncol=4)
> m1 + m2
     [,1] [,2] [,3] [,4]
[1,]   17   17   17   17
[2,]   17   17   17   17
[3,]   17   17   17   17
[4,]   17   17   17   17
> m1 - m2
     [,1] [,2] [,3] [,4]
[1,]  -15   -7    1    9
[2,]  -13   -5    3   11
[3,]  -11   -3    5   13
[4,]   -9   -1    7   15
> m1 %*% m2
     [,1] [,2] [,3] [,4]
[1,]  386  274  162   50
[2,]  444  316  188   60
[3,]  502  358  214   70
[4,]  560  400  240   80
> m1 %/% m2
     [,1] [,2] [,3] [,4]
[1,]    0    0    1    3
[2,]    0    0    1    4
[3,]    0    0    1    7
[4,]    0    0    2   16

Transpose

Use the t() function.

> m1 <- matrix(1:15, nrow=3, ncol=5)
> m1
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15
> t(m1)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
[5,]   13   14   15

rowSums, rowMeans, colSums, colMeans

> m1 <- matrix(1:15, nrow=3, ncol=5)
> m1
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15
> rowSums(m1)
[1] 35 40 45
> colSums(m1)
[1]  6 15 24 33 42
> rowMeans(m1)
[1] 7 8 9

Equations

We want to solve the following equation $y = m * x$ where $x$ and $y$ are vectors and $m$ is a matrix. We only know $m$ and $y$ and we want to find $x$. For this we can use the solve() function:

$[\table -8;13;31;] = [\table 1, 2, -1; 3, 4, 7; -5, 1, 6 ] × [\table x_1; x_2; x_3]$

> m <- matrix(c(1, 2, -1, 3, 4, 7, -5, 1, 6), nrow=3, ncol=3)
> y <- c(-8,13,31)
> x <- solve(m,y)
     [,1]
[1,]    1
[2,]    2
[3,]    3
 
> x
[1] 1 2 3

or we could compute $x = m^{-1} × y$, where $m^{-1}$ is the inverse of $m$. To do this you will need the ginv function from the MASS package.

# do this once !
> install.packages('MASS')
...
> inv_m <- MASS::ginv(m)  # compute inverse of matrix m
> inv_m
           [,1]         [,2]        [,3]
[1,] -0.1517857  0.473214286 -0.20535714
[2,]  0.1160714 -0.008928571  0.09821429
[3,] -0.1607143  0.089285714  0.01785714
>  x <- inv_m %*% y
> x
     [,1]
[1,]    1
[2,]    2
[3,]    3

Plot

Plotting with R (free chapter)

curve(sin(x)/x,-10,10,main="\frac(sin(x)/x)") x <- c(10,20,30,40,50) y <- c(100, 205, 297, 410, 501) png("graph1.png", width=400, height=300) plot(x,z,type="b") dev.off()

Statistics

Linear regression

Consider three series of data that we want to compare, we want to know if the data of $x$ are related to $y$ or $z$ via a linear relationship.

> x <- c(10,20,30,40,50)
> y <- c(105, 210, 302, 415, 506)
> z <- c(10,60,80,40,25)
> cor(x,y)               # x and y are strongly related (value close to 1.0)
[1] 0.9995255
> cor(x,z)
[1] 0.05698029           # x and z are not related linearly (value close to 0.0)

Simple linear regression

Let us determine if $x$ is related to $y$, by performing a linear regression that will try to find a formula to predict $x$ from $y$ of the form $x' = c_0 + c_1 × y$ :

> x <- c(10,20,30,40,50)
> y <- c(105, 210, 302, 415, 506)
> z<-c(60, 16, 1, 18, 64)
# apply linear regression
> fit <- lm(x ~ y)
> summary(fit)
 
Call:
lm(formula = x ~ y)
 
Residuals:
      1       2       3       4       5 
 0.1001 -0.3170  0.5556 -0.6552  0.3166 
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.517196   0.598378  -0.864    0.451    
y            0.099211   0.001765  56.205 1.24e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.5624 on 3 degrees of freedom
Multiple R-squared:  0.9991,	Adjusted R-squared:  0.9987 
F-statistic:  3159 on 1 and 3 DF,  p-value: 1.241e-05

From the previous calculation we see that the p-value is close to 0 and the $R^2$ value is close to 1.0 which means that the data are correlated by a linear relation.

The p-value helps you determine the significance of your results, it is a probability (between 0 and 1.0) that the difference observed between data is due to randomness. In other words:

if the p-value is less than a threshold $α$ (generally 5%), so if $p < α = 0.05$, it is said that you reject the null hypothesis $H_0$, in plain language terms the null hypothesis is usually an hypothesis of no difference, so you accept the alternative hypothesis $H_1$ which is the opposite of the null hypothesis
if $p > α$ then you can consider that the data are not related

By obtaining the coefficients (see below) we can determine that $x = -0.51719586 + 0.09921065 * y$.

> c = coefficients(fit)
> c
(Intercept)           y 
-0.51719586  0.09921065
# define function f
>  f <- function(y) {
+ return (-0.51719586 + 0.09921065 * y)
+ }
# apply function f to each element of y
> sapply(y,f)
[1]  9.899922 20.317041 29.444420 40.655224 49.683393

Finally the formula $f$ gives use values close to the ones of $x$.

If you apply the same reasoning between $x$ and $z$ then you will find:

> fit <- lm(x ~ z)
> summary(fit)
 
Call:
lm(formula = x ~ z)
 
Residuals:
       1        2        3        4        5 
-20.8756  -9.5094   0.9563  10.4285  19.0002 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.01267   13.06874   2.220    0.113
z            0.03105    0.32120   0.097    0.929
 
Residual standard error: 18.23 on 3 degrees of freedom
Multiple R-squared:  0.003105,	Adjusted R-squared:  -0.3292 
F-statistic: 0.009343 on 1 and 3 DF,  p-value: 0.9291

Here the p-values are reater than 0.05 and $R^2$ is far from 1.0, so the data are not related by a linear function.

Multiple linear regression

Multiple non linear regression

> x
[1] 10 20 30 40 50
> z
[1] 60 16  1 18 64
 
fit <- nls(z ~ a +  b * ((x -c)^2), start=list(a=-10, b=-50,c=0))
> summary(fit)
 
Formula: z ~ a + b * ((x - c)^2)
 
Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a  1.497776   0.353335   4.239   0.0514 .  
b  0.151429   0.001355 111.734 8.01e-05 ***
c 29.669811   0.053030 559.487 3.19e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.5071 on 2 degrees of freedom
 
Number of iterations to convergence: 3 
Achieved convergence tolerance: 4.058e-06
 
> coefficients(fit)
         a          b          c 
 1.4977757  0.1514286 29.6698114 
> f <- function(x) 1.4977757 + 0.1514286 *((x - 29.6698114)^2) 
> sapply(x, f)
[1] 60.085725 15.657145  1.514285 17.657145 64.085725
 
> c <- coefficients(fit)
> f <- function(x) c[1] + c[2] *((x - c[3])^2) 
> sapply(x, f)
        a         a         a         a         a 
60.085713 15.657142  1.514285 17.657142 64.085712

Page de Jean-Michel Richer