According to the R-project site :
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
Note that R is available as Free Software and runs on a wide variety of platforms UNIX, Linux, Windows or MacOS.
R is made of 5 components:
All of this makes R very difficult to master all the more that the functions produce different results in function of the parameters provided.
You can either :
For example if you execute R in a terminal you will get somehting like this and you will be in interactive mode: you have to type your calculation and press enter to get the result
richer@richer:~/dev/R/code$ R
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
> q()
Save workspace image? [y/n/c]: y
You will have to type q() ↵ (where ↵ is the ENTER key on the keyboard) to quit R.
Here is what the interface of the free version of RStudio looks like under Ubuntu 16.04:
To install RStudio under Linux Ubuntu 16.04, follow this link.
Sometimes you need to install packages that contain certain functions. Use the install.packages() command for this purpose:
install.packages('stringi')
You can define different kinds of data like boolean, string, numeric value, vector, matrix and data frames (a matrix of different value types).
In R, T represents TRUE and F FALSE. You can use the operators of the C language to create a boolean expression.
> a <- T # a is TRUE > b <- T # b is TRUE > a & b # a AND b [1] TRUE > a | b # A OR b [1] TRUE > a | !b # A OR NOT(b) [1] TRUE > a & !b # A AND NOT(b) [1] FALSE
There are many functions that operate on strings, here are some examples.
> m <- "hello world!" > nchar(m) # size of the string [1] 12 > toupper(m) # convert to uppercase [1] "HELLO WORLD!" > substr(m, 7,4) ## extract substring(string, from, to), so here it will ## not work because 4<7 [1] "" > substr(m, 7,11) ## extract substring from characters 7 to 11 [1] "world" > paste(rep("=*=", 6)) [1] "=*=" "=*=" "=*=" "=*=" "=*=" "=*=" > stringi::stri_dup("=*=",6) [1] "=*==*==*==*==*==*="
The function strsplit can cut a string in function of some pattern:
Let's consider the following string: "A␣␣␣␣text␣␣with␣spaces␣␣␣␣"
## use space to separate words > strsplit("A text with spaces ", " ") [[1]] [1] "A" "" "" "" "text" "" "with" "spaces" [9] "" "" "" "" ## use regular expression: the separator is represented by several spaces > strsplit("A text with spaces ", "[ ]+", perl = T) [[1]] [1] "A" "text" "with" "spaces"
Numeric values are defined naturally:
> a <- 3.1415 > b <- a * 2.3 - 7.55 > a [1] 3.1415 > b [1] -0.32455
You can create a list (also called a vector) using different operators :
> x <- c(1.2, 3.5, 4, -7.2) > x [1] 1.2 3.5 4.0 -7.2> y <- 1:5 > y [1] 1 2 3 4 5 ##/* create vector of 10 values initialized with 0 */ > x <- numeric(10) > x [1] 0 0 0 0 0 0 0 0 0 0
When you use seq(), you can specify to = or lengh.out = :
> seq(from = 1, to = 10, by = 0.5) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 [16] 8.5 9.0 9.5 10.0 > seq(from = 10, to = 0, by = -2) [1] 10 8 6 4 2 0 > seq(from = 1, by = 2, length.out=3) [1] 1 3 5
You can access and modify the contents of a vector:
> x[1] [1] 1.2 > x[4] [1] -7.2 > x[4] <- 333.33 > x [1] 1.20 3.50 4.00 333.33
To get the list of variables you can use ls() and to remove a variable you can use rm()
> x <- c(1,3,5,7) > pi_div_2 = pi / 2 > pi = 3.1415 > pi_div_2 = pi / 2 > rasp_pi = 6.28 > ls() [1] "pi" "pi_div_2" "rasp_pi" "x" > ls(pattern="pi") [1] "pi" "pi_div_2" "rasp_pi" > rm(rasp_pi) > ls() [1] "pi" "pi_div_2" "x" > rm(list=ls()) > ls() character(0)
> getwd() [1] "/home/richer/public_html" > setwd("/home/richer/dev/R/code") > getwd() [1] "/home/richer/dev/R/code" > save.image() > save.image('my_env.Renv') > ls() [1] "m" "m1" "m2" "model" [5] "mydata" "my_data" "my_data2" "my_data_frame" [9] "my_model" "my_subtable" "my_table" "x" [13] "y" > rm(list=ls()) > ls() character(0) > load('my_env.Renv') > ls() [1] "m" "m1" "m2" "model" [5] "mydata" "my_data" "my_data2" "my_data_frame" [9] "my_model" "my_subtable" "my_table" "x" [13] "y"
Imagine you have the following data that you want to process:
age | gender | weight |
25 | male | 160 |
30 | female | 110 |
56 | male | 220 |
You can define them writing R code like this:
# create a data frame from scratch age <- c(25, 30, 56) gender <- c("male", "female", "male") weight <- c(160, 110, 220) mydata <- data.frame(age,gender,weight)
Or you can define a data frame that you can later fill with the edit() function:
# enter data using editor mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0)) # make assignment to save data mydata <- edit(mydata) # save in a file > save(mydata,file="mydata.Rdata")
You can put the data in a file either a table or a CSV file as follows.
If you define the data as a table use the space as a separator and let's call the file mydata.Rtable:
age gender weight
25 male 160
30 female 110
56 male 220
Then you can read the table with the read.table() method, don't forget to set the header = T in order for the first line of the file to be considered as the names of the columns and not part of the data.
mydata <- read.table("mydata.Rtable", header = T) > mydata age gender weight 1 25 male 160 2 30 female 110 3 56 male 220
age, gender, weight
25, male, 160
30, female, 110
56, male, 220
Then you can read the file using the read.csv() method and the first line will be considered as the header.
mydata <- read.csv("mydata.csv") > mydata age gender weight 1 25 male 160 2 30 female 110 3 56 male 220
Exercice 1.1
Define the following data as a table recorded in the file
> persons <- c("John", "Jack", "Bill", "Kim", "Jason", "Michael", "Britney", "Julia") > length(persons) [1] 8 > persons[c(4:6)] [1] "Kim" "Jason" "Michael" > persons[c(rep(TRUE,2),rep(FALSE,3),rep(T,5))] [1] "John" "Jack" "Michael" "Britney" "Julia" NA NA
> weights <- c(77, 58, 66, 82) > names(weights) <- c("John", "Jack", "Bill", "Kim") > weights John Jack Bill Kim 77 58 66 82 > weights["Bill"] Bill 66 > weights[c("John","Bill")] John Bill 77 66
> weights <- c(77, 58, 66, 82) > names(weights) <- c("John", "Jack", "Bill", "Kim") > weights John Jack Bill Kim 77 58 66 82 > weights+1 John Jack Bill Kim 78 59 67 83 > weights+10 John Jack Bill Kim 87 68 76 92 > weights+c(10,20) John Jack Bill Kim 87 78 76 102 > weights > 68 John Jack Bill Kim TRUE FALSE FALSE TRUE > 77 %in% weights # use operator x %in% v (is x inside vector v) [1] TRUE
> weights <- c(85,78,54,98,66,78,77,62,89,92,76,77,55,68); > sort(weights) [1] 54 55 62 66 68 76 77 77 78 78 85 89 92 98 > sort(weights[3:8]) [1] 54 62 66 77 78 98 > order(weights) [1] 3 13 8 5 14 11 7 12 2 6 1 9 10 4
> weights <- c(85,78,54,98,66,78,77,62,89,92,76,77,55,68); > sum(weights) [1] 1055 > mean(weights) [1] 75.35714 > var(weights) # variance [1] 175.3242 > sd(weights) # standard deviation [1] 13.241 > median(weights) [1] 77 > min(weights) [1] 54 > max(weights) [1] 98 > quantile(weights) 0% 25% 50% 75% 100% 54.00 66.50 77.00 83.25 98.00 > summary(weights) Min. 1st Qu. Median Mean 3rd Qu. Max. 54.00 66.50 77.00 75.36 83.25 98.00
You can use:
> id <- diag(4) # create identity matrix of size 4x4 > id [,1] [,2] [,3] [,4] [1,] 1 0 0 0 [2,] 0 1 0 0 [3,] 0 0 1 0 [4,] 0 0 0 1 > m1 <- matrix(1:10, nrow=2, ncol=5) > m1 [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 > m2 <- matrix(c(1,-1,2,-2,3,-3), nrow=2, ncol=3) > m2 [,1] [,2] [,3] [1,] 1 2 3 [2,] -1 -2 -3 > print(readLines("matrix.Rdata")) [1] "1 2 3 4 5 6 7 8 9 10 11 12" > m3 <- matrix(scan("matrix.Rdata"), nrow=4, ncol=3) Read 12 items > m3 [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12
Combine matrices horizontally (cbind) or vertically (rbind)
> x1 <- c(1,2,3) > x2 <- c(4,5,6) > cbind(x1,x2) x1 x2 [1,] 1 4 [2,] 2 5 [3,] 3 6 > rbind(x1,x2) [,1] [,2] [,3] x1 1 2 3 x2 4 5 6 > m <- matrix(10:15, nrow=3, ncol=2) > m [,1] [,2] [1,] 10 13 [2,] 11 14 [3,] 12 15 > cbind(m,x1) x1 [1,] 10 13 1 [2,] 11 14 2 [3,] 12 15 3
Use +, -, %*% and %/%.
> m1 <- matrix(1:16, nrow=4, ncol=4) > m2 <- matrix(16:1, nrow=4, ncol=4) > m1 + m2 [,1] [,2] [,3] [,4] [1,] 17 17 17 17 [2,] 17 17 17 17 [3,] 17 17 17 17 [4,] 17 17 17 17 > m1 - m2 [,1] [,2] [,3] [,4] [1,] -15 -7 1 9 [2,] -13 -5 3 11 [3,] -11 -3 5 13 [4,] -9 -1 7 15 > m1 %*% m2 [,1] [,2] [,3] [,4] [1,] 386 274 162 50 [2,] 444 316 188 60 [3,] 502 358 214 70 [4,] 560 400 240 80 > m1 %/% m2 [,1] [,2] [,3] [,4] [1,] 0 0 1 3 [2,] 0 0 1 4 [3,] 0 0 1 7 [4,] 0 0 2 16
Use the t() function.
> m1 <- matrix(1:15, nrow=3, ncol=5) > m1 [,1] [,2] [,3] [,4] [,5] [1,] 1 4 7 10 13 [2,] 2 5 8 11 14 [3,] 3 6 9 12 15 > t(m1) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 [5,] 13 14 15
> m1 <- matrix(1:15, nrow=3, ncol=5) > m1 [,1] [,2] [,3] [,4] [,5] [1,] 1 4 7 10 13 [2,] 2 5 8 11 14 [3,] 3 6 9 12 15 > rowSums(m1) [1] 35 40 45 > colSums(m1) [1] 6 15 24 33 42 > rowMeans(m1) [1] 7 8 9
We want to solve the following equation $y = m * x$ where $x$ and $y$ are vectors and $m$ is a matrix. We only know $m$ and $y$ and we want to find $x$. For this we can use the solve() function:
$[\table -8;13;31;] = [\table 1, 2, -1; 3, 4, 7; -5, 1, 6 ] × [\table x_1; x_2; x_3]$
> m <- matrix(c(1, 2, -1, 3, 4, 7, -5, 1, 6), nrow=3, ncol=3) > y <- c(-8,13,31) > x <- solve(m,y) [,1] [1,] 1 [2,] 2 [3,] 3 > x [1] 1 2 3
or we could compute $x = m^{-1} × y$, where $m^{-1}$ is the inverse of $m$. To do this you will need the ginv function from the MASS package.
# do this once ! > install.packages('MASS') ... > inv_m <- MASS::ginv(m) # compute inverse of matrix m > inv_m [,1] [,2] [,3] [1,] -0.1517857 0.473214286 -0.20535714 [2,] 0.1160714 -0.008928571 0.09821429 [3,] -0.1607143 0.089285714 0.01785714 > x <- inv_m %*% y > x [,1] [1,] 1 [2,] 2 [3,] 3
Plotting with R (free chapter)
curve(sin(x)/x,-10,10,main="\frac(sin(x)/x)") x <- c(10,20,30,40,50) y <- c(100, 205, 297, 410, 501) png("graph1.png", width=400, height=300) plot(x,z,type="b") dev.off()Consider three series of data that we want to compare, we want to know if the data of $x$ are related to $y$ or $z$ via a linear relationship.
> x <- c(10,20,30,40,50) > y <- c(105, 210, 302, 415, 506) > z <- c(10,60,80,40,25) > cor(x,y) # x and y are strongly related (value close to 1.0) [1] 0.9995255 > cor(x,z) [1] 0.05698029 # x and z are not related linearly (value close to 0.0)
Let us determine if $x$ is related to $y$, by performing a linear regression that will try to find a formula to predict $x$ from $y$ of the form $x' = c_0 + c_1 × y$ :
> x <- c(10,20,30,40,50) > y <- c(105, 210, 302, 415, 506) > z<-c(60, 16, 1, 18, 64) # apply linear regression > fit <- lm(x ~ y) > summary(fit) Call: lm(formula = x ~ y) Residuals: 1 2 3 4 5 0.1001 -0.3170 0.5556 -0.6552 0.3166 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.517196 0.598378 -0.864 0.451 y 0.099211 0.001765 56.205 1.24e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5624 on 3 degrees of freedom Multiple R-squared: 0.9991, Adjusted R-squared: 0.9987 F-statistic: 3159 on 1 and 3 DF, p-value: 1.241e-05
From the previous calculation we see that the p-value is close to 0 and the $R^2$ value is close to 1.0 which means that the data are correlated by a linear relation.
The p-value helps you determine the significance of your results, it is a probability (between 0 and 1.0) that the difference observed between data is due to randomness. In other words:
By obtaining the coefficients (see below) we can determine that $x = -0.51719586 + 0.09921065 * y$.
> c = coefficients(fit) > c (Intercept) y -0.51719586 0.09921065 # define function f > f <- function(y) { + return (-0.51719586 + 0.09921065 * y) + } # apply function f to each element of y > sapply(y,f) [1] 9.899922 20.317041 29.444420 40.655224 49.683393
Finally the formula $f$ gives use values close to the ones of $x$.
If you apply the same reasoning between $x$ and $z$ then you will find:
> fit <- lm(x ~ z) > summary(fit) Call: lm(formula = x ~ z) Residuals: 1 2 3 4 5 -20.8756 -9.5094 0.9563 10.4285 19.0002 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.01267 13.06874 2.220 0.113 z 0.03105 0.32120 0.097 0.929 Residual standard error: 18.23 on 3 degrees of freedom Multiple R-squared: 0.003105, Adjusted R-squared: -0.3292 F-statistic: 0.009343 on 1 and 3 DF, p-value: 0.9291
Here the p-values are reater than 0.05 and $R^2$ is far from 1.0, so the data are not related by a linear function.
> x [1] 10 20 30 40 50 > z [1] 60 16 1 18 64 fit <- nls(z ~ a + b * ((x -c)^2), start=list(a=-10, b=-50,c=0)) > summary(fit) Formula: z ~ a + b * ((x - c)^2) Parameters: Estimate Std. Error t value Pr(>|t|) a 1.497776 0.353335 4.239 0.0514 . b 0.151429 0.001355 111.734 8.01e-05 *** c 29.669811 0.053030 559.487 3.19e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5071 on 2 degrees of freedom Number of iterations to convergence: 3 Achieved convergence tolerance: 4.058e-06 > coefficients(fit) a b c 1.4977757 0.1514286 29.6698114 > f <- function(x) 1.4977757 + 0.1514286 *((x - 29.6698114)^2) > sapply(x, f) [1] 60.085725 15.657145 1.514285 17.657145 64.085725 > c <- coefficients(fit) > f <- function(x) c[1] + c[2] *((x - c[3])^2) > sapply(x, f) a a a a a 60.085713 15.657142 1.514285 17.657142 64.085712