Ce site est en cours de reconstruction certains liens peuvent ne pas fonctionner ou certaines images peuvent ne pas s'afficher.

3.1. How to define data ?

You can define different kinds of data like boolean, string, numeric value, vector, matrix and data frames (a matrix of different value types).

3.1.1. Boolean variables

In R, T represents TRUE and F FALSE. You can use the operators of the C language to create a boolean expression.

> a <- T    # a is TRUE
> b <- T    # b is TRUE
> a & b     # a AND b
[1] TRUE
> a | b     # A OR b
[1] TRUE
> a | !b    # A OR NOT(b)
[1] TRUE
> a & !b    # A AND NOT(b)
[1] FALSE

3.1.2. Strings

There are many functions that operate on strings, here are some examples.

> m <- "hello world!"
> nchar(m)   # size of the string
[1] 12
> toupper(m)    # convert to uppercase
[1] "HELLO WORLD!"
> substr(m, 7,4)  ## extract substring(string, from, to), so here it will 
                  ## not work because 4<7
[1] ""
> substr(m, 7,11) ## extract substring from characters 7 to 11
[1] "world"
> paste(rep("=*=", 6))
[1] "=*=" "=*=" "=*=" "=*=" "=*=" "=*="
> stringi::stri_dup("=*=",6)
[1] "=*==*==*==*==*==*="

The function strsplit can cut a string in function of some pattern:

Let's consider the following string: "A␣␣␣␣text␣␣with␣spaces␣␣␣␣"

## use space to separate words
> strsplit("A    text  with spaces     ", " ") 
[[1]]
 [1] "A"      ""       ""       ""       "text"   ""       "with"   "spaces"
 [9] ""       ""       ""       ""      
 
## use regular expression: the separator is represented by several spaces
> strsplit("A    text  with spaces     ", "[ ]+", perl = T) 
[[1]]
[1] "A"      "text"   "with"   "spaces"

3.1.3. Numbers

Numeric values are defined naturally:

> a <- 3.1415
> b <- a * 2.3 - 7.55
> a
[1] 3.1415
> b
[1] -0.32455

3.1.4. Vectors

You can create a list (also called a vector) using different operators :

use c(...) for unordered data : c(1,5,-7,3)
use x:y to create ordered vector of integers: 1:5
use seq() for oredered sequences of numbers
use numeric() to define a vector of numbers with initial value of 0 for each element
use rep() to replicate values

> x <- c(1.2, 3.5, 4, -7.2)
> x
[1]  1.2  3.5  4.0 -7.2
> y <- 1:5
> y
[1] 1 2 3 4 5
##/* create vector of 10 values initialized with 0 */
> x <- numeric(10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

When you use seq(), you can specify to = or lengh.out = :

> seq(from = 1, to = 10, by = 0.5)
 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0
> seq(from = 10, to = 0, by = -2)
[1] 10  8  6  4  2  0
> seq(from = 1, by = 2, length.out=3)
[1] 1 3 5
 
# repeat 'F'
rep('F', 4)
[1] "F" "F" "F" "F"

You can access and modify the contents of a vector, note that the first index of a vector starts at 1:

> x[1]
[1] 1.2
> x[4]
[1] -7.2
> x[4] <- 333.33
> x
[1]   1.20   3.50   4.00 333.33

3.1.5. Variables

To get the list of variables you can use ls() and to remove a variable you can use rm(). See section Manage, save and restore environment for details.

> x <- c(1,3,5,7)
> pi_div_2 = pi / 2
> pi = 3.1415
> pi_div_2 = pi / 2
> rasp_pi = 6.28
 
# two different variables
> my.long.variable=1
> my_long_variable=2

3.1.6. Data frame

A data frame is used for storing data tables. It is a list of vectors of equal length. In the following example you can see how to handle a data frame. We define information about Intel CPUs:

name of cpu
when it was launched, it is given by Intel in terms of quarters, for example Q2'14 and we translate it into first of April 2014
thermal dissipation power in Watts
lithography in nm
number of cores
number of threads

cpus <- c("i7-4790", "i5-3570K", "i5-7400", "i7-2600")
launch.date <- c("2014-04-01", "2012-04-01", "2017-01-01", "2011-01-01")
tdp <- c(84,77,64,95)
litho <- c(22,22,14,32)
cores <- c(4,4,4,4)
threads <- c(8,4,4,8)
mydata <- data.frame(cpus, launch.date, tdp, litho, cores, threads, stringsAsFactors = FALSE)

3.1.6.a Basic operations

# read the data frame
> source("dfdef.rs")
> mydata
/*
      cpus launch.date tdp litho cores threads
1  i7-4790  2014-04-01  84    22     4       8
2 i5-3570K  2012-04-01  77    22     4       4
3  i5-7400  2017-01-01  64    14     4       4
4  i7-2600  2011-01-01  95    32     4       8
*/
 
# show number of rows
> nrow(mydata)
[1] 4
# show number of columns
> ncol(mydata)
[1] 6
 
# get information about first column using numeric index
> mydata[1] 
/*
      cpus
1  i7-4790
2 i5-3570K
3  i5-7400
4  i7-2600
*/
# or use 
> mydata[,1]
[1] i7-4790  i5-3570K i5-7400  i7-2600 
Levels: i5-3570K i5-7400 i7-2600 i7-4790
 
# get information about first column's name
> mydata$cpus
[1] i7-4790  i5-3570K i5-7400  i7-2600 
Levels: i5-3570K i5-7400 i7-2600 i7-4790

Note that you can modify the names of the columns during the definition of the data frame or after by using colnames():

# during definition we use tdp.in.W (Thermal Dissipation Power in Watts)
> mydata <- data.frame(cpus, launch.date, tdp.in.W = tdp, litho, cores, threads)
 
# get the names of the columns
> colnames(mydata)
[1] "cpus"        "launch.date" "tdp.in.W"    "litho"       "cores"      
[6] "threads"
# modify names of columnes
> colnames(mydata) <- c("CPUS", "launch", "TDP.W", "LITHO.NM", "CORES", "TH")

3.1.6.b Insertion and deletion of rows

# insert new row
> newrow <- c("Pentium-M-760","2014-04-01",27,90,1,1)
> mydata <- rbind(mydata[1:2,], newrow, mydata[-(1:2),])
1        i7-4790  2014-04-01  84    22     4       8
2       i5-3570K  2012-04-01  77    22     4       4
3  Pentium-M-760  2014-04-01  27    90     1       1
31       i5-7400  2017-01-01  64    14     4       4
4        i7-2600  2011-01-01  95    32     4       8
 
#remove row
> mydata <- mydata[-c(4), ]
> mydata
           cpus launch.date tdp litho cores threads
1       i7-4790  2014-04-01  84    22     4       8
2      i5-3570K  2012-04-01  77    22     4       4
3 Pentium-M-760  2014-04-01  27    90     1       1
4       i7-2600  2011-01-01  95    32     4       8
 
# or
> mydata <- mydata[-which(rownames(mydata) %in% c("31")),]

3.1.6.c Selection

You can select information in a data frame by rows, by columns or using a filter.

# selection by columns
> mydata[,c("cpus","tdp")]
/*
      cpus tdp
1  i7-4790  84
2 i5-3570K  77
3  i5-7400  64
4  i7-2600  95
*/
 
# selection by rows
> mydata[c(2:3),]
/*
      cpus launch.date tdp litho cores threads
2 i5-3570K  2012-04-01  77    22     4       4
3  i5-7400  2017-01-01  64    14     4       4
*/
 
# selection by rows and columns
> mydata[c(2:3),c("cpus","tdp")]
/*
      cpus tdp
2 i5-3570K  77
3  i5-7400  64
*/
 
# selection using a filter
# we want the cpus that have a thermal dissipation power
# greater than 80
> mydata[mydata$tdp > 80,]
/*
     cpus launch.date tdp litho cores threads
1 i7-4790  2014-04-01  84    22     4       8
*/

3.1.6.d Sort

# order using 1 column
> mydata[order(litho),]
/*
      cpus launch.date tdp litho cores threads
3  i5-7400  2017-01-01  64    14     4       4
1  i7-4790  2014-04-01  84    22     4       8
2 i5-3570K  2012-04-01  77    22     4       4
4  i7-2600  2011-01-01  95    32     4       8
*/
 
# order using two columns
> mydata[order(litho,tdp),]
/*
      cpus launch.date tdp litho cores threads
3  i5-7400  2017-01-01  64    14     4       4
2 i5-3570K  2012-04-01  77    22     4       4
1  i7-4790  2014-04-01  84    22     4       8
4  i7-2600  2011-01-01  95    32     4       8
*/