Site de Jean-Michel RICHER

Maître de Conférences en Informatique à l'Université d'Angers

Ce site est en cours de reconstruction certains liens peuvent ne pas fonctionner ou certaines images peuvent ne pas s'afficher.


3.1. How to define data ?

You can define different kinds of data like boolean, string, numeric value, vector, matrix and data frames (a matrix of different value types).

3.1.1. Boolean variables

In R, T represents TRUE and F FALSE. You can use the operators of the C language to create a boolean expression.

> a <- T    # a is TRUE
> b <- T    # b is TRUE
> a & b     # a AND b
[1] TRUE
> a | b     # A OR b
[1] TRUE
> a | !b    # A OR NOT(b)
[1] TRUE
> a & !b    # A AND NOT(b)
[1] FALSE

3.1.2. Strings

There are many functions that operate on strings, here are some examples.

> m <- "hello world!"
> nchar(m)   # size of the string
[1] 12
> toupper(m)    # convert to uppercase
[1] "HELLO WORLD!"
> substr(m, 7,4)  ## extract substring(string, from, to), so here it will 
                  ## not work because 4<7
[1] ""
> substr(m, 7,11) ## extract substring from characters 7 to 11
[1] "world"
> paste(rep("=*=", 6))
[1] "=*=" "=*=" "=*=" "=*=" "=*=" "=*="
> stringi::stri_dup("=*=",6)
[1] "=*==*==*==*==*==*="

The function strsplit can cut a string in function of some pattern:

Let's consider the following string: "A␣␣␣␣text␣␣with␣spaces␣␣␣␣"

## use space to separate words
> strsplit("A    text  with spaces     ", " ") 
[[1]]
 [1] "A"      ""       ""       ""       "text"   ""       "with"   "spaces"
 [9] ""       ""       ""       ""      
 
## use regular expression: the separator is represented by several spaces
> strsplit("A    text  with spaces     ", "[ ]+", perl = T) 
[[1]]
[1] "A"      "text"   "with"   "spaces"

3.1.3. Numbers

Numeric values are defined naturally:

> a <- 3.1415
> b <- a * 2.3 - 7.55
> a
[1] 3.1415
> b
[1] -0.32455

3.1.4. Vectors

You can create a list (also called a vector) using different operators :

  • use c(...) for unordered data : c(1,5,-7,3)
  • use x:y to create ordered vector of integers: 1:5
  • use seq() for oredered sequences of numbers
  • use numeric() to define a vector of numbers with initial value of 0 for each element
  • use rep() to replicate values
> x <- c(1.2, 3.5, 4, -7.2)
> x
[1]  1.2  3.5  4.0 -7.2
> y <- 1:5
> y
[1] 1 2 3 4 5
##/* create vector of 10 values initialized with 0 */
> x <- numeric(10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

When you use seq(), you can specify to = or lengh.out = :

> seq(from = 1, to = 10, by = 0.5)
 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0
> seq(from = 10, to = 0, by = -2)
[1] 10  8  6  4  2  0
> seq(from = 1, by = 2, length.out=3)
[1] 1 3 5
 
# repeat 'F'
rep('F', 4)
[1] "F" "F" "F" "F"

You can access and modify the contents of a vector, note that the first index of a vector starts at 1:

> x[1]
[1] 1.2
> x[4]
[1] -7.2
> x[4] <- 333.33
> x
[1]   1.20   3.50   4.00 333.33

3.1.5. Variables

To get the list of variables you can use ls() and to remove a variable you can use rm(). See section Manage, save and restore environment for details.

> x <- c(1,3,5,7)
> pi_div_2 = pi / 2
> pi = 3.1415
> pi_div_2 = pi / 2
> rasp_pi = 6.28
 
# two different variables
> my.long.variable=1
> my_long_variable=2

3.1.6. Data frame

A data frame is used for storing data tables. It is a list of vectors of equal length. In the following example you can see how to handle a data frame. We define information about Intel CPUs:

  • name of cpu
  • when it was launched, it is given by Intel in terms of quarters, for example Q2'14 and we translate it into first of April 2014
  • thermal dissipation power in Watts
  • lithography in nm
  • number of cores
  • number of threads
cpus <- c("i7-4790", "i5-3570K", "i5-7400", "i7-2600")
launch.date <- c("2014-04-01", "2012-04-01", "2017-01-01", "2011-01-01")
tdp <- c(84,77,64,95)
litho <- c(22,22,14,32)
cores <- c(4,4,4,4)
threads <- c(8,4,4,8)
mydata <- data.frame(cpus, launch.date, tdp, litho, cores, threads, stringsAsFactors = FALSE)

3.1.6.a  Basic operations

# read the data frame
> source("dfdef.rs")
> mydata
/*
      cpus launch.date tdp litho cores threads
1  i7-4790  2014-04-01  84    22     4       8
2 i5-3570K  2012-04-01  77    22     4       4
3  i5-7400  2017-01-01  64    14     4       4
4  i7-2600  2011-01-01  95    32     4       8
*/
 
# show number of rows
> nrow(mydata)
[1] 4
# show number of columns
> ncol(mydata)
[1] 6
 
# get information about first column using numeric index
> mydata[1] 
/*
      cpus
1  i7-4790
2 i5-3570K
3  i5-7400
4  i7-2600
*/
# or use 
> mydata[,1]
[1] i7-4790  i5-3570K i5-7400  i7-2600 
Levels: i5-3570K i5-7400 i7-2600 i7-4790
 
# get information about first column's name
> mydata$cpus
[1] i7-4790  i5-3570K i5-7400  i7-2600 
Levels: i5-3570K i5-7400 i7-2600 i7-4790

Note that you can modify the names of the columns during the definition of the data frame or after by using colnames():

# during definition we use tdp.in.W (Thermal Dissipation Power in Watts)
> mydata <- data.frame(cpus, launch.date, tdp.in.W = tdp, litho, cores, threads)
 
# get the names of the columns
> colnames(mydata)
[1] "cpus"        "launch.date" "tdp.in.W"    "litho"       "cores"      
[6] "threads"
# modify names of columnes
> colnames(mydata) <- c("CPUS", "launch", "TDP.W", "LITHO.NM", "CORES", "TH")
 

3.1.6.b  Insertion and deletion of rows

# insert new row
> newrow <- c("Pentium-M-760","2014-04-01",27,90,1,1)
> mydata <- rbind(mydata[1:2,], newrow, mydata[-(1:2),])
1        i7-4790  2014-04-01  84    22     4       8
2       i5-3570K  2012-04-01  77    22     4       4
3  Pentium-M-760  2014-04-01  27    90     1       1
31       i5-7400  2017-01-01  64    14     4       4
4        i7-2600  2011-01-01  95    32     4       8
 
#remove row
> mydata <- mydata[-c(4), ]
> mydata
           cpus launch.date tdp litho cores threads
1       i7-4790  2014-04-01  84    22     4       8
2      i5-3570K  2012-04-01  77    22     4       4
3 Pentium-M-760  2014-04-01  27    90     1       1
4       i7-2600  2011-01-01  95    32     4       8
 
# or
> mydata <- mydata[-which(rownames(mydata) %in% c("31")),]

3.1.6.c  Selection

You can select information in a data frame by rows, by columns or using a filter.

# selection by columns
> mydata[,c("cpus","tdp")]
/*
      cpus tdp
1  i7-4790  84
2 i5-3570K  77
3  i5-7400  64
4  i7-2600  95
*/
 
# selection by rows
> mydata[c(2:3),]
/*
      cpus launch.date tdp litho cores threads
2 i5-3570K  2012-04-01  77    22     4       4
3  i5-7400  2017-01-01  64    14     4       4
*/
 
# selection by rows and columns
> mydata[c(2:3),c("cpus","tdp")]
/*
      cpus tdp
2 i5-3570K  77
3  i5-7400  64
*/
 
# selection using a filter
# we want the cpus that have a thermal dissipation power
# greater than 80
> mydata[mydata$tdp > 80,]
/*
     cpus launch.date tdp litho cores threads
1 i7-4790  2014-04-01  84    22     4       8
*/

3.1.6.d  Sort

# order using 1 column
> mydata[order(litho),]
/*
      cpus launch.date tdp litho cores threads
3  i5-7400  2017-01-01  64    14     4       4
1  i7-4790  2014-04-01  84    22     4       8
2 i5-3570K  2012-04-01  77    22     4       4
4  i7-2600  2011-01-01  95    32     4       8
*/
 
# order using two columns
> mydata[order(litho,tdp),]
/*
      cpus launch.date tdp litho cores threads
3  i5-7400  2017-01-01  64    14     4       4
2 i5-3570K  2012-04-01  77    22     4       4
1  i7-4790  2014-04-01  84    22     4       8
4  i7-2600  2011-01-01  95    32     4       8
*/