Ce site est en cours de reconstruction certains liens peuvent ne pas fonctionner ou certaines images peuvent ne pas s'afficher.

10.1. Statistics

The $χ^2$ test is used to compare observed data to theoretical (i.e. expected) data. There are different tests of $χ^2$, we will here focus on the goodness-of-fit.

Note that we call it:

chi square test in english
test du khi deux in french
prueba ji dos de Pearson in spanish

10.1.1. Khi / Chi square test

Input data

Consider that you want to buy a restaurant and you ask the owner how many clients are coming to eat each day. He gives you the following information:

Percentage of clients each day given by the Owner
day	monday	tuesday	wednesday	thursday	friday	saturday
percentage of clients	10	10	15	20	30	15

To be sure of the information you come every day and count how many customers are coming during one week.

Number of clients that you have counted' data
day	monday	tuesday	wednesday	thursday	friday	saturday
number of clients	30	14	34	45	57	20

Hypothesis

Given those data you want to check if the owner is right or not, in terms of statistics we want to know if we must accept or reject the null hypothesis:

$H_0$ or null hypothesis: the data of the owner are correct
$H_1$ or alternative hypothesis: the data of the owner are not correct

We will consider a significance level of 5%.

The significance level, also denoted as alpha or $α$, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

Khi / Chi square computation

If we consider our data, we have $30+14+34+45+57+20 = 200$ clients for one specific week.

If we use the data of the owner, we should have in theory:

Number of clients in theory
day	monday	tuesday	wednesday	thursday	friday	saturday
percentage of clients	10	10	15	20	30	15
number (theory)	20	20	30	40	60	30

For example for monday: $10% × 200 = 20$.

We now have theoretical and observed values:

Number of clients: theory and observed
day	monday	tuesday	wednesday	thursday	friday	saturday
number of clients (theory)	20	20	30	40	60	30
number of clients (observed)	30	14	34	45	57	20

The value of $χ^2$ is then defined as:

$$ (30-20)^2 / 20 + (14-20)^2 / 20 + ... + (20-30)^2 / 30 = 11.44$$

So let's check the value that corresponds of to the degree of freedom and level of significance chosen in the Khi square table.

In our case the degree of freedom is 5 because it is equal to the size of the data minus 1. We have 6 days minus 1 equals 5. Remember we chose a level of significance of 5%. The value in the khi square table is 11.07.

As $11.44 > 11.07$ we must reject the null hypothesis which means we must consider that the percentages given by the owner are not correct.

This is coherent with the fact (see below) that if the $p-value < 0.05$ then we reject $H_0$.

To do this with R, you would write something like this:

> clients.percentage <- c(0.1,0.1,0.15,0.20,0.3,0.15)
> clients.observed <- c(30,14,34,45,57,20)
> khi2 <- chisq.test(x = clients.observed, p = clients.theory)
 
	Chi-squared test for given probabilities
 
data:  clients.observed
X-squared = 11.442, df = 5, p-value = 0.04329
 
> khi2$p.value
[1] 0.04329313
> khi2$parameter
df 
 5 
> khi2$statistic
X-squared 
 11.44167

Note that in R you could get the p-value and khi square value using:

# get p-value for given khi square value of 11.442 
# and degree of freedom of 5 -> 0.0432 = 4.32%
> pchisq(11.442, df=5, lower.tail = F)  
[1] 0.04328751
 
# get p-value for 0.05=5% and degree of freedom of 5 -> 11.07 
> qchisq(0.05, df=5, lower.tail = F)  
[1] 11.0705