Site de Jean-Michel RICHER

Maître de Conférences en Informatique à l'Université d'Angers

Ce site est en cours de reconstruction certains liens peuvent ne pas fonctionner ou certaines images peuvent ne pas s'afficher.


10.1. Statistics

The $χ^2$ test is used to compare observed data to theoretical (i.e. expected) data. There are different tests of $χ^2$, we will here focus on the goodness-of-fit.

Note that we call it:

  • chi square test in english
  • test du khi deux in french
  • prueba ji dos de Pearson in spanish

10.1.1. Khi / Chi square test

Input data

Consider that you want to buy a restaurant and you ask the owner how many clients are coming to eat each day. He gives you the following information:

 day   monday   tuesday   wednesday   thursday   friday   saturday 
 percentage of clients   10   10   15   20   30   15 
Percentage of clients each day given by the Owner

To be sure of the information you come every day and count how many customers are coming during one week.

 day   monday   tuesday   wednesday   thursday   friday   saturday 
 number of clients   30   14   34   45   57   20 
Number of clients that you have counted' data

Hypothesis

Given those data you want to check if the owner is right or not, in terms of statistics we want to know if we must accept or reject the null hypothesis:

  • $H_0$ or null hypothesis: the data of the owner are correct
  • $H_1$ or alternative hypothesis: the data of the owner are not correct

We will consider a significance level of 5%.

The significance level, also denoted as alpha or $α$, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

Khi / Chi square computation

If we consider our data, we have $30+14+34+45+57+20 = 200$ clients for one specific week.

If we use the data of the owner, we should have in theory:

 day   monday   tuesday   wednesday   thursday   friday   saturday 
 percentage of clients   10   10   15   20   30   15 
 number (theory)   20   20   30   40   60   30 
Number of clients in theory

For example for monday: $10% × 200 = 20$.

We now have theoretical and observed values:

 day   monday   tuesday   wednesday   thursday   friday   saturday 
 number of clients (theory)   20   20   30   40   60   30 
 number of clients (observed)   30   14   34   45   57   20 
Number of clients: theory and observed

The value of $χ^2$ is then defined as:

$$ (30-20)^2 / 20 + (14-20)^2 / 20 + ... + (20-30)^2 / 30 = 11.44$$

So let's check the value that corresponds of to the degree of freedom and level of significance chosen in the Khi square table.

In our case the degree of freedom is 5 because it is equal to the size of the data minus 1. We have 6 days minus 1 equals 5. Remember we chose a level of significance of 5%. The value in the khi square table is 11.07.

As $11.44 > 11.07$ we must reject the null hypothesis which means we must consider that the percentages given by the owner are not correct.

This is coherent with the fact (see below) that if the $p-value < 0.05$ then we reject $H_0$.

To do this with R, you would write something like this:

> clients.percentage <- c(0.1,0.1,0.15,0.20,0.3,0.15)
> clients.observed <- c(30,14,34,45,57,20)
> khi2 <- chisq.test(x = clients.observed, p = clients.theory)
 
	Chi-squared test for given probabilities
 
data:  clients.observed
X-squared = 11.442, df = 5, p-value = 0.04329
 
> khi2$p.value
[1] 0.04329313
> khi2$parameter
df 
 5 
> khi2$statistic
X-squared 
 11.44167 

Note that in R you could get the p-value and khi square value using:

# get p-value for given khi square value of 11.442 
# and degree of freedom of 5 -> 0.0432 = 4.32%
> pchisq(11.442, df=5, lower.tail = F)  
[1] 0.04328751
 
# get p-value for 0.05=5% and degree of freedom of 5 -> 11.07 
> qchisq(0.05, df=5, lower.tail = F)  
[1] 11.0705