Maître de Conférences en Informatique à l'Université d'Angers
Ce site est en cours de reconstruction certains liens peuvent ne pas fonctionner ou certaines images peuvent ne pas s'afficher.
The $χ^2$ test is used to compare observed data to theoretical (i.e. expected) data. There are different tests of $χ^2$, we will here focus on the goodness-of-fit.
Note that we call it:
Consider that you want to buy a restaurant and you ask the owner how many clients are coming to eat each day. He gives you the following information:
| day | monday | tuesday | wednesday | thursday | friday | saturday |
| percentage of clients | 10 | 10 | 15 | 20 | 30 | 15 |
To be sure of the information you come every day and count how many customers are coming during one week.
| day | monday | tuesday | wednesday | thursday | friday | saturday |
| number of clients | 30 | 14 | 34 | 45 | 57 | 20 |
Given those data you want to check if the owner is right or not, in terms of statistics we want to know if we must accept or reject the null hypothesis:
We will consider a significance level of 5%.
The significance level, also denoted as alpha or $α$, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.
If we consider our data, we have $30+14+34+45+57+20 = 200$ clients for one specific week.
If we use the data of the owner, we should have in theory:
| day | monday | tuesday | wednesday | thursday | friday | saturday |
| percentage of clients | 10 | 10 | 15 | 20 | 30 | 15 |
| number (theory) | 20 | 20 | 30 | 40 | 60 | 30 |
For example for monday: $10% × 200 = 20$.
We now have theoretical and observed values:
| day | monday | tuesday | wednesday | thursday | friday | saturday |
| number of clients (theory) | 20 | 20 | 30 | 40 | 60 | 30 |
| number of clients (observed) | 30 | 14 | 34 | 45 | 57 | 20 |
The value of $χ^2$ is then defined as:
$$ (30-20)^2 / 20 + (14-20)^2 / 20 + ... + (20-30)^2 / 30 = 11.44$$
So let's check the value that corresponds of to the degree of freedom and level of significance chosen in the Khi square table.
In our case the degree of freedom is 5 because it is equal to the size of the data minus 1. We have 6 days minus 1 equals 5. Remember we chose a level of significance of 5%. The value in the khi square table is 11.07.
As $11.44 > 11.07$ we must reject the null hypothesis which means we must consider that the percentages given by the owner are not correct.
This is coherent with the fact (see below) that if the $p-value < 0.05$ then we reject $H_0$.
To do this with R, you would write something like this:
> clients.percentage <- c(0.1,0.1,0.15,0.20,0.3,0.15) > clients.observed <- c(30,14,34,45,57,20) > khi2 <- chisq.test(x = clients.observed, p = clients.theory) Chi-squared test for given probabilities data: clients.observed X-squared = 11.442, df = 5, p-value = 0.04329 > khi2$p.value [1] 0.04329313 > khi2$parameter df 5 > khi2$statistic X-squared 11.44167
Note that in R you could get the p-value and khi square value using:
# get p-value for given khi square value of 11.442 # and degree of freedom of 5 -> 0.0432 = 4.32% > pchisq(11.442, df=5, lower.tail = F) [1] 0.04328751 # get p-value for 0.05=5% and degree of freedom of 5 -> 11.07 > qchisq(0.05, df=5, lower.tail = F) [1] 11.0705