ChiSquare Test (original) (raw)

Chi-Square Test

Last Updated : 14 Apr, 2026

The Chi-squared (χ²) test is a statistical method used to determine whether there is a significant association between two categorical variables or whether observed data fits an expected distribution. In categorical data analysis, the chi-square test compares observed frequencies with expected frequencies under a given hypothesis.

Chi-squared test, or χ² test, helps in determining whether these two variables are associated with each other.

This test is widely used in market research, healthcare, social sciences, and more to analyze categorical relationships.

chi_square_test

For example, Entity 1: People’s favorite colors and Entity 2: Their preference for ice cream.

By comparing observed survey data with expected frequencies (if no relationship existed), the Chi-Square test calculates a test statistic (χ²). If this value is large enough, we reject H₀, concluding that color preference does influence ice cream choice and vice versa.

Formula For Chi-Square Test

\chi^2 = \sum \frac{ (O_i - E_i)² }{ E_i }

Symbols are broken down as follows:

**Categorical Variables

Categorical variables classify data into distinct, non-numeric groups (e.g., colors, fruit types).

**Key Characteristics:

  1. **Distinct Groups: No overlap (e.g., hair color: blonde, brunette).
  2. **Non-Numerical: No arithmetic meaning (e.g., "apple" ≠ "orange" numerically).
  3. **Limited Options: Fixed categories (e.g., traffic lights: red, yellow, green).

**Example: _"Do you prefer tea, coffee, or juice?" → Categories: tea/coffee/juice.

Steps for Chi-Square Test

Steps and an illustration of an example of how sex influences which type of ice-cream a person will choose using a chi-square test are added below:

Step 1: Define Hypothesis

Step 2: Gather and Organize Data

Gather Information about the Two Category Variables: Before performing a chi-square test, you should have on hand information about two categorical variables you wish to observe.

Once this information is collected, it can be inserted into a contingency table.

The hypothesis is that men prefer vanilla while women prefer chocolate. So we need to record how many have chosen vanilla among all male respondents versus the number who chose chocolate out of all female respondents.

Here's an example of what a contingency table might look like:

Chocolate Vanilla Strawberry Total
Male 20 15 10 45
Female 25 20 30 75
Total 45 35 40 120

Step 3: Calculate Expected Frequencies

Observed frequency is the table given above.

E_{ij}=\frac{(Row Total)×(Column Total)}{Grand Total}

Step 4: Perform Chi-Square Test

Use Chi-Square Formula:

\chi^2 = \sum \frac{ (O_i - E_i)² }{ E_i }

\chi^2 = \sum \frac{(O -E)^2}{E} = \frac{(20 -16.875)^2}{16.875} + \frac{(15 -13.125)^2}{13.125} + \frac{(10 -15)^2}{15}+ \frac{(25 -28.125)^2}{28.125}+ \frac{(20 -21.875)^2}{21.875}+ \frac{(30 -25)^2}{25} = 4.86

Step 5: Determine Degrees of Freedom (df)

df = (number of rows - 1) × (number of columns - 1)

df=(r−1)(c−1)=(2−1)(3−1)=2

Step 6: Find p-value

Compare the calculated χ² value with the critical value from the Chi-Square distribution table for the given degrees of freedom.

significance_level_

Here, χ² = 4.86 with df=2:

Critical value at α=0.05 is 5.991.

Since 4.86 < 5.991, p > 0.05

Step 7: Interpret Results

No significant evidence supports the claim that men prefer vanilla or women prefer chocolate (p>0.05).

Addressing Assumptions and Considerations

Goodness-Of-Fit

A goodness-of-fit test checks if a hypothesized model matches observed data. For example, testing whether a die is fair.

**Key Aspects:

  1. **Purpose: Check how well observed data fits expected data
  2. **Data Types: Categorical only
  3. **Applications: Compare observed vs. expected frequencies.
  4. **Benefits: Identifies model-data mismatch.

Applications of Chi-Square Test in Computer Science

**A/B Testing & Feature Evaluation

**Machine Learning (Feature Selection)

**Database Query Optimization

**Natural Language Processing (NLP)

Solved Examples

**Example 1: A study investigates the relationship between eye color (blue, brown, green) and hair color (blonde, brunette, Redhead). The following data is collected:

**Eye Color Blonde Brunette Redhead Total
Blue 30 50 20 100
Brown 40 30 10 80
Green 20 10 10 40
Total 90 90 40 220

**Step 1: Hypotheses

H₀: Eye color and hair color are independent
H₁: They are associated

**Step 2: Expected Frequencies

Using E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}}

Blue: (40.91, 40.91, 18.18) {color: blonde,brunette,redhead }
Brown: (32.73, 32.73, 14.55)
Green: (16.36, 16.36, 7.27)

**Step 3: Chi-Square Calculation

\chi^2 = \sum \frac{(O - E)^2}{E} \approx 12.67

**Step 4: Degrees of Freedom

df = (3 − 1)(3 − 1) = 4

**Step 5: Decision

Critical value (α = 0.05, df = 4) = 9.488

Since 12.67 > 9.488 → Reject H₀

There is a significant association between eye color and hair color

**Example 2: 100 flips of a coin are performed. The coin is fair, with an equal chance of heads and tails, according to the null hypothesis. 55 heads and 45 tails are the observed findings.

**Step 1: Hypotheses

H₀: Coin is fair
H₁: Coin is not fair

**Step 2: Expected Values

Heads = 50, Tails = 50

**Step 3: Chi-Square Calculation

\chi^2 = \frac{(55-50)^2}{50} + \frac{(45-50)^2}{50} = 1

**Step 4: Degrees of Freedom

df = 1

**Step 5: Decision

Critical value (α = 0.05) = 3.84

Since 1 < 3.84 → Fail to reject H₀

The coin is likely fair

Practice Problems

Q1. Market Research on Beverages

A company conducts a survey to determine whether there's a relationship between age groups and preferred beverages. The data collected is as follows:

Age Group Coffee Tea Soft Drinks Water
18-25 30 20 25 15
26-35 25 30 20 25
36-45 20 25 30 25
46-55 15 20 25 40

Use a chi-square test to determine if there is an association between age groups and preferred beverages.

**Q2. Student Performance

A teacher wants to find out if there is a relationship between study habits and grades. The data collected is as follows:

Study Habits A B C D F
Regular 15 20 25 10 5
Occasional 10 15 20 15 10
Rare 5 10 15 20 25

Perform a chi-square test to determine if study habits and grades are associated.

**Q3. Gender and Major

A university wants to see if there is an association between gender and chosen major. The data collected is:

Major Male Female
Engineering 60 30
Business 40 50
Arts 20 40
Sciences 30 30

Conduct a chi-square test to examine if gender and chosen major are related.

**Q4. Voting Preferences

A political analyst wants to know if there is a relationship between gender and voting preference. The data is:

Preference Male Female
Candidate A 80 90
Candidate B 70 60
Undecided 50 40

Test the hypothesis that gender and voting preference are independent.