Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chi-Square Test #872

Merged
merged 4 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
262 changes: 262 additions & 0 deletions Chi-Square Test/Chi_Square_Test.ipynb

Large diffs are not rendered by default.

201 changes: 201 additions & 0 deletions Chi-Square Test/dataset.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
Gender,Preference
Male,Sports
Female,Reading
Male,Sports
Male,Sports
Male,Reading
Female,Sports
Male,Reading
Male,Reading
Male,Reading
Female,Sports
Male,Sports
Male,Sports
Male,Sports
Male,Sports
Female,Sports
Male,Sports
Female,Reading
Female,Sports
Female,Reading
Male,Sports
Female,Sports
Male,Sports
Female,Reading
Female,Reading
Female,Reading
Female,Reading
Female,Sports
Female,Reading
Female,Sports
Female,Sports
Male,Reading
Male,Reading
Female,Reading
Female,Reading
Female,Reading
Male,Reading
Female,Reading
Male,Reading
Male,Sports
Male,Reading
Male,Reading
Male,Sports
Female,Reading
Female,Sports
Female,Sports
Female,Reading
Female,Sports
Male,Sports
Female,Sports
Female,Sports
Male,Reading
Female,Sports
Male,Reading
Female,Sports
Male,Sports
Female,Sports
Female,Sports
Male,Reading
Male,Reading
Male,Sports
Male,Sports
Male,Reading
Male,Sports
Male,Sports
Male,Sports
Female,Reading
Female,Reading
Male,Reading
Female,Sports
Female,Sports
Female,Reading
Female,Reading
Male,Reading
Female,Reading
Male,Sports
Female,Reading
Female,Sports
Female,Reading
Male,Sports
Female,Reading
Male,Reading
Female,Reading
Male,Reading
Female,Sports
Male,Reading
Male,Sports
Female,Sports
Male,Sports
Female,Sports
Female,Reading
Female,Sports
Female,Sports
Female,Sports
Female,Reading
Female,Reading
Female,Reading
Female,Reading
Female,Sports
Female,Sports
Male,Reading
Male,Sports
Female,Sports
Female,Sports
Female,Reading
Female,Reading
Female,Sports
Female,Reading
Female,Reading
Female,Reading
Male,Reading
Female,Reading
Male,Sports
Female,Reading
Female,Sports
Male,Sports
Female,Reading
Male,Sports
Female,Sports
Female,Sports
Male,Reading
Female,Reading
Male,Sports
Female,Reading
Male,Reading
Male,Reading
Female,Reading
Female,Sports
Male,Sports
Female,Reading
Female,Reading
Female,Sports
Male,Sports
Male,Reading
Male,Sports
Male,Reading
Male,Reading
Male,Sports
Male,Sports
Male,Reading
Male,Reading
Female,Sports
Male,Reading
Female,Sports
Female,Reading
Female,Sports
Male,Sports
Male,Sports
Male,Reading
Male,Reading
Female,Sports
Male,Reading
Male,Sports
Male,Sports
Male,Reading
Male,Reading
Female,Sports
Male,Reading
Female,Reading
Male,Sports
Female,Sports
Male,Reading
Male,Sports
Female,Sports
Female,Reading
Female,Sports
Male,Sports
Female,Reading
Male,Reading
Male,Sports
Female,Sports
Female,Reading
Male,Reading
Male,Reading
Female,Reading
Female,Sports
Female,Reading
Male,Reading
Male,Sports
Male,Reading
Male,Sports
Male,Reading
Male,Reading
Female,Reading
Male,Sports
Male,Sports
Male,Reading
Female,Sports
Male,Sports
Male,Sports
Female,Sports
Male,Sports
Male,Sports
Male,Reading
Male,Reading
Male,Sports
Female,Reading
Female,Reading
Female,Reading
Male,Reading
Male,Sports
99 changes: 99 additions & 0 deletions Chi-Square Test/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Chi-Square Test for Independence

## Overview
The Chi-Square Test for Independence is a statistical test used to determine whether there is a significant association between two categorical variables. It is based on the comparison of observed frequencies in the data with the frequencies that would be expected if the variables were independent.

## Mathematical Logic
The Chi-Square statistic (χ²) is calculated using the following formula:

$$
χ² = Σ((O_i - E_i)² / E_i)
$$

Where:
- **O_i**: Observed frequency for the i-th category
- **E_i**: Expected frequency for the i-th category

The expected frequency (E_i) for each category is calculated under the assumption of independence between the variables, using:

$$
E_i = \frac{(Row \, Total \, * \, Column \, Total)}{Grand \, Total}
$$

## Steps to Perform Chi-Square Test
1. **Create a Contingency Table**: Summarize the frequencies of the two categorical variables in a matrix format.
2. **Calculate Expected Frequencies**: Compute the expected frequencies for each cell of the table assuming the variables are independent.
3. **Compute the Chi-Square Statistic**: Use the formula to calculate the χ² value.
4. **Determine the Degrees of Freedom (df)**: Calculated as (number of rows - 1) * (number of columns - 1).
5. **Find the P-Value**: Compare the χ² value with the Chi-Square distribution table to find the p-value.
6. **Interpret the Result**: If the p-value is less than the significance level (typically 0.05), reject the null hypothesis of independence.

## Example Calculation
Consider a dataset with two variables: Gender (Male, Female) and Preference (Sports, Reading). Suppose we have the following observed frequencies in a contingency table:

| | Sports | Reading | Row Total |
|-------------|--------|---------|-----------|
| Male | 30 | 20 | 50 |
| Female | 20 | 30 | 50 |
| Column Total| 50 | 50 | 100 |

1. **Expected Frequencies**:
- E(Male, Sports) = (50 * 50) / 100 = 25
- E(Male, Reading) = (50 * 50) / 100 = 25
- E(Female, Sports) = (50 * 50) / 100 = 25
- E(Female, Reading) = (50 * 50) / 100 = 25

2. **Chi-Square Statistic**:
$$
χ² = Σ((O_i - E_i)² / E_i) = ((30 - 25)² / 25) + ((20 - 25)² / 25) + ((20 - 25)² / 25) + ((30 - 25)² / 25)
$$

$$
χ² = (5² / 25) + ((-5)² / 25) + ((-5)² / 25) + (5² / 25) = 4
$$

3. **Degrees of Freedom**:
- df = (2-1)(2-1) = 1

4. **P-Value**: Using the Chi-Square distribution table or a calculator, find the p-value corresponding to χ² = 4 and df = 1.

## Uses of Chi-Square Test
The Chi-Square Test for Independence is widely used in various fields, including:
1. **Biology**: To determine if there is an association between different genetic traits.
2. **Social Sciences**: To analyze survey data and examine relationships between demographic variables.
3. **Market Research**: To evaluate consumer preferences and behaviors based on categorical variables like gender, age group, etc.
4. **Medical Research**: To investigate the relationship between risk factors and health outcomes.

## Interpretation
- **P-Value < 0.05**: Reject the null hypothesis; there is a significant association between the variables.
- **P-Value ≥ 0.05**: Fail to reject the null hypothesis; no significant association exists between the variables.

This test allows researchers and analysts to make informed decisions based on the relationships between categorical variables in their data.

## Coding Implementation
### 1. Generate Dataset
First, we generate a dataset with 200 samples, each having two categorical variables: Gender and Preference. The dataset is saved to a CSV file for further analysis.

```python
import pandas as pd
import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Generate sample data
n_samples = 200
genders = np.random.choice(['Male', 'Female'], size=n_samples)
preferences = np.random.choice(['Sports', 'Reading'], size=n_samples)

# Create a DataFrame
data = {
'Gender': genders,
'Preference': preferences
}
df = pd.DataFrame(data)

# Save DataFrame to CSV in the current environment
csv_file_path = 'large_sample_data.csv'
df.to_csv(csv_file_path, index=False)
print(f"CSV file created at: {csv_file_path}")
5 changes: 5 additions & 0 deletions Chi-Square Test/requirement.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pandas==1.5.3
numpy==1.24.2
scipy==1.10.0
matplotlib==3.7.1
seaborn==0.12.2
Loading