chi-squared test(卡方测验)

Dataset

本文的数据集是来自卡方测试的分类数据,这个测试使我们能够确定观察一组分类的统计显著性值。

  • 样本例子:

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K

  • 样本属性

age – how old the person is
workclass – the type of sector the person is employed in.(部门类型)
race – the race of the person.(种族)
sex – the gender of the person, either Male or Female.

proportional difference(比例差分)

  • 比例差分的计算公式如下:

这里写图片描述

  • 由于美国完整的人口普查显示男女比例是1:1,而在这个数据集中的比例如下:

    chi-squared test(卡方测验)_第1张图片

  • 因此计算男女比例的比例差分:

female_diff = (10771 - 16280.5) / 16280.5 male_diff = (21790 - 16280.5) / 16280.5
'''
-0.33841098246368356
0.33841098246368356
'''
  • 上述的比例差分的和为0,并不能带给我们什么有用的信息,我们想知道的是我们观测到的值与期望值到底有多大的偏差,因此我们将上述公式修改如下,差异的平方使得求和的时候不会是0:

  • 将所有上述差异求和即可得到卡方值χ2

female_diff = (10771 - 16280.5) ** 2 / 16280.5
male_diff = (21790 - 16280.5) ** 2 / 16280.5
gender_chisq = female_diff + male_diff
''' gender_chisq :3728.950615767329 '''
  • 我们进行1000次试验,每次按照期望比例来随机抽样32561个性别,并计算抽样的卡方值,总共产生1000个卡方值,并画出直方图,显示我们的卡方值3728.950615767329远远大于所有的卡方值,因此我们的p值是0(所有的值大于我们的卡方值的比例),表示我们有0%的机会会产生与期望相同的分布,意味着我们的结果是统计上显著的(p值低于0.05就是显著的)。显著的表示我们观测的值和期望值是有很大的差距的。
from numpy.random import random
import matplotlib.pyplot as plt

chi_squared_values = []
for i in range(1000):
    sequence = random((32561,))
    sequence[sequence < .5] = 0
    sequence[sequence >= .5] = 1
    male_count = len(sequence[sequence == 0])
    female_count = len(sequence[sequence == 1])
    male_diff = (male_count - 16280.5) ** 2 / 16280.5
    female_diff = (female_count - 16280.5) ** 2 / 16280.5
    chi_squared = male_diff + female_diff
    chi_squared_values.append(chi_squared)

plt.hist(chi_squared_values)

Degrees Of Freedom(自由度)

  • 在这个数据中,由于整体数据已知,男性加女性的和是32561,因此自由度是1,因为只要男性确定了,另一个通过减法就可以得到,自由度是个很重要的概概念。因此我们看一个自由度较高的属性,种族:

  • 计算卡方值
diffs = []
observed = [27816, 3124, 1039, 311, 271]
expected = [26146.5, 3939.9, 944.3, 260.5, 1269.8]

for i, obs in enumerate(observed):
    exp = expected[i]
    diff = (obs - exp) ** 2 / exp
    diffs.append(diff)

race_chisq = sum(diffs)
''' 1080.485936593381 '''
  • 在scipy中有一个专门计算卡方值的函数:
from scipy.stats import chisquare
import numpy as np
observed = np.array([27816, 3124, 1039, 311, 271])
expected = np.array([26146.5, 3939.9, 944.3, 260.5, 1269.8])

chisquare_value, race_pvalue = chisquare(observed, expected)
'''
chisquare_value:1080.485936593381 race_pvalue :1.2848494674873035e-232
'''

你可能感兴趣的:(chi-squared test(卡方测验))