卡方检验 原理_什么是卡方检验及其工作原理?

卡方检验 原理

As a data science engineer, it’s imperative that the sample data set which you pick from the data is reliable, clean, and well tested for its usability in machine learning model building.

作为数据科学工程师,当务之急是从数据中挑选出的样本数据集是可靠的,干净的,并经过了良好的测试,以证明其在机器学习模型构建中的可用性。

So how do you do that?

那你该怎么做呢?

Well, we have multiple statistical techniques like descriptive statistics where we measure the data central value, how it is spread across the mean/median. Is it normally distributed or there is a skew in the data spread? Please refer to my previous article on the same for more clarity.

好吧,我们有多种统计技术,例如描述性统计,在其中我们测量数据中心值,其在均值/中位数之间的分布方式。 它是正态分布的还是数据散布有偏差? 请参考我以前关于同一文章的更多信息。

As the first thing we do is to visualize the data using various data visualization techniques to make some early sense of any data skewness or discrepancies, to identify any kind of relationship between data set variables.

首先,我们要使用各种数据可视化技术来可视化数据,以便对任何数据偏斜或差异有早期的了解,以识别数据集变量之间的任何类型的关系。

Data has so much to say and we data engineer give it a voice to express and describe itself, using descriptive statistical techniques.

数据有这么多话要说,我们的数据工程师使用描述性统计技术表达和描述自己的声音。

But to make any prediction or to infer something beyond the given data to find any hidden probability, we rely on inferential statistic techniques.

但是,要进行任何预测或推断超出给定数据的内容以找到任何隐藏的概率,我们将依靠推断统计技术。

Inferential statistics are concerned with making inferences based on relations found in the sample, to relations in the population. Inferential statistics help us decide, for example, whether the differences between groups that we see in our data are strong enough to provide support for our hypothesis that group differences exist in general, in the entire population.

推论统计涉及根据样本中发现的关系对总体中的关系进行推论。 推论统计可以帮助我们确定,例如,我们在数据中看到的群体之间的差异是否足够大,足以支持我们的假设,即整个人群中普遍存在群体差异。

Today we will cover one of the inferential statistical mechanisms to understand the concept of hypothesis testing using a popular Chi-Square test.

今天,我们将介绍一种推论统计机制,以了解使用流行的卡方检验进行假设检验的概念。

什么是卡方检验? (What is the Chi-Square Test?)

Do remember that,

请记住,

It is an inferential statistical test that works on categorical data.

这是一种推论统计检验,适用于分类数据。

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.

卡方检验是一种统计假设检验,它假设(原假设)分类变量的观测频率与分类变量的预期频率匹配。 该测试计算出具有卡方分布的统计量,该统计量以希腊大写字母Chi(X)命名,在风筝中的发音为“ ki”。

We try to test the likelihood of test data(sample data) to find out whether the observed distribution of data set is a statistical fluke(due to chance ) or not. “Goodness of fit” statistic in the chi-square test, measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent.

我们尝试测试测试数据(样本数据)的可能性,以发现观察到的数据集分布是否为统计偶然(由于偶然)。 卡方检验中的“拟合优度”统计量用于衡量观察到的数据分布与变量独立的预期分布的吻合程度。

卡方如何工作? (How Does Chi-Square Work?)

Generally, we try to establish a relationship between the given categorical variable in this test. Chi-square evaluates whether given variables in a data set(sample) are independent, called the Test of Independence. Chi-square tests are used for testing hypotheses about one or two categorical variables and are appropriate when the data can be summarized by counts in a table. The variables can have multiple categories.

通常,在此测试中,我们尝试在给定的类别变量之间建立关系。 卡方可评估数据集(样本)中的给定变量是否独立,称为独立性测试。 卡方检验用于检验关于一个或两个类别变量的假设,并且当可以通过表中的计数来汇总数据时,卡方检验是适用的。 变量可以具有多个类别。

卡方检验类型: (Type of Chi-Square Test:)

For One Categorical Variable, we perform

对于一个分类变量,我们执行

  • Chi-Square Goodness-of-Fit Test

    卡方拟合优度检验

The chi-square goodness of fit test begins by hypothesizing that the distribution of a variable behaves in a particular manner. For example, in order to determine the daily staffing needs of a retail store, the manager may wish to know whether there is an equal number of customers each day of the week.

拟合检验的卡方检验的优劣始于假设变量的分布以特定方式表现。 例如,为了确定零售商店的日常人员需求,经理可能希望知道一周中的每一天是否有相等数量的顾客。

For, Two Categorical Variables, we perform

为此,我们执行两个分类变量

  • Chi-Square Test for Association

    卡方检验

Another way we can describe the Chi-square test is that:

我们可以描述卡方检验的另一种方式是:

It tests the null hypothesis that the variables are independent.

它测试变量是独立的零假设。

The test compares the observed data to a model that distributes the data according to the expectation that the variables are independent. Wherever the observed data doesn’t fit the model, the likelihood that the variables are dependent becomes stronger, thus proving the null hypothesis incorrect!

该测试将观察到的数据与一个模型进行比较,该模型根据变量是独立的期望来分配数据。 无论在何处观察到的数据都不适合模型,变量所依赖的可能性都会变得更大,从而证明原假设不正确!

卡方假设: (Hypothesis In Chi-Square:)

The first thing as a data engineer, you need to establish before performing any Inferential statistic test like Chi-Square, is to establish

作为数据工程师,在执行任何推理统计检验(如Chi-Square)之前,您需要建立的第一件事是建立

  • H0: Null Hypothesis

    H0:零假设
  • H1: Alternate Hypothesis

    H1:替代假设

对于一个分类变量: (For One Categorical Variable:)

  • Null hypothesis: The proportions match an assumed set of proportions

    零假设 :比例与假设的比例集匹配

  • Alternative hypothesis: At least one category has a different proportion. •

    替代假设 :至少一个类别具有不同的比例。 •

对于,两个分类变量: (For, Two Categorical Variables:)

  • Null hypothesis: There is no association between the two variables

    零假设 :两个变量之间没有关联

  • Alternative hypothesis: There is an association between the two variable

    替代假设 :两个变量之间存在关联

Before we jump into understanding how Chi-square works with an example, we need to understand what is Chi-square distribution & some other related concepts. This Chi-squared distribution is what we will analyze going forward in the chi-square or χ2 test.

在通过示例了解卡方的工作原理之前,我们需要了解什么是卡方分布及其他一些相关概念。 卡方分布是我们将在卡方 检验χ2检验中继续分析的结果。

什么是卡方分布? (What Is Chi-Square Distribution?)

The chi-square distribution (also chi-squared or χ2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables.

具有 k 个自由度 卡方分布 (也 称为 卡方 χ2分布 ) k个独立标准正态随机变量 的平方和的分布

It is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing or in the construction of confidence intervals.

它是推论统计中使用最广泛的概率分布之一,特别是在假设检验或构造置信区间中。

The primary reason that the chi-square distribution is used extensively in hypothesis testing is its relationship to the normal distribution. An additional reason that the chi-square distribution is widely used is that it is a member of the class of likelihood ratio tests (LRT).LRT’s have several desirable properties; in particular, LRT’s commonly provide the highest power to reject the null hypothesis.

在假设检验中广泛使用卡方分布的主要原因是其与正态分布的关系。 卡方分布被广泛使用的另一个原因是它是似然比检验(LRT)类的成员。 特别是LRT通常会提供最高的能力来拒绝原假设。

卡方分布的自由度: (Degree Of Freedom in Chi-Squared Distribution:)

The degrees of freedom in Chi-Squared distribution is equal to the number of standard normal deviates being summed. The mean of a Chi-square distribution is its degrees of freedom. A chi-square distribution constructed by squaring a single standard normal distribution is said to have 1 degree of freedom

卡方分布的自由度等于要求和的标准正态偏差的数量。 卡方分布的平均值是其自由度。 通过平方单个标准正态分布构造的卡方分布据说具有1个自由度

The degrees of freedom ( df or d) tell you how many numbers in your grid are actually independent. For a Chi-square grid, the degrees of freedom can be said to be the number of cells you need to fill in before, given the totals in the margins, you can fill in the rest of the grid using a formula.

自由度 ( dfd )告诉您网格中实际上有多少个独立的数字。 对于卡方网格,自由度可以说是您之前需要填充的像元数,给定边距的总数,则可以使用公式来填充网格的其余部分。

The degrees of freedom for a Chi-square grid is equal to the number of rows minus one times the number of columns minus one: that is, (R-1)*(C-1).

卡方网格的自由度等于行数减去列数减去一的一倍,即(R-1)*(C-1)。

Remember!

记得!

As the degree of freedom (df), increases the Chi-square distribution approaches a normal distribution

随着自由度(df)的增加,卡方分布接近正态分布

卡方统计: (Chi-Square Statistic:)

The formula for the chi-square statistic used in the chi-square test is:

卡方检验中使用的卡方统计量公式为:

The subscript “c” here are the degrees of freedom. “O” is your observed value and E is your expected value. The summation symbol means that you’ll have to perform a calculation for every single data item in your data set.

下标“ c ”是自由度。 “ O ”是您的观测值, E是您的期望值。 求和符号表示您必须对数据集中的每个数据项执行计算。

E=(row total×column total) / sample size

E =(行总数×列总数)/样本量

The Chi-square statistic can only be used on the numbers. They can’t be used for percentages, proportions, means, or similar statistical value. For example, if you have 10 percent of 200 people, you would need to convert that to a number (20) before you can run a test statistic.

卡方统计只能用于数字。 它们不能用于百分比,比例,均值或类似的统计值。 例如,如果您有200个人中的10%,则需要先将其转换为数字(20),然后才能运行测试统计信息。

Chi-Square test involves calculating a metric called the Chi-square statistic mentioned above, which follows the Chi-square distribution.

卡方检验涉及计算遵循卡方分布的上述度量(称为卡方统计量)。

Let’s see an example to get clarity on all the above-covered topics related to Chi-Square:

让我们看一个例子,以使上面提到的所有与卡方相关的主题变得清晰:

P-Value:

P值:

The null hypothesis provides a probability framework against which to compare our data. Specifically, through the proposed statistical model, the null hypothesis can be represented by a probability distribution called P-value, which gives the probability of all possible outcomes if the null hypothesis is true;

零假设为比较我们的数据提供了一个概率框架。 具体来说,通过提出的统计模型,可以用称为P值的概率分布表示零假设,如果零假设为真,则给出所有可能结果的概率。

It is a probabilistic representation of our expectations under the null hypothesis.

它是原假设下我们期望的概率表示。

Chi-Square Test Explained With Example:

卡方检验举例说明:

We will cover the following important steps in our journey of the Chi_square test for Independence of two variables.

我们将在卡方检验的两个变量独立性测试过程中涵盖以下重要步骤。

  • State The Hypothesis

    陈述假设

  • Formulate Data Analysis Plan

    制定数据分析计划

  • Analyze The Sample Data

    分析样本数据

  • Interpret The Outcome

    解释结果

Problem: This problem has been sourced from starttrek

问题:此问题来自 starttrek

A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified by gender (male or female) and by voting preference (Republican, Democrat, or Independent). The results are shown in the contingency table below.

一项民意调查对1000名选民进行了简单随机抽样调查。 按性别(男性或女性)和投票偏好(共和党,民主党或独立人士)对受访者进行分类。 结果显示在下面的列联表中。

We have to infer, Is there a gender gap? Do the men’s voting preferences differ significantly from women’s preferences? Use a 0.05 level of significance.

我们必须推断,是否存在性别差距? 男性的投票偏好与女性的偏好有很大不同吗? 使用0.05的显着性水平。

Let’s try to solve this problem using the Chi-Square test to find out the P-Value.

让我们尝试使用卡方检验来解决此问题,以找出P值。

Here test type which we will employ is :

我们将采用的测试类型为:

卡方检验是否具有独立性。 (Chi-square test for independence.)

So let’s get started by first stating our hypothesis.

因此,让我们从首先说明我们的假设开始。

Step 1: State The Hypothesis:

步骤1:陈述假设:

Here we need to start by establishing a null hypothesis and counter hypothesis(alternative hypothesis) as given below.

在这里,我们需要先建立一个零假设和反假设(替代假设),如下所示。

Null Hypothesis:

零假设:

Ho: Gender and voting preferences are independent.

何:性别和投票偏好是独立的。

Alternate Hypothesis:

替代假设:

H1: Gender and voting preferences are not independent.

假设1:性别和投票偏好不是独立的。

Step 2: Let’s Build Our Data Analysis Plan :

步骤2:让我们建立资料分析计划:

Here we will try to find out P-Value and match it with the significance level. Let’s take the standard and accepted level of significance to be 0.05. Given the sample data in the table above, let’s try to employ Chi-Square test for independence and deduce the Probability value.

在这里,我们将尝试找出P值并将其与显着性水平匹配。 让我们将标准和可接受的显着性水平设为0.05。 给定上表中的样本数据,让我们尝试采用卡方检验进行独立性分析并推论出概率值。

Step 3: Let’s Do Sample Analysis:

步骤3:让我们做样本分析:

Here we will analyze the given sample data to compute

在这里,我们将分析给定的样本数据以进行计算

  • Degree of freedom

    自由度

  • Expected Frequency Count of sample variable

    样本变量的预期频率计数

  • Calculate Chi-Square test static value

    计算卡方检验静态值

All the above values will help us find the P-value.

以上所有值将帮助我们找到P值

Degree Of Freedom Calculation: Let’s calculate df = (r — 1) * (c — 1), so in the given table, we have r(rows)= 2 and c(column) = 3

自由度计算:让我们计算df =(r_1)*(c_1),因此在给定的表中,我们的r(rows)= 2和c(column)= 3

df= (2–1)*(3–1) = 1*2= 2 ;

df =(2-1)*(3-1)= 1 * 2 = 2;

Expected Frequency Count Calculation:

预期的频率计数计算:

Let Eij, represent expected values of the two variables are independent of one another.

令Eij代表两个变量的期望值彼此独立。

Eij = ith (row total X jth column total) / grand total

Eij = ith(行总数x第j列总数)/总计

Let’s calculate the expected value for each given row and column value by using the above mentioned formula, Let me copy the table image again below to help you make calculation easily,

让我们使用上述公式为每个给定的行和列值计算期望值,让我在下面再次复制表格图片以帮助您轻松进行计算,

Here, Row 1 total value = 400, total value for column1 = 450, total sample size = 1000,

在这里,第1行的总值= 400,第1列的总值= 450,样本总数= 1000,

So,

所以,

E1,1 = (400 * 450) / 1000 = 180000/1000 = 180

E1,1 =(400 * 450)/ 1000 = 180000/1000 = 180

Similarly, let's calculate other expected values as shown below,

同样,让我们​​计算其他期望值,如下所示,

E1,2 = (400 * 450) / 1000 = 180000/1000 = 180E1,3 = (400 * 100) / 1000 = 40000/1000 = 40E2,1 = (600 * 450) / 1000 = 270000/1000 = 270E2,2 = (600 * 450) / 1000 = 270000/1000 = 270E2,3 = (600 * 100) / 1000 = 60000/1000 = 60

E1,2 =(400 * 450)/ 1000 = 180000/1000 = 180E1,3 =(400 * 100)/ 1000 = 40000/1000 = 40E2,1 =(600 * 450)/ 1000 = 270000/1000 = 270E2, 2 =(600 * 450)/ 1000 = 270000/1000 = 270E2,3 =(600 * 100)/ 1000 = 60000/1000 = 60

Time to calculate Chi-Squares for each calculated expected values above using the formula:

是时候使用以下公式为上述每个计算出的期望值计算Chi-Squares:

Calculating Chi-Square:

计算卡方:

As already discussed above, the formula for calculating chi-square statistic is

如上所述,用于计算卡方统计量的公式为

The subscript “c” here are the degrees of freedom. “O” is your observed value (actual values given in the table above)and E is your expected value(which we just calculated). The summation symbol means that you’ll have to perform a calculation for every single data item in your data set.

下标“ c ”是自由度。 “ O ”是您的观测值(上表中给出的实际值),而E是您的期望值(我们刚刚计算出)。 求和符号表示您必须对数据集中的每个数据项执行计算。

Χ² = Σ [ (Oi,j — Ei,j)² / Ei,j ]

Χ²=Σ[(Oi,j-Ei,j)²/ Ei,j]

Using the above formula our chi-square values comes out to be as given below,

使用上述公式,我们得出的卡方值如下所示:

Χ² = (200–180)²/180 + (150–180)²/180 + (50–40)²/40 + (250–270)²/270 + (300–270)²/270 + (50–60)²/60Χ² = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60

Χ²=(200–180)²/ 180 +(150–180)²/ 180 +(50–40)²/ 40 +(250–270)²/ 270 +(300–270)²/ 270 +(50– 60)²/60Χ²= 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60

So our final chi-square statistic value ,

因此,我们的最终卡方统计值

Χ² = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2

Χ²= 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2

Having calculated the chi-square value and degrees of freedom, we consult a chi-square table to check whether the chi-square statistic of 16.2 exceeds the critical value for the Chi-square distribution. The intent is to find P-value, which is is the probability that a chi-square statistic having 2 degrees of freedom is more extreme than 16.2.

计算出卡方值和自由度后,我们查阅卡方表以检查16.2的卡方统计量是否超过卡方分布的临界值。 目的是找到P值,这是具有2个自由度的卡方统计量比16.2极端的概率。

How to calculate P-value?

如何计算P值?

Given the degree of freedom = 2 & Chi-square statistic value = 16.2 , we can easily find P-value using this given

给定自由度= 2和卡方统计值= 16.2,我们可以使用给定的值轻松找到P值

Chi-Square Calculator link, simply enter the Chi-square statistic value & degree of freedom as an input, also keep your significance level as 0.05, you will find the result as given below,

卡方计算器链接,只需输入卡方统计值和自由度作为输入,并将您的显着性水平保持为0.05,您将发现以下结果,

P-Value is =. 000304. The result is significant at p < .05.

P值为=。 000304。结果在p <.05时很显着。

You can also find P-value using Chi-Square table given below, you can get this table from this source

您还可以使用下面给出的卡方表找到P值,您可以从此来源获取此表

Having calculated the chi-square value to be 16.2 and degrees of freedom to be 2, we consult a chi-square table given above to check whether the chi-square statistic of 16.2 exceeds the critical value for the Chi-square distribution. The critical value for the alpha of .05 (95% confidence) for df=2 comes out to be 5.99

计算卡方值为16.2,自由度为2后,我们查阅上面给出的卡方表,检查卡方统计量16.2是否超过卡方分布的临界值。 df = 2的.05(95%置信度)的alpha的临界值得出为5.99

Step 4: Interpreting the result

步骤4:解释结果

A: Inference From The P-value:

答:从P值推断:

Since we have got the P-value of 0.000304 we can interpret the result where it signifies that

由于我们获得了0.000304的P值,因此我们可以在表示该值的地方解释结果

As the P-value (0.000304) is less than the significance level (0.05),

由于P值(0.000304)小于显着性水平(0.05),

So we have to reject the below given

所以我们必须拒绝以下给出的

Null Hypothesis, which says, gender and voting preferences are independent.

零假设说, 性别和投票偏好是独立的。

& accept Alternate Hypothesis:

并接受替代假设

Which says, gender and voting preferences are not independent.

也就是说,性别和投票偏好不是独立的。

Hence we can conclude that,

因此,我们可以得出结论,

There is a relationship between gender and voting preference.

性别与投票偏好之间存在联系。

B:从卡方表解释: (B: Interpreting from Chi-Square Table:)

Since the critical value for the alpha of .05 (95% confidence) for df=2 is 5.99 and our chi-square statistic value 16.3, is much larger than 5.99, we have sufficient evidence to reject our Null hypothesis which we covered above.

由于df = 2的.05的alpha的临界值(95%的置信度)为5.99,而我们的卡方统计值16.3远大于5.99,因此我们有足够的证据拒绝我们上面讨论的Null假设。

So we accept the Alternate Hypothesis:

因此,我们接受替代假设:

Which says, gender and voting preferences are not independent.

也就是说,性别和投票偏好不是独立的。

Hence we conclude that,

因此,我们得出结论,

There is a relationship between gender and voting preference.

性别与投票偏好之间存在联系。

下一步是什么? (What’s Next?)

We will understand how to perform Chi-Square test using python & Jupyter notebook in part 2 of this series of Inferential Statistic: Hypothesis testing Using Chi-Square and will further explore

在本系列推论统计:使用卡方假设检验的系列文章的第2部分中,我们将了解如何使用python和Jupyter笔记本执行卡方检验。

  • Normal Deviate Z Test:

    正常偏差Z测试:

  • Two-Sample T-Test

    两样本T检验

  • ANOVA Test

    方差检验

& also will introduce one of the key topic: “Power of Statistical Test “

&还将介绍关键主题之一:“ 统计检验的力量”

The power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis.

任何具有统计意义的检验的功效被定义为它将拒绝错误的虚假假设的概率。

总结这一部分,并提供一个非常有用的信息图,它指导您选择假设检验类型: (Summing up this part, with a very helpful infographic which guides you to choose your hypothesis test type:)

source 资源

So choose your test data wisely and make sure you are interpreting sample data right, so that you can go ahead to design your ML models with required accuracy & confidence.

因此,请明智地选择测试数据,并确保您正确解释了样本数据,以便您可以按要求的准确性和信心继续设计ML模型。

Your ability to be an effective data scientist will largely become a reality only & only if you know how to analyze the given sample data with minimum deviation. The more you treat data with the required precision and clean them in the preliminary stage of EDA, the more reliable and productive your model building effort will become.

只有当您知道如何以最小的偏差分析给定的样本数据时,您成为有效的数据科学家的能力才会在很大程度上变成现实。 在EDA的初期阶段,您越以所需的精度处理数据并清理数据,建模工作就会变得更加可靠和高效。

翻译自: https://medium.com/swlh/what-is-chi-square-test-how-does-it-work-3b7f22c03b01

卡方检验 原理

你可能感兴趣的:(java,python)