样本大小的确定
Congratulations, your experiment has yielded significant results! You can be sure (well, 95% sure) that the independent variable influenced your dependent variable. I guess all you have left to do is write up your discussion and submit your results to a scholarly journal. Right…………?
Çongratulations,实验取得了显著的结果! 您可以肯定(95%确信)自变量影响了您的因变量。 我想您剩下要做的就是写下您的讨论,然后将结果提交给学术期刊。 对…………?
Obtaining significant results is a tremendous accomplishment in itself self but it does not tell the entire story behind your results. I want to take this time and discuss statistical significance, sample size, statistical power, and effect size, all of which have an enormous impact on how we interpret our results.
获得重大成果本身就是一项巨大的成就,但并不能说明结果背后的全部故事。 我想借此时间讨论统计显着性,样本量,统计功效和效应量,所有这些因素对我们解释结果的方式产生巨大影响。
显着性(p = 0.05) (Significance (p = 0.05))
First and foremost, let’s discuss statistical significance as it forms the cornerstone of inferential statistics. We’ll discuss significance in the context of true experiments as it is the most relevant and easily understood. A true experiment is used to test a specific hypothesis(s) we have regarding the causal relationship between one or many variables. Specifically, we hypothesize that one or more variables (ie. independent variables) produce a change in another variable (ie. dependent variable). The change is our inferred causality. If you would like to learn more about the various research design types visit my article (LINK).
首先,让我们讨论统计意义,因为统计意义构成推理统计的基础。 我们将在真实实验的背景下讨论重要性,因为它是最相关且最容易理解的。 一个真实的实验用于检验关于一个或多个变量之间因果关系的特定假设。 具体而言,我们假设一个或多个变量(即自变量)在另一个变量(即因变量)中产生了变化。 变化是我们推断的因果关系。 如果您想了解有关各种研究设计类型的更多信息,请访问我的文章( LINK )。
For example, we want to test a hypothesis that an authoritative teaching style will produce higher test scores in students. In order to accurately test this hypothesis, we randomly select 2 groups of students that get randomly placed into one of two classrooms. One classroom is taught by an authoritarian teacher and one taught by an authoritative teacher. Throughout the semester, we collect all the test scores among all the classrooms. At the end of the year, we average all the scores to produce a grand average for each classroom. Let’s assume the average test score for the authoritarian classroom was 80%, and the authoritative classroom was 88%. It would seem your hypothesis was correct, the students taught by the authoritative teacher scored on average 8% higher on their tests compared to the students taught by the authoritarian teacher. However, what if we ran this experiment 100 times, each time with different groups of students do you think we would obtain similar results? What is the likelihood that this effect of teaching style on student test scores occurred by chance or another latent (ie. unmeasured) variable? Last but not least, is 8% considered “high enough” to be that different from 80%?
例如,我们要检验一个假设,即一种权威的教学风格会在学生中产生更高的考试成绩。 为了准确检验该假设,我们随机选择两组学生,将其随机分配到两个教室之一中。 一间教室由一位威权老师教,一间教室由一位权威老师教。 在整个学期中,我们收集所有教室中的所有考试成绩。 到年底,我们将所有分数平均,以得出每个教室的平均分数。 假设威权教室的平均考试分数为80%,威权教室为88%。 看来您的假设是正确的,由权威老师教的学生比由权威老师教的学生平均分数高8%。 但是,如果我们对每个不同的学生组进行100次此实验,您认为我们会得到类似的结果吗? 教学方式对学生测验分数的这种影响是偶然还是其他潜在(即无法衡量的)变量发生的可能性是什么? 最后但并非最不重要的一点是,8%被认为与80%相比“足够高”吗?
Null Hypothesis: Assumed hypothesis which states there are no significant differences between groups. In our teaching style example, the null hypothesis would predict no differences between student test scores based on teaching styles.
零假设:假设假设指出各组之间没有显着差异。 在我们的教学风格示例中,零假设将根据教学风格预测学生考试成绩之间没有差异。
Alternative or Research Hypothesis: Our original hypothesis which predicts the authoritative teaching style will produce the highest average student test scores.
另类或研究假设 :我们预测权威教学风格的原始假设将产生最高的平均学生考试成绩。
现在我们已经准备好阶段,让我们定义什么是p值,以及什么对您的结果有意义。 (Now that we have set the stage let’s define what is a p-value and what it means for your results to be significant.)
The p-value (also known as Alpha) is the probability that our Null Hypothesis is true. Obtaining a significant result simply means the p-value obtained by your statistical test was equal to or less than your alpha, which in most cases is 0.05.
p值(也称为Alpha)是零假设成立的概率。 获得显着结果只是意味着您通过统计检验获得的p值等于或小于您的alpha(在大多数情况下为0.05)。
A p-value of 0.05 is a common standard used in many areas of research.
p值0.05是许多研究领域使用的通用标准。
A significant p-value (ie. less than 0.05) would indicate that there is a less than 5% chance that your null hypothesis is correct. If this is the case, we reject the null hypothesis, accept our alternative hypothesis, and determine the student test scores are significantly different from each other. Notice we didn’t say the different teaching styles caused the significant differences in student test scores. The p-value only tells us whether or not the groups are different from each other, we need to make the inferential leap assume teaching styles influenced the groups to be different.
显着的p值(即小于0.05)将表明原假设正确的可能性小于5%。 在这种情况下,我们拒绝原假设,接受我们的替代假设,并确定学生的考试成绩彼此之间存在显着差异。 注意,我们并不是说不同的教学方式导致了学生考试成绩的显着差异。 p值仅告诉我们各组是否彼此不同,我们需要使推论性飞跃假定教学风格对各组的影响不同。
Another way of looking at a significant p-value is to consider the probability that if we run this experiment 100 times, we could expect at least 5 times the student test scores to be very similar to each other.
查看有效p值的另一种方法是,考虑如果我们进行100次实验,我们可以预期至少5倍的学生考试成绩彼此非常相似。
If we set our alpha to 0.01, we would need our resulting p-value is be equal to or less than 0.01 (ie. 1%) in order to consider our results significant. Of course, this would impose a stricter criterion and if found significant we would conclude there is a less than 1% chance the null hypothesis is correct.
如果我们将alpha设置为0.01,则我们需要得到的p值等于或小于0.01(即1%),以使我们的结果有意义。 当然,这将施加更严格的标准,如果发现有意义,我们将得出原假设正确的可能性小于1%。
统计功效 (Statistical Power)
The sample size or the number of participants in your study has an enormous influence on whether or not your results are significant. The larger the actual difference between the groups (ie. student test scores) the smaller of a sample we’ll need to find a significant difference (ie. p ≤ 0.05). Theoretically, with can find a significant difference in most experiments with a large enough sample size. However, extremely large sample sizes require expensive studies and are extremely difficult to obtain.
研究的样本量或参与者人数对您的结果是否显着影响很大。 两组之间的实际差异越大(即学生考试成绩),则需要一个显着差异(即p≤0.05)的样本就越小。 从理论上讲,在样本量足够大的大多数实验中,可以找到显着差异。 但是,非常大的样本量需要昂贵的研究,并且极难获得。
Type I error (α) or false positives, the probability of concluding the groups are significantly different when in reality they are not. We are will to concede a 5% chance that we incorrectly reject the null hypothesis.
I型错误(α)或误报,得出结论的可能性实际上是非常不同的。 我们将承认有5%的机会我们错误地拒绝了原假设。
Type II error (β) or false negatives, is the probability of concluding the groups are not significantly different when in fact they are different. We can decrease the probability of committing a Type II error by making sure our statistical test has the appropriate amount of Power.
II型错误(β)或假阴性是在实际上两组不同时得出结论的几率无明显差异的可能性。 通过确保统计测试具有适当的功效,我们可以降低发生II型错误的可能性。
Power is defined as 1 — probability of type II error (β). In other words, it is the probability of detecting a difference between the groups when the difference actually exists (ie. the probability of correctly rejecting the null hypothesis). Therefore, as we increase the power of a statistical test we increase its ability to detect a significant (ie. p ≤ 0.05) difference between the groups.
功效定义为1 – II型错误的概率( β)。 换句话说,它是当差异实际存在时检测到组之间差异的概率(即正确拒绝无效假设的概率)。 因此,随着我们增加统计检验的能力,我们也增加了其检测两组之间显着(即,p≤0.05)差异的能力。
It is generally accepted we should aim for a power of 0.8 or greater.
一般认为,我们的目标应该是0.8或更高的功效。
Then we will have an 80% chance of finding a statistically significant difference. That said, we still have a 20% chance of not being able to detect an actual significant difference between the groups.
然后,我们将有80%的机会找到具有统计意义的差异。 也就是说,我们仍然有20%的机会无法检测出两组之间的实际差异。
规模效应 (Effect Size)
If you recall our teaching style example, we found significant differences between the two groups of teachers. The average authoritarian classroom test score 80% and the authoritative classroom was 88%. Effect size tries to answer the question of “Are these differences large enough to be meaningful despite being statistically significant?”.
如果您还记得我们的教学风格示例,我们发现两组老师之间存在显着差异。 威权课堂的平均考试分数为80%,威权课堂的平均分数为88%。 效应大小试图回答“尽管统计上显着,这些差异是否足够大以至于有意义?”。
Effect size addresses the concept of “minimal important difference” which states that at a certain point a significant difference (ie p≤ 0.05) is so small that it wouldn’t serve any benefits in the real world. Therefore, effect size tries to determine whether or not the 8% increase in student test scores between authoritative and authoritarian teachers is large enough to be considered important. Keep in mind, by small we do not mean a small p-value.
效应大小涉及“最小重要差异”的概念,该概念指出在某个点上的显着差异(即p≤0.05)非常小,以至于无法在现实世界中发挥任何作用。 因此,效果量试图确定权威和威权老师之间的学生考试分数8%的增长是否足够大以至于被认为是重要的。 请记住,总的说来,我们并不意味着一个小的p值。
A different way to look at effect size is the quantitative measure of how much the IV affected the DV. A high effect size would indicate a very important result as the manipulation on the IV produced a large effect on the DV.
观察效应大小的另一种方法是定量测量IV对DV的影响。 高效果的大小将表明非常重要的结果,因为对IV的操纵对DV产生了很大的影响。
Effect size is typically expressed as Cohen’s d. Cohen described a small effect = 0.2, medium effect size = 0.5 and large effect size = 0.8
效应大小通常表示为Cohen d。 科恩描述小效果= 0.2,中效果大小= 0.5,大效果大小= 0.8
Smaller p-values (0.05 and below) don’t suggest the evidence of large or important effects, nor do high p-values (0.05+) imply insignificant importance and/or small effects. Given a large enough sample size, even very small effect sizes can produce significant p-values (0.05 and below). In other words, statistical significance explores the probability our results were due to chance and effect size explains the importance of our results.
较小的p值(0.05及以下)并不表示有较大或重要影响的证据,较高的p值(0.05+)也并不表示重要性不大和/或较小的影响。 给定足够大的样本量,即使很小的效应量也可以产生显着的p值(0.05及以下)。 换句话说,统计显着性探讨了我们的结果归因于机会的可能性,效应大小说明了我们的结果的重要性。
放在一起(功耗分析) (Putting it all Together (Power Analysis))
We can calculate the minimum required sample size for our experiment to achieve a specific statistical power and effect size for our analysis. This analysis should be conducted a priori to actually conducting the experiment.
我们可以计算实验所需的最小样本量,以实现分析所需的特定统计功效和效应量。 该分析应在实际进行实验之前进行。
Power analysis is a critical procedure to conduct during the design phase of your study. This way you will have a good idea of the number of participants needed for each experiment group (including control) to find a significant difference(s) if there is one to be found.
功效分析是在研究设计阶段进行的关键程序。 这样,您将对每个实验组(包括对照)发现显着差异( 如果有 )所需的参与者数量有了一个很好的了解。
G*Power is a great open-source program used to quickly calculate the required sample size based on your power and effect size parameters.
G * Power是一款出色的开源程序,可用于根据您的功效和效果量参数快速计算所需的样本量。
G *功率 (G*Power)
Select the “Test Family” appropriate for your analysis
选择适合您分析的“测试族”
- we’ll select t-tests 我们将选择t检验
2. Select the “Statistical Test” you are using for your analysis
2.选择用于分析的“统计检验”
- We will use Means: Difference between two independent means (two groups) 我们将使用均值:两个独立均值(两组)之间的差异
3. Select the “Type of Power Analysis”
3.选择“功率分析类型”
- We will select “A priori” to determine the required sample for the power and effect size you wish to achieve. 我们将选择“先验”来确定所需的力量和效果大小所需的样本。
4. Select the number of tails
4.选择尾数
- Use one tail if you only wish to determine a significant difference between groups in one direction. Typically, we select a 2-tailed test. 如果仅希望确定一个方向上的组之间的显着差异,请使用一条尾巴。 通常,我们选择2尾测试。
- We will select a two-tailed test 我们将选择一个两尾测试
5. Select the Desired Effect Size or “Effect size d”
5.选择所需的效果尺寸或“ 效果尺寸d”
- we’ll go through a range of effect sizes 我们将介绍各种效果大小
6. Select “α erro prob” or Alpha or the probability of not rejecting the null hypothesis when there is an actual difference between the groups.
6.选择“α错误概率”或“阿尔法”,或者在组之间存在实际差异时选择不拒绝原假设的概率。
- We’ll use 0.05 我们用0.05
7. Select the power you wish to achieve.
7.选择您想要获得的功率 。
- We’ll select 0.8 or 80% power and 0.9 or 90% 我们将选择0.8或80%的功率以及0.9或90%的功率
Select “Allocation Ratio N2/N1”
选择“分配比率N2 / N1”
- If you are expecting to have an equal number of participants in each group (treatment and control) then select 1. If you have twice as many in one group compared to the other group then select 2. 如果希望每个组(治疗组和对照组)的参与者人数相等,则选择1。如果一个组的人数是另一组的两倍,则选择2。
In general, large effect sizes require smaller sample sizes because they are “obvious” for the analysis to see/find. As we decrease in effect size we required larger sample sizes as smaller effect sizes are harder to find. This works in our favor as the larger the effect size the more important our results and fewer participants we need to recruit for our study.
通常,较大的效应量需要较小的样本量,因为它们对于分析可见/发现是“明显的”。 随着效应大小的减小,由于难以找到较小的效应大小,因此需要更大的样本大小。 这对我们有利,因为效应量越大,我们的结果越重要,我们需要招募的参与者越少。
Last but not least, these are the sample sizes requires for each participant group. For example, an experiment with one IV with 4 groups/levels and one DV, where you wish to find a large effect size (0.8+) with a power of 80%, you will need a sample size of 52 participants per group or 208 in total.
最后但并非最不重要的是,这些是每个参与者组所需的样本量。 例如,对于一个具有4个组/级别的IV和一个DV的实验,您希望以80%的功效找到较大的效应大小(0.8+),则每组需要52个参与者的样本量或208总共。
翻译自: https://towardsdatascience.com/the-relationship-between-significance-power-sample-size-effect-size-899fcf95a76d
样本大小的确定