cumei1658

python 方差分析_使用Python的重复测量方差分析

python 方差分析

A common method in experimental psychology is within-subjects designs. One way to analysis the data collected using within-subjects designs are using repeated measures ANOVA. I recently wrote a post on how to conduct a repeated measures ANOVA using Python and rpy2. I wrote that post since the great Python package statsmodels do not include repeated measures ANOVA. However, the approach using rpy2 requires R statistical environment installed. Recently, I found a python library called pyvttbl whith which you can do within-subjects ANOVAs. Pyvttbl enables you to create multidimensional pivot tables, process data and carry out statistical tests. Using the method anova on pyvttbl’s DataFrame we can carry out repeated measures ANOVA using only Python.

实验心理学中的一种常见方法是受试者内部设计 。分析使用受试者内部设计收集的数据的一种方法是使用重复测量方差分析 。我最近写了一篇关于如何使用Python和rpy2进行重复测量方差分析的文章。我写这篇文章是因为出色的Python软件包statsmodels不包含重复测量方差分析。但是，使用rpy2的方法需要安装R统计环境。最近，我发现了一个名为pyvttbl的python库，您可以在主题内进行方差分析。 Pyvttbl使您可以创建多维数据透视表，处理数据并进行统计测试。使用pyvttbl的DataFrame上的方差分析方法，我们可以仅使用Python进行重复测量方差分析。

为什么在主题设计中？ (Why within subject designs?)

There are, at least, two of the advantages using within-subjects design. First, more information is obtained from each subject in a within-subjects design compared to a between-subjects design. Each subject is measured in all conditions, whereas in the between-subjects design, each subject is typically measured in one or more but not all conditions. A within-subject design thus requires fewer subjects to obtain a certain level of statistical power. In situations where it is costly to find subjects this kind of design is clearly better than a between-subjects design. Second, the variability in individual differences between subjects is removed from the error term. That is, each subject is his or her own control and extraneous error variance is reduced.

使用主题内设计至少有两个优点。首先，与受试者之间的设计相比，受试者内部设计可从每个受试者获得更多信息。在所有条件下对每个对象进行测量，而在对象间设计中，通常在一个或多个但不是所有条件下对每个对象进行测量。因此，受试者内部的设计需要较少的受试者来获得一定水平的统计能力。在找不到主题的成本很高的情况下，这种设计显然比主题间设计更好。第二，从误差项中消除了受试者之间个体差异的变异性。即，每个对象都是他或她自己的控制，并且减少了无关的误差方差。

Python中的重复测量方差分析 (Repeated measures ANOVA in Python)

安装pyvttbl (Installing pyvttbl)

pyvttbl can be installed using pip:

可以使用pip安装pyvttbl：

pip install pyvttblpip install pyvttbl

If you are using Linux you may need to add ‘sudo’ before the pip command. This method installs pyvttbl and, hopefully, any missing dependencies.

如果您使用的是Linux，则可能需要在pip命令之前添加“ sudo”。此方法将安装pyvttbl，并希望安装所有缺少的依赖项。

Python脚本 (Python script)

I continue with simulating a response time data set. If you have your own data set you want to do your analysis on you can use the method “read_tbl” to load your data from a CSV-file.

我继续模拟响应时间数据集。如果您拥有自己的数据集，则可以使用“ read_tbl ”方法从CSV文件加载数据，以进行分析。

Conducting the repeated measures ANOVA with pyvttbl is pretty straight forward. You just take the pyvttbl DataFrame object and use the method anova. The first argument is your dependent variable (e.g. response time), and you specify the column in which the subject IDs are (e.g., sub=’Sub_id’). Finally, you add your within subject factor(s) (e.g., wfactors). wfactors take a list of column names containing your within subject factors. In my simulated data there is only one (e.g. ‘condition’).

用pyvttbl进行重复测量ANOVA非常简单。您只需获取pyvttbl DataFrame对象并使用anova方法。第一个参数是您的因变量（例如响应时间），并且您指定主题ID所在的列（例如sub ='Sub_id'）。最后，您添加您的主题因素（例如wfactor）。 wfactors列出包含您的主题内因素的列名列表。在我的模拟数据中，只有一个（例如“条件”）。

aov = df.anova('rt', sub='Sub_id', wfactors=['condition'])
print(aov)
aov = df.anova('rt', sub='Sub_id', wfactors=['condition'])
print(aov)

主体内效应测试 (Tests of Within-Subjects Effects)

度量：rt (Measure: rt)

Source	资源		Type III Sum of Squares	III型平方和	ε	ε	df	df	MS	多发性硬化症	F	F	Sig.	签名	η²_G	^η2 _G	Obs.	观察	SE of x̄	x̄的SE	±95% CI	±95％CI	λ	λ	Obs. Power	观察功率
condition	健康）状况	Sphericity Assumed	假定的球形度	4209536.428	4209536.428	–	–	1.000	1.000	4209536.428	4209536.428	309.093	309.093	0.000	0.000	4.165	4.165	40.000	40.000	19.042	19.042	37.323	37.323	317.019	317.019	1.000	1.000
		Greenhouse-Geisser	温室盖瑟	4209536.428	4209536.428	1.000	1.000	1.000	1.000	4209536.428	4209536.428	309.093	309.093	0.000	0.000	4.165	4.165	40.000	40.000	19.042	19.042	37.323	37.323	317.019	317.019	1.000	1.000
		Huynh-Feldt	休恩·费尔德	4209536.428	4209536.428	1.000	1.000	1.000	1.000	4209536.428	4209536.428	309.093	309.093	0.000	0.000	4.165	4.165	40.000	40.000	19.042	19.042	37.323	37.323	317.019	317.019	1.000	1.000
		Box	框	4209536.428	4209536.428	1.000	1.000	1.000	1.000	4209536.428	4209536.428	309.093	309.093	0.000	0.000	4.165	4.165	40.000	40.000	19.042	19.042	37.323	37.323	317.019	317.019	1.000	1.000
Error(condition)	错误（条件）	Sphericity Assumed	假定的球形度	531140.646	531140.646	–	–	39.000	39.000	13618.991	13618.991
		Greenhouse-Geisser	温室盖瑟	531140.646	531140.646	1.000	1.000	39.000	39.000	13618.991	13618.991
		Huynh-Feldt	休恩·费尔德	531140.646	531140.646	1.000	1.000	39.000	39.000	13618.991	13618.991
		Box	框	531140.646	531140.646	1.000	1.000	39.000	39.000	13618.991	13618.991

As can be seen in the output table the Sum of Squares used is Type III which is what common statistical software use when calculating ANOVA (the F-statistic) (e.g., SPSS or R-packages such as ‘afex’ or ‘ez’). The table further contains correction in case our data violates the assumption of Sphericity (which in the case of only 2 factors, as in the simulated data, is nothing to worry about). As you can see we also get generalized eta squared as effect size measure and 95 % Confidence Intervals. It is stated in the docstring for the class Anova that standard Errors and 95% confidence intervals are calculated according to Loftus and Masson (1994). Furthermore, generalized eta squared allows comparability across between-subjects and within-subjects designs (see, Olejnik & Algina, 2003).

从输出表中可以看出，使用的平方和是III型 ，这是计算ANOVA（F统计量）时常用的统计软件（例如SPSS或R-package，例如“ afex”或“ ez”）。该表还包含更正，以防我们的数据违反“球形性”的假设（在模拟数据中只有2个因素的情况下，无需担心）。如您所见，我们还获得了广义的eta平方作为效果量度和95％的置信区间。在Anova类的文档字符串中指出，标准误差和95％置信区间是根据Loftus和Masson（1994）计算的。此外，广义eta平方可以实现对象间和对象内设计之间的可比性（请参阅Olejnik＆Algina，2003）。

Conveniently, if you ever want to transform your data you can add the argument transform. There are several options here; log or log10, reciprocal or inverse, square-root or sqrt, arcsine or arcsin, and windsor10. For instance, if you want to use log-transformation you just add the argument “transform=’log’” (either of the previously mentioned methods can be used as arguments in string form):

方便地，如果您要转换数据，可以添加参数transform。这里有几种选择。 log或log10，倒数或倒数，平方根或sqrt，反正弦或反正弦以及windsor10。例如，如果要使用对数转换，则只需添加参数“ transform ='log'”（前面提到的两种方法都可以用作字符串形式的参数）：

Using pyvttbl we can also analyse mixed-design/split-plot (within-between) data. Doing a split-plot is easy; just add the argument “bfactors=” and a list of your between-subject factors. If you are interested in one-way ANOVA for independent measures see my newer post: Four ways to conduct one-way ANOVAS with Python.

使用pyvttbl，我们还可以分析混合设计/分割图（中间）数据。进行分割图很容易；只需添加参数“ bfactors =”和主题间因素列表即可。如果您对单向ANOVA的独立测量感兴趣，请参阅我的新文章：用Python进行单向ANOVAS的四种方法。

Finally, I created a function that extracts the F-statistics, Mean Square Error, generalized eta squared, and the p-value the results obtained with the anova method. It takes a factor as a string, a ANOVA object, and the values you want to extract. Keys for your different factors can be found using the key-method (e.g., aov.keys()).

最后，我创建了一个函数，该函数提取F统计量，均方误差，广义eta平方和p值（使用方差分析方法获得的结果）。它采用一个因子作为字符串，ANOVA对象和要提取的值。可以使用密钥方法找到不同因素的密钥（例如aov.keys（））。

def extract_for_apa(factor, aov, values = ['F', 'mse', 'eta', 'p']):
    results = {}
    for key,result in aov[(factor,)].iteritems():
        if key in values:
            results[key] = result
            
    return results
def extract_for_apa(factor, aov, values = ['F', 'mse', 'eta', 'p']):
    results = {}
    for key,result in aov[(factor,)].iteritems():
        if key in values:
            results[key] = result
            
    return results

Note, the table with the results in this post was created with the private method _within_html. To create an HTML table you will have to import SimpleHTML:

请注意，本文中包含结果的表是使用_within_html私有方法创建的。要创建HTML表，您将必须导入SimpleHTML：

That was all. There are at least one downside with using pyvttbl for doing within-subjects analysis in Python (ANOVA). Pyvttbl is not compatible with Pandas DataFrame which is commonly used. However, this may not be a problem since pyvttbl, as we have seen, has its own DataFrame method. There are also a some ways to aggregate and visualizing data using Pyvttbl. Another downside is that it seems like Pyvttbl no longer is maintained. You can find Pyvttbl documentation here.

这就是全部了。使用pyvttbl在Python（ANOVA）中进行对象内分析至少有一个缺点。 Pyvttbl与常用的Pandas DataFrame不兼容。但是，由于pyvttbl具有自己的DataFrame方法，因此这可能不是问题。还有一些使用Pyvttbl聚合和可视化数据的方法。另一个缺点是，似乎Pyvttbl不再被维护。您可以在此处找到Pyvttbl文档。

参考资料 (References)

翻译自: https://www.pybloggers.com/2016/02/repeated-measures-anova-using-python/