多重检验、Post-Hoc分析与重复测量ANOVA
首先参考这篇简文,了解多重检验的目的、原理和方法:多重检验中FDR的计算。
多重检验之所以重要,是因为在实际科学实验中,对单一现象的研究过程往往会重复成百上千次,即便p值的阈值设定为0.01,也不免随着测量次数的增加而导致Type I 错误的累积,因此需要对检验方法与p值进行校正。
方差分析post hoc借鉴一个例子来说明:
需要使用到lendingclub数据集。LendingClub是一家美国点对点贷款公司,总部位于加利福尼亚州旧金山。它是第一家向美国证券交易委员会(SEC)注册其证券产品并在二级市场上提供贷款交易的点对点贷方。
Lending Club允许借款人创建1,000至40,000美元的无抵押个人贷款。标准贷款期限为三年。投资者可以在Lending Club网站上搜索和浏览贷款清单,并根据提供的有关借款人、贷款金额、贷款等级和贷款目的信息选择他们想要投资的贷款。投资者从利息中赚钱。Lending Club通过向借款人收取原始费用和向投资者收取服务费来赚钱。
官方网站包含2007年至今的所有数据,有兴趣的同学可自行下载研究:
LendingClub Statistics
LendingClub官网
由于数据量非常大,在此我们选用部分数据进行分析。
从Datacamp中下载Lendingclub数据子集
lendingclub <-read.csv(url("https://assets.datacamp.com/production/repositories/1793/datasets/e14dbe91a0840393e86e4fb9a7ec1b958842ae39/lendclub.csv"))
用dplyr包中的glimpse()查看数据基本情况
glimpse是“一瞥”的意思,意为“扫一眼数据、大致查看数据”。
library(dplyr)
glimpse(lendingclub)
Observations: 1,500
Variables: 12
$ member_id
55096114, 1555332, 1009151, 69524202, 72128084, 53906707, 610285, 48822267, 62159245, 1688398, 37887752, 69880... $ loan_amnt
11000, 10000, 13000, 5000, 18000, 14000, 8000, 5000, 7500, 6900, 12000, 8000, 18250, 9325, 28000, 16000, 15000... $ funded_amnt
11000, 10000, 13000, 5000, 18000, 14000, 8000, 5000, 7500, 6900, 12000, 8000, 18250, 9325, 28000, 16000, 15000... $ term
36 months, 36 months, 60 months, 36 months, 36 months, 60 months, 36 months, 36 months, 36 months, 36 months, ... $ int_rate
12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.89, 16.55, 10.16, 12.39, 10.16, 19.89, 15.61, 6.24, 14.99, 16... $ emp_length
10+ years, 10+ years, 3 years, 10+ years, 10+ years, 3 years, 10+ years, 10+ years, < 1 year, n/a, 10+ years, ... $ home_ownership
RENT, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, RENT, MORTGAGE, MORTGAGE, RENT, OWN, OWN, MO... $ annual_inc
51000, 40000, 78204, 51000, 96000, 47000, 40000, 33000, 50000, 70000, 65000, 50000, 36500, 70000, 176000, 6950... $ verification_status
Not Verified, Verified, Not Verified, Not Verified, Not Verified, Not Verified, Not Verified, Source Verified,... $ loan_status
Current, Fully Paid, Fully Paid, Current, Current, Current, Fully Paid, Current, Current, Fully Paid, Current,... $ purpose
debt_consolidation, debt_consolidation, home_improvement, home_improvement, credit_card, home_improvement, deb... $ grade
C, A, B, C, A, D, C, A, D, B, C, B, E, C, A, C, D, B, C, D, D, A, B, D, B, B, C, A, B, B, B, A, C, A, B, C, B,...
可以看到数据集包含12个变量,共1500行贷款人的数据
使用summarise()计算贷款数额(loan_amnt)的中位数、利率(int_rate)与年收入(annual_inc)的均值。
对数据集中变量的含义有任何不清楚的地方,都可以在官网下面的DATA DICTIONARY下载数据变量介绍的excel文档查看。
此处需要使用到管道函数%>%,其作用是将前一步的结果直接传递给下一步的函数,该函数属于dplyr包,上面已载入。
lendingclub %>% summarise(median(loan_amnt), mean(int_rate), mean(annual_inc))
median(loan_amnt) mean(int_rate) mean(annual_inc)
1 13000 13.31472 75736.03
贷款数额的中位数为13000美元,利率均值为13.31%,贷款人平均年收入为75736.03美元。
使用ggplot绘制贷款目的(purpose)的柱形图
由于x轴文字较长会挤在一起,用coord_flip()翻转坐标轴
library(ggplot2)
ggplot(data = lendingclub, aes(x = purpose)) + geom_bar() + coord_flip()
图中可以看出以债务合并(debt consolidation)为目的借款的人最多,其次是信用卡还款(credit card)。
注:债务合并指用一个贷款来偿还其他的贷款和一系列的信用卡。
贷款目的的种类:
婚姻 wedding
度假 vacation
小生意 small business
可再生能源 renewable energy
其它 other
搬家 moving
医疗 medical
大宗购买 major purchase
购房 house
装修 home improvement
债务合并 debt consolidation
信用卡 credit card
购车 car
由于种类较为杂乱,我们可以对其进行合并。比如信用卡、债务合并、医疗都跟债务相关,而购车、大宗购买、度假都跟大宗消费有关。
使用recode()重新整理贷款目的
lendingclub$purpose_recode <- lendingclub$purpose %>% recode(
"credit_card" = "debt_related", "debt_consolidation" = "debt_related", "medical" = "debt_related",
"car" = "big_purchase", "major_purchase" = "big_purchase", "vacation" = "big_purchase",
"moving" = "life_change", "small_business" = "life_change", "wedding" = "life_change",
"house" = "home_related", "home_improvement" = "home_related")
再次查看柱形图
ggplot(data=lendingclub, aes(x = purpose_recode)) + geom_bar()
以贷款目的(purpose_recode)为自变量,贷款金额(funded_amnt)为因变量,建立回归方程。
purpose_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)
summary(purpose_model)
Call:
lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
Residuals:
Min 1Q Median 3Q Max
-14472 -6251 -1322 4678 25761
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9888.1 1248.9 7.917 4.69e-15 ***
purpose_recodedebt_related 5433.5 1270.5 4.277 2.02e-05 ***
purpose_recodehome_related 4845.0 1501.0 3.228 0.00127 **
purpose_recodelife_change 4095.3 2197.2 1.864 0.06254 .
purpose_recodeother -649.3 1598.3 -0.406 0.68461
purpose_recoderenewable_energy -1796.4 4943.3 -0.363 0.71636
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8284 on 1494 degrees of freedom
Multiple R-squared: 0.03473, Adjusted R-squared: 0.0315
F-statistic: 10.75 on 5 and 1494 DF, p-value: 3.598e-10
对回归方程进行方差分析
直接使用R自带的anova()函数
anova(purpose_model)
Analysis of Variance Table
Response: funded_amnt
Df Sum Sq Mean Sq F value Pr(>F)
purpose_recode 5 3.6888e+09 737756668 10.75 3.598e-10 ***
Residuals 1494 1.0253e+11 68629950
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
不同的贷款目的对贷款金额影响不同,但我们不知道哪些贷款目的之间是有差异的,所以要进行事后检验。
采用Tukey's HSD检验
注:HSD为Honest Significant Difference的缩写
TukeyHSD()的使用方法:
TukeyHSD(aov_model, "outcome", conf.level = 0.9)
aov_model:方差分析的模型,用“~”连接因变量与自变量;
"outcome":需要比较的组别变量,即自变量
conf.level:置信区间
上面对回归方程使用anova()进行方差分析,我们也可以用aov()来进行方差分析。aov_model中~左边的是因变量,右边的是自变量,记不住的同学可以想象方程的表达式结构都是y=ax+b,等号左边的y是因变量,右边的是自变量。
purpose_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub)
进行事后检验,置信水平设置为95%
TukeyHSD(purpose_aov, "purpose_recode", conf.level = 0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = funded_amnt ~ purpose_recode, data = lendingclub)
$`purpose_recode`
diff lwr upr p adj
debt_related-big_purchase 5433.5497 1808.4015 9058.698 0.0002905
home_related-big_purchase 4845.0126 562.0345 9127.991 0.0160698
life_change-big_purchase 4095.2652 -2174.3728 10364.903 0.4250779
other-big_purchase -649.3001 -5209.7754 3911.175 0.9985823
renewable_energy-big_purchase -1796.4015 -15901.7471 12308.944 0.9991732
home_related-debt_related -588.5371 -3055.5905 1878.516 0.9840871
life_change-debt_related -1338.2845 -6539.3240 3862.755 0.9776960
other-debt_related -6082.8498 -9005.2437 -3160.456 0.0000001
renewable_energy-debt_related -7229.9512 -20893.8901 6433.988 0.6580011
life_change-home_related -749.7475 -6428.9211 4929.426 0.9990158
other-home_related -5494.3127 -9201.4124 -1787.213 0.0003576
renewable_energy-home_related -6641.4141 -20494.4076 7211.579 0.7462798
other-life_change -4744.5652 -10635.8339 1146.703 0.1953886
renewable_energy-life_change -5891.6667 -20481.7279 8698.395 0.8592034
renewable_energy-other -1147.1014 -15088.3877 12794.185 0.9999029
可以看到:
债务相关(debt related)和大宗购买(big purchase)差异显著,p<0.001;
家居相关(home related)和大宗购买差异显著,p<0.05;
其它(other)和债务相关差异显著,p<0.001;
其它和家居相关差异显著,p<0.001。
绘制各组均值及其置信区间的图形
gplots包中的plotmeans()可以用来绘制带有置信区间的组均值图形。
library(gplots)
plotmeans(funded_amnt ~ purpose_recode, xlab = "贷款目的", ylab = "贷款金额", data = lendingclub, main = "Mean Plot with 95% CI")
组均值图
绘制事后检验的图形
直接使用plot()即可
tukey_output <- TukeyHSD(purpose_aov, "purpose_recode", conf.level = 0.95)
par(las = 2) # 旋转轴标签
par(mar = c(10,13,9,7)) # 增大左边界面积
plot(tukey_output)
事后比较图形
置信区间包含0的说明两种贷款目的差异不显著。
再来看重复测量的例子:
重复测量方差分析
所谓重复测量方差分析,即受试者被测量不止一次。本节重点关注含一个组内和一个组间因子的重复测量方差分析(这是一个常见的设计)。示例来源于生理生态学领域,研究方向是生命系统的生理和生化过程如何响应环境因素的变异(此为应对全球变暖的一个非常重要的研究领域)。基础安装包中的CO2数据集包含了北方和南方牧草类植物Echinochloa crus-galli (Potvin、Lechowicz、Tardif,1990)的寒冷容忍度研究结果,在某浓度二氧化碳的环境中,对寒带植物与非寒带植物的光合作用率进行了比较。研究所用植物一半来自于加拿大的魁北克省(Quebec),另一半来自美国的密西西比州(Mississippi)。
首先,我们关注寒带植物。因变量是二氧化碳吸收量(uptake),单位为ml/L,自变量是植物类型Type(魁北克VS.密西西比)和七种水平(95~1000 umol/m^2 sec)的二氧化碳浓度(conc)。另外,Type是组间因子,conc是组内因子。Type已经被存储为一个因子变量,但你还需要先将
conc转换为因子变量。分析过程见代码清单9-7。
代码清单9-7 含一个组间因子和一个组内因子的重复测量方差分析
> CO2$conc <- factor(CO2$conc)
> w1b1 <- subset(CO2, Treatment=='chilled')
> fit <- aov(uptake ~ conc*Type + Error(Plant/(conc)), w1b1)
> summary(fit)
Error: Plant
Df Sum Sq Mean Sq F value Pr(>F)
Type 1 2667 2667 60.4 0.0015 **
Residuals 4 177 44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Error: Plant:conc
Df Sum Sq Mean Sq F value Pr(>F)
conc 6 1472 245.4 52.5 1.3e-12 ***
conc:Type 6 429 71.5 15.3 3.7e-07 ***
Residuals 24 112 4.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
其中CO2的数据是这样的:
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
7 Qn1 Quebec nonchilled 1000 39.7
8 Qn2 Quebec nonchilled 95 13.6
9 Qn2 Quebec nonchilled 175 27.3
10 Qn2 Quebec nonchilled 250 37.1
11 Qn2 Quebec nonchilled 350 41.8
12 Qn2 Quebec nonchilled 500 40.6
13 Qn2 Quebec nonchilled 675 41.4
14 Qn2 Quebec nonchilled 1000 44.3
15 Qn3 Quebec nonchilled 95 16.2
16 Qn3 Quebec nonchilled 175 32.4
17 Qn3 Quebec nonchilled 250 40.3
18 Qn3 Quebec nonchilled 350 42.1
19 Qn3 Quebec nonchilled 500 42.9
20 Qn3 Quebec nonchilled 675 43.9
21 Qn3 Quebec nonchilled 1000 45.5
22 Qc1 Quebec chilled 95 14.2
23 Qc1 Quebec chilled 175 24.1
24 Qc1 Quebec chilled 250 30.3
25 Qc1 Quebec chilled 350 34.6
26 Qc1 Quebec chilled 500 32.5
27 Qc1 Quebec chilled 675 35.4
28 Qc1 Quebec chilled 1000 38.7
29 Qc2 Quebec chilled 95 9.3
30 Qc2 Quebec chilled 175 27.3
31 Qc2 Quebec chilled 250 35.0
32 Qc2 Quebec chilled 350 38.8
33 Qc2 Quebec chilled 500 38.6
34 Qc2 Quebec chilled 675 37.5
35 Qc2 Quebec chilled 1000 42.4
36 Qc3 Quebec chilled 95 15.1
37 Qc3 Quebec chilled 175 21.0
38 Qc3 Quebec chilled 250 38.1
39 Qc3 Quebec chilled 350 34.0
40 Qc3 Quebec chilled 500 38.9
41 Qc3 Quebec chilled 675 39.6
42 Qc3 Quebec chilled 1000 41.4
43 Mn1 Mississippi nonchilled 95 10.6
44 Mn1 Mississippi nonchilled 175 19.2
45 Mn1 Mississippi nonchilled 250 26.2
46 Mn1 Mississippi nonchilled 350 30.0
47 Mn1 Mississippi nonchilled 500 30.9
48 Mn1 Mississippi nonchilled 675 32.4
49 Mn1 Mississippi nonchilled 1000 35.5
50 Mn2 Mississippi nonchilled 95 12.0
51 Mn2 Mississippi nonchilled 175 22.0
52 Mn2 Mississippi nonchilled 250 30.6
53 Mn2 Mississippi nonchilled 350 31.8
54 Mn2 Mississippi nonchilled 500 32.4
55 Mn2 Mississippi nonchilled 675 31.1
56 Mn2 Mississippi nonchilled 1000 31.5
57 Mn3 Mississippi nonchilled 95 11.3
58 Mn3 Mississippi nonchilled 175 19.4
59 Mn3 Mississippi nonchilled 250 25.8
60 Mn3 Mississippi nonchilled 350 27.9
61 Mn3 Mississippi nonchilled 500 28.5
62 Mn3 Mississippi nonchilled 675 28.1
63 Mn3 Mississippi nonchilled 1000 27.8
64 Mc1 Mississippi chilled 95 10.5
65 Mc1 Mississippi chilled 175 14.9
66 Mc1 Mississippi chilled 250 18.1
67 Mc1 Mississippi chilled 350 18.9
68 Mc1 Mississippi chilled 500 19.5
69 Mc1 Mississippi chilled 675 22.2
70 Mc1 Mississippi chilled 1000 21.9
71 Mc2 Mississippi chilled 95 7.7
72 Mc2 Mississippi chilled 175 11.4
73 Mc2 Mississippi chilled 250 12.3
74 Mc2 Mississippi chilled 350 13.0
75 Mc2 Mississippi chilled 500 12.5
76 Mc2 Mississippi chilled 675 13.7
77 Mc2 Mississippi chilled 1000 14.4
78 Mc3 Mississippi chilled 95 10.6
79 Mc3 Mississippi chilled 175 18.0
80 Mc3 Mississippi chilled 250 17.9
81 Mc3 Mississippi chilled 350 17.9
82 Mc3 Mississippi chilled 500 17.9
83 Mc3 Mississippi chilled 675 18.9
84 Mc3 Mississippi chilled 1000 19.9
可以看到这个例子中做重复测量two-way ANOVA的方法是:aov(uptake ~ conc*Type + Error(Plant/conc)),其中uptake是吸收量,conc是浓度,Type是植物所在的地区(Quebec和Mississippi),Plant是植物编号,注意到Plant中同一株植物不止一次出现,因此是重复测量。