我们看一个例子来感性认识下dummy variable和contrast matrix。
> library(datasets)
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 578 obs. of 4 variables:
$ weight: num 42 51 59 64 76 93 106 125 149 171 ...
$ Time : num 0 2 4 6 8 10 12 14 16 18 ...
$ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
$ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
.. ..- attr(*, ".Environment")=
- attr(*, "outer")=Class 'formula' length 2 ~Diet
.. ..- attr(*, ".Environment")=
- attr(*, "labels")=List of 2
..$ x: chr "Time"
..$ y: chr "Body weight"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(gm)"
> summary(lm(weight~Diet, data=ChickWeight))
Call:
lm(formula = weight ~ Diet, data = ChickWeight)
Residuals:
Min 1Q Median 3Q Max
-103.95 -53.65 -13.64 40.38 230.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.645 4.674 21.961 < 2e-16 ***
Diet2 19.971 7.867 2.538 0.0114 *
Diet3 40.305 7.867 5.123 4.11e-07 ***
Diet4 32.617 7.910 4.123 4.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.33 on 574 degrees of freedom
Multiple R-squared: 0.05348, Adjusted R-squared: 0.04853
F-statistic: 10.81 on 3 and 574 DF, p-value: 6.433e-07
函数contrasts用来查看某个Factor的Contrast Matrix。一般情况下,对于有K个level的Factor,R会创建K-1个dummy variable,另外一个可以通过K-1个dummy variable推导出来。如下所示,Diet有4个Level(1,2,3,4),所以创建了Diet2,Diet3,Diet4三个变量。R根据这个对照矩阵进行dummy variable的赋值。当Diet=1时,(Diet2,Diet3,Diet4)=(0,0,0);Diet=2,则(Diet2,Diet3,Diet4)=(1,0,0);Diet=3,则(Diet2,Diet3,Diet4)=(0,1,0);当Diet=4,则(Diet2,Diet3,Diet4)=(0,0,1)
> contrasts(ChickWeight$Diet)
2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
Diet | Diet2 | Diet3 | Diet4 |
---|---|---|---|
1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 1 | 0 |
4 | 0 | 0 | 1 |
> tapply(ChickWeight$weight, ChickWeight$Diet, mean)
1 2 3 4
102.6455 122.6167 142.9500 135.2627