1、基础安装:summary()函数可获取描述性统计量
#7-1 summary()函数
> vars <- c("mpg", "hp", "wt")
> head(mtcars[vars])
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
> summary(mtcars[vars])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
2、自定义函数,并用sapply将自定义函数应用到每列上
#7-2利用sapply()计算,自行添加偏度峰度计算函数
> mystats <- function(x, na.omit=FALSE){
+ if (na.omit)
+ x <- x[!is.na(x)]
+ m <- mean(x)
+ n <- length(x)
+ s <- sd(x)
+ skew <- sum((x-m)^3/s^3)/n
+ kurt <- sum((x-m)^4/s^4)/n - 3
+ return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
+ }
> sapply(mtcars[vars], mystats)
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtosis -0.372766 -0.1355511 -0.02271075
3、扩展
利用Hmisc、pastecs、psych包中的describe函数等也可以进行类似的描述性计算
1、使用aggregate()分组
#7-6aggregate()函数分组
> aggregate(mtcars[vars], by=list(am=mtcars$am), mean)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
> aggregate(mtcars[vars], by=list(am=mtcars$am), sd)
am mpg hp wt
1 0 3.833966 53.90820 0.7774001
2 1 6.166504 84.06232 0.6169816
注:
aggregate()函数只能在调用的时候使用mean、sd这样的单返回值函数。若需返回多个统计量,可以使用by()函数
示例使用数据:
> library(vcd)
> head(Arthritis)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
1、一维列联表
使用函数table()可生成简单频数表
#一维列联表
> options(digits = 3)
> mytable <- with(Arthritis, table(Improved))
> mytable
Improved
None Some Marked
42 14 28
#转化为比例值
> prop.table(mytable)
Improved
None Some Marked
0.500 0.167 0.333
#转化为百分比
> prop.table(mytable)*100
Improved
None Some Marked
50.0 16.7 33.3
2、二维列联表
方法一:table()函数
mytable <- table(A, B)
其中,A是行变量,B是列变量
方法二:xtabs()函数
mytable <- xtabs(~ A + B, data=mydata)
其中,mydata是一个矩阵或数据框。
需要进行交叉分类的变量为B,在右侧
频数向量为A,在左侧
#二维列联表
> mytable <- xtabs(~ Treatment+Improved, data = Arthritis)
> #以Treatment为频数向量,以Improved为进行交叉分类的变量
> mytable
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
#生成边际频数和比例,下标指代是第几个变量
#计算行和与行比例
> margin.table(mytable, 1)
Treatment
Placebo Treated
43 41
> prop.table(mytable, 1)
Improved
Treatment None Some Marked
Placebo 0.674 0.163 0.163
Treated 0.317 0.171 0.512
#计算列和与列比例
> margin.table(mytable, 2)
Improved
None Some Marked
42 14 28
> prop.table(mytable, 2)
Improved
Treatment None Some Marked
Placebo 0.69 0.50 0.25
Treated 0.31 0.50 0.75
#计算各单元格所占比例
> prop.table(mytable)
Improved
Treatment None Some Marked
Placebo 0.3452 0.0833 0.0833
Treated 0.1548 0.0833 0.2500
使用addmargins()函数为表格添加边际和
#添加边际和
> addmargins(mytable) #默认为表中的所有变量添加边际和
Improved
Treatment None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
> addmargins(prop.table(mytable))
Improved
Treatment None Some Marked Sum
Placebo 0.3452 0.0833 0.0833 0.5119
Treated 0.1548 0.0833 0.2500 0.4881
Sum 0.5000 0.1667 0.3333 1.0000
#仅为各行添加和
> addmargins(prop.table(mytable, 1), 2)
Improved
Treatment None Some Marked Sum
Placebo 0.674 0.163 0.163 1.000
Treated 0.317 0.171 0.512 1.000
#仅为各列添加和
> addmargins(prop.table(mytable, 2), 1)
Improved
Treatment None Some Marked
Placebo 0.69 0.50 0.25
Treated 0.31 0.50 0.75
Sum 1.00 1.00 1.00
方法三:使用gmodels包中的CrossTable()函数
#7-11 使用CrossTable生成二维列联表
> library(gmodels)
> CrossTable(Arthritis$Treatment, Arthritis$Improved)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 84
| Arthritis$Improved
Arthritis$Treatment | None | Some | Marked | Row Total |
--------------------|-----------|-----------|-----------|-----------|
Placebo | 29 | 7 | 7 | 43 |
| 2.616 | 0.004 | 3.752 | |
| 0.674 | 0.163 | 0.163 | 0.512 |
| 0.690 | 0.500 | 0.250 | |
| 0.345 | 0.083 | 0.083 | |
--------------------|-----------|-----------|-----------|-----------|
Treated | 13 | 7 | 21 | 41 |
| 2.744 | 0.004 | 3.935 | |
| 0.317 | 0.171 | 0.512 | 0.488 |
| 0.310 | 0.500 | 0.750 | |
| 0.155 | 0.083 | 0.250 | |
--------------------|-----------|-----------|-----------|-----------|
Column Total | 42 | 14 | 28 | 84 |
| 0.500 | 0.167 | 0.333 | |
--------------------|-----------|-----------|-----------|-----------|
3、多维列联表
前面用过的table()、xtabs()、margin.table()、prop.table()、addmargins()函数都可以推广到高于二维。
也可以使用ftable()函数
#7-12三维列联表
#生成三维列联表
> mytable <- xtabs(~ Treatment+Sex+Improved, data = Arthritis)
> mytable
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
> ftable(mytable)
Improved None Some Marked
Treatment Sex
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5
#生成边际频数
> margin.table(mytable, 1)
Treatment
Placebo Treated
43 41
> margin.table(mytable, 2)
Sex
Female Male
59 25
> margin.table(mytable, 3)
Improved
None Some Marked
42 14 28
#生成两种情况混合后的边际频数
> margin.table(mytable, c(1,3))
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
> margin.table(mytable, c(2,3))
Improved
Sex None Some Marked
Female 25 12 22
Male 17 2 6
> margin.table(mytable, c(1,2))
Sex
Treatment Female Male
Placebo 32 11
Treated 27 14
#生成两种情况混合后的比例
> ftable(prop.table(mytable, c(1, 2)))
Improved None Some Marked
Treatment Sex
Placebo Female 0.5938 0.2188 0.1875
Male 0.9091 0.0000 0.0909
Treated Female 0.2222 0.1852 0.5926
Male 0.5000 0.1429 0.3571
#生成边际
> ftable(addmargins(prop.table(mytable, c(1,2)), 3))
Improved None Some Marked Sum
Treatment Sex
Placebo Female 0.5938 0.2188 0.1875 1.0000
Male 0.9091 0.0000 0.0909 1.0000
Treated Female 0.2222 0.1852 0.5926 1.0000
Male 0.5000 0.1429 0.3571 1.0000
#得到百分比而不是比例
> ftable(addmargins(prop.table(mytable, c(1,2)), 3)) * 100
Improved None Some Marked Sum
Treatment Sex
Placebo Female 59.38 21.88 18.75 100.00
Male 90.91 0.00 9.09 100.00
Treated Female 22.22 18.52 59.26 100.00
Male 50.00 14.29 35.71 100.00
1、卡方独立性检验
使用chisq.test()函数
#7-13卡方独立性检验
> library(vcd)
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> chisq.test(mytable)
Pearson's Chi-squared test
data: mytable
X-squared = 13.1, df = 2, p-value = 0.0015
#P小于0.05,不独立
> mytable <- xtabs(~Improved+Sex, data=Arthritis)
> chisq.test(mytable)
Pearson's Chi-squared test
data: mytable
X-squared = 4.84, df = 2, p-value = 0.089
#P大于0.05,不独立
p值表示从总体中抽取的样本行变量与列变量是相互独立的概率
2、Fisher精确检验
使用fisher.test()函数
原假设:边界固定的列联表中行和列是相互独立的。
#Fisher
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> fisher.test(mytable)
Fisher's Exact Test for Count Data
data: mytable
p-value = 0.0014
alternative hypothesis: two.sided
3、Cochran-Mantel-Haenszel检验
使用mantelhaen.test()函数
原假设:两个名义变量在第三个变量的每一层中都是条件独立的
以下代码检验治疗情况和改善情况在性别的每一水平下是否独立
#Cochran-Mantel-Haenszel检验
> mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
> mantelhaen.test(mytable)
Cochran-Mantel-Haenszel test
data: mytable
Cochran-Mantel-Haenszel M^2 = 14.6, df = 2, p-value = 0.00066
P小于0.05,说明并不独立
若不独立,可以度量相关性
使用vcd包中的assocstats()函数
#7-14二维列联表的相关性度量
> library(vcd)
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> assocstats(mytable)
X^2 df P(> X^2)
Likelihood Ratio 13.530 2 0.0011536
Pearson 13.055 2 0.0014626
Phi-Coefficient : NA
Contingency Coeff.: 0.367
Cramer's V : 0.394
利用自定义函数table2flat进行转换
#7-15自定义函数table2flat:将表转换为扁平格式
> table2flat <- function(mytable){
+ df <- as.data.frame(mytable)
+ rows <- dim(df)[1]
+ cols <- dim(df)[2]
+ x <- NULL
+ for (i in 1:rows){
+ for (j in 1:df$Freq[i]) {
+ rows <- df[i, c(1:(cols-1))]
+ x <- rbind(x, row)
+ }
+ }
+ rows.names(x) <- c(1:dims(x)[1])
+ return(x)
+ }
#7-16使用table2flat函数转换已发表的数据
> treatment <- rep(c("Placebo", "Treated"), times=3)
> improved <- rep(c("None", "Some", "Marked"), each=2)
> Freq <- c(29,13,7,17,7,21)
> mytable <- as.data.frame(cbind(treatment, improved, Freq))
> mydata <- table2flat(mytable)
Error in rbind(x, row) :
cannot coerce type 'closure' to vector of type 'list'
> head(mydata)
>
TTT报错,我也不知道为啥,咋改TTT
Error in rbind(x, row) :
cannot coerce type 'closure' to vector of type 'list'
相关关系用来描述定量变量之间的关系。
相关系数的符号表示关系的方向(正相关或负相关),相关系数的大小表示关系的强弱程度(完全不相关时为0,完全相关时为1)
1、Pearson、Spearman、Kendall相关
Pearson积差相关系数衡量了两个定量变量之间的线性相关程度
Spearman等级相关系数衡量了分级定序变量之间的相关程度
Kendall相关系数是一种非参数的等级相关度量
cor()函数可以计算这三种相关系数
cov()函数可以计算协方差
#7.3相关
> #7-17协方差和相关系数
> states <- state.x77[,1:6]
> cov(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714 -3551.509551
Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286 3076.768980
Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776 -3.235469
Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480 6.312685
Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465 -14.549616
HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616 65.237894
> cor(states) #默认是Pearson相关系数
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
> cor(states, method = "spearman")
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.0000000 0.1246098 0.3130496 -0.1040171 0.3457401 -0.3833649
Income 0.1246098 1.0000000 -0.3145948 0.3241050 -0.2174623 0.5104809
Illiteracy 0.3130496 -0.3145948 1.0000000 -0.5553735 0.6723592 -0.6545396
Life Exp -0.1040171 0.3241050 -0.5553735 1.0000000 -0.7802406 0.5239410
Murder 0.3457401 -0.2174623 0.6723592 -0.7802406 1.0000000 -0.4367330
HS Grad -0.3833649 0.5104809 -0.6545396 0.5239410 -0.4367330 1.0000000
默认得到方形相关矩阵(即所有变量之间两两相关),也可以计算非方形的矩阵。
#计算非方形的矩阵
> x <- states[, c("Population", "Income", "Illiteracy", "HS Grad")]
> y <- states[, c("Life Exp", "Murder")]
> cor(x,y)
Life Exp Murder
Population -0.06805195 0.3436428
Income 0.34025534 -0.2300776
Illiteracy -0.58847793 0.7029752
HS Grad 0.58221620 -0.4879710
2、偏相关
偏相关是指在控制一个或多个定量变量的时候,另外两个定量变量之间的相互关系,可使用ggm包中的pcor()函数计算偏相关系数
#计算偏相关系数
> library(ggm)
> #控制收入2、文盲率3和高中毕业率6时,计算人口1和谋杀率5的偏相关系数
> pcor(c(1,5,2,3,6), cov(states))
[1] 0.3462724
3、其它类型的相关
polycor包中的hetcor()函数可以计算混合相关矩阵
常用原假设:变量间不相关(即总体相关系数为0)
1、使用cor.test对单个相关系数进行检验
cor.test(x, y, alternative = , method = )
其中x, y为要检验相关性的变量,alternative =用来指定双侧检验或单侧检验(默认双侧=two.side;当总体相关系数小于0时,用alternative =“less”;当总体相关系数大于0时,用alternative =“greater”)
2、psych包中的corr.test()可以计算相关矩阵并进行检验
参数use = "complete"表示对缺失值进行行删除;use = "pairwise"表示对缺失值进行成对删除
参数method可取pearson(默认)、spearman、kendall
两个组进行比较,假设结果变量为连续性组间比较,假设其呈正态分布
一个针对两组的独立样本t检验可以用于检验两个总体的均值相等的假设
调用格式1:y是一个数值型变量,x为一个二分变量
t.test(y ~ x, data)
调用格式2:y1和y2为数值型向量(即各组的结果变量)
t.test(y1, y2)
#7.4.1独立样本t检验
> library(MASS)
> t.test(Prob ~ So, data = UScrime)
Welch Two Sample t-test
data: Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1
0.03851265 0.06371269
非独立样本的t检验假定组间的差异呈正态分布
调用格式为:y1, y2为两个非独立组的数值向量
t.test(y1, y2, paired=TRUE)
#7.4.2非独立样本t检验
> library(MASS)
> sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x), sd=sd(x))))
U1 U2
mean 95.46809 33.97872
sd 18.02878 8.44545
> with(UScrime, t.test(U1, U2, paired=TRUE))
Paired t-test
data: U1 and U2
t = 32.407, df = 46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
57.67003 65.30870
sample estimates:
mean of the differences
61.48936
使用方差分析ANOVA,参考ch9
若结果变量在本质上就严重偏倚或呈现有序关系,可使用非参数检验
1、若两组数据独立,可使用Wilcoxon秩和检验,来评估是否从相同的概率分布中抽得
调用格式1:y是一个数值型变量,x为一个二分变量
wilcox.test(y ~ x, data)
调用格式2:y1和y2为各组的结果变量)
wilcox.test(y1, y2)
默认为双侧检验
#Wilcoxon秩和检验
> with(UScrime, by(Prob, So, median))
So: 0
[1] 0.038201
----------------------------------------------------------------------------------
So: 1
[1] 0.055552
> wilcox.test(Prob ~ So, data=UScrime)
Wilcoxon rank sum exact test
data: Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0
2、Wilcoxon符号秩检验是非独立样本t检验的一种非参数替代方法,适用于两组成对数据和无法保证正态性假设的情况
#Wilcoxon符号秩检验
> sapply(UScrime[c("U1","U2")], median)
U1 U2
92 34
> with(UScrime, wilcox.test(U1, U2, paired=TRUE))
Wilcoxon signed rank test with continuity correction
data: U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0
若各组独立,用Kruskal-Wallis检验
若各组不独立,用Friedman检验