R in action整理
1.描述性统计
数据使用R自带的mtcars,mpg每加仑行驶英里数,hp马力,wt车重
1)连续型变量描述性统计
myvars<-c("mpg","hp","wt")
①summary():
>summary(mtcars[myvars])
mpg hp wt
Min. :10 Min. : 52 Min. :1.5
1st Qu.:15 1st Qu.: 96 1st Qu.:2.6
Median :19 Median :123 Median :3.3
Mean :20 Mean :147 Mean :3.2
3rd Qu.:23 3rd Qu.:180 3rd Qu.:3.6
Max. :34 Max. :335 Max. :5.4
②misc包的describe():
>describe(mtcars[myvars])
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.1 6.03 19.2 19.7 5.41 10.4 33.9 23.5 0.61 -0.37 1.07
hp 2 32 146.7 68.56 123.0 141.2 77.10 52.0 335.0 283.0 0.73 -0.14 12.12
wt 3 32 3.2 0.98 3.3 3.1 0.77 1.5 5.4 3.9 0.42 -0.02 0.17
注:trimmed截尾默认0.1,skew偏度,kurtosis峰度
③pastecs包的stat.desc()
>stat.desc(mtcars[myvars])
mpg hp wt
nbr.val 32.0 32.00 32.00
nbr.null 0.0 0.00 0.00
nbr.na 0.0 0.00 0.00
min 10.4 52.00 1.51
max 33.9 335.00 5.42
range 23.5 283.00 3.91
sum 642.9 4694.00 102.95
median 19.2 123.00 3.33
mean 20.1 146.69 3.22
SE.mean 1.1 12.12 0.17 #平均数的标准误
CI.mean.0.95 2.2 24.72 0.35 #平均数置信度为95%的置信区间
var 36.3 4700.87 0.96
std.dev 6.0 68.56 0.98 #标准差
coef.var 0.3 0.47 0.30 #变异系数
2)分组描述性统计
myvars<-c("mpg","hp","wt")
①aggregate():
>aggregate(mtcars[myvars],by=list(am=mtcars$am),mean)
am mpg hp wt
1 0 17 160 3.8
2 1 24 127 2.4
注:如果list(mtcars$am),则am列为Group.1而不是am
②psych包中的describeBy():
>describeBy(mtcars[myvars],list(am=mtcars$am))
Descriptive statistics by group
am: 0
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 19 17.1 3.83 17.3 17.1 3.11 10.4 24.4 14 0.01 -0.80 0.88
hp 2 19 160.3 53.91 175.0 161.1 77.10 62.0 245.0 183 -0.01 -1.21 12.37
wt 3 19 3.8 0.78 3.5 3.8 0.45 2.5 5.4 3 0.98 0.14 0.18
----------------------------------------------------------------------------------
am: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 13 24.4 6.17 22.8 24.4 6.67 15.0 33.9 18.9 0.05 -1.46 1.71
hp 2 13 126.8 84.06 109.0 114.7 63.75 52.0 335.0 283.0 1.36 0.56 23.31
wt 3 13 2.4 0.62 2.3 2.4 0.68 1.5 3.6 2.1 0.21 -1.17 0.17
2.频数表
数据使用vcd包的Arthritis
1)一维
>table(Arthritis$Improved)
None Some Marked
42 14 28
2)二维
①xtabs:
>mytable<-xtabs(~Treatment+Improved,data=Arthritis) #~A+B,A为行变量,B为列变量
>mytable
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
#可使用addmargins(mytale),addmargins(prop.table(mytable))来生成边际频数和比例
②gmodels包的CrossTable()
>CrossTable(Arthritis$Treatment,Arthritis$Improved)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 84
| Arthritis$Improved
Arthritis$Treatment | None | Some | Marked | Row Total |
--------------------|-----------|-----------|-----------|-----------|
Placebo | 29 | 7 | 7 | 43 |
| 2.616 | 0.004 | 3.752 | |
| 0.674 | 0.163 | 0.163 | 0.512 |
| 0.690 | 0.500 | 0.250 | |
| 0.345 | 0.083 | 0.083 | |
--------------------|-----------|-----------|-----------|-----------|
Treated | 13 | 7 | 21 | 41 |
| 2.744 | 0.004 | 3.935 | |
| 0.317 | 0.171 | 0.512 | 0.488 |
| 0.310 | 0.500 | 0.750 | |
| 0.155 | 0.083 | 0.250 | |
--------------------|-----------|-----------|-----------|-----------|
Column Total | 42 | 14 | 28 | 84 |
| 0.500 | 0.167 | 0.333 | |
--------------------|-----------|-----------|-----------|-----------|
3)多维
mytable<-xtabs(~Treatment+Improved+Sex, data=Arthritis) #~A+B+C,A列1,B列2,C行,分别对应1 2 3
①ftable():
>ftable(mytable)
Sex Female Male
Treatment Improved
Placebo None 19 10
Some 7 0
Marked 6 1
Treated None 6 7
Some 5 2
Marked 16 5
同样可以使用margin.table(mytable,x) #x也可以写成c(x,y)的形式,为数字,分别对应A B C下标的123
>margin.table(mytable,c(1,3))
Sex
Treatment Female Male
Placebo 32 11
Treated 27 14
或者ftable(addmargins(prop.table(mytable,c(1,2)),3)),得到对应百分比
> ftable(addmargins(prop.table(mytable,c(1,2)),3))*100
Sex Female Male Sum
Treatment Improved
Placebo None 65.51724 34.48276 100.00000
Some 100.00000 0.00000 100.00000
Marked 85.71429 14.28571 100.00000
Treated None 46.15385 53.84615 100.00000
Some 71.42857 28.57143 100.00000
Marked 76.19048 23.80952 100.00000
4)独立性检验(变量是否相关或独立)
①卡方检验chisq.test() #二维
先创建包含所需要的变量的表格,此处先检验治疗方式和改善情况的关系
mytable<-xtabs(~Treatment+Improved,data=Arthritis)
chisq.test(mytable)对其进行检验
>chisq.test(mytable)
Pearson's Chi-squared test
data: mytable
X-squared = 13.055, df = 2, p-value = 0.001463
#统计学中"="都放在原假设,H0:两者相互独立
p值反应原假设发生的概率 p<0.01说明原假设发生概率很小(即不相互独立)
在99%以上的可信度上认为两者有关
同样尝试找出治疗方式与性别的关系,先建立表格
mytable<-xtabs(~Treatment+Sex,data=Arthritis)
chisq.test(mytable)
chisq.test(mytable)
Pearson's Chi-squared test with Yates' continuity correction
data: mytable
X-squared = 0.38378, df = 1, p-value = 0.5356
#根据上述原理,治疗方式与性别相互独立
②Fisher检验
先创建包含所需要的变量的表格,此处以治疗方式和改善情况的为例
mytable<-xtabs(~Treatment+Improved,data=Arthritis)
fisher.test(mytable)
> fisher.test(mytable)
Fisher's Exact Test for Count Data
data: mytable
p-value = 0.001393
alternative hypothesis: two.sided
#同样的H0:两者独立,p值反应原假设发生的状况
②Corchran-Mantel-Haenszel检验
3.相关
数据为state.x77的1-6列
1)相关类型及计算
①cor(x,use= ,method= ) #x:矩阵或数据框,use:缺失数据的处理方法,method:相关类型
Pearson基差关系:默认,两个定量变量之间的线性相关程度
Spearman等级相关系数:定序变量之间的相关程度
Kendall’s Tay相关系数:非参数的等级相关度量
默认得到一个方阵(所有变量之间的两两关系),同样可以计算非方形的相关矩阵,设定x,y,然后cov(x,y)
②ggm包的pcor(u,s) #排除其他若干变量的干扰,计算两变量之间的相关系数
2)相关性的显著性检验
cor.test(x,y,alternative= ,method= ) #默认为双侧检验,persons
H0:两者相关系数为0,用p值去衡量相关性的显著水平