【R语言实战】Ch7基本统计方法

Ch7基本统计方法

  • 7.1描述性统计分析
    • 7.1.1方法云集
    • 7.1.2分组计算描述性统计量
  • 7.2频数表和列联表
    • 7.2.1生成频数表
    • 7.2.2独立性检验
    • 7.2.3相关性的度量
    • 7.2.5将表转换为扁平格式
  • 7.3相关
    • 7.3.1相关的类型
    • 7.3.2相关性的显著性检验
  • 7.4t检验
    • 7.4.1独立样本的t检验
    • 7.4.2非独立样本的t检验
    • 7.4.3多于两组的情况
  • 7.5组间差异的非参数检验
    • 7.5.1两组的比较
    • 7.5.2多于两组的比较

7.1描述性统计分析

7.1.1方法云集

1、基础安装:summary()函数可获取描述性统计量

#7-1 summary()函数
> vars <- c("mpg", "hp", "wt")
> head(mtcars[vars])
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460

> summary(mtcars[vars])
      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424  

2、自定义函数,并用sapply将自定义函数应用到每列上

 #7-2利用sapply()计算,自行添加偏度峰度计算函数
> mystats <- function(x, na.omit=FALSE){
+   if (na.omit)
+     x <- x[!is.na(x)]
+   m <- mean(x)
+   n <- length(x)
+   s <- sd(x)
+   skew <- sum((x-m)^3/s^3)/n
+   kurt <- sum((x-m)^4/s^4)/n - 3
+   return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
+ }
> sapply(mtcars[vars], mystats)
               mpg          hp          wt
n        32.000000  32.0000000 32.00000000
mean     20.090625 146.6875000  3.21725000
stdev     6.026948  68.5628685  0.97845744
skew      0.610655   0.7260237  0.42314646
kurtosis -0.372766  -0.1355511 -0.02271075

3、扩展
利用Hmisc、pastecs、psych包中的describe函数等也可以进行类似的描述性计算

7.1.2分组计算描述性统计量

1、使用aggregate()分组

#7-6aggregate()函数分组
> aggregate(mtcars[vars], by=list(am=mtcars$am), mean)
  am      mpg       hp       wt
1  0 17.14737 160.2632 3.768895
2  1 24.39231 126.8462 2.411000
> aggregate(mtcars[vars], by=list(am=mtcars$am), sd)
  am      mpg       hp        wt
1  0 3.833966 53.90820 0.7774001
2  1 6.166504 84.06232 0.6169816

注:
aggregate()函数只能在调用的时候使用mean、sd这样的单返回值函数。若需返回多个统计量,可以使用by()函数

7.2频数表和列联表

示例使用数据:

> library(vcd)
> head(Arthritis)
  ID Treatment  Sex Age Improved
1 57   Treated Male  27     Some
2 46   Treated Male  29     None
3 77   Treated Male  30     None
4 17   Treated Male  32   Marked
5 36   Treated Male  46   Marked
6 23   Treated Male  58   Marked

7.2.1生成频数表

1、一维列联表

使用函数table()可生成简单频数表

#一维列联表
> options(digits = 3)
> mytable <- with(Arthritis, table(Improved))
> mytable
Improved
  None   Some Marked 
    42     14     28 

#转化为比例值
> prop.table(mytable)
Improved
  None   Some Marked 
 0.500  0.167  0.333 

#转化为百分比
> prop.table(mytable)*100
Improved
  None   Some Marked 
  50.0   16.7   33.3 

2、二维列联表

方法一:table()函数

mytable <- table(A, B)

其中,A是行变量,B是列变量

方法二:xtabs()函数

mytable <- xtabs(~ A + B, data=mydata)

其中,mydata是一个矩阵或数据框。
需要进行交叉分类的变量为B,在右侧
频数向量为A,在左侧

#二维列联表
> mytable <- xtabs(~ Treatment+Improved, data = Arthritis)
> #以Treatment为频数向量,以Improved为进行交叉分类的变量
> mytable
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

#生成边际频数和比例,下标指代是第几个变量
#计算行和与行比例
> margin.table(mytable, 1)
Treatment
Placebo Treated 
     43      41 
> prop.table(mytable, 1)
         Improved
Treatment  None  Some Marked
  Placebo 0.674 0.163  0.163
  Treated 0.317 0.171  0.512

#计算列和与列比例
> margin.table(mytable, 2)
Improved
  None   Some Marked 
    42     14     28 
> prop.table(mytable, 2)
         Improved
Treatment None Some Marked
  Placebo 0.69 0.50   0.25
  Treated 0.31 0.50   0.75

#计算各单元格所占比例
> prop.table(mytable)
         Improved
Treatment   None   Some Marked
  Placebo 0.3452 0.0833 0.0833
  Treated 0.1548 0.0833 0.2500

使用addmargins()函数为表格添加边际和

#添加边际和
> addmargins(mytable) #默认为表中的所有变量添加边际和
         Improved
Treatment None Some Marked Sum
  Placebo   29    7      7  43
  Treated   13    7     21  41
  Sum       42   14     28  84
> addmargins(prop.table(mytable))
         Improved
Treatment   None   Some Marked    Sum
  Placebo 0.3452 0.0833 0.0833 0.5119
  Treated 0.1548 0.0833 0.2500 0.4881
  Sum     0.5000 0.1667 0.3333 1.0000

#仅为各行添加和
> addmargins(prop.table(mytable, 1), 2)
         Improved
Treatment  None  Some Marked   Sum
  Placebo 0.674 0.163  0.163 1.000
  Treated 0.317 0.171  0.512 1.000

#仅为各列添加和
> addmargins(prop.table(mytable, 2), 1)
         Improved
Treatment None Some Marked
  Placebo 0.69 0.50   0.25
  Treated 0.31 0.50   0.75
  Sum     1.00 1.00   1.00

方法三:使用gmodels包中的CrossTable()函数

#7-11 使用CrossTable生成二维列联表
> library(gmodels)
> CrossTable(Arthritis$Treatment, Arthritis$Improved)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |        29 |         7 |         7 |        43 | 
                    |     2.616 |     0.004 |     3.752 |           | 
                    |     0.674 |     0.163 |     0.163 |     0.512 | 
                    |     0.690 |     0.500 |     0.250 |           | 
                    |     0.345 |     0.083 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        13 |         7 |        21 |        41 | 
                    |     2.744 |     0.004 |     3.935 |           | 
                    |     0.317 |     0.171 |     0.512 |     0.488 | 
                    |     0.310 |     0.500 |     0.750 |           | 
                    |     0.155 |     0.083 |     0.250 |           | 
--------------------|-----------|-----------|-----------|-----------|
       Column Total |        42 |        14 |        28 |        84 | 
                    |     0.500 |     0.167 |     0.333 |           | 
--------------------|-----------|-----------|-----------|-----------|
 

3、多维列联表

前面用过的table()、xtabs()、margin.table()、prop.table()、addmargins()函数都可以推广到高于二维。
也可以使用ftable()函数

#7-12三维列联表
#生成三维列联表
> mytable <- xtabs(~ Treatment+Sex+Improved, data = Arthritis)
> mytable
, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2

, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5

> ftable(mytable)
                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5


#生成边际频数
> margin.table(mytable, 1)
Treatment
Placebo Treated 
     43      41 
> margin.table(mytable, 2)
Sex
Female   Male 
    59     25 
> margin.table(mytable, 3)
Improved
  None   Some Marked 
    42     14     28 


#生成两种情况混合后的边际频数
> margin.table(mytable, c(1,3))
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
> margin.table(mytable, c(2,3))
        Improved
Sex      None Some Marked
  Female   25   12     22
  Male     17    2      6
> margin.table(mytable, c(1,2))
         Sex
Treatment Female Male
  Placebo     32   11
  Treated     27   14


#生成两种情况混合后的比例
> ftable(prop.table(mytable, c(1, 2)))
                 Improved   None   Some Marked
Treatment Sex                                 
Placebo   Female          0.5938 0.2188 0.1875
          Male            0.9091 0.0000 0.0909
Treated   Female          0.2222 0.1852 0.5926
          Male            0.5000 0.1429 0.3571


#生成边际
> ftable(addmargins(prop.table(mytable, c(1,2)), 3))
                 Improved   None   Some Marked    Sum
Treatment Sex                                        
Placebo   Female          0.5938 0.2188 0.1875 1.0000
          Male            0.9091 0.0000 0.0909 1.0000
Treated   Female          0.2222 0.1852 0.5926 1.0000
          Male            0.5000 0.1429 0.3571 1.0000


#得到百分比而不是比例
> ftable(addmargins(prop.table(mytable, c(1,2)), 3)) * 100
                 Improved   None   Some Marked    Sum
Treatment Sex                                        
Placebo   Female           59.38  21.88  18.75 100.00
          Male             90.91   0.00   9.09 100.00
Treated   Female           22.22  18.52  59.26 100.00
          Male             50.00  14.29  35.71 100.00

7.2.2独立性检验

1、卡方独立性检验
使用chisq.test()函数

#7-13卡方独立性检验
> library(vcd)
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> chisq.test(mytable)

	Pearson's Chi-squared test

data:  mytable
X-squared = 13.1, df = 2, p-value = 0.0015
#P小于0.05,不独立


> mytable <- xtabs(~Improved+Sex, data=Arthritis)
> chisq.test(mytable)

	Pearson's Chi-squared test

data:  mytable
X-squared = 4.84, df = 2, p-value = 0.089
#P大于0.05,不独立

p值表示从总体中抽取的样本行变量与列变量是相互独立的概率

2、Fisher精确检验
使用fisher.test()函数
原假设:边界固定的列联表中行和列是相互独立的。

#Fisher
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> fisher.test(mytable)

	Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.0014
alternative hypothesis: two.sided

3、Cochran-Mantel-Haenszel检验
使用mantelhaen.test()函数
原假设:两个名义变量在第三个变量的每一层中都是条件独立的
以下代码检验治疗情况和改善情况在性别的每一水平下是否独立

#Cochran-Mantel-Haenszel检验
> mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
> mantelhaen.test(mytable)

	Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 14.6, df = 2, p-value = 0.00066

P小于0.05,说明并不独立

7.2.3相关性的度量

若不独立,可以度量相关性
使用vcd包中的assocstats()函数

#7-14二维列联表的相关性度量
> library(vcd)
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> assocstats(mytable)
                    X^2 df  P(> X^2)
Likelihood Ratio 13.530  2 0.0011536
Pearson          13.055  2 0.0014626

Phi-Coefficient   : NA 
Contingency Coeff.: 0.367 
Cramer's V        : 0.394 

7.2.5将表转换为扁平格式

利用自定义函数table2flat进行转换

#7-15自定义函数table2flat:将表转换为扁平格式
> table2flat <- function(mytable){
+   df <- as.data.frame(mytable)
+   rows <- dim(df)[1]
+   cols <- dim(df)[2]
+   x <- NULL
+   for (i in 1:rows){
+     for (j in 1:df$Freq[i]) {
+       rows <- df[i, c(1:(cols-1))]
+       x <- rbind(x, row)
+     }
+   }
+   rows.names(x) <- c(1:dims(x)[1])
+   return(x)
+ }


#7-16使用table2flat函数转换已发表的数据
> treatment <- rep(c("Placebo", "Treated"), times=3)
> improved <- rep(c("None", "Some", "Marked"), each=2)
> Freq <- c(29,13,7,17,7,21)
> mytable <- as.data.frame(cbind(treatment, improved, Freq))

> mydata <- table2flat(mytable)
Error in rbind(x, row) : 
  cannot coerce type 'closure' to vector of type 'list'

> head(mydata)

> 

TTT报错,我也不知道为啥,咋改TTT

Error in rbind(x, row) : 
  cannot coerce type 'closure' to vector of type 'list'

7.3相关

相关关系用来描述定量变量之间的关系。
相关系数的符号表示关系的方向(正相关或负相关),相关系数的大小表示关系的强弱程度(完全不相关时为0,完全相关时为1)

7.3.1相关的类型

1、Pearson、Spearman、Kendall相关

Pearson积差相关系数衡量了两个定量变量之间的线性相关程度
Spearman等级相关系数衡量了分级定序变量之间的相关程度
Kendall相关系数是一种非参数的等级相关度量
cor()函数可以计算这三种相关系数
cov()函数可以计算协方差

#7.3相关
> #7-17协方差和相关系数
> states <- state.x77[,1:6]

> cov(states)
              Population      Income   Illiteracy     Life Exp      Murder      HS Grad
Population 19931683.7588 571229.7796  292.8679592 -407.8424612 5663.523714 -3551.509551
Income       571229.7796 377573.3061 -163.7020408  280.6631837 -521.894286  3076.768980
Illiteracy      292.8680   -163.7020    0.3715306   -0.4815122    1.581776    -3.235469
Life Exp       -407.8425    280.6632   -0.4815122    1.8020204   -3.869480     6.312685
Murder         5663.5237   -521.8943    1.5817755   -3.8694804   13.627465   -14.549616
HS Grad       -3551.5096   3076.7690   -3.2354694    6.3126849  -14.549616    65.237894

> cor(states) #默认是Pearson相关系数
            Population     Income Illiteracy    Life Exp     Murder     HS Grad
Population  1.00000000  0.2082276  0.1076224 -0.06805195  0.3436428 -0.09848975
Income      0.20822756  1.0000000 -0.4370752  0.34025534 -0.2300776  0.61993232
Illiteracy  0.10762237 -0.4370752  1.0000000 -0.58847793  0.7029752 -0.65718861
Life Exp   -0.06805195  0.3402553 -0.5884779  1.00000000 -0.7808458  0.58221620
Murder      0.34364275 -0.2300776  0.7029752 -0.78084575  1.0000000 -0.48797102
HS Grad    -0.09848975  0.6199323 -0.6571886  0.58221620 -0.4879710  1.00000000

> cor(states, method = "spearman")
           Population     Income Illiteracy   Life Exp     Murder    HS Grad
Population  1.0000000  0.1246098  0.3130496 -0.1040171  0.3457401 -0.3833649
Income      0.1246098  1.0000000 -0.3145948  0.3241050 -0.2174623  0.5104809
Illiteracy  0.3130496 -0.3145948  1.0000000 -0.5553735  0.6723592 -0.6545396
Life Exp   -0.1040171  0.3241050 -0.5553735  1.0000000 -0.7802406  0.5239410
Murder      0.3457401 -0.2174623  0.6723592 -0.7802406  1.0000000 -0.4367330
HS Grad    -0.3833649  0.5104809 -0.6545396  0.5239410 -0.4367330  1.0000000

默认得到方形相关矩阵(即所有变量之间两两相关),也可以计算非方形的矩阵。

#计算非方形的矩阵
> x <- states[, c("Population", "Income", "Illiteracy", "HS Grad")]
> y <- states[, c("Life Exp", "Murder")]
> cor(x,y)
              Life Exp     Murder
Population -0.06805195  0.3436428
Income      0.34025534 -0.2300776
Illiteracy -0.58847793  0.7029752
HS Grad     0.58221620 -0.4879710

2、偏相关
偏相关是指在控制一个或多个定量变量的时候,另外两个定量变量之间的相互关系,可使用ggm包中的pcor()函数计算偏相关系数

#计算偏相关系数
> library(ggm)
> #控制收入2、文盲率3和高中毕业率6时,计算人口1和谋杀率5的偏相关系数
> pcor(c(1,5,2,3,6), cov(states))
[1] 0.3462724

3、其它类型的相关
polycor包中的hetcor()函数可以计算混合相关矩阵

7.3.2相关性的显著性检验

常用原假设:变量间不相关(即总体相关系数为0)

1、使用cor.test对单个相关系数进行检验

cor.test(x, y, alternative = , method = )

其中x, y为要检验相关性的变量,alternative =用来指定双侧检验或单侧检验(默认双侧=two.side;当总体相关系数小于0时,用alternative =“less”;当总体相关系数大于0时,用alternative =“greater”)

2、psych包中的corr.test()可以计算相关矩阵并进行检验
参数use = "complete"表示对缺失值进行行删除;use = "pairwise"表示对缺失值进行成对删除
参数method可取pearson(默认)、spearman、kendall

7.4t检验

两个组进行比较,假设结果变量为连续性组间比较,假设其呈正态分布

7.4.1独立样本的t检验

一个针对两组的独立样本t检验可以用于检验两个总体的均值相等的假设
调用格式1:y是一个数值型变量,x为一个二分变量

t.test(y ~ x, data)

调用格式2:y1和y2为数值型向量(即各组的结果变量)

t.test(y1, y2)
#7.4.1独立样本t检验
> library(MASS)
> t.test(Prob ~ So, data = UScrime)

	Welch Two Sample t-test

data:  Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1 
     0.03851265      0.06371269 

7.4.2非独立样本的t检验

非独立样本的t检验假定组间的差异呈正态分布
调用格式为:y1, y2为两个非独立组的数值向量

t.test(y1, y2, paired=TRUE)
#7.4.2非独立样本t检验
> library(MASS)
> sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x), sd=sd(x))))
           U1       U2
mean 95.46809 33.97872
sd   18.02878  8.44545
> with(UScrime, t.test(U1, U2, paired=TRUE))

	Paired t-test

data:  U1 and U2
t = 32.407, df = 46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 57.67003 65.30870
sample estimates:
mean of the differences 
               61.48936 

7.4.3多于两组的情况

使用方差分析ANOVA,参考ch9

7.5组间差异的非参数检验

若结果变量在本质上就严重偏倚或呈现有序关系,可使用非参数检验

7.5.1两组的比较

1、若两组数据独立,可使用Wilcoxon秩和检验,来评估是否从相同的概率分布中抽得

调用格式1:y是一个数值型变量,x为一个二分变量

wilcox.test(y ~ x, data)

调用格式2:y1和y2为各组的结果变量)

wilcox.test(y1, y2)

默认为双侧检验

#Wilcoxon秩和检验
> with(UScrime, by(Prob, So, median))
So: 0
[1] 0.038201
---------------------------------------------------------------------------------- 
So: 1
[1] 0.055552
> wilcox.test(Prob ~ So, data=UScrime)

	Wilcoxon rank sum exact test

data:  Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0

2、Wilcoxon符号秩检验是非独立样本t检验的一种非参数替代方法,适用于两组成对数据和无法保证正态性假设的情况

#Wilcoxon符号秩检验
> sapply(UScrime[c("U1","U2")], median)
U1 U2 
92 34 
> with(UScrime, wilcox.test(U1, U2, paired=TRUE))

	Wilcoxon signed rank test with continuity correction

data:  U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0

7.5.2多于两组的比较

若各组独立,用Kruskal-Wallis检验
若各组不独立,用Friedman检验

你可能感兴趣的:(R语言学习笔记,r语言,开发语言)