R提供了计算单变量,多变量和观察值子集的均值,长度,标准差,最小值,最大值,方差等的函数。
1 tapply函数
首先载入实验数据
> Veg <- read.table(file = "Vegetation2.txt",header = TRUE) > names(Veg) [1] "TransectName" "Samples" "Transect" "Time" [5] "R" "ROCK" "LITTER" "ML" [9] "BARESOIL" "FallPrec" "SprPrec" "SumPrec" [13] "WinPrec" "FallTmax" "SprTmax" "SumTmax" [17] "WinTmax" "FallTmin" "SprTmin" "SumTmin" [21] "WinTmin" "PCTSAND" "PCTSILT" "PCTOrgC" > str(Veg) 'data.frame': 58 obs. of 24 variables: $ TransectName: Factor w/ 58 levels "A_22_02","A_22_58",..: 2 3 4 5 6 7 1 9 10 11 ... $ Samples : int 1 2 3 4 5 6 7 8 9 10 ... $ Transect : int 1 1 1 1 1 1 1 2 2 2 ... $ Time : int 1958 1962 1967 1974 1981 1994 2002 1958 1962 1967 ... $ R : int 8 6 8 8 10 7 6 5 8 6 ... $ ROCK : num 27 26 30 18 23 26 39 25 24 21 ... $ LITTER : num 30 20 24 35 22 26 19 26 24 16 ... $ ML : int 0 0 0 0 4 0 4 0 2 1 ... $ BARESOIL : num 26 28 30 16 9 23 19 33 29 41 ... $ FallPrec : num 30.2 99.6 43.4 54.9 24.4 ... $ SprPrec : num 75.4 56.1 65 58.7 87.6 ... $ SumPrec : num 125.5 95 112.3 70.3 81.8 ... $ WinPrec : num 39.6 107.4 76.7 90.7 46 ... $ FallTmax : num 17 14.6 18.4 17.2 18.5 ... $ SprTmax : num 15.8 15.2 12.8 14 14.3 ... $ SumTmax : num 25.2 24.9 25.5 26.7 26 ... $ WinTmax : num 3.47 1.16 3.09 2.46 5.72 ... $ FallTmin : num 0.49 -0.18 1.23 1.43 1.09 ... $ SprTmin : num 0.36 0.18 -1.86 -0.53 0.75 ... $ SumTmin : num 6.97 6.4 7.12 7.2 6.9 ... $ WinTmin : num -8.54 -10.76 -8.5 -8.28 -7.56 ... $ PCTSAND : int 24 24 24 24 24 24 24 20 20 20 ... $ PCTSILT : int 30 30 30 30 30 30 30 34 34 34 ... $ PCTOrgC : num 0.0346 0.0346 0.0346 0.0346 0.0346 ... >1.1 计算时间截面的均值
> m <- mean(Veg$R) > m1 <- mean(Veg$R[Veg$Transect == 1]) > m2 <- mean(Veg$R[Veg$Transect == 2]) > m3 <- mean(Veg$R[Veg$Transect == 3]) > m4 <- mean(Veg$R[Veg$Transect == 4]) > m5 <- mean(Veg$R[Veg$Transect == 5]) > m6 <- mean(Veg$R[Veg$Transect == 6]) > m7 <- mean(Veg$R[Veg$Transect == 7]) > m8 <- mean(Veg$R[Veg$Transect == 8]) > c(m,m1,m2,m3,m4,m5,m6,m7,m8) [1] 9.965517 7.571429 6.142857 10.375000 9.250000 12.375000 11.500000 [8] 10.500000 11.833333 >变量m表示时间截面的平均丰富度,m1到m8表示每个时间截面的平均丰富度,mean命令使用的对象是数据向量Veg$R,它不是矩阵所以没必要在方括号中加入逗号
1.2 更高效的计算每个时间截面的均值
> > tapply(Veg$R,Veg$Transect,mean) 1 2 3 4 5 6 7 7.571429 6.142857 10.375000 9.250000 12.375000 11.500000 10.500000 8 11.833333 >tapply函数根据第二个变量(Transect)的不同水平对第一个变量R进行了求平均值运算
。命令还可以写为:
> tapply(X=Veg$R,INDEX=Veg$Transect,FUN=mean) 1 2 3 4 5 6 7 7.571429 6.142857 10.375000 9.250000 12.375000 11.500000 10.500000 8 11.833333 >
除了求均值意外还可以求标准差sd,方差var,长度length等等
> Me <- tapply(Veg$R,Veg$Transect,mean) > Sd <- tapply(Veg$R,Veg$Transect,sd) > Le <- tapply(Veg$R,Veg$Transect,length) > cbind(Me,Sd,Le) Me Sd Le 1 7.571429 1.3972763 7 2 6.142857 0.8997354 7 3 10.375000 3.5831949 8 4 9.250000 2.3145502 8 5 12.375000 2.1339099 8 6 11.500000 2.2677868 8 7 10.500000 3.1464265 6 8 11.833333 2.7141604 6 >2 sapply函数和lapply函数
> sapply(Veg[,5:9],FUN=mean) R ROCK LITTER ML BARESOIL 9.965517 20.991379 22.853448 1.086207 17.594828 > lapply(Veg[,5:9],FUN=mean) $R [1] 9.965517 $ROCK [1] 20.99138 $LITTER [1] 22.85345 $ML [1] 1.086207 $BARESOIL [1] 17.59483 >sapply函数输出一个向量,lapply输出一个列表
tapply函数计算的是一个变量观察值子集的均值,而sapply和lapply计算的是一个或多个变量全部观察值的均值
另外,sapply和lapply中的数据必须是数据框,下面这个命令的记过将是一个很长的向量,原因就是cbind输出的不是数据框
> sapply(cbind(Veg$R,Veg$LITTER,Veg$ROCK,Veg$ML,Veg$BARESOIL),FUN=mean) [1] 8.0 6.0 8.0 8.0 10.0 7.0 6.0 5.0 8.0 6.0 6.0 6.0 6.0 6.0 [15] 7.0 10.0 8.0 18.0 12.0 11.0 7.0 10.0 8.0 9.0 6.0 12.0 13.0 10.0 [29] 8.0 8.0 13.0 16.0 9.0 14.0 11.0 13.0 11.0 12.0 9.0 10.0 14.0 14.0 [43] 10.0 14.0 9.0 12.0 11.0 12.0 14.0 9.0 5.0 12.0 9.0 10.0 16.0 12.0 [57] 10.0 14.0 30.0 20.0 24.0 35.0 22.0 26.0 19.0 26.0 24.0 16.0 25.0 28.0 [71] 41.5 18.0 17.0 7.0 14.0 15.0 37.0 17.0 14.0 19.0 10.0 5.0 9.0 12.0 [85] 24.0 10.0 18.0 9.0 23.0 21.0 51.0 34.0 28.0 30.0 32.0 29.0 32.0 20.0 [99] 29.0 19.0 23.0 32.0 22.5 28.0 26.0 29.0 23.0 40.0 14.5 21.0 24.0 15.0而下面这条命令转化成数据框就可以成功
> sapply(data.frame(cbind(Veg$R,Veg$LITTER,Veg$ROCK,Veg$ML,Veg$BARESOIL)),FUN=mean) X1 X2 X3 X4 X5 9.965517 22.853448 20.991379 1.086207 17.594828 >3 summary函数
summay函数可以提供变量的信息,它的参数可以是一个变量,cbind命令的输出或者数据框
eg:
> Z <- cbind(Veg$R,Veg$ROCK,Veg$LITTER) > colnames(Z) <- c("R","ROCK","LITTER") > summary(Z) R ROCK LITTER Min. : 5.000 Min. : 0.00 Min. : 5.00 1st Qu.: 8.000 1st Qu.: 7.25 1st Qu.:17.00 Median :10.000 Median :18.50 Median :23.00 Mean : 9.966 Mean :20.99 Mean :22.85 3rd Qu.:12.000 3rd Qu.:27.00 3rd Qu.:28.75 Max. :18.000 Max. :59.00 Max. :51.00 >summay命令给出了变量的最小值,第一四分位数,中位数,平均值,第三四分位数和最大值
一下两条命令同样可以达到效果
> summary(Veg[,c("R","ROCK","LITTER")]) R ROCK LITTER Min. : 5.000 Min. : 0.00 Min. : 5.00 1st Qu.: 8.000 1st Qu.: 7.25 1st Qu.:17.00 Median :10.000 Median :18.50 Median :23.00 Mean : 9.966 Mean :20.99 Mean :22.85 3rd Qu.:12.000 3rd Qu.:27.00 3rd Qu.:28.75 Max. :18.000 Max. :59.00 Max. :51.00 > summary(Veg[,c(5,6,7)]) R ROCK LITTER Min. : 5.000 Min. : 0.00 Min. : 5.00 1st Qu.: 8.000 1st Qu.: 7.25 1st Qu.:17.00 Median :10.000 Median :18.50 Median :23.00 Mean : 9.966 Mean :20.99 Mean :22.85 3rd Qu.:12.000 3rd Qu.:27.00 3rd Qu.:28.75 Max. :18.000 Max. :59.00 Max. :51.00 >4 table函数
载入数据
> Deer <- read.table(file="Deer.txt",header=TRUE,fill=TRUE) > names(Deer) [1] "Farm" "Month" "Year" "Sex" "clas1_4" "LCT" "KFI" [8] "Ecervi" "Tb" >注意加上fill=TRUE这个条件,不然会报错:
> Deer <- read.table(file="Deer.txt",header=TRUE) 错误于scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 657行没有9元素
> names(Deer) [1] "Farm" "Month" "Year" "Sex" "clas1_4" "LCT" "KFI" [8] "Ecervi" "Tb" > str(Deer) 'data.frame': 1182 obs. of 9 variables: $ Farm : Factor w/ 28 levels "R\xd1\t02","R\xd1\t12",..: 3 3 3 3 3 3 3 3 3 3 ... $ Month : int 10 10 10 10 10 10 10 10 10 10 ... $ Year : int 0 0 0 0 0 0 0 0 0 0 ... $ Sex : int 1 1 1 1 1 1 1 1 1 1 ... $ clas1_4: num 4 4 3 4 4 4 4 4 4 4 ... $ LCT : num 191 180 192 196 204 190 196 200 197 208 ... $ KFI : num 20.4 16.4 15.9 17.3 NA ... $ Ecervi : num 0 0 2.38 0 0 0 1.21 0 0.8 0 ... $ Tb : int 0 0 0 0 NA 0 NA 1 0 0 ... >
> table(Deer$Farm) R\xd1\t02 R\xd1\t12 AL AU BA BE CB 10 15 15 37 98 19 93 CRC HB LCV LN MAN MB MO 16 35 2 34 76 41 278 NC NV PA PN QM RF RO 32 35 11 45 75 34 44 SAL SAU SE TI TN VISO VY 1 3 26 21 31 15 40 >
R\xd1\t02 R\xd1\t12这两个东西不知原因。上面结果说明一些农场抽取了15个样本,一些农场抽取了98个样本
> table(Deer$Sex,Deer$Year) 0 1 2 3 4 5 99 1 100 88 157 72 78 34 21 2 76 41 198 116 60 35 0 3 0 9 1 0 0 0 0 4 0 5 2 0 0 0 0 >5 总结:
tapply 根据x的不同水平对y使用FUN的函数 tapply(y,x,FUN=mean)
sapply 对y的每一个变量使用FUN函数 sapply(y,FUN=mean)
lapply 对y的每一个变量使用FUN函数 lapply(y,FUN=mean)
sd 计算y的标准差 sd(y)
length 确定y的长度 length(y)
summay 计算基本信息 summay(y)
table 计算列联表 table(x,y)