摘要: 仅用于记录R语言学习过程:
内容提要:
数据汇总统计有关的函数:apply家族(apply()、lappy()、sapply()、tapply()、mapply());ave()函数、by()函数、aggregate()函数、sweep()函数
数据预处理包:plyr包、dplyr包、data.table包
正文:
数据汇总统计之apply家族
n apply()函数:可用于矩阵,数组,数据框。参数设置:x=矩阵名,MARGIN 选1为对行操作,2为对列操作,fun为拟进行的函数运算,如sum,mean等
u 示例:
> mat <- matrix(1:24,nrow = 4,ncol = 6)
> apply(mat,1,sum)
[1] 66 72 78 84
n lappy()函数:除了有汇总功能外,还有遍历的功能,返回的是列表
u 示例:lapply(X = c(1:5),FUN = log) # 但此时的返回值是一个list
lapply(iris[,1:3],function(x)lm(x~iris$我,data = iris[,1:3]))
n sapply()函数:返回的是向量、矩阵、数据框
u 示例:
> sapply(1:5,log)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
> sapply(1:5,function(x) x+3)
[1] 4 5 6 7 8
n tapply()函数:主要适用于数据框。根据一个分类变量将一个数值型变量进行切分,并且进行数据汇总。tapply只能对于一个变量进行汇总,如想同时对多个,可采用dcast()函数。
u 示例:tapply(iris$我,INDEX = iris$人,FUN = mean)
n mapply()函数:使if语句具有向量化操作的功能
u 示例:
myfun <- funtion(x,y) {
if(x >4) return(y)
else return(x+y)
}
mapply(myfun,1:5,2:6)
数据汇总函数
n ave()函数:
u 示例:
survival <- data.frame(id =1:10,cancer=sample(c('lung','liver','colon'),
10,replace = TRUE),
treatment = sample(c('Surg','Chemo'),10,replace = TRUE),
sur_days = sample(100:1000,10))
survival
ave(survival$sur_days,survival$cancer)
其中ave函数中也可以自定义fun,如ave(survival$sur_days,survival$cancer,FUN = sd)
n by()函数:可进行多个参数的汇总
u 示例:
> by(data = survival$sur_days,INDICES = survival$cancer,FUN =mean)
survival$cancer: colon
[1] 768
-----------------------------------------------------------------------------------------------------------------
survival$cancer: liver
[1] 387.6667
-----------------------------------------------------------------------------------------------------------------
survival$cancer: lung
[1] 772.5
u 示例2:
> by(data = survival$sur_days,INDICES = list(survival$cancer,survival$treatment)
+ ,FUN = mean)
: colon
: Chemo
[1] 931
-----------------------------------------------------------------------------------------------------------------
: liver
: Chemo
[1] 509.5
-----------------------------------------------------------------------------------------------------------------
: lung
: Chemo
[1] NA
-----------------------------------------------------------------------------------------------------------------
: colon
: Surg
[1] 605
-----------------------------------------------------------------------------------------------------------------
: liver
: Surg
[1] 326.75
-----------------------------------------------------------------------------------------------------------------
: lung
: Surg
[1] 772.5
u 示例3:自定义函数
by(mtcars,mtcars$cyl,function(x)lm(mpg~disp+hp,data = x))
n aggregate()函数:功能特别强大,根据数据框原本的变量生成新的变量
u 示例1:
data(mtcars)
View(mtcars)
aggregate(x=mtcars,by =list(VS = mtcars$vs==1,high = mtcars$mpg >22),mean)
VS high mpg cyl disp hp drat
1 FALSE FALSE 16.06471 7.647059 318.1412 195.5294 3.331176
2 TRUE FALSE 19.90000 5.333333 176.5500 111.1667 3.581667
3 FALSE TRUE 26.00000 4.000000 120.3000 91.0000 4.430000
4 TRUE TRUE 28.05000 4.000000 99.3875 76.5000 4.067500
wt qsec vs am gear carb
1 3.779647 16.69353 0 0.2941176 3.470588 3.705882
2 3.133333 19.24500 1 0.1666667 3.500000 2.166667
3 2.140000 16.70000 0 1.0000000 5.000000 2.000000
4 2.219750 19.40000 1 0.7500000 4.125000 1.500000
u 示例2:做到了与dcast()函数相同的功能
aggregate(.~Species,data = iris,mean) #除了Species外的所有iris数据集中的数据求均值。
n sweep()函数:主要针对数组的,参数设置:STATS指的是统计量,FUN默认为减法,如果不是需要自己写入自定义的运算
u 示例:
> my_array <- array(1:24,dim= c(2,4,2))
> my_array
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
, , 2
[,1] [,2] [,3] [,4]
[1,] 9 11 13 15
[2,] 10 12 14 16
> sweep(x = my_array,MARGIN = 1,STATS = 1,FUN = '+')
, , 1
[,1] [,2] [,3] [,4]
[1,] 2 4 6 8
[2,] 3 5 7 9
, , 2
[,1] [,2] [,3] [,4]
[1,] 10 12 14 16
[2,] 11 13 15 17
数据预处理包----plyr包
input |
Array |
Data frame |
List |
Discarded |
Array |
aaply |
adply |
alply |
a_ply |
Data frame |
daply |
ddply |
dlply |
d_ply |
List |
laply |
ldply |
llply |
l_ply |
n aaply()函数
u 示例:
> my_matrix <- matrix(1:24,nrow = 3,ncol = 8)
> aaply(.data = my_matrix,.margins = 2,.fun = mean)
1 2 3 4 5 6 7 8
2 5 8 11 14 17 20 23 #返回的是数组
> apply(my_matrix,2,mean)
[1] 2 5 8 11 14 17 20 23 #返回的是向量
n adply()函数
u 示例:
> adply(my_matrix,.margins = 2,.fun = mean)
X1 V1 #返回的是数组
1 1 2
2 2 5
3 3 8
4 4 11
5 5 14
6 6 17
7 7 20
8 8 23
n laply()函数:
u 示例:
> my_list <- list(1:10,2:8,rep(c(T,F),times =5))
> laply(my_list,.fun = mean)
[1] 5.5 5.0 0.5
n ddply()函数:可同时对一个对象进行多种数据汇总(如mean和sd);可根据多种分类变量对某一个数据框进行操作。
u 示例1:
> my_df <- data.frame(name = c('a','b','c','d','e'),
+ height = c(178,176,175,167,190),
+ gender = c('M','F','F','M','M'))
> ddply(.data = my_df,.variables = .(gender),summarise,mean_h = mean(height))
gender mean_h
1 F 175.5000
2 M 178.3333
> ddply(.data = my_df,.variables = .(gender),summarise,mean_h = mean(height),sd_h = sd(height) )
gender mean_h sd_h
1 F 175.5000 0.7071068
2 M 178.3333 11.5036226
> tapply(my_df$height,my_df$gender,mean)
F M
175.5000 178.3333
u 示例2:
my_df <- data.frame(name = c('a','b','c','d','e'),
height = c(178,176,175,167,190),
gender = c('M','F','F','M','M'),
age = c('old','young','old','old','young'))
> ddply(.data = my_df,.variables = .(gender,age),.fun = summarise,mean_h = mean(height))
gender age mean_h
1 F old 175.0
2 F young 176.0
3 M old 172.5
4 M young 190.0
u 示例3:自定义函数;多个分类变量的写法
library(reshape2)
View(tips)
tips
> ddply(tips, .(sex,smoker), function(x) sum(x$tip)/sum(x$total_bill))
sex smoker V1
1 Female No 0.1531892
2 Female Yes 0.1630623
3 Male No 0.1573122
4 Male Yes 0.1369188
> ddply(tips, ~ sex + smoker, function(x) sum(x$tip)/sum(x$total_bill))
sex smoker V1
1 Female No 0.1531892
2 Female Yes 0.1630623
3 Male No 0.1573122
4 Male Yes 0.1369188
u 三种写法求算平均值
ddply(iris,~Species,colwise(mean,c('Sepal.Length','Sepal.Width')))
ddply(iris,~Species,colwise(mean,.(Sepal.Length,Sepal.Width)))
ddply(iris,~Species,colwise(mean, ~ Sepal.Length + Sepal.Width)) #最推荐的,因为看着最简洁,不易错
n dlply()函数:在进行回归分析的时候
u 示例:
my_model <- function(x) lm(Sepal.length ~ Sepal.Width,data = x)
dlply(iris,~$Species,my_model)
n each()函数:对数据对象进行多批量的操作
u 示例:each(mean,sd,median)(iris$Sepal.length) #一次性求出mean,sd和median
n colwise()函数:表示对列进行操作
u 示例:colwise(mean)(iris)
n numcolwise()函数:针对数据集中的数值型向量
u 示例:numcolwise(mean)(iris)
数据预处理包----dplyr包
n filter()函数:主要针对行的筛选函数
u 示例:
> sub1 <-filter(tips,tips$smoker == 'No',tips$day=='Sun')
> head(sub1)
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
n slice()函数:不对变量进行操作,只对行数进行操作。参数设置:数据集,选择的行数
u 示例:sub2 <- slice(tips,1:5)
n select()函数:只要针对列进行操作。参数设置:数据集,想要显示的列名
u 示例:
sub3 <- select(tips, tip, sex, smoker) #显示tips中的tip列,sex列和smoker列
sub4 <- select(tips, tip:time) #显示tips集中从tip列到time列
sub4 <- select(tips,2:5) #显示tips集的2到5 列
n arrange()函数:排序。参数设置:数据集,拟用于排序的列名,可用desc()函数设置排序时降序排列,默认为升序排列。
u 示例:
> new_tips <- arrange(tips,total_bill,tip)
> head(new_tips)
total_bill tip sex smoker day time size
1 3.07 1.00 Female Yes Sat Dinner 1
2 5.75 1.00 Female Yes Fri Dinner 2
3 7.25 1.00 Female No Sat Dinner 1
4 7.25 5.15 Male Yes Sun Dinner 2
5 7.51 2.00 Male No Thur Lunch 2
6 7.56 1.44 Male No Thur Lunch 2
> new_tips2 <- arrange(tips,desc(total_bill),tip)
> head(new_tips2)
total_bill tip sex smoker day time size
1 50.81 10.00 Male Yes Sat Dinner 3
2 48.33 9.00 Male No Sat Dinner 4
3 48.27 6.73 Male No Sat Dinner 4
4 48.17 5.00 Male No Sun Dinner 6
5 45.35 3.50 Male Yes Sun Dinner 3
6 44.30 2.50 Female Yes Sat Dinner 3
n rename()函数:对列重新进行命名。参数设置:数据集,新名字=老名字,新名字=老名字。。。
u 示例:new_tips <- rename(tips, bill = total_bill, tipp =tip)
n distinct()函数:返回因子水平,可以同时查看多列
u 示例:
> distinct(tips,sex)
sex
1 Female
2 Male
> levels(tips$sex)
[1] "Female" "Male"
n mutate()函数:用于生成新的变量,比较特别的地方在于可以在函数中用新生成的变量名作为另一个新变量生成中的一个变量。transform()函数则不能在rate后面再写上new_rate。
u 示例:new_tips3 <- mutate(tips,rate = tip/total_bill,new_rate = rate*100)
n 取随机数函数:sample_n()函数和sample_frac()函数:在数据框中随机抽取行。
u 参数设置:sample_n:数据对象,随机抽取的总行数,省略了抽取随机数的步骤
l 示例:sample_n(iris,size = 10) #随机抽取iris中的10行
u 参数设置:sample_frac:数据对象,传入的比例
l 示例:sample_frac(iris,0.1) #随机抽取iris中10%的行
n 分组函数:group_by(),一般结合summarise()函数一起使用
u 示例:
> group <- group_by(tips,smoker)
> summarise(group,count = n(),mean_tips = mean(tip),sd_bill = sd(total_bill))
# A tibble: 2 x 4
smoker count mean_tips sd_bill
1 No 151 2.99 8.26
2 Yes 93 3.01 9.83
n 管道符: %>%,连通上下函数,可节省编程时间
u 示例:
> result <- tips %>% group_by(smoker,sex) %>% summarise(count = n(),mean_tips = mean(tip),sd_bill = sd(total_bill))
> result
# A tibble: 4 x 5
# Groups: smoker [?]
smoker sex count mean_tips sd_bill
1 No Female 54 2.77 7.29
2 No Male 97 3.11 8.73
3 Yes Female 33 2.93 9.19
4 Yes Male 60 3.05 9.91
n join()函数家族:用于数据框的合并
u inner_join()函数:合并,但只合并两个数据框里都有的
示例:
> df_a <- data.frame(x = c('a','b','c','a','c','b','c'),y =1:7)
> df_a
x y
1 a 1
2 b 2
3 c 3
4 a 4
5 c 5
6 b 6
7 c 7
> df_b <- data.frame(x = c('a','b','a'),z = 10:12)
> df_b
x z
1 a 10
2 b 11
3 a 12
> inner_join(df_a,df_b,by = 'x')
x y z
1 a 1 10
2 a 1 12
3 b 2 11
4 a 4 10
5 a 4 12
6 b 6 11
u semi_join()函数:如果第一个数据框的元素在第二个数据框里出现了,则返回第一个数据框的元素
l 示例:
> semi_join(df_a,df_b,by = 'x')
x y
1 a 1
2 b 2
3 a 4
4 b 6
u anti_join()函数:反结合,第二个数据框中的元素不出现的,第一个数据框中有的,则返回,如返回c(下例)
l 示例:
> anti_join(df_a,df_b,by = 'x')
x y
1 c 3
2 c 5
3 c 7
u left_join()函数:把所有元素结合到一起,没有都出现的用NA表示
l 示例:
> left_join(df_a,df_b,by = 'x')
x y z
1 a 1 10
2 a 1 12
3 b 2 11
4 c 3 NA
5 a 4 10
6 a 4 12
7 c 5 NA
8 b 6 11
9 c 7 NA
u right_join()函数:根据第二个数据框进行填充,返回b中的所有行,三个列,如果a 中无b的观测,用NA填充。
l 示例:
> right_join(df_a,df_b,by = 'x')
x y z
1 a 1 10
2 a 4 10
3 b 2 11
4 b 6 11
5 a 1 12
6 a 4 12
数据预处理包----data.table包
n 示例:
> library(data.table)
> dt <- data.table(v1=c(1,2),v2 = LETTERS[1:3],v3= rnorm(12,2,2),v4 = sample(1:20,12))
> dt
v1 v2 v3 v4
1: 1 A 1.0590425 13
2: 2 B 0.4570105 14
3: 1 C -0.2072561 11
4: 2 A -0.4769700 17
5: 1 B 0.9041292 15
6: 2 C 1.2926636 19
7: 1 A 1.9194111 20
8: 2 B 4.3148126 18
9: 1 C 0.5666122 12
10: 2 A 2.3982736 3
11: 1 B 2.6283114 2
12: 2 C 3.1367615 7
> dt[3:6,] #提取行
v1 v2 v3 v4
1: 1 C -0.2072561 11
2: 2 A -0.4769700 17
3: 1 B 0.9041292 15
4: 2 C 1.2926636 19
> dt[v2=='B']
v1 v2 v3 v4
1: 2 B 0.4570105 14
2: 1 B 0.9041292 15
3: 2 B 4.3148126 18
4: 1 B 2.6283114 2
u 注1:rnorm()函数:生成正态分布的数据 rnorm(12,3,2) 12个数,均数为3,标准差为2
u 注2:round()函数,取整函数
u 注3:每一列中元素个数可不同,通过重复来进行填充
u 注4:可用[ ]来进行元素提取
n %in%:判断变量v2是否在向量c(‘A’,’B’)当中,如果是,则返回A,B
u 示例:
> dt[v2 %in% c('A','B')]
v1 v2 v3 v4
1: 1 A 1.0590425 13
2: 2 B 0.4570105 14
3: 2 A -0.4769700 17
4: 1 B 0.9041292 15
5: 1 A 1.9194111 20
6: 2 B 4.3148126 18
7: 2 A 2.3982736 3
8: 1 B 2.6283114 2
n 提取列:用[ ],但是只接受传入的为列名
u 示例:
> dt[,list(v1,v2)]
v1 v2
1: 1 A
2: 2 B
3: 1 C
4: 2 A
5: 1 B
6: 2 C
7: 1 A
8: 2 B
9: 1 C
10: 2 A
11: 1 B
12: 2 C
> dt[,v3]
[1] -1 2 3 4 4 4 0 3 3 3 5 1
n 进行统计操作:如求和,均值,可多个操作,也可多个变量 并重新命名;前提是放到list里面去,也可以使用点.,来替代list
u 示例:
> dt[,sum(v4)]
[1] 114
> dt[,list(sum_v4 = sum(v4),mean_v4 = mean(v4))]
sum_v4 mean_v4
1: 114 9.5
> dt[,list(sum_v3 = sum(v3),mean_v4 = mean(v4))]
sum_v3 mean_v4
1: 31 9.5
n 生成新的变量
u 示例:
> dt[,list(v5 = v4+1,v6 = v3-1)] #生成v5和v6
v5 v6
1: 20 -2
2: 10 1
3: 15 2
4: 19 3
5: 3 3
6: 17 3
7: 7 -1
8: 8 2
9: 16 2
10: 4 2
11: 5 4
12: 2 0
n 用于打印和作图:此时不能用list,要用{ },并且不同的命令之间要用分号;,用于分行
u 示例:
dt[,{print(v2);plot(1:12,v3,col = 'red')}]
[1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
NULL
图片未放置
n by参数:根据by后面参数的水平,对by前面的数据进行操作
u 示例:
> dt[,list(sum_v3 = sum(v3)),by = v2 ] #根据v2的水平对v3进行求和
v2 sum_v3
1: A 6
2: B 5
3: C 10
> dt[,list(sum_v3 = sum(v3),mean_v4 = mean(v4)),by = v2 ] #同时对v3和v4进行操作
v2 sum_v3 mean_v4
1: A 6 9.75
2: B 5 10.75
3: C 10 7.50
> dt[,list(sum_v3 = sum(v3),mean_v4 = mean(v4)),by = .(v2,v1)] #根据v2和v1进行v3和v4的操作 类似ddply 和summarise的联合操作
v2 v1 sum_v3 mean_v4
1: A 1 2 10.0
2: B 2 2 17.0
3: C 1 6 9.0
4: A 2 4 9.5
5: B 1 3 4.5
6: C 2 4 6.0
> dt[1:8,list(sum_v3 = sum(v3),mean_v4 = mean(v4)),by = v2 ] #仅对前8行进行相应操作
v2 sum_v3 mean_v4
1: A 5 7.333333
2: B 0 11.666667
3: C 3 8.000000
n N参数:频数汇总 用法 .N,后面加上分类变量by
u 示例:
> dt[,.N,by =v2] #注意:点号前面一定要写上逗号
v2 N
1: A 4
2: B 4
3: C 4
> dt[,.N,by = list(v1,v2)] #根据v1和v2列出的频数
v1 v2 N
1: 1 A 2
2: 2 B 2
3: 1 C 2
4: 2 A 2
5: 1 B 2
6: 2 C 2
n :=(冒号等号) :这个符号的前后都需要有空格,用于把新的列加入到原来的变量dt中
u 示例:
> dt[,v5 := v4+1]
> dt
v1 v2 v3 v4 v5
1: 1 A 0 4 5
2: 2 B 0 20 21
3: 1 C 2 13 14
4: 2 A 3 2 3
5: 1 B -2 1 2
6: 2 C 1 3 4
7: 1 A 2 16 17
8: 2 B 2 14 15
9: 1 C 4 5 6
10: 2 A 1 17 18
11: 1 B 5 8 9
12: 2 C 3 9 10
> dt[,c('V5','V6') := list(v3 +1, v4-1)] #加入多个列的写法
> dt
v1 v2 v3 v4 v5 V5 V6
1: 1 A 0 4 5 1 3
2: 2 B 0 20 21 1 19
3: 1 C 2 13 14 3 12
4: 2 A 3 2 3 4 1
5: 1 B -2 1 2 -1 0
6: 2 C 1 3 4 2 2
7: 1 A 2 16 17 3 15
8: 2 B 2 14 15 3 13
9: 1 C 4 5 6 5 4
10: 2 A 1 17 18 2 16
11: 1 B 5 8 9 6 7
12: 2 C 3 9 10 4 8
n setkey()函数:设置关键变量
u attach()函数:进入局部变量 如attach(iris)
u setkey(dt,v2) 直接进入到v2,对v2进行操作
u 示例:
> setkey(dt,v2)
> dt[c('A','C')]
v1 v2 v3 v4 v5 V5 V6
1: 1 A 0 4 5 1 3
2: 2 A 3 2 3 4 1
3: 1 A 2 16 17 3 15
4: 2 A 1 17 18 2 16
5: 1 C 2 13 14 3 12
6: 2 C 1 3 4 2 2
7: 1 C 4 5 6 5 4
8: 2 C 3 9 10 4 8
n nomatch参数:没有匹配到,用NA表示
u 示例:
> dt[c('A','D'),nomatch = 0] #设成0后就不返回NA
v1 v2 v3 v4 v5 V5 V6
1: 1 A 0 4 5 1 3
2: 2 A 3 2 3 4 1
3: 1 A 2 16 17 3 15
4: 2 A 1 17 18 2 16
n by = .EACHI 参数
u 示例:
> dt[c('A','B'),sum(v4),by = .EACHI]
v2 V1
1: A 39
2: B 43
n 串联操作 :两个[ ]之间没有其他的符号
u 示例:
> dt[,.(v4_sum = sum(v4)), by = v2][v4_sum >40]
v2 v4_sum
1: B 43