R语言 分组计算,不止group_by

R语言 分组计算,不止group_by_第1张图片

​最近在研究excel透视图,想到好像自己在R-分组操作并不是很流畅,顺便学习分享一下。R自带数据集比较多,今天就选择一个我想对了解的mtcars数据集带大家学习一下R语言中的分组计算(操作)。

 

目录

1 dplyr包中的group_by联合summarize

1.1 group_by语法

1.2 summarise语法

1.3 group_by和summarise单变量分组计算示例

1.4 group_by和summarise多变量分组计算示例

2 ddply

2.1 ddply语法

2.2 ddply分组计算示例

3 aggregate

3.1 aggregate语法

3.2 aggregate分组计算示例

3.3 aggregate分组计算补充(formula形式)

4 splite

 


 

正文

 

首先给大家看一下mtcars数据集的基本情况,data.frame类型,32个观测对象,11个变量。

 

> head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1> str(mtcars)'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

 

 

1 dplyr包中的group_by联合summarize

 

1.1 group_by语法

 

data为数据集...为分组变量,可以是一个也可以是多个,多个的话以逗号分割group_by(mtcars, vs, am)

 

1.2 summarise语法

 

data为数据集,如果data被group_by定义分组,则根据分组变量分组计算...为计算函数,可以是一个也可以是多个,多个的话以逗号分割summarise(data,disp = mean(disp),hp = mean(hp))summarise计算函数Useful functions拓展Center: mean(), median()Spread: sd(), IQR(), mad()Range: min(), max(), quantile()Position: first(), last(), nth(),Count: n(), n_distinct()Logical: any(), all()

注:计算函数Useful functions拓展中英语不解释了,应该懂得

 

1.3 group_by和summarise单变量分组计算示例

 

> library(dplyr) #加载dplyr包> by_cyl <- group_by(mtcars,cyl) #对mtcars数据集根据cyl变量进行分组注意行5> by_cyl# A tibble: 32 x 11# Groups: cyl [3] mpg cyl disp hp drat wt qsec vs am gear carb * 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 210 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4# ... with 22 more rows# 对分组数据的相关变量进行函数计算> summarise(by_cyl,disp = mean(disp),hp = mean(hp))# A tibble: 3 x 3 cyl disp hp 1 4 105. 82.62 6 183. 122. 3 8 353. 209.—————分割线:引入%>%管道符号,等价于上方分步骤使用———————————————————————————————————————————————————————————————————————————————————————————————————————————>library(dplyr) #加载dplyr包> mtcars %>% group_by(cyl) %>% summarise(disp = mean(disp),hp = mean(hp))# A tibble: 3 x 3 cyl disp hp 1 4 105. 82.62 6 183. 122. 3 8 353. 209.

 

1.4 group_by和summarise多变量分组计算示例

 

> mtcars %>% group_by(vs, am) %>% summarise(n = n())# A tibble: 4 x 3# Groups: vs [2] vs am n 1 0 0 122 0 1 63 1 0 74 1 1 7

 

 

2 ddply

 

接触了Hadley Wickham神包tidyverse以后感觉数据操作那么简单,这里介绍一种可以实现分组计算/操作的方法,就是plyr包的split-apply-combine思想

 

2.1 ddply语法

 

ddply(.data, .variables, ... ).data为数据集.variables分组变量一定要在“点+括号中”,例如".(sex)或.(group, sex)"...为计算函数,可以是一个也可以是多个,

 

2.2 ddply分组计算示例

> library(plyr); library(dplyr)> dfx <- data.frame(+   group = c(rep('A', 8), rep('B', 15), rep('C', 6)),+   sex = sample(c("M", "F"), size = 29, replace = TRUE),+   age = runif(n = 29, min = 18, max = 54)+ )> > ddply(dfx, .(group, sex), summarize,+       mean = round(mean(age), 2),+       sd = round(sd(age), 2))  group sex  mean    sd1     A   F 31.46  8.702     A   M 28.49  2.783     B   F 28.75  9.194     B   M 40.90  8.135     C   F 32.24  7.376     C   M 40.77 13.22> > > ddply(dfx,.(sex), summarize,+       mean = round(mean(age), 2),+       sd = round(sd(age), 2))  sex  mean   sd1   F 30.46 8.102   M 38.68 9.72

注意ddply中分组变量一定要在“点+括号中”,例如".(sex)  或  .(group, sex)"

 

3 aggregate

 

3.1 aggregate语法

aggregate(x, by, FUN)x为数据集by为分组变量列表FUN为计算函数

 

3.2 aggregate分组计算示例

 

> aggregate(state.x77, list(Region = state.region), mean) Region Population Income Illiteracy Life Exp Murder HS Grad Frost1 Northeast 5495.111 4570.222 1.000000 71.26444 4.722222 53.96667 132.77782 South 4208.125 4011.938 1.737500 69.70625 10.581250 44.34375 64.62503 North Central 4803.000 4611.083 0.700000 71.76667 5.275000 54.51667 138.83334 West 2915.308 4702.615 1.023077 71.23462 7.215385 62.00000 102.1538 Area1 18141.002 54605.123 62652.004 134463.00————————————————————————————————————————————————————————————————————> aggregate(state.x77,list(+ Region = state.region,+ Cold = state.x77[,"Frost"] > 130),+ mean) Region Cold Population Income Illiteracy Life Exp Murder HS Grad1 Northeast FALSE 8802.8000 4780.400 1.1800000 71.12800 5.580000 52.060002 South FALSE 4208.1250 4011.938 1.7375000 69.70625 10.581250 44.343753 North Central FALSE 7233.8333 4633.333 0.7833333 70.95667 8.283333 53.366674 West FALSE 4582.5714 4550.143 1.2571429 71.70000 6.828571 60.114295 Northeast TRUE 1360.5000 4307.500 0.7750000 71.43500 3.650000 56.350006 North Central TRUE 2372.1667 4588.833 0.6166667 72.57667 2.266667 55.666677 West TRUE 970.1667 4880.500 0.7500000 70.69167 7.666667 64.20000 Frost Area1 110.6000 21838.602 64.6250 54605.123 120.0000 56736.504 51.0000 91863.715 160.5000 13519.006 157.6667 68567.507 161.8333 184162.17

 

3.3 aggregate分组计算补充(formula形式)

 

aggregate(formula, data, FUN)#Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:> aggregate(weight ~ feed, data = chickwts, mean) feed weight1 casein 323.58332 horsebean 160.20003 linseed 218.75004 meatmeal 276.90915 soybean 246.42866 sunflower 328.9167> aggregate(breaks ~ wool + tension, data = warpbreaks, mean) wool tension breaks1 A L 44.555562 B L 28.222223 A M 24.000004 B M 28.777785 A H 24.555566 B H 18.77778> aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean) Month Ozone Temp1 5 23.61538 66.730772 6 29.44444 78.222223 7 59.11538 83.884624 8 59.96154 83.961545 9 31.44828 76.89655> aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum) alcgp tobgp ncases ncontrols1 0-39g/day 0-9g/day 9 2612 40-79 0-9g/day 34 1793 80-119 0-9g/day 19 614 120+ 0-9g/day 16 245 0-39g/day 10-19 10 846 40-79 10-19 17 857 80-119 10-19 19 498 120+ 10-19 12 189 0-39g/day 20-29 5 4210 40-79 20-29 15 6211 80-119 20-29 6 1612 120+ 20-29 7 1213 0-39g/day 30+ 5 2814 40-79 30+ 9 2915 80-119 30+ 7 1216 120+ 30+ 10 13

 

 

4 splite

 

感觉splite没有太多好讲的,直接上例子体会一下吧~

> require(stats); require(graphics)> n <- 10; nn <- 100> g <- factor(round(n * runif(n * nn)))> x <- rnorm(n * nn) + sqrt(as.numeric(g))> > xg_group_length <- split(x, g) %>% sapply(length)> xg_group_length  0   1   2   3   4   5   6   7   8   9  10  42 105 103  93 119 120  80  88  97 101  52 > xg_group_mean <- split(x, g) %>% sapply(mean)> xg_group_mean        0         1         2         3         4         5         6         7 0.9776091 1.3270451 1.6645178 1.7567653 2.2137027 2.4426637 2.5394288 2.6557613         8         9        10 2.8258368 3.0948452 3.1845892 

 

《R数据科学》是一本专门讲解tidyverse相关包的书籍,主要涉及dplyr、tidyr、ggplot2、purrr等,非常值得学习,基本上此一本书可以解答数据处理的大部分问题

 

【往期回顾推荐】

R语言快速入门

R 语言 逻辑运算:TRUE/FALSE

R语言 高阶可视化绘图系统:ggplot2入门

R语言,入门首看、必看基础概述

R语言数据管理与dplyr、tidyr

 

你可能感兴趣的:(R语言)