tidyverse
系列应该算是R语言数据分析中的瑞士军刀了,统一的格式,简洁的代码,管道符便于阅读的形式,都能让大家快速上手。R数据科学就是专门讲这个系列的,但是对于很多函数的用法和细节问题,都没有说,所以在使用时还是会经常遇到各种问题。
我根据R数据科学和tidyverse官网的教程,整理了几篇笔记,主要是对tidyverse的各种函数的用法进行详细的演示。
前面已经介绍过了forcats
包处理因子型数据,lubridate
包处理日期时间格式数据。
下面介绍dplyr
包。
在处理数据时,要明确以下几个问题:
dplyr
包可以帮你又快又简单地处理这些问题。tidyr
包主要聚焦于把数据变成整洁数据,dplyr
包主要功能在于对整洁数据进行各种操作,比如新增、筛选、汇总、合并等。
install.packages("tidyverse")
下面使用*星战(starwars)*数据集演示基本的dplyr
用法。
starwars数据集共有87行,14列,记录了星战里面的87个人物(机器人、外星人等等)的14个特点,比如姓名、身高、体重、头发颜色、眼睛颜色、种族等。
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dim(starwars)
## [1] 87 14
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex "male", "none", "none", "male", "female", "male", "female",~
## $ gender "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~
今天介绍的都是主要针对单个数据集进行操作的各种函数,也是最常见的类型。
根据作用方式不同,大致可以分为以下3类:
filter()
slice()
arrange()
select()
rename()
mutate()
relocate()
summarise()
filter()
函数用于筛选符合条件的行,可以用各种表达式进行筛选,比如筛选眼睛颜色是brown并且皮肤颜色是light的行,注意这里不需要使用 & 符号:
starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Leia Or~ 150 49 brown light brown 19 female femin~
## 2 Biggs D~ 183 84 black light brown 24 male mascu~
## 3 Cordé 157 NA brown light brown NA female femin~
## 4 Dormé 165 NA brown light brown NA female femin~
## 5 Raymus ~ 188 79 brown light brown NA male mascu~
## 6 Poe Dam~ NA NA brown light brown NA male mascu~
## 7 Padmé A~ 165 45 brown light brown 46 female femin~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
但是需要注意,filter()
函数不支持直接使用行号进行筛选,比如说你想选择第1行到第3行,下面这种写法是错误的:
starwars %>% filter(1:3)
这种情况应该使用slice()
函数:
starwars %>% slice(1:3)
## # A tibble: 3 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, blue red 33 none mascu~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
arrange()
函数是用来排序的,根据某一列进行排序。
starwars %>% arrange(height, mass)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Yoda 66 17 white green brown 896 male mascu~
## 2 Ratts T~ 79 15 none grey, blue unknown NA male mascu~
## 3 Wicket ~ 88 20 brown brown brown 8 male mascu~
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu~
## 5 R2-D2 96 32 white, bl~ red 33 none mascu~
## 6 R4-P17 96 NA none silver, r~ red, blue NA none femin~
## 7 R5-D4 97 32 white, red red NA none mascu~
## 8 Sebulba 112 40 none grey, red orange NA male mascu~
## 9 Gasgano 122 NA none white, bl~ black NA male mascu~
## 10 Watto 137 NA black blue, grey yellow NA male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## # films , vehicles , starships
desc()
函数可以进行倒序:
starwars %>% arrange(desc(height))
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Yarael~ 264 NA none white yellow NA male mascu~
## 2 Tarfful 234 136 brown brown blue NA male mascu~
## 3 Lama Su 229 88 none grey black NA male mascu~
## 4 Chewba~ 228 112 brown unknown blue 200 male mascu~
## 5 Roos T~ 224 82 none grey orange NA male mascu~
## 6 Grievo~ 216 159 none brown, whi~ green, y~ NA male mascu~
## 7 Taun We 213 NA none grey black NA fema~ femin~
## 8 Rugor ~ 206 NA none green orange NA male mascu~
## 9 Tion M~ 206 80 none grey black NA male mascu~
## 10 Darth ~ 202 136 none white yellow 41.9 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## # films , vehicles , starships
选择第5-10行的数据:
starwars %>% slice(5:10)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 2 Owen La~ 178 120 brown, grey light blue 52 male mascu~
## 3 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 4 R5-D4 97 32 white, red red NA none mascu~
## 5 Biggs D~ 183 84 black light brown 24 male mascu~
## 6 Obi-Wan~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
这其实是一组函数,还有各种变体,比如下面这个,选择前4行:
starwars %>% slice_head(n = 4)
## # A tibble: 4 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, blue red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
选择最后面的10%的行:
starwars %>% slice_tail(prop = 0.1)
## # A tibble: 8 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Sly Moo~ 178 48 none pale white NA
## 2 Tion Me~ 206 80 none grey black NA male mascu~
## 3 Finn NA NA black dark dark NA male mascu~
## 4 Rey NA NA brown light hazel NA female femin~
## 5 Poe Dam~ NA NA brown light brown NA male mascu~
## 6 BB8 NA NA none none black NA none mascu~
## 7 Captain~ NA NA unknown unknown unknown NA
## 8 Padmé A~ 165 45 brown light brown 46 female femin~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
随机选择10%的行:
starwars %>% slice_sample(prop = 0.1) # n=2
## # A tibble: 8 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Lobot 175 79 none light blue 37 male mascu~
## 2 Zam We~ 168 55 blonde fair, green~ yellow NA fema~ femin~
## 3 Ric Ol~ 183 NA brown fair blue NA
## 4 R4-P17 96 NA none silver, red red, blue NA none femin~
## 5 Lando ~ 177 79 black dark brown 31 male mascu~
## 6 Greedo 173 74 green black 44 male mascu~
## 7 Ackbar 180 83 none brown mottle orange 41 male mascu~
## 8 Rugor ~ 206 NA none green orange NA male mascu~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
随机选择10%的行,可以重复:
starwars %>% slice_sample(n=10, replace = T)
## # A tibble: 10 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Jek To~ 180 110 brown fair blue NA male mascu~
## 2 Quarsh~ 183 NA black dark brown 62
## 3 Arvel ~ NA NA brown fair brown NA male mascu~
## 4 Darth ~ 175 80 none red yellow 54 male mascu~
## 5 Finn NA NA black dark dark NA male mascu~
## 6 Ric Ol~ 183 NA brown fair blue NA
## 7 Mace W~ 188 84 none dark brown 72 male mascu~
## 8 Jango ~ 183 79 black tan brown 66 male mascu~
## 9 San Hi~ 191 NA none grey gold NA male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
选择某一列中最大或者最小的几个值所在的行,注意不能有NA值:
starwars %>% filter(!is.na(height)) %>%
slice_max(height, n = 5) # 选择的这列不能有NA
## # A tibble: 5 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Yarael ~ 264 NA none white yellow NA male mascul~
## 2 Tarfful 234 136 brown brown blue NA male mascul~
## 3 Lama Su 229 88 none grey black NA male mascul~
## 4 Chewbac~ 228 112 brown unknown blue 200 male mascul~
## 5 Roos Ta~ 224 82 none grey orange NA male mascul~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
直接根据列名选择列,列名不需要使用引号:
starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 x 3
## hair_color skin_color eye_color
##
## 1 blond fair blue
## 2 gold yellow
## 3 white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
选择列名中以color结尾的列:
starwars %>% select(ends_with("color"))
## # A tibble: 87 x 3
## hair_color skin_color eye_color
##
## 1 blond fair blue
## 2 gold yellow
## 3 white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
选择列名中包含color字样的列:
starwars %>% select(contains("color"))
## # A tibble: 87 x 3
## hair_color skin_color eye_color
##
## 1 blond fair blue
## 2 gold yellow
## 3 white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
重命名列:
starwars %>% rename(home_world = homeworld)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: home_world , species ,
## # films , vehicles , starships
starwars %>% mutate(height_m = height/100, .before = 1)
## # A tibble: 87 x 15
## height_m name height mass hair_color skin_color eye_color birth_year sex
##
## 1 1.72 Luke ~ 172 77 blond fair blue 19 male
## 2 1.67 C-3PO 167 75 gold yellow 112 none
## 3 0.96 R2-D2 96 32 white, bl~ red 33 none
## 4 2.02 Darth~ 202 136 none white yellow 41.9 male
## 5 1.5 Leia ~ 150 49 brown light brown 19 fema~
## 6 1.78 Owen ~ 178 120 brown, gr~ light blue 52 male
## 7 1.65 Beru ~ 165 75 brown light blue 47 fema~
## 8 0.97 R5-D4 97 32 white, red red NA none
## 9 1.83 Biggs~ 183 84 black light brown 24 male
## 10 1.82 Obi-W~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 6 more variables: gender , homeworld ,
## # species , films , vehicles , starships
这个效果和上面是一样的,都是把新建的列放在最前面:
starwars %>%
mutate(height_m = height/100) %>%
select(height_m, everything())
## # A tibble: 87 x 15
## height_m name height mass hair_color skin_color eye_color birth_year sex
##
## 1 1.72 Luke ~ 172 77 blond fair blue 19 male
## 2 1.67 C-3PO 167 75 gold yellow 112 none
## 3 0.96 R2-D2 96 32 white, bl~ red 33 none
## 4 2.02 Darth~ 202 136 none white yellow 41.9 male
## 5 1.5 Leia ~ 150 49 brown light brown 19 fema~
## 6 1.78 Owen ~ 178 120 brown, gr~ light blue 52 male
## 7 1.65 Beru ~ 165 75 brown light blue 47 fema~
## 8 0.97 R5-D4 97 32 white, red red NA none
## 9 1.83 Biggs~ 183 84 black light brown 24 male
## 10 1.82 Obi-W~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 6 more variables: gender , homeworld ,
## # species , films , vehicles , starships
新建的列可以直接被使用:
starwars %>%
mutate(
height_m = height / 100,
BMI = mass / (height_m^2)
) %>%
select(BMI, everything())
## # A tibble: 87 x 16
## BMI name height mass hair_color skin_color eye_color birth_year sex
##
## 1 26.0 Luke Sk~ 172 77 blond fair blue 19 male
## 2 26.9 C-3PO 167 75 gold yellow 112 none
## 3 34.7 R2-D2 96 32 white, bl~ red 33 none
## 4 33.3 Darth V~ 202 136 none white yellow 41.9 male
## 5 21.8 Leia Or~ 150 49 brown light brown 19 fema~
## 6 37.9 Owen La~ 178 120 brown, grey light blue 52 male
## 7 27.5 Beru Wh~ 165 75 brown light blue 47 fema~
## 8 34.0 R5-D4 97 32 white, red red NA none
## 9 25.1 Biggs D~ 183 84 black light brown 24 male
## 10 23.2 Obi-Wan~ 182 77 auburn, wh~ fair blue-gray 57 male
## # ... with 77 more rows, and 7 more variables: gender , homeworld ,
## # species , films , vehicles , starships ,
## # height_m
只保留新建的列,其他列不要了:
starwars %>%
transmute(
height_m = height / 100,
BMI = mass / (height_m^2)
)
## # A tibble: 87 x 2
## height_m BMI
##
## 1 1.72 26.0
## 2 1.67 26.9
## 3 0.96 34.7
## 4 2.02 33.3
## 5 1.5 21.8
## 6 1.78 37.9
## 7 1.65 27.5
## 8 0.97 34.0
## 9 1.83 25.1
## 10 1.82 23.2
## # ... with 77 more rows
主要是使用.before
和.after
参数,控制位置:
starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 x 14
## name sex gender homeworld height mass hair_color skin_color eye_color
##
## 1 Luke Sk~ male mascu~ Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu~ Tatooine 167 75 gold yellow
## 3 R2-D2 none mascu~ Naboo 96 32 white, bl~ red
## 4 Darth V~ male mascu~ Tatooine 202 136 none white yellow
## 5 Leia Or~ female femin~ Alderaan 150 49 brown light brown
## 6 Owen La~ male mascu~ Tatooine 178 120 brown, gr~ light blue
## 7 Beru Wh~ female femin~ Tatooine 165 75 brown light blue
## 8 R5-D4 none mascu~ Tatooine 97 32 white, red red
## 9 Biggs D~ male mascu~ Tatooine 183 84 black light brown
## 10 Obi-Wan~ male mascu~ Stewjon 182 77 auburn, w~ fair blue-gray
## # ... with 77 more rows, and 5 more variables: birth_year , species ,
## # films , vehicles , starships
一般来说和group_by()
连用才能发挥威力。
starwars %>% summarise(height = mean(height, na.rm = T))
## # A tibble: 1 x 1
## height
##
## 1 174.
今天主要是对dplyr
有一个大致的认识,熟悉下最常见的操作,后面会根据不同的应用场景继续介绍更多的内容。
在现实生活中我们经常会遇到非常多需要分组汇总的情况,单个的汇总价值不大,只有分组之后,才能看出差异,才能表现出数据的价值。
dplyr
为我们提供了group_by()
函数,主要使用group_by()
对数据进行分组,然后再进行各种计算,通过和其他操作进行连接,发挥更加强大的作用。
先建立2个分组数据进行演示,还是使用星战数据集。
by_species <- starwars %>% group_by(species)
by_sex_gender <- starwars %>% group_by(sex, gender)
看看这两个对象有什么不同,可以看出和原数据集没什么不同,但是都被分组了!
by_species
## # A tibble: 87 x 14
## # Groups: species [38]
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## # films , vehicles , starships
by_sex_gender
## # A tibble: 87 x 14
## # Groups: sex, gender [6]
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## # films , vehicles , starships
使用tally()
函数进行计数:
by_species %>% tally(sort = T)
## # A tibble: 38 x 2
## species n
##
## 1 Human 35
## 2 Droid 6
## 3 4
## 4 Gungan 3
## 5 Kaminoan 2
## 6 Mirialan 2
## 7 Twi'lek 2
## 8 Wookiee 2
## 9 Zabrak 2
## 10 Aleena 1
## # ... with 28 more rows
和下面这个操作是一样的效果:
by_species %>% summarise(n=n())
## # A tibble: 38 x 2
## species n
##
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # ... with 28 more rows
除了根据现有的变量进行分组外,还可以根据现有变量的函数进行分组,这样做类似于先mutate()
再group_by()
。
bmi_breaks <- c(0,18.5,25,30,Inf)
starwars %>%
group_by(bmi_cat = cut(mass/(height/100)^2,breaks = bmi_breaks)) %>%
tally(sort = T)
## # A tibble: 5 x 2
## bmi_cat n
##
## 1 28
## 2 (18.5,25] 24
## 3 (25,30] 13
## 4 (30,Inf] 12
## 5 (0,18.5] 10
是不是很神奇?
group_keys()
查看用于分组的组内有哪些类别,可以看到species有38种:
by_species %>% group_keys()
## # A tibble: 38 x 1
## species
##
## 1 Aleena
## 2 Besalisk
## 3 Cerean
## 4 Chagrian
## 5 Clawdite
## 6 Droid
## 7 Dug
## 8 Ewok
## 9 Geonosian
## 10 Gungan
## # ... with 28 more rows
by_sex_gender %>% group_keys()
## # A tibble: 6 x 2
## sex gender
##
## 1 female feminine
## 2 hermaphroditic masculine
## 3 male masculine
## 4 none feminine
## 5 none masculine
## 6
group_indices()
查看每一行属于哪个组:
by_species %>% group_indices() # which group each row belongs to
## [1] 11 6 6 11 11 11 11 6 11 11 11 11 34 11 24 12 11 11 36 11 11 6 31 11 11
## [26] 18 11 11 8 26 11 21 11 10 10 10 38 30 7 38 11 37 32 32 33 35 29 11 3 20
## [51] 37 27 13 23 16 4 11 11 11 9 17 17 11 11 11 11 5 2 15 15 11 1 6 25 19
## [76] 28 14 34 11 38 22 11 11 11 6 38 11
group_rows()
查看每个组包括哪些行:
by_species %>% group_rows() # which rows each group contains with
## [38]>
## [[1]]
## [1] 72
##
## [[2]]
## [1] 68
##
## [[3]]
## [1] 49
##
## [[4]]
## [1] 56
##
## [[5]]
## [1] 67
##
## [[6]]
## [1] 2 3 8 22 73 85
##
## [[7]]
## [1] 39
##
## [[8]]
## [1] 29
##
## [[9]]
## [1] 60
##
## [[10]]
## [1] 34 35 36
##
## [[11]]
## [1] 1 4 5 6 7 9 10 11 12 14 17 18 20 21 24 25 27 28 31 33 41 48 57 58 59
## [26] 63 64 65 66 71 79 82 83 84 87
##
## [[12]]
## [1] 16
##
## [[13]]
## [1] 53
##
## [[14]]
## [1] 77
##
## [[15]]
## [1] 69 70
##
## [[16]]
## [1] 55
##
## [[17]]
## [1] 61 62
##
## [[18]]
## [1] 26
##
## [[19]]
## [1] 75
##
## [[20]]
## [1] 50
##
## [[21]]
## [1] 32
##
## [[22]]
## [1] 81
##
## [[23]]
## [1] 54
##
## [[24]]
## [1] 15
##
## [[25]]
## [1] 74
##
## [[26]]
## [1] 30
##
## [[27]]
## [1] 52
##
## [[28]]
## [1] 76
##
## [[29]]
## [1] 47
##
## [[30]]
## [1] 38
##
## [[31]]
## [1] 23
##
## [[32]]
## [1] 43 44
##
## [[33]]
## [1] 45
##
## [[34]]
## [1] 13 78
##
## [[35]]
## [1] 46
##
## [[36]]
## [1] 19
##
## [[37]]
## [1] 42 51
##
## [[38]]
## [1] 37 40 80 86
group_vars()
查看用于聚合的变量名字:
by_sex_gender %>% group_vars() # the name of the grouping variable
## [1] "sex" "gender"
如果把group_by()
作用于已经聚合的变量,那数据会被覆盖,比如下面这个,by_species
已经被species
聚合了,再通过homeworld
聚合,那结果只是homeworld
的结果:
by_species %>%
group_by(homeworld) %>%
tally()
## # A tibble: 49 x 2
## homeworld n
##
## 1 Alderaan 3
## 2 Aleen Minor 1
## 3 Bespin 1
## 4 Bestine IV 1
## 5 Cato Neimoidia 1
## 6 Cerea 1
## 7 Champala 1
## 8 Chandrila 1
## 9 Concord Dawn 1
## 10 Corellia 2
## # ... with 39 more rows
是不是之前没注意过这些小问题?
通过使用一个参数可以避免这个问题:
by_species %>%
group_by(homeworld, .add = T) %>%
tally()
## # A tibble: 58 x 3
## # Groups: species [38]
## species homeworld n
##
## 1 Aleena Aleen Minor 1
## 2 Besalisk Ojom 1
## 3 Cerean Cerea 1
## 4 Chagrian Champala 1
## 5 Clawdite Zolan 1
## 6 Droid Naboo 1
## 7 Droid Tatooine 2
## 8 Droid 3
## 9 Dug Malastare 1
## 10 Ewok Endor 1
## # ... with 48 more rows
一个被聚合的数据如果不解除聚合,那么后面的操作都会以聚合后的结果呈现出来,所以聚合之后一定要记得解除聚合!
by_species %>%
ungroup() %>%
tally()
## # A tibble: 1 x 1
## n
##
## 1 87
by_sex_gender %>%
ungroup(sex) %>%
tally()
## # A tibble: 3 x 2
## gender n
##
## 1 feminine 17
## 2 masculine 66
## 3 4
下面这部分主要介绍group_by和其他函数的联合使用:
by_species %>%
summarise(
n = n(),
height = mean(height, na.rm = TRUE)
)
## # A tibble: 38 x 3
## species n height
##
## 1 Aleena 1 79
## 2 Besalisk 1 198
## 3 Cerean 1 198
## 4 Chagrian 1 196
## 5 Clawdite 1 168
## 6 Droid 6 131.
## 7 Dug 1 112
## 8 Ewok 1 88
## 9 Geonosian 1 183
## 10 Gungan 3 209.
## # ... with 28 more rows
control the grouping variables
通过.groups
参数控制聚合变量:
by_sex_gender %>%
summarise(n = n()) %>%
group_vars()
## `summarise()` has grouped output by 'sex'. You can override using the
## `.groups` argument.
## [1] "sex"
# 只通过sex进行聚合
by_sex_gender %>%
summarise(n = n(), .groups = "drop_last") %>%
group_vars()
## [1] "sex"
by_sex_gender %>%
summarise(n = n(), .groups = "keep") %>%
group_vars()
## [1] "sex" "gender"
# 不聚合了
by_sex_gender %>%
summarise(n = n(), .groups = "drop") %>%
group_vars()
## character(0)
select()
/rename()
/relocate()
by_species %>% select(mass) # grouped by species
## Adding missing grouping variables: `species`
## # A tibble: 87 x 2
## # Groups: species [38]
## species mass
##
## 1 Human 77
## 2 Droid 75
## 3 Droid 32
## 4 Human 136
## 5 Human 49
## 6 Human 120
## 7 Human 75
## 8 Droid 32
## 9 Human 84
## 10 Human 77
## # ... with 77 more rows
by_species %>%
ungroup() %>%
select(mass)
## # A tibble: 87 x 1
## mass
##
## 1 77
## 2 75
## 3 32
## 4 136
## 5 49
## 6 120
## 7 75
## 8 32
## 9 84
## 10 77
## # ... with 77 more rows
by_species %>%
arrange(desc(mass)) %>%
relocate(species, mass)
## # A tibble: 87 x 14
## # Groups: species [38]
## species mass name height hair_color skin_color eye_color birth_year sex
##
## 1 Hutt 1358 Jabba~ 175 green-tan~ orange 600 herm~
## 2 Kaleesh 159 Griev~ 216 none brown, wh~ green, y~ NA male
## 3 Droid 140 IG-88 200 none metal red 15 none
## 4 Human 136 Darth~ 202 none white yellow 41.9 male
## 5 Wookiee 136 Tarff~ 234 brown brown blue NA male
## 6 Human 120 Owen ~ 178 brown, gr~ light blue 52 male
## 7 Trandos~ 113 Bossk 190 none green red 53 male
## 8 Wookiee 112 Chewb~ 228 brown unknown blue 200 male
## 9 Human 110 Jek T~ 180 brown fair blue NA male
## 10 Besalisk 102 Dexte~ 198 none brown yellow NA male
## # ... with 77 more rows, and 5 more variables: gender , homeworld ,
## # films , vehicles , starships
通过.by_group
参数控制进行排序的先后位置,下面这个例子就是先根据species进行排序,再根据mass进行排序,和上面的不一样哦!
by_species %>%
arrange(desc(mass), .by_group = T) %>%
relocate(species, mass)
## # A tibble: 87 x 14
## # Groups: species [38]
## species mass name height hair_color skin_color eye_color birth_year sex
##
## 1 Aleena 15 Ratts~ 79 none grey, blue unknown NA male
## 2 Besalisk 102 Dexte~ 198 none brown yellow NA male
## 3 Cerean 82 Ki-Ad~ 198 white pale yellow 92 male
## 4 Chagrian NA Mas A~ 196 none blue blue NA male
## 5 Clawdite 55 Zam W~ 168 blonde fair, gre~ yellow NA fema~
## 6 Droid 140 IG-88 200 none metal red 15 none
## 7 Droid 75 C-3PO 167 gold yellow 112 none
## 8 Droid 32 R2-D2 96 white, bl~ red 33 none
## 9 Droid 32 R5-D4 97 white, red red NA none
## 10 Droid NA R4-P17 96 none silver, r~ red, blue NA none
## # ... with 77 more rows, and 5 more variables: gender , homeworld ,
## # films , vehicles , starships
muatate()
and transmutate()
starwars %>%
select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(means = mean(mass, na.rm = T),
standard_mass = mass - mean(mass, na.rm = T))
## # A tibble: 87 x 5
## # Groups: homeworld [49]
## name homeworld mass means standard_mass
##
## 1 Luke Skywalker Tatooine 77 85.4 -8.38
## 2 C-3PO Tatooine 75 85.4 -10.4
## 3 R2-D2 Naboo 32 64.2 -32.2
## 4 Darth Vader Tatooine 136 85.4 50.6
## 5 Leia Organa Alderaan 49 64 -15
## 6 Owen Lars Tatooine 120 85.4 34.6
## 7 Beru Whitesun lars Tatooine 75 85.4 -10.4
## 8 R5-D4 Tatooine 32 85.4 -53.4
## 9 Biggs Darklighter Tatooine 84 85.4 -1.38
## 10 Obi-Wan Kenobi Stewjon 77 77 0
## # ... with 77 more rows
min_rank()
函数返回顺序(秩次):
# Overall rank
starwars %>%
select(name, homeworld, height) %>%
mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## name homeworld height rank
##
## 1 Luke Skywalker Tatooine 172 29
## 2 C-3PO Tatooine 167 21
## 3 R2-D2 Naboo 96 5
## 4 Darth Vader Tatooine 202 72
## 5 Leia Organa Alderaan 150 11
## 6 Owen Lars Tatooine 178 35
## 7 Beru Whitesun lars Tatooine 165 17
## 8 R5-D4 Tatooine 97 7
## 9 Biggs Darklighter Tatooine 183 45
## 10 Obi-Wan Kenobi Stewjon 182 44
## # ... with 77 more rows
先根据homeworld进行分组,再新建列:
# Rank per homeworld
starwars %>%
select(name, homeworld, height) %>%
group_by(homeworld) %>%
mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## # Groups: homeworld [49]
## name homeworld height rank
##
## 1 Luke Skywalker Tatooine 172 5
## 2 C-3PO Tatooine 167 4
## 3 R2-D2 Naboo 96 1
## 4 Darth Vader Tatooine 202 10
## 5 Leia Organa Alderaan 150 1
## 6 Owen Lars Tatooine 178 6
## 7 Beru Whitesun lars Tatooine 165 3
## 8 R5-D4 Tatooine 97 1
## 9 Biggs Darklighter Tatooine 183 7
## 10 Obi-Wan Kenobi Stewjon 182 1
## # ... with 77 more rows
筛选每个物种(species)中最高(height)的那一个:
by_species %>%
select(name, species, height) %>%
filter(height == max(height))
## # A tibble: 35 x 3
## # Groups: species [35]
## name species height
##
## 1 Greedo Rodian 173
## 2 Jabba Desilijic Tiure Hutt 175
## 3 Yoda Yoda's species 66
## 4 Bossk Trandoshan 190
## 5 Ackbar Mon Calamari 180
## 6 Wicket Systri Warrick Ewok 88
## 7 Nien Nunb Sullustan 160
## 8 Nute Gunray Neimodian 191
## 9 Roos Tarpals Gungan 224
## 10 Watto Toydarian 137
## # ... with 25 more rows
去掉只有1个成员的物种:
by_species %>%
filter(n() != 1) %>%
tally()
## # A tibble: 9 x 2
## species n
##
## 1 Droid 6
## 2 Gungan 3
## 3 Human 35
## 4 Kaminoan 2
## 5 Mirialan 2
## 6 Twi'lek 2
## 7 Wookiee 2
## 8 Zabrak 2
## 9 4
在dplyr verbs内部,可以使用带有cur前缀的函数族访问当前组的各种属性。
cur_data() returns the current group, excluding grouping variables. It’s useful to feed to functions that take a whole data frame. For example, the following code fits a linear model of mass ~ height to each species:
by_species %>%
filter(n() > 1) %>%
mutate(mod = list(lm(mass ~ height, data = cur_data())))
## # A tibble: 58 x 15
## # Groups: species [9]
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 48 more rows, and 6 more variables: homeworld , species ,
## # films , vehicles , starships , mod
by_species %>%
arrange(species) %>%
select(name, species, homeworld) %>%
mutate(id = cur_group_id())
## # A tibble: 87 x 4
## # Groups: species [38]
## name species homeworld id
##
## 1 Ratts Tyerell Aleena Aleen Minor 1
## 2 Dexter Jettster Besalisk Ojom 2
## 3 Ki-Adi-Mundi Cerean Cerea 3
## 4 Mas Amedda Chagrian Champala 4
## 5 Zam Wesell Clawdite Zolan 5
## 6 C-3PO Droid Tatooine 6
## 7 R2-D2 Droid Naboo 6
## 8 R5-D4 Droid Tatooine 6
## 9 IG-88 Droid 6
## 10 R4-P17 Droid 6
## # ... with 77 more rows
可以参考R数据科学这本书中的介绍,非常详细。
处理两个数据集的函数。
内连接
外连接
library(nycflights13)
library(dplyr)
# 选择部分数据方便演示
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
glimpse(flights2)
## Rows: 336,776
## Columns: 8
## $ year 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20~
## $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6,~
## $ origin "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK",~
## $ dest "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO",~
## $ tailnum "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N~
## $ carrier "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B~
glimpse(airlines)
## Rows: 16
## Columns: 2
## $ carrier "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O~
## $ name "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline~
有一列共有的,carrier
flights2 %>%
left_join(airlines)
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier name
##
## 1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
## 4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
## 5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
## 8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
## 10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
by = NULL
是默认。
glimpse(weather)
## Rows: 26,115
## Columns: 15
## $ origin "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW~
## $ year 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,~
## $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, ~
## $ temp 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.~
## $ dewp 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.~
## $ humid 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.~
## $ wind_dir 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,~
## $ wind_speed 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, ~
## $ wind_gust NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.~
## $ precip 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ pressure 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101~
## $ visib 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ time_hour 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:00~
flights2 %>% left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
##
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir ,
## # wind_speed , wind_gust , precip , pressure ,
## # visib , time_hour
glimpse(planes)
## Rows: 3,322
## Columns: 9
## $ tailnum "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW~
## $ year 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199~
## $ type "Fixed wing multi engine", "Fixed wing multi engine", "Fi~
## $ manufacturer "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU~
## $ model "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145~
## $ engines 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ seats 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5~
## $ speed NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ engine "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb~
flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
##
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA
## # ... with 336,766 more rows, and 6 more variables: manufacturer ,
## # model , engines , seats , speed , engine
glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "~
## $ name "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur~
## $ lat 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4~
## $ lon -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342~
## $ alt 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10~
## $ tz -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, ~
## $ dst "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",~
## $ tzone "America/New_York", "America/Chicago", "America/Chicago", "Ameri~
如果两个数据集中列名不一样也可以进行合并:
flights2 %>% left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
##
## 1 2013 1 1 5 EWR IAH N14228 UA Georg~ 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Georg~ 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miami~ 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Harts~ 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chica~ 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort ~ 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Washi~ 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orlan~ 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chica~ 42.0 -87.9 668
## # ... with 336,766 more rows, and 3 more variables: tz , dst ,
## # tzone
下面是一个简单的左连接的例子:
df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = 10, b = "a")
df1
## # A tibble: 2 x 2
## x y
##
## 1 1 2
## 2 2 1
df2
## # A tibble: 2 x 3
## x a b
##
## 1 3 10 a
## 2 1 10 a
df1 %>% inner_join(df2)
## Joining, by = "x"
## # A tibble: 1 x 4
## x y a b
##
## 1 1 2 10 a
df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
##
## 1 1 2 10 a
## 2 2 1 NA
右连接:
df1 %>% right_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
##
## 1 1 2 10 a
## 2 3 NA 10 a
df2 %>% left_join(df1)
## Joining, by = "x"
## # A tibble: 2 x 4
## x a b y
##
## 1 3 10 a NA
## 2 1 10 a 2
全连接:
df1 %>% full_join(df2)
## Joining, by = "x"
## # A tibble: 3 x 4
## x y a b
##
## 1 1 2 10 a
## 2 2 1 NA
## 3 3 NA 10 a
df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))
df1
## # A tibble: 4 x 2
## x y
##
## 1 1 1
## 2 1 2
## 3 3 3
## 4 4 4
df2
## # A tibble: 3 x 2
## x z
##
## 1 1 a
## 2 1 b
## 3 2 a
df1 %>% nrow()
## [1] 4
df1 %>% inner_join(df2, by = "x")
## # A tibble: 4 x 3
## x y z
##
## 1 1 1 a
## 2 1 1 b
## 3 1 2 a
## 4 1 2 b
df1 %>% semi_join(df2, by = "x")
## # A tibble: 2 x 2
## x y
##
## 1 1 1
## 2 1 2
df1 %>% anti_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 2
## x y
##
## 1 3 3
## 2 4 4
(df1 <- tibble(x = 1:2, y = c(1L, 1L)))
## # A tibble: 2 x 2
## x y
##
## 1 1 1
## 2 2 1
(df2 <- tibble(x = 1:2, y = 1:2))
## # A tibble: 2 x 2
## x y
##
## 1 1 1
## 2 2 2
intersect(df1, df2) # 取交集
## # A tibble: 1 x 2
## x y
##
## 1 1 1
union(df1, df2) # 并集
## # A tibble: 3 x 2
## x y
##
## 1 1 1
## 2 2 1
## 3 2 2
setdiff(df1, df2)
## # A tibble: 1 x 2
## x y
##
## 1 2 1
setdiff(df2, df1)
## # A tibble: 1 x 2
## x y
##
## 1 2 2
下面是一些集合操作的示意图:
主要是介绍across
函数的用法,这是dplyr
1.0才出来的一个函数,大大简化了代码
可用于对多列做同一个操作。
library(dplyr, warn.conflicts = FALSE)
across()
有两个基本参数:
可以替代_if(),at_(),all_()
starwars %>%
summarise(across(where(is.character), n_distinct))
## # A tibble: 1 x 8
## name hair_color skin_color eye_color sex gender homeworld species
##
## 1 87 13 31 15 5 3 49 38
可以直接写列名:
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
summarise(across(c(sex, gender, homeworld), n_distinct))
## # A tibble: 9 x 4
## species sex gender homeworld
##
## 1 Droid 1 2 3
## 2 Gungan 1 1 1
## 3 Human 2 2 16
## 4 Kaminoan 2 2 1
## 5 Mirialan 1 1 1
## 6 Twi'lek 2 2 1
## 7 Wookiee 1 1 1
## 8 Zabrak 1 1 2
## 9 1 1 3
也可以和where
函数连用,省时省力:
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
## # A tibble: 10 x 4
## homeworld height mass birth_year
##
## 1 Alderaan 176. 64 43
## 2 Corellia 175 78.5 25
## 3 Coruscant 174. 50 91
## 4 Kamino 208. 83.1 31.5
## 5 Kashyyyk 231 124 200
## 6 Mirial 168 53.1 49
## 7 Naboo 175. 64.2 55
## 8 Ryloth 179 55 48
## 9 Tatooine 170. 85.4 54.6
## 10 139. 82 334.
如果没有缺失值,可以直接写mean,
library(tidyr)
starwars %>% drop_na() %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric), mean))
## # A tibble: 4 x 4
## homeworld height mass birth_year
##
## 1 Corellia 175 78.5 25
## 2 Mirial 168 53.1 49
## 3 Naboo 177 62 60
## 4 Tatooine 181. 96 37.6
acorss
支持多个函数同时使用,只要放入列表中即可:
min_max <- list(
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE)
)
starwars %>% summarise(across(where(is.numeric), min_max))
## # A tibble: 1 x 6
## height_min height_max mass_min mass_max birth_year_min birth_year_max
##
## 1 66 264 15 1358 8 896
starwars %>% summarise(across(c(height, mass, birth_year), min_max))
## # A tibble: 1 x 6
## height_min height_max mass_min mass_max birth_year_min birth_year_max
##
## 1 66 264 15 1358 8 896
当然也是支持glue
的:
starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
## min.height max.height min.mass max.mass min.birth_year max.birth_year
##
## 1 66 264 15 1358 8 896
starwars %>% summarise(across(c(height, mass, birth_year), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
## min.height max.height min.mass max.mass min.birth_year max.birth_year
##
## 1 66 264 15 1358 8 896
分开写也是可以的:
starwars %>% summarise(
across(c(height, mass, birth_year), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
across(c(height, mass, birth_year), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
## # A tibble: 1 x 6
## min_height min_mass min_birth_year max_height max_mass max_birth_year
##
## 1 66 15 8 264 1358 896
这种情况不能使用where(is.numeric)
,因为第2个across
会使用新创建的列(“min_height”, “min_mass” and “min_birth_year”)。
可以放在tibble里解决:
starwars %>% summarise(
tibble(
across(where(is.numeric), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
across(where(is.numeric), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
)
## # A tibble: 1 x 6
## min_height min_mass min_birth_year max_height max_mass max_birth_year
##
## 1 66 15 8 264 1358 896
在使用where(is.numeric)
,要注意下面这种情况:
df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))
df %>%
summarise(n = n(), across(where(is.numeric), sd))
## n x y
## 1 NA 1 4.041452
n这里是3,是一个常数,所以它的sd变成了NA,可以通过换一下顺序解决:
df %>% summarise(across(where(is.numeric), sd),
n = n()
)
## x y n
## 1 1 4.041452 3
或者通过下面两种方法解决:
df %>%
summarise(n = n(), across(where(is.numeric) & !n, sd))
## n x y
## 1 3 1 4.041452
df %>%
summarise(
tibble(n = n(), across(where(is.numeric), sd))
)
## n x y
## 1 3 1 4.041452
还可以和group_by()
/count()
/distinct()
连用。
across()
不能直接和filter()
连用,和filter()
连用的是if_any()
和if_all()
。
starwars %>%
filter(if_any(everything(), ~ !is.na(.x)))
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## # films , vehicles , starships
starwars %>%
filter(if_all(everything(), ~ !is.na(.x)))
## # A tibble: 29 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 3 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 4 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 5 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 6 Biggs ~ 183 84 black light brown 24 male mascu~
## 7 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## 8 Anakin~ 188 84 blond fair blue 41.9 male mascu~
## 9 Chewba~ 228 112 brown unknown blue 200 male mascu~
## 10 Han So~ 180 80 brown fair brown 29 male mascu~
## # ... with 19 more rows, and 5 more variables: homeworld , species ,
## # films , vehicles , starships
在tidyverse中,整洁数据一般都是每一行是一个观测,每一列是一个变量,基本上所有操作都是基于整洁的数据进行的,都是对某列做什么操作。但有时候我们也需要对某行做一些操作,dplyr中现在提供了rowwise()
函数快速执行对行的操作。
library(dplyr, warn.conflicts = FALSE)
rowwise()
和group_by()
很像,本身不做任何操作,但是使用了rowwise之后,再和mutate()
等函数连用时,就会变成按照行进行操作!
df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
## # A tibble: 2 x 3
## # Rowwise:
## x y z
##
## 1 1 3 5
## 2 2 4 6
假如你想分别计算每行的均值(只是一个例子),不使用rowwise()
函数,得到的结果是所有数据的均值,很明显不是想要的:
df %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
## x y z m
##
## 1 1 3 5 3.5
## 2 2 4 6 3.5
使用rowwise()
之后,神奇的事情发生了,变成了按行操作!
df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
## # Rowwise:
## x y z m
##
## 1 1 3 5 3
## 2 2 4 6 4
df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)
df
## # A tibble: 2 x 4
## name x y z
##
## 1 Mara 1 3 5
## 2 Hadley 2 4 6
按照行计算均值:
df %>%
rowwise() %>%
summarise(m = mean(c(x, y, z)))
## # A tibble: 2 x 1
## m
##
## 1 3
## 2 4
根据name这一列按照行计算均值:
df %>%
rowwise(name) %>%
summarise(m = mean(c(x, y, z)))
## `summarise()` has grouped output by 'name'. You can override using
## the `.groups` argument.
## # A tibble: 2 x 2
## # Groups: name [2]
## name m
##
## 1 Mara 3
## 2 Hadley 4
rowwise()
可以看做是group_by()
的特殊形式,本身也是对数据先进行聚合操作,所以如果要解除聚合,也要使用ungroup()
函数。
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
## # A tibble: 6 x 5
## id w x y z
##
## 1 1 10 20 30 40
## 2 2 11 21 31 41
## 3 3 12 22 32 42
## 4 4 13 23 33 43
## 5 5 14 24 34 44
## 6 6 15 25 35 45
接下来要进行按行操作了!
rf <- df %>% rowwise(id)
计算加和:
rf %>% mutate(total = sum(c(w, x, y, z)))
## # A tibble: 6 x 6
## # Rowwise: id
## id w x y z total
##
## 1 1 10 20 30 40 100
## 2 2 11 21 31 41 104
## 3 3 12 22 32 42 108
## 4 4 13 23 33 43 112
## 5 5 14 24 34 44 116
## 6 6 15 25 35 45 120
rf %>% summarise(total = sum(c(w, x, y, z)))
## `summarise()` has grouped output by 'id'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 2
## # Groups: id [6]
## id total
##
## 1 1 100
## 2 2 104
## 3 3 108
## 4 4 112
## 5 5 116
## 6 6 120
rf %>% mutate(total = sum(c_across(w:z)))
## # A tibble: 6 x 6
## # Rowwise: id
## id w x y z total
##
## 1 1 10 20 30 40 100
## 2 2 11 21 31 41 104
## 3 3 12 22 32 42 108
## 4 4 13 23 33 43 112
## 5 5 14 24 34 44 116
## 6 6 15 25 35 45 120
rf %>% mutate(total = sum(c_across(where(is.numeric))))
## # A tibble: 6 x 6
## # Rowwise: id
## id w x y z total
##
## 1 1 10 20 30 40 100
## 2 2 11 21 31 41 104
## 3 3 12 22 32 42 108
## 4 4 13 23 33 43 112
## 5 5 14 24 34 44 116
## 6 6 15 25 35 45 120
可以和列操作联合使用:
rf %>%
mutate(total = sum(c_across(w:z))) %>%
ungroup() %>%
mutate(across(w:z, ~ . / total))
## # A tibble: 6 x 6
## id w x y z total
##
## 1 1 0.1 0.2 0.3 0.4 100
## 2 2 0.106 0.202 0.298 0.394 104
## 3 3 0.111 0.204 0.296 0.389 108
## 4 4 0.116 0.205 0.295 0.384 112
## 5 5 0.121 0.207 0.293 0.379 116
## 6 6 0.125 0.208 0.292 0.375 120
可以和``rowSums()函数和
rowMeans()`等函数联合使用。
df <- tibble(
x = list(1, 2:3, 4:6)
)
df
## # A tibble: 3 x 1
## x
##
## 1
## 2
## 3
df %>% mutate(l = length(x))
## # A tibble: 3 x 2
## x l
##
## 1 3
## 2 3
## 3 3
df %>% mutate(l = lengths(x))
## # A tibble: 3 x 2
## x l
##
## 1 1
## 2 2
## 3 3
df %>% mutate(l = sapply(x, length))
## # A tibble: 3 x 2
## x l
##
## 1 1
## 2 2
## 3 3
df %>% mutate(l = purrr::map_int(x, length))
## # A tibble: 3 x 2
## x l
##
## 1 1
## 2 2
## 3 3
df %>%
rowwise() %>%
mutate(l = length(x))
## # A tibble: 3 x 2
## # Rowwise:
## x l
##
## 1 1
## 2 2
## 3 3
df <- tibble(g = 1:2, y = list(1:3, "a"))
gf <- df %>% group_by(g)
rf <- df %>% rowwise(g)
gf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Groups: g [2]
## g y type length
##
## 1 1 list 1
## 2 2 list 1
rf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Rowwise: g
## g y type length
##
## 1 1 integer 3
## 2 2 character 1
# grouped
out1 <- integer(2)
for (i in 1:2) {
out1[[i]] <- length(df$y[i])
}
out1
## [1] 1 1
# rowwise
out2 <- integer(2)
for (i in 1:2) {
out2[[i]] <- length(df$y[[i]])
}
out2
## [1] 3 1
gf %>% mutate(y2 = y)
## # A tibble: 2 x 3
## # Groups: g [2]
## g y y2
##
## 1 1
## 2 2
rf %>% mutate(y2 = y)
## Error in `mutate()`:
## ! Problem while computing `y2 = y`.
## x `y2` must be size 1, not 3.
## i Did you mean: `y2 = list(y)` ?
## i The error occurred in row 1.
rf %>% mutate(y2 = list(y))
## # A tibble: 2 x 3
## # Rowwise: g
## g y y2
##
## 1 1
## 2 2
by_cyl <- mtcars %>% nest_by(cyl)
by_cyl
## # A tibble: 3 x 2
## # Rowwise: cyl
## cyl data
## >
## 1 4 [11 x 10]
## 2 6 [7 x 10]
## 3 8 [14 x 10]
mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
## # A tibble: 3 x 3
## # Rowwise: cyl
## cyl data mod
## >
## 1 4 [11 x 10]
## 2 6 [7 x 10]
## 3 8 [14 x 10]
mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
## # A tibble: 3 x 4
## # Rowwise: cyl
## cyl data mod pred
## >
## 1 4 [11 x 10]
## 2 6 [7 x 10]
## 3 8 [14 x 10]
mods %>% summarise(rmse = sqrt(mean((pred - data$mpg) ^ 2)))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups: cyl [3]
## cyl rmse
##
## 1 4 3.01
## 2 6 0.985
## 3 8 1.87
mods %>% summarise(rsq = summary(mod)$r.squared)
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups: cyl [3]
## cyl rsq
##
## 1 4 0.509
## 2 6 0.465
## 3 8 0.423
mods %>% summarise(broom::glance(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 13
## # Groups: cyl [3]
## cyl r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
##
## 1 4 0.509 0.454 3.33 9.32 0.0137 1 -27.7 61.5 62.7
## 2 6 0.465 0.357 1.17 4.34 0.0918 1 -9.83 25.7 25.5
## 3 8 0.423 0.375 2.02 8.80 0.0118 1 -28.7 63.3 65.2
## # ... with 3 more variables: deviance , df.residual , nobs
mods %>% summarise(broom::tidy(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 6
## # Groups: cyl [3]
## cyl term estimate std.error statistic p.value
##
## 1 4 (Intercept) 39.6 4.35 9.10 0.00000777
## 2 4 wt -5.65 1.85 -3.05 0.0137
## 3 6 (Intercept) 28.4 4.18 6.79 0.00105
## 4 6 wt -2.78 1.33 -2.08 0.0918
## 5 8 (Intercept) 23.9 3.01 7.94 0.00000405
## 6 8 wt -2.19 0.739 -2.97 0.0118
df <- tribble(
~ n, ~ min, ~ max,
1, 0, 1,
2, 10, 100,
3, 100, 1000,
)
df %>%
rowwise() %>%
mutate(data = list(runif(n, min, max)))
## # A tibble: 3 x 4
## # Rowwise:
## n min max data
##
## 1 1 0 1
## 2 2 10 100
## 3 3 100 1000
df %>%
rowwise() %>%
mutate(data = runif(n, min, max))
## Error in `mutate()`:
## ! Problem while computing `data = runif(n, min, max)`.
## x `data` must be size 1, not 2.
## i Did you mean: `data = list(runif(n, min, max))` ?
## i The error occurred in row 2.
df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))
df %>%
rowwise() %>%
mutate(data = list(rnorm(10, mean, sd)))
## # A tibble: 9 x 3
## # Rowwise:
## mean sd data
##
## 1 -1 1
## 2 0 1
## 3 1 1
## 4 -1 10
## 5 0 10
## 6 1 10
## 7 -1 100
## 8 0 100
## 9 1 100
df <- tribble(
~rng, ~params,
"runif", list(n = 10),
"rnorm", list(n = 20),
"rpois", list(n = 10, lambda = 5),
) %>%
rowwise()
df %>%
mutate(data = list(do.call(rng, params)))
## # A tibble: 3 x 3
## # Rowwise:
## rng params data
##
## 1 runif
## 2 rnorm
## 3 rpois