关于 dplyr 1.0.0 出来后我想分享的一些东西

于 2020-05-29 那天，期盼已久的 dplyr 1.0.0 终于出来了（emm，鸽了半个月）。

dplyr 在出 1.0.0 版本之前不久，于 hadely 在 twitter 发文 dplyr 发布推迟半个月到 29 号，同时也终于把那黄不拉几的 logo 换成了一个更炫目的 logo，新 logo 还是蛮好看的。

不过我还是喜欢粉笔画版本的这个。好了，闲扯就这么多吧，我反正就记得了这个鸽了半个月。

关于 dplyr 1.0.0 的几个我的笔记：

dplyr 1.0 须知
dplyr1.0.0 重点内容
dplyr 1.0.0 之列操作
dplyr 1.0.0 之 rowwise
dplyr 1.0.0 之 select_rename_relocate

dplyr 1.0.0 出来了，我也该推一波相关资源了。

我想推荐的几本围绕《R for data science》相关的几本书

Tidy evaluation（进化版）：https://tidyeval.tidyverse.org/
《Modern R with the tidyverse》：https://b-rodrigues.github.io/modern_R/
《Statistical Inference via Data Science: A ModernDive into R and the Tidyverse》：https://moderndive.netlify.com/index.html
《The tidyverse style guide（Tidyverse 代码风格指引）》: https://style.tidyverse.org/
《R 数据分析指南与速查手册》：https://bookdown.org/xiao/RAnalysisBook/
《数据科学与 R 语言》：https://bookdown.org/xiangyun/RGraphics/
四川师范大学研究生公选课《数据科学中的 R 语言》：https://bookdown.org/wangminjie/R4DS/

我想推荐的几篇 dplyr 博文：

Tidyverse 学习素材：https://www.stat.cmu.edu/~ryantibs/statcomp/lectures/
Tidyverse 问答社区：https://community.rstudio.com/c/tidyverse
Tidyverse 中包更新消息：https://www.tidyverse.org/blog/
data.table and dplyr（两两对比）：https://atrebas.github.io/post/2019-03-03-datatable-dplyr/
TidyTuesday（数据处理+可视化实例）：https://github.com/rfordatascience/tidytuesday/blob/master/README.md
TidyTuesday twitter 在线shiny app：https://nsgrantham.shinyapps.io/tidytuesdayrocks/
dplyr 操作 50 例（强烈推荐跟一波）：https://www.listendata.com/2016/08/dplyr-tutorial.html
Hot questions for Dplyr（强烈推荐）****：https://www.thetopsites.net/projects/dplyr/ dplyr 处理数据的各种问题收集。
知乎张敬信老师的 玩转数据处理120题（R语言tidyverse版本）

玩转数据处理120题之P1-P20（R语言tidyverse版本）

玩转数据处理120题之P21-P50（R语言tidyverse版本）

玩转数据处理120题之P51-P80（R语言tidyverse版本）

玩转数据处理120题之P81-P100（R语言tidyverse版本）

玩转数据处理120题之P101-P120（R语言tidyverse版本）

参考资源：

Tidyverse 包官方更新处：其实看这个就行了，其他的都是这个的衍生。。。
- 2020-0309-dplyr 1.0.0 is coming soon：关于 dplyr 1.0 的几句话
- 2020-0320-dplyr 1.0.0: new summarise() features
- 2020-0327-dplyr 1.0.0: select, rename, relocate
- 2020-0403-dplyr 1.0.0: working across columns
- 2020-0410-dplyr 1.0.0: working within rows
- 2020-04-27- dplyr 1.0.0 and vctrs
2020-0414-Dplyr across: First look at a new Tidyverse function
2020-0415-The Seven Key Things You Need To Know About dplyr 1.0.0
- twitter 链接：https://twitter.com/dr_keithmcnulty/status/1250404270027026432
- 1. Built in tidyselect
- 1. relocate()
- 1. Superpowered summarise()
- 1. colwise using across()
- 1. new rowwise() grammar
- 1. easy modeling inside dataframes
- 1. nest_by()
2020-04-11-dplyr 1.0 代码示例 ：建议不用看，看官方的示例即可了
Twitter 上 dplyr 的话题标签 #dplyr
Nick Merlino 2020/05/27-My Favorite dplyr 1.0.0 Features
Tidyverse Case Study: Anscombe’s quartet
知乎张敬信老师的 【R语言】dplyr1.0.0新功能解读
2020-0602-dplyr 1.0.0 （58 页 PPT 讲解），可以说是 dplyr 包的发展史了（强烈推荐）。
- twitter 链接：https://twitter.com/rdataberlin/status/1268266145909551106
- github 代码 Rmarkdown 链接：https://github.com/courtiol/Rcourses/tree/master/dplyr_1_0_0

dplyr 1.0.0 小结

那么这一次 dplyr 1.0.0 更新后多了些什么内容呢？又带了怎样更便捷的操作。请允许我一一道来。

dplyr 包中有哪些核心函数呢？

select()：列操作，
rename()：对列进行重命名
mutate()：创建新的列
filter()：行操作，按条件筛选出所需要的行
summarise()：汇总函数
arrange(): 排序函数
*_join()：多个表格（数据）之间的操作
relocate()：更方便的调整列的位置
slice()：功能类似 head() 函数、但是比 head() 函数更为强大，可以输出特定行、最大值的行、最小值的行、随机选择若干行或者百分比行
across()：内置于 summarise()、mutate() 等函数内部，使得数据处理更加简单，取代了之前的一系列 *_if()、*_at()、*_all() 子函数，使得对列可以同时进行多个函数处理。
rowwise(): 使得在 R 中对于数据按照行进行数据分析，比如：感兴趣的列的每一行的统计运算。
c_across(): 常常与 rowwise() 函数连用，行处理中的 across()
...

下面我们来逐一介绍。

select()

按照位置：
- df %>% select(1, 5, 10)
- df %>% select(1:4)
按照名字：
- df %>% select(a, e, j)
- df %>% select(c(a, e, j))
- df %>% select(a:d)
按照函数选择：
- df %>% select(starts_with("x"))：选择列名以 x 开头列
- df %>% select(ends_with("s"))：选择列名以 s 结尾的列
- df %>% select(num_range("x", 1:3)) ：选择列名为 x1、x2、x3 的列
- df %>% select(contains("ijk"))：匹配包含列名中 “ijk” 的名称的列
- df %>% select(matches("(.)\\1")) ：通过正则来进行匹配列
- 也可以通过与 contains() 和 matches() 、str_c()等函数连用
按照数据类型：
- df %>% select(where(is.numeric))
- df %>% select(where(is.factor))
- df %>% select(where(~is.numeric(.x) & mean(.x, na.omit = TRUE) > 1))
通过布尔运算符进行多个组合
- df %>% select(!where(is.factor))
- df %>% select(where(is.numeric) & starts_with("x"))
- df %>% select(starts_with("a") | ends_with("z"))

rename()

直接修改：
- df1 %>% rename(b = 2)；b 表示修改后的列名，2 表示第二列
按照函数：
- df2 %>% rename_with(toupper)
- df2 %>% rename_with(toupper, !col1)
- df2 %>% rename_with(toupper, starts_with("x"))
- df2 %>% rename_with(toupper, where(is.numeric))

mutate()

可以很方便的新增列，而且新列一旦创建就可以直接被用来创建新列。
- df %>% mutate(new_col = col1 + col2, new_col1 = new_col/2)
.keep 参数
- .keep = "all": 全都保留，和 dplyr 1.0.0 之前版本一致
- .keep = "used": 只保留用来计算得到新列的列
- .keep = "unused": 只保留没有用来处理得到新列的列
- .keep = "none": 只保留新增的列，相当于函数 transmute()
.before 参数可以控制新增列的位置在哪一列之前
.after 参数可以控制新增列的位置在哪一列之后

filter

可以通过布尔运算筛选符合条件的行

df %>% filter(col > 1 & col2 == "A")
df %>% filter(col1 == 1 & col1 == 2)
df %>% filter(col %in% c("A", "B"))
between() 函数

summarise()

汇总函数。一般结合 group_by() 、across() 、数学统计运算函数、自定义函数 等连用。新版本中可以创建新的一列，更方便查看数据结果

mtcars %>%
  group_by(carb) %>%
  summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75)),
            q = c(0.25, 0.50, 0.75))
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 18 x 3
# Groups:   carb [6]
    carb disp_q     q
      
 1     1   78.8  0.25
 2     1  108    0.5 
 3     1  173.   0.75
 4     2  120.   0.25
 5     2  144.   0.5 
 6     2  314.   0.75
 7     3  276.   0.25
 8     3  276.   0.5 
 9     3  276.   0.75
10     4  168.   0.25
11     4  350.   0.5 
12     4  420    0.75
13     6  145    0.25
14     6  145    0.5 
15     6  145    0.75
16     8  301    0.25
17     8  301    0.5 
18     8  301    0.75


mtcars %>%
  group_by(carb) %>%
  summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75)),
            q = c(0.25, 0.50, 0.75)) %>%
  slice_head()
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 6 x 3
# Groups:   carb [6]
   carb disp_q     q
     
1     1   78.8  0.25
2     2  120.   0.25
3     3  276.   0.25
4     4  168.   0.25
5     6  145    0.25
6     8  301    0.25


mtcars %>%
  group_by(carb) %>%
  summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75))) %>%
  slice_head()

R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R是自由软件，不带任何担保。
在某些条件下你可以将其自由散布。
用'license()'或'licence()'来看散布的详细条件。

R是个合作计划，有许多人为之做出了贡献.
用'contributors()'来看合作者的详细情况
用'citation()'会告诉你如何在出版物中正确地引用R或R程序包。

用'demo()'来看一些示范程序，用'help()'来阅读在线帮助文件，或
用'help.start()'通过HTML浏览器来看帮助文件。
用'q()'退出R.

> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
√ ggplot2 3.3.0.9000     √ purrr   0.3.3     
√ tibble  3.0.1          √ dplyr   1.0.0     
√ tidyr   1.0.2          √ stringr 1.4.0     
√ readr   1.3.1          √ forcats 0.4.0     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning message:
package ‘tibble’ was built under R version 3.6.3 
> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
> mtcars %>%
    group_by(card) %>%
    summarise(disp_q  = quantile(disp),
              q = c(0.25, 0.50, 0.75))
Error: Must group by variables found in `.data`.
* Column `card` is not found.
Run `rlang::last_error()` to see where the error occurred.
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp),
              q = c(0.25, 0.50, 0.75))
Error: Problem with `summarise()` input `q`.
x Input `q` must be size 5 or 1, not 3.
i Input `q` is `c(0.25, 0.5, 0.75)`.
i An earlier column had size 5.
i The error occured in group 1: carb = 1.
Run `rlang::last_error()` to see where the error occurred.
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75)),
              q = c(0.25, 0.50, 0.75))
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 18 x 3
# Groups:   carb [6]
    carb disp_q     q
      
 1     1   78.8  0.25
 2     1  108    0.5 
 3     1  173.   0.75
 4     2  120.   0.25
 5     2  144.   0.5 
 6     2  314.   0.75
 7     3  276.   0.25
 8     3  276.   0.5 
 9     3  276.   0.75
10     4  168.   0.25
11     4  350.   0.5 
12     4  420    0.75
13     6  145    0.25
14     6  145    0.5 
15     6  145    0.75
16     8  301    0.25
17     8  301    0.5 
18     8  301    0.75
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75))) %>%
    slice(5)
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 0 x 2
# Groups:   carb [0]
# ... with 2 variables: carb , disp_q 
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75))) %>%
    slice()
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 18 x 2
# Groups:   carb [6]
    carb disp_q
     
 1     1   78.8
 2     1  108  
 3     1  173. 
 4     2  120. 
 5     2  144. 
 6     2  314. 
 7     3  276. 
 8     3  276. 
 9     3  276. 
10     4  168. 
11     4  350. 
12     4  420  
13     6  145  
14     6  145  
15     6  145  
16     8  301  
17     8  301  
18     8  301  
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75))) %>%
    slice_head()
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 6 x 2
# Groups:   carb [6]
   carb disp_q
    
1     1   78.8
2     2  120. 
3     3  276. 
4     4  168. 
5     6  145  
6     8  301  
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75)),
              q = c(0.25, 0.50, 0.75)) %>%
    slice_head()
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 6 x 3
# Groups:   carb [6]
   carb disp_q     q
     
1     1   78.8  0.25
2     2  120.   0.25
3     3  276.   0.25
4     4  168.   0.25
5     6  145    0.25
6     8  301    0.25
> mtcars %>%
    group_by(carb) %>%
    summarise(disp_q  = quantile(disp, c(0.25, 0.50, 0.75))) %>%
    slice_head()
`summarise()` regrouping output by 'carb' (override with `.groups` argument)
# A tibble: 6 x 2
# Groups:   carb [6]
   carb disp_q
    
1     1   78.8
2     2  120. 
3     3  276. 
4     4  168. 
5     6  145  
6     8  301

arrange

df %>% arrange(col1, col2)：默认升序
df %>% arrange(desc(col1))：desc 降序
df %>% arrange(col1 - col2)

*_join()

inner_join() ：内连接；by 指定两个表相同的键
left_join() ：左连接；保留 x 中的所有观测。
full_join() ：全连接；保留 x 和 y 中的所有观测
right_join() ：右连接；保留 y 中的所有观测
semi_join(x, y)：保留 x 表中与 y 表中的观测相匹配的所有观测
anti_join(x, y)：丢弃 x 表中与 y 表中的观测相匹配的所有观测

relocate()

df3 %>% relocate(y, z)；将 yz 列移到最前面
df3 %>% relocate(where(is.character))；将字符串类型列都放到最前面
df3 %>% relocate(w, .after = y)；将 w 列移动到 y 列后面
df3 %>% relocate(w, .before = y)；将 w 列移动到 y 列前面
df3 %>% relocate(w, .after = last_col())；将 w 列移至最后面

slice()

top_n()、 sample_n()、 sample_frac() 这三个函数已经被 slice 新增的子函数所替代

slice_head()：默认只输出第一行，如果数据分组了则为每一个组的第一行
- df %>% slice_head(prop = 0.1)
- df %>% slice_head(prop = 10)
slice_tail()：默认只输出最后一行，其他参数同 slice_head()
slice_sample()：默认随机输出一行，
slice_min()：
slice_max()
slice()

其中 slice_head() 、slice_sample() 中新增了参数 n = 和 prop =，n 表示多上行，prop 表示所占数据行的比例。相当于函数 sample_n() 和 sample_frac()。

top_n 被函数 slice_min() 和 slice_max() 所替代

across

across(.cols = everything(), .fns = NULL, ..., .names = NULL)

第一个参数，选择你所想要操作的列（类似于 select() 函数），我们可以通过位置、名字、数据类型来选择。
第二个参数，.fns 就是要对列进行的操作函数，可以类似 purrr 中的公式，比如：~ .x/2

为什么我们要多使用 across()

across() 函数可以很方便的同时对列进行多个操作
across() 函数减少了 dplyr 所需要提供的函数数目。使得 dplyr 用起来更加方便以及更加通俗易懂
across() 整合了之前后缀为 _if、_at 等函数的功能，使我们能够按照位置、列名、列数据类型来筛选数据
across() 不需要 vars() 函数，_at() 函数是 dplyr 中唯一必须手动引用变量名的地方。

注意：across() 函数不能与 select() 、rename() 函数连用，因为他们已经使用了选择的语法，我们如果想要使用函数来改变列名那么就需要使用函数 rename_with()

本次更新最为重要的一个函数。所有 *_if()、 *_at()、 *_all() 变体函数都已经被 across() 函数所取代，使得所有列进行相同操作更为便捷。

怎么转换我们之前基于 _at、_if、_all 等后缀的函数处理为 across()

去掉 _at、 _if、 _all 后缀
变为 across()
- _if 系列则改为 where()
- _at() 系列则去掉 vars 函数即可
- _all() 系列则改为 everything() 即可

across() 与其他函数连用

across() 与 mutate() 连用

df %>% mutate_if(is.numeric, log)
df %>% mutate(across(where(is.numeric), log))

rescale01 <- function(x){
  rng <- range(x, na.rm = T)
  (x - rng[1])/(rng[2] - rng[1])
}

df <- tibble(x = 1:4, y = rnorm(4))

df %>%
  mutate(across(where(is.numeric), rescale01))
## # A tibble: 4 x 2
##       x     y
##    
## 1 0     0    
## 2 0.333 0.291
## 3 0.667 0.207
## 4 1     1

across(where()) 与 summarise() 函数

# 选择字符串列进行统计长度信息
starwars %>%
  summarise(across(where(is.character), ~length(unique(.x))))

# 选取数值列，进行求均值
starwars %>%
  group_by(homeworld) %>%
  filter(n() > 1) %>%
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = T)))

across(everything()) 取代 mutate_all()
across() 与 count() 函数连用

starwars %>%
  count(across(contains("color")), sort = TRUE)

across() 与 distinct() 函数连用

starwars %>%
  distinct(across(contains("color")))

across() 与 filter() 函数连用

# 查找所有没有缺失值 NA 的列
starwars %>%
  filter(across(everything(), ~ !is.na(.x)))

通过 across() 对列同时进行多个操作

min_max <- list(
  min = ~min(.x, na.rm = T),
  max = ~max(.x, na.rm = T)
)

starwars %>%
  summarise(across(where(is.numeric), min_max))


# 怎么控制输出结果列名呢？
# 使用 glue 包
# {fn} 表示使用的函数名，{col} 表示操作的列名
starwars %>%
  summarise(across(where(is.numeric), min_max, .names = "{fn}.{col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##                                          
## 1         66        264       15     1358              8            896

# 如果我们想要将同样函数处理的数据放置于一起，我们就需要将函数分开
# 我们可以看到结果是很奇怪的。
starwars %>%
  summarise(across(where(is.numeric), ~min(.x, na.rm = T), .names = "min.{col}"),
            across(where(is.numeric), ~max(.x, na.rm = T), .names = "max.{col}"))
## # A tibble: 1 x 9
##   min.height min.mass min.birth_year max.height max.mass max.birth_year
##                                          
## 1         66       15              8        264     1358            896
## # ... with 3 more variables: max.min.height , max.min.mass ,
## #   max.min.birth_year

总之这是一个非常重要的函数。但是以下几种情况需要注意：

across 在结合 summarise() 函数使用时候，会自动将前面所计算的函数：比如 n() 考虑在内，会覆盖 n() 结果。

df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))
df %>%
  summarise(n = n(), across(where(is.numeric), sd))
##    n x        y
## 1 NA 1 4.041452

# 可看到这里 n() 统计结果为 NA，因为 n 为一个数值，所以后面 across() 计算了他的 sd 值，3 的 sd 值为 NA，如果我们想解决这一个问题，我们就需要将 n() 统计放置于 across() 函数处理之后
df %>%
  summarise(across(where(is.numeric), sd), n = n())
##   x        y n
## 1 1 4.041452 3

# 还有另外一种方法，即在 across() 函数中加上一个条件 !n
df %>%
  summarise(n = n(), across(where(is.numeric) & !n, sd))
##   n x        y
## 1 3 1 4.041452

rowwise()

在 R 中 dplyr 通常是对列进行操作，然而对于行处理方面还是比较困难， rowwise()函数来对数据进行行处理，常与 c_across() 连用。

本节中列举了三个常见的案例：

行水平的计算（比如，xyz 的平均值）
使用不同的参数调用同一个函数
对列表列进行操作

当然这些问题我们可以通过类似 for 等循环来进行操作，但是我们可以通过管道的形式进行更便捷的操作，这里作者有一句经典的话:

Of course, someone has to write loops. It doesn’t have to be you. — Jenny Bryan

rowwise 按行来进行分组，和 group_by() 函数一样，并不会改变数据得内容，仅仅是进行分组：

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
# 可以看到下面中多一个表示符号：Rowwise
## # A tibble: 2 x 3
## # Rowwise: 
##       x     y     z
##     
## 1     1     3     5
## 2     2     4     6

# 计算的是数据中所有的数值的平均值
df %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
##       x     y     z     m
##      
## 1     1     3     5   3.5
## 2     2     4     6   3.5

# 计算每一列的平均值
df %>% mutate(across(everything(), ~mean(.x, na.rm = T)))
## # A tibble: 2 x 3
##       x     y     z
##     
## 1   1.5   3.5   5.5
## 2   1.5   3.5   5.5

# 计算的是每一行的平均值
df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
## # Rowwise: 
##       x     y     z     m
##      
## 1     1     3     5     3
## 2     2     4     6     4

rowwise() 与 summarise() 函数连用

df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)

# 结果仅仅只有值
df %>% 
  rowwise() %>% 
  summarise(m = mean(c(x, y, z)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 1
##       m
##   
## 1     3
## 2     4


# 可以通过加上需要处理的行作为 summarise() 的行名，可以使用 `rowwise(name)`，保留 `name` 列
df %>% 
  rowwise(name) %>% 
  summarise(m = mean(c(x, y, z)))
## `summarise()` regrouping output by 'name' (override with `.groups` argument)
## # A tibble: 2 x 2
## # Groups:   name [2]
##   name       m
##     
## 1 Mara       3
## 2 Hadley     4


df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
## # A tibble: 6 x 5
##      id     w     x     y     z
##       
## 1     1    10    20    30    40
## 2     2    11    21    31    41
## 3     3    12    22    32    42
## 4     4    13    23    33    43
## 5     5    14    24    34    44
## 6     6    15    25    35    45
# 使用 `rowwise` 对数据进行行分组 
rf <- df %>% rowwise(id)

rf %>% mutate(total = sum(c(w, x, y, z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120
rf %>% summarise(total = sum(c(w, x, y, z)))
## `summarise()` regrouping output by 'id' (override with `.groups` argument)
## # A tibble: 6 x 2
## # Groups:   id [6]
##      id total
##    
## 1     1   100
## 2     2   104
## 3     3   108
## 4     4   112
## 5     5   116
## 6     6   120

c_across

常常与 rowwise() 函数连用，行处理中的 across()

rf <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45) %>% rowwise(id)

rf %>% mutate(total = sum(c_across(w:z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120
rf %>% mutate(total = sum(c_across(where(is.numeric))))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

rowwise() 、c_across()、across() 连用

ungroup() 取消分组，这里表示取消按照行进行分组

rf %>% 
  mutate(total = sum(c_across(w:z))) %>% 
  ungroup() %>% 
  mutate(across(w:z, ~ . / total))
## # A tibble: 6 x 6
##      id     w     x     y     z total
##        
## 1     1 0.1   0.2   0.3   0.4     100
## 2     2 0.106 0.202 0.298 0.394   104
## 3     3 0.111 0.204 0.296 0.389   108
## 4     4 0.116 0.205 0.295 0.384   112
## 5     5 0.121 0.207 0.293 0.379   116
## 6     6 0.125 0.208 0.292 0.375   120

行处理函数总结：rowSums() 和 rowMeans()

内置行处理函数更快，对行进行操作，没有分成行、然后统计，最后连接到一起。

df %>% mutate(total = rowSums(across(where(is.numeric))))
## # A tibble: 6 x 6
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   101
## 2     2    11    21    31    41   106
## 3     3    12    22    32    42   111
## 4     4    13    23    33    43   116
## 5     5    14    24    34    44   121
## 6     6    15    25    35    45   126

df %>% mutate(mean = rowMeans(across(where(is.numeric))))
## # A tibble: 6 x 6
##      id     w     x     y     z  mean
##        
## 1     1    10    20    30    40  20.2
## 2     2    11    21    31    41  21.2
## 3     3    12    22    32    42  22.2
## 4     4    13    23    33    43  23.2
## 5     5    14    24    34    44  24.2
## 6     6    15    25    35    45  25.2

重复的函数调用：按行传入变量参数

rowwise() 不仅适用于返回长度为 1 的向量的函数; 如果结果是一个列表，它可以与任何函数一起连用。这意味着 rowwise() 和 mutate() 提供了一种优雅的方法，可以多次使用不同的参数调用函数，将输出存储在输入旁边。

一定要用 list() 函数来将命令括起来，比如 list(runif(n, min, max)) 而非 runif(n, min, max)

df <- tribble(
  ~ n, ~ min, ~ max,
    1,     0,     1,
    2,    10,   100,
    3,   100,  1000,
)

df %>% 
  rowwise() %>% 
  mutate(data = list(runif(n, min, max)))
## # A tibble: 3 x 4
## # Rowwise: 
##       n   min   max data     
##         
## 1     1     0     1 
## 2     2    10   100 
## 3     3   100  1000

两两多重组合：tidyr::expand_grid() 函数

# 这里就会得到  3*3 九种结果
df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))

df %>% 
  rowwise() %>% 
  mutate(data = list(rnorm(10, mean, sd)))

各种功能：结合 do.call()

df <- tribble(
   ~rng,     ~params,
   "runif",  list(n = 10), 
   "rnorm",  list(n = 20),
   "rpois",  list(n = 10, lambda = 5),
) %>%
  rowwise()

df %>% 
  mutate(data = list(do.call(rng, params)))
## # A tibble: 3 x 3
## # Rowwise: 
##   rng   params           data      
##                   
## 1 runif  
## 2 rnorm  
## 3 rpois

最重要的是用来建模

nest_by() 分组存储为一个 list

by_cyl <- mtcars %>% nest_by(cyl)
by_cyl
## # A tibble: 3 x 2
## # Rowwise:  cyl
##     cyl                data
##    >
## 1     4           [11 x 10]
## 2     6            [7 x 10]
## 3     8           [14 x 10]

按行线性建模

mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
## # A tibble: 3 x 3
## # Rowwise:  cyl
##     cyl                data mod   
##    > 
## 1     4           [11 x 10]   
## 2     6            [7 x 10]   
## 3     8           [14 x 10] 
mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
## # A tibble: 3 x 4
## # Rowwise:  cyl
##     cyl                data mod    pred      
##    >      
## 1     4           [11 x 10]    
## 2     6            [7 x 10]     
## 3     8           [14 x 10]

dplyr 简介

这次对于 dplyr 包函数更新了一个很重要的说明参考文件书，主要分为以下几个方面，方便我们系统的去学习（本文大多数例子也是从中而来）。

dplyr 简介，是学习 dplyr 包主要功能的最佳选择地方，没有之一，其中包括以下几个方面：

base R 操作与 dplyr 操作的等同函数

列操作
兼容性操作
dplyr
分组操作
常见的 dplyr 相关编程
行操作
两个数据之间的操作：*join() 系列操作 (翻译不到位的勿见怪)