医学和生信笔记

R语言dplyr入门到进阶

文章目录

dplyr介绍
- 安装
- 数据集：starwars
- 针对单个数据集的操作
- - filter()根据条件筛选行
  - arrange()进行排序
  - slice()根据位置选择行
  - select()选择列
  - mutate()新建列
  - relocate()重排列的位置
  - summarise()汇总
grouped data
- group_by()
- 查看分组信息
- - 增加或改变用于聚合的变量
  - 移除聚合的变量
- 联合使用
- - summarise()
  - `select()`/`rename()`/`relocate()`
  - arrange()
  - `muatate()` and `transmutate()`
  - filter()
- Computing on grouping information
- - cur_data()
  - cur_group() and cur_group_id()
two-table verbs
- 合并连接
- 筛选连接
- 集合操作
- 合并连接
- 筛选连接
- 集合操作
column-wise operations
- - 陷阱
- across其他连用
- 和filter()连用
row-wide operations
- 简介
- 对行进行汇总统计
- list columns
- - motivation
  - subsetting
  - modeling
- repeated function calls
- - simulations
  - multiple combinations
  - varying functions

dplyr介绍

tidyverse系列应该算是R语言数据分析中的瑞士军刀了，统一的格式，简洁的代码，管道符便于阅读的形式，都能让大家快速上手。R数据科学就是专门讲这个系列的，但是对于很多函数的用法和细节问题，都没有说，所以在使用时还是会经常遇到各种问题。

我根据R数据科学和tidyverse官网的教程，整理了几篇笔记，主要是对tidyverse的各种函数的用法进行详细的演示。

前面已经介绍过了forcats包处理因子型数据，lubridate包处理日期时间格式数据。

下面介绍dplyr包。

在处理数据时，要明确以下几个问题：

明确你的目的
用计算机程序的方式描述你的任务
执行程序

dplyr包可以帮你又快又简单地处理这些问题。tidyr包主要聚焦于把数据变成整洁数据，dplyr包主要功能在于对整洁数据进行各种操作，比如新增、筛选、汇总、合并等。

安装

install.packages("tidyverse")

数据集：starwars

下面使用*星战（starwars）*数据集演示基本的dplyr用法。

starwars数据集共有87行，14列，记录了星战里面的87个人物（机器人、外星人等等）的14个特点，比如姓名、身高、体重、头发颜色、眼睛颜色、种族等。

library(dplyr)
## 
## 载入程辑包：'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dim(starwars)
## [1] 87 14
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name        "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height      172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass        77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color  "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color  "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color   "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year  19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex         "male", "none", "none", "male", "female", "male", "female",~
## $ gender      "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld   "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species     "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films       <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles    <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships   <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

针对单个数据集的操作

今天介绍的都是主要针对单个数据集进行操作的各种函数，也是最常见的类型。

根据作用方式不同，大致可以分为以下3类：

作用于行
- filter()
- slice()
- arrange()
作用于列
- select()
- rename()
- mutate()
- relocate()
作用于一组数据
- summarise()

filter()根据条件筛选行

filter()函数用于筛选符合条件的行，可以用各种表达式进行筛选，比如筛选眼睛颜色是brown并且皮肤颜色是light的行，注意这里不需要使用 & 符号：

starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##                                    
## 1 Leia Or~    150    49 brown      light      brown             19 female femin~
## 2 Biggs D~    183    84 black      light      brown             24 male   mascu~
## 3 Cordé       157    NA brown      light      brown             NA female femin~
## 4 Dormé       165    NA brown      light      brown             NA female femin~
## 5 Raymus ~    188    79 brown      light      brown             NA male   mascu~
## 6 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 7 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

但是需要注意，filter()函数不支持直接使用行号进行筛选，比如说你想选择第1行到第3行，下面这种写法是错误的：

starwars %>% filter(1:3)

这种情况应该使用slice()函数：

starwars %>% slice(1:3)
## # A tibble: 3 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##                                    
## 1 Luke Sk~    172    77 blond      fair        blue              19 male  mascu~
## 2 C-3PO       167    75        gold        yellow           112 none  mascu~
## 3 R2-D2        96    32        white, blue red               33 none  mascu~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

arrange()进行排序

arrange()函数是用来排序的，根据某一列进行排序。

starwars %>% arrange(height, mass)
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##                                    
##  1 Yoda         66    17 white      green      brown            896 male  mascu~
##  2 Ratts T~     79    15 none       grey, blue unknown           NA male  mascu~
##  3 Wicket ~     88    20 brown      brown      brown              8 male  mascu~
##  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu~
##  5 R2-D2        96    32        white, bl~ red               33 none  mascu~
##  6 R4-P17       96    NA none       silver, r~ red, blue         NA none  femin~
##  7 R5-D4        97    32        white, red red               NA none  mascu~
##  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu~
##  9 Gasgano     122    NA none       white, bl~ black             NA male  mascu~
## 10 Watto       137    NA black      blue, grey yellow            NA male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## #   films , vehicles , starships

desc()函数可以进行倒序：

starwars %>% arrange(desc(height))
## # A tibble: 87 x 14
##    name    height  mass hair_color skin_color  eye_color birth_year sex   gender
##                                    
##  1 Yarael~    264    NA none       white       yellow          NA   male  mascu~
##  2 Tarfful    234   136 brown      brown       blue            NA   male  mascu~
##  3 Lama Su    229    88 none       grey        black           NA   male  mascu~
##  4 Chewba~    228   112 brown      unknown     blue           200   male  mascu~
##  5 Roos T~    224    82 none       grey        orange          NA   male  mascu~
##  6 Grievo~    216   159 none       brown, whi~ green, y~       NA   male  mascu~
##  7 Taun We    213    NA none       grey        black           NA   fema~ femin~
##  8 Rugor ~    206    NA none       green       orange          NA   male  mascu~
##  9 Tion M~    206    80 none       grey        black           NA   male  mascu~
## 10 Darth ~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## #   films , vehicles , starships

slice()根据位置选择行

选择第5-10行的数据：

starwars %>% slice(5:10)
## # A tibble: 6 x 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
## 1 Leia Or~    150    49 brown       light      brown             19 fema~ femin~
## 2 Owen La~    178   120 brown, grey light      blue              52 male  mascu~
## 3 Beru Wh~    165    75 brown       light      blue              47 fema~ femin~
## 4 R5-D4        97    32         white, red red               NA none  mascu~
## 5 Biggs D~    183    84 black       light      brown             24 male  mascu~
## 6 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

这其实是一组函数，还有各种变体，比如下面这个，选择前4行：

starwars %>% slice_head(n = 4)
## # A tibble: 4 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##                                    
## 1 Luke Sk~    172    77 blond      fair        blue            19   male  mascu~
## 2 C-3PO       167    75        gold        yellow         112   none  mascu~
## 3 R2-D2        96    32        white, blue red             33   none  mascu~
## 4 Darth V~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

选择最后面的10%的行：

starwars %>% slice_tail(prop = 0.1)
## # A tibble: 8 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##                                    
## 1 Sly Moo~    178    48 none       pale       white             NA      
## 2 Tion Me~    206    80 none       grey       black             NA male   mascu~
## 3 Finn         NA    NA black      dark       dark              NA male   mascu~
## 4 Rey          NA    NA brown      light      hazel             NA female femin~
## 5 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 6 BB8          NA    NA none       none       black             NA none   mascu~
## 7 Captain~     NA    NA unknown    unknown    unknown           NA      
## 8 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

随机选择10%的行：

starwars %>% slice_sample(prop = 0.1) # n=2
## # A tibble: 8 x 14
##   name    height  mass hair_color skin_color   eye_color birth_year sex   gender
##                                    
## 1 Lobot      175    79 none       light        blue              37 male  mascu~
## 2 Zam We~    168    55 blonde     fair, green~ yellow            NA fema~ femin~
## 3 Ric Ol~    183    NA brown      fair         blue              NA     
## 4 R4-P17      96    NA none       silver, red  red, blue         NA none  femin~
## 5 Lando ~    177    79 black      dark         brown             31 male  mascu~
## 6 Greedo     173    74        green        black             44 male  mascu~
## 7 Ackbar     180    83 none       brown mottle orange            41 male  mascu~
## 8 Rugor ~    206    NA none       green        orange            NA male  mascu~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

随机选择10%的行，可以重复：

starwars %>% slice_sample(n=10, replace = T)
## # A tibble: 10 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Jek To~    180   110 brown       fair       blue              NA male  mascu~
##  2 Quarsh~    183    NA black       dark       brown             62     
##  3 Arvel ~     NA    NA brown       fair       brown             NA male  mascu~
##  4 Darth ~    175    80 none        red        yellow            54 male  mascu~
##  5 Finn        NA    NA black       dark       dark              NA male  mascu~
##  6 Ric Ol~    183    NA brown       fair       blue              NA     
##  7 Mace W~    188    84 none        dark       brown             72 male  mascu~
##  8 Jango ~    183    79 black       tan        brown             66 male  mascu~
##  9 San Hi~    191    NA none        grey       gold              NA male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

选择某一列中最大或者最小的几个值所在的行，注意不能有NA值：

starwars %>% filter(!is.na(height)) %>% 
  slice_max(height, n = 5) # 选择的这列不能有NA
## # A tibble: 5 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex   gender 
##                                    
## 1 Yarael ~    264    NA none       white      yellow            NA male  mascul~
## 2 Tarfful     234   136 brown      brown      blue              NA male  mascul~
## 3 Lama Su     229    88 none       grey       black             NA male  mascul~
## 4 Chewbac~    228   112 brown      unknown    blue             200 male  mascul~
## 5 Roos Ta~    224    82 none       grey       orange            NA male  mascul~
## # ... with 5 more variables: homeworld , species , films ,
## #   vehicles , starships

select()选择列

直接根据列名选择列，列名不需要使用引号：

starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##                        
##  1 blond         fair        blue     
##  2           gold        yellow   
##  3           white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8           white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中以color结尾的列：

starwars %>% select(ends_with("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##                        
##  1 blond         fair        blue     
##  2           gold        yellow   
##  3           white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8           white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中包含color字样的列：

starwars %>% select(contains("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##                        
##  1 blond         fair        blue     
##  2           gold        yellow   
##  3           white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8           white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

重命名列：

starwars %>% rename(home_world = homeworld)
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75         gold       yellow         112   none  mascu~
##  3 R2-D2       96    32         white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32         white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: home_world , species ,
## #   films , vehicles , starships

mutate()新建列

starwars %>% mutate(height_m = height/100, .before = 1)
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex  
##                                    
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male 
##  2     1.67 C-3PO     167    75        gold       yellow         112   none 
##  3     0.96 R2-D2      96    32        white, bl~ red             33   none 
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male 
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male 
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32        white, red red             NA   none 
##  9     1.83 Biggs~    183    84 black      light      brown           24   male 
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender , homeworld ,
## #   species , films , vehicles , starships

这个效果和上面是一样的，都是把新建的列放在最前面：

starwars %>% 
  mutate(height_m = height/100) %>% 
  select(height_m, everything())
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex  
##                                    
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male 
##  2     1.67 C-3PO     167    75        gold       yellow         112   none 
##  3     0.96 R2-D2      96    32        white, bl~ red             33   none 
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male 
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male 
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32        white, red red             NA   none 
##  9     1.83 Biggs~    183    84 black      light      brown           24   male 
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender , homeworld ,
## #   species , films , vehicles , starships

新建的列可以直接被使用：

starwars %>%
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) %>%
  select(BMI, everything())
## # A tibble: 87 x 16
##      BMI name     height  mass hair_color  skin_color eye_color birth_year sex  
##                                    
##  1  26.0 Luke Sk~    172    77 blond       fair       blue            19   male 
##  2  26.9 C-3PO       167    75         gold       yellow         112   none 
##  3  34.7 R2-D2        96    32         white, bl~ red             33   none 
##  4  33.3 Darth V~    202   136 none        white      yellow          41.9 male 
##  5  21.8 Leia Or~    150    49 brown       light      brown           19   fema~
##  6  37.9 Owen La~    178   120 brown, grey light      blue            52   male 
##  7  27.5 Beru Wh~    165    75 brown       light      blue            47   fema~
##  8  34.0 R5-D4        97    32         white, red red             NA   none 
##  9  25.1 Biggs D~    183    84 black       light      brown           24   male 
## 10  23.2 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 7 more variables: gender , homeworld ,
## #   species , films , vehicles , starships ,
## #   height_m

只保留新建的列，其他列不要了：

starwars %>%
  transmute(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  )
## # A tibble: 87 x 2
##    height_m   BMI
##        
##  1     1.72  26.0
##  2     1.67  26.9
##  3     0.96  34.7
##  4     2.02  33.3
##  5     1.5   21.8
##  6     1.78  37.9
##  7     1.65  27.5
##  8     0.97  34.0
##  9     1.83  25.1
## 10     1.82  23.2
## # ... with 77 more rows

relocate()重排列的位置

主要是使用.before和.after参数，控制位置：

starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 x 14
##    name     sex    gender homeworld height  mass hair_color skin_color eye_color
##                                    
##  1 Luke Sk~ male   mascu~ Tatooine     172    77 blond      fair       blue     
##  2 C-3PO    none   mascu~ Tatooine     167    75        gold       yellow   
##  3 R2-D2    none   mascu~ Naboo         96    32        white, bl~ red      
##  4 Darth V~ male   mascu~ Tatooine     202   136 none       white      yellow   
##  5 Leia Or~ female femin~ Alderaan     150    49 brown      light      brown    
##  6 Owen La~ male   mascu~ Tatooine     178   120 brown, gr~ light      blue     
##  7 Beru Wh~ female femin~ Tatooine     165    75 brown      light      blue     
##  8 R5-D4    none   mascu~ Tatooine      97    32        white, red red      
##  9 Biggs D~ male   mascu~ Tatooine     183    84 black      light      brown    
## 10 Obi-Wan~ male   mascu~ Stewjon      182    77 auburn, w~ fair       blue-gray
## # ... with 77 more rows, and 5 more variables: birth_year , species ,
## #   films , vehicles , starships

summarise()汇总

一般来说和group_by()连用才能发挥威力。

starwars %>% summarise(height = mean(height, na.rm = T))
## # A tibble: 1 x 1
##   height
##    
## 1   174.

今天主要是对dplyr有一个大致的认识，熟悉下最常见的操作，后面会根据不同的应用场景继续介绍更多的内容。

grouped data

在现实生活中我们经常会遇到非常多需要分组汇总的情况，单个的汇总价值不大，只有分组之后，才能看出差异，才能表现出数据的价值。

dplyr为我们提供了group_by()函数，主要使用group_by()对数据进行分组，然后再进行各种计算，通过和其他操作进行连接，发挥更加强大的作用。

group_by()

先建立2个分组数据进行演示，还是使用星战数据集。

by_species <- starwars %>% group_by(species)
by_sex_gender <- starwars %>% group_by(sex, gender)

看看这两个对象有什么不同，可以看出和原数据集没什么不同，但是都被分组了！

by_species
## # A tibble: 87 x 14
## # Groups:   species [38]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75         gold       yellow         112   none  mascu~
##  3 R2-D2       96    32         white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32         white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## #   films , vehicles , starships 
by_sex_gender
## # A tibble: 87 x 14
## # Groups:   sex, gender [6]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75         gold       yellow         112   none  mascu~
##  3 R2-D2       96    32         white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32         white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## #   films , vehicles , starships

使用tally()函数进行计数：

by_species %>% tally(sort = T)
## # A tibble: 38 x 2
##    species      n
##        
##  1 Human       35
##  2 Droid        6
##  3          4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # ... with 28 more rows

和下面这个操作是一样的效果：

by_species %>% summarise(n=n())
## # A tibble: 38 x 2
##    species       n
##         
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ... with 28 more rows

除了根据现有的变量进行分组外，还可以根据现有变量的函数进行分组，这样做类似于先mutate()再group_by()。

bmi_breaks <- c(0,18.5,25,30,Inf)

starwars %>% 
  group_by(bmi_cat = cut(mass/(height/100)^2,breaks = bmi_breaks)) %>% 
  tally(sort = T)
## # A tibble: 5 x 2
##   bmi_cat       n
##        
## 1          28
## 2 (18.5,25]    24
## 3 (25,30]      13
## 4 (30,Inf]     12
## 5 (0,18.5]     10

是不是很神奇？

查看分组信息

group_keys()查看用于分组的组内有哪些类别，可以看到species有38种：

by_species %>% group_keys() 
## # A tibble: 38 x 1
##    species  
##        
##  1 Aleena   
##  2 Besalisk 
##  3 Cerean   
##  4 Chagrian 
##  5 Clawdite 
##  6 Droid    
##  7 Dug      
##  8 Ewok     
##  9 Geonosian
## 10 Gungan   
## # ... with 28 more rows

by_sex_gender %>% group_keys()
## # A tibble: 6 x 2
##   sex            gender   
##                 
## 1 female         feminine 
## 2 hermaphroditic masculine
## 3 male           masculine
## 4 none           feminine 
## 5 none           masculine
## 6

group_indices()查看每一行属于哪个组：

by_species %>% group_indices() # which group each row belongs to
##  [1] 11  6  6 11 11 11 11  6 11 11 11 11 34 11 24 12 11 11 36 11 11  6 31 11 11
## [26] 18 11 11  8 26 11 21 11 10 10 10 38 30  7 38 11 37 32 32 33 35 29 11  3 20
## [51] 37 27 13 23 16  4 11 11 11  9 17 17 11 11 11 11  5  2 15 15 11  1  6 25 19
## [76] 28 14 34 11 38 22 11 11 11  6 38 11

group_rows()查看每个组包括哪些行：

by_species %>% group_rows() # which rows each group contains with
## [38]>
## [[1]]
## [1] 72
## 
## [[2]]
## [1] 68
## 
## [[3]]
## [1] 49
## 
## [[4]]
## [1] 56
## 
## [[5]]
## [1] 67
## 
## [[6]]
## [1]  2  3  8 22 73 85
## 
## [[7]]
## [1] 39
## 
## [[8]]
## [1] 29
## 
## [[9]]
## [1] 60
## 
## [[10]]
## [1] 34 35 36
## 
## [[11]]
##  [1]  1  4  5  6  7  9 10 11 12 14 17 18 20 21 24 25 27 28 31 33 41 48 57 58 59
## [26] 63 64 65 66 71 79 82 83 84 87
## 
## [[12]]
## [1] 16
## 
## [[13]]
## [1] 53
## 
## [[14]]
## [1] 77
## 
## [[15]]
## [1] 69 70
## 
## [[16]]
## [1] 55
## 
## [[17]]
## [1] 61 62
## 
## [[18]]
## [1] 26
## 
## [[19]]
## [1] 75
## 
## [[20]]
## [1] 50
## 
## [[21]]
## [1] 32
## 
## [[22]]
## [1] 81
## 
## [[23]]
## [1] 54
## 
## [[24]]
## [1] 15
## 
## [[25]]
## [1] 74
## 
## [[26]]
## [1] 30
## 
## [[27]]
## [1] 52
## 
## [[28]]
## [1] 76
## 
## [[29]]
## [1] 47
## 
## [[30]]
## [1] 38
## 
## [[31]]
## [1] 23
## 
## [[32]]
## [1] 43 44
## 
## [[33]]
## [1] 45
## 
## [[34]]
## [1] 13 78
## 
## [[35]]
## [1] 46
## 
## [[36]]
## [1] 19
## 
## [[37]]
## [1] 42 51
## 
## [[38]]
## [1] 37 40 80 86

group_vars()查看用于聚合的变量名字：

by_sex_gender %>% group_vars() # the name of the grouping variable
## [1] "sex"    "gender"

增加或改变用于聚合的变量

如果把group_by()作用于已经聚合的变量，那数据会被覆盖，比如下面这个，by_species已经被species聚合了，再通过homeworld聚合，那结果只是homeworld的结果：

by_species %>% 
  group_by(homeworld) %>% 
  tally()
## # A tibble: 49 x 2
##    homeworld          n
##              
##  1 Alderaan           3
##  2 Aleen Minor        1
##  3 Bespin             1
##  4 Bestine IV         1
##  5 Cato Neimoidia     1
##  6 Cerea              1
##  7 Champala           1
##  8 Chandrila          1
##  9 Concord Dawn       1
## 10 Corellia           2
## # ... with 39 more rows

是不是之前没注意过这些小问题？

通过使用一个参数可以避免这个问题：

by_species %>% 
  group_by(homeworld, .add = T) %>% 
  tally()
## # A tibble: 58 x 3
## # Groups:   species [38]
##    species  homeworld       n
##               
##  1 Aleena   Aleen Minor     1
##  2 Besalisk Ojom            1
##  3 Cerean   Cerea           1
##  4 Chagrian Champala        1
##  5 Clawdite Zolan           1
##  6 Droid    Naboo           1
##  7 Droid    Tatooine        2
##  8 Droid                3
##  9 Dug      Malastare       1
## 10 Ewok     Endor           1
## # ... with 48 more rows

移除聚合的变量

一个被聚合的数据如果不解除聚合，那么后面的操作都会以聚合后的结果呈现出来，所以聚合之后一定要记得解除聚合！

by_species %>% 
  ungroup() %>% 
  tally()
## # A tibble: 1 x 1
##       n
##   
## 1    87

by_sex_gender %>% 
  ungroup(sex) %>% 
  tally()
## # A tibble: 3 x 2
##   gender        n
##        
## 1 feminine     17
## 2 masculine    66
## 3           4

联合使用

下面这部分主要介绍group_by和其他函数的联合使用：

summarise()

by_species %>%
  summarise(
    n = n(),
    height = mean(height, na.rm = TRUE)
  )
## # A tibble: 38 x 3
##    species       n height
##           
##  1 Aleena        1    79 
##  2 Besalisk      1   198 
##  3 Cerean        1   198 
##  4 Chagrian      1   196 
##  5 Clawdite      1   168 
##  6 Droid         6   131.
##  7 Dug           1   112 
##  8 Ewok          1    88 
##  9 Geonosian     1   183 
## 10 Gungan        3   209.
## # ... with 28 more rows

control the grouping variables

通过.groups参数控制聚合变量：

by_sex_gender %>% 
  summarise(n = n()) %>% 
  group_vars()
## `summarise()` has grouped output by 'sex'. You can override using the
## `.groups` argument.
## [1] "sex"

# 只通过sex进行聚合
by_sex_gender %>% 
  summarise(n = n(), .groups = "drop_last") %>% 
  group_vars()
## [1] "sex"

by_sex_gender %>% 
  summarise(n = n(), .groups = "keep") %>% 
  group_vars()
## [1] "sex"    "gender"

# 不聚合了
by_sex_gender %>% 
  summarise(n = n(), .groups = "drop") %>% 
  group_vars()
## character(0)

`select()`/`rename()`/`relocate()`

by_species %>% select(mass) # grouped by species
## Adding missing grouping variables: `species`
## # A tibble: 87 x 2
## # Groups:   species [38]
##    species  mass
##       
##  1 Human      77
##  2 Droid      75
##  3 Droid      32
##  4 Human     136
##  5 Human      49
##  6 Human     120
##  7 Human      75
##  8 Droid      32
##  9 Human      84
## 10 Human      77
## # ... with 77 more rows

by_species %>% 
  ungroup() %>% 
  select(mass)
## # A tibble: 87 x 1
##     mass
##    
##  1    77
##  2    75
##  3    32
##  4   136
##  5    49
##  6   120
##  7    75
##  8    32
##  9    84
## 10    77
## # ... with 77 more rows

arrange()

by_species %>% 
  arrange(desc(mass)) %>% 
  relocate(species, mass)
## # A tibble: 87 x 14
## # Groups:   species [38]
##    species   mass name   height hair_color skin_color eye_color birth_year sex  
##                                    
##  1 Hutt      1358 Jabba~    175        green-tan~ orange         600   herm~
##  2 Kaleesh    159 Griev~    216 none       brown, wh~ green, y~       NA   male 
##  3 Droid      140 IG-88     200 none       metal      red             15   none 
##  4 Human      136 Darth~    202 none       white      yellow          41.9 male 
##  5 Wookiee    136 Tarff~    234 brown      brown      blue            NA   male 
##  6 Human      120 Owen ~    178 brown, gr~ light      blue            52   male 
##  7 Trandos~   113 Bossk     190 none       green      red             53   male 
##  8 Wookiee    112 Chewb~    228 brown      unknown    blue           200   male 
##  9 Human      110 Jek T~    180 brown      fair       blue            NA   male 
## 10 Besalisk   102 Dexte~    198 none       brown      yellow          NA   male 
## # ... with 77 more rows, and 5 more variables: gender , homeworld ,
## #   films , vehicles , starships

通过.by_group参数控制进行排序的先后位置，下面这个例子就是先根据species进行排序，再根据mass进行排序，和上面的不一样哦！

by_species %>% 
  arrange(desc(mass), .by_group = T) %>% 
  relocate(species, mass)
## # A tibble: 87 x 14
## # Groups:   species [38]
##    species   mass name   height hair_color skin_color eye_color birth_year sex  
##                                    
##  1 Aleena      15 Ratts~     79 none       grey, blue unknown           NA male 
##  2 Besalisk   102 Dexte~    198 none       brown      yellow            NA male 
##  3 Cerean      82 Ki-Ad~    198 white      pale       yellow            92 male 
##  4 Chagrian    NA Mas A~    196 none       blue       blue              NA male 
##  5 Clawdite    55 Zam W~    168 blonde     fair, gre~ yellow            NA fema~
##  6 Droid      140 IG-88     200 none       metal      red               15 none 
##  7 Droid       75 C-3PO     167        gold       yellow           112 none 
##  8 Droid       32 R2-D2      96        white, bl~ red               33 none 
##  9 Droid       32 R5-D4      97        white, red red               NA none 
## 10 Droid       NA R4-P17     96 none       silver, r~ red, blue         NA none 
## # ... with 77 more rows, and 5 more variables: gender , homeworld ,
## #   films , vehicles , starships

`muatate()` and `transmutate()`

starwars %>% 
  select(name, homeworld, mass) %>% 
  group_by(homeworld) %>% 
  mutate(means = mean(mass, na.rm = T), 
         standard_mass = mass - mean(mass, na.rm = T))
## # A tibble: 87 x 5
## # Groups:   homeworld [49]
##    name               homeworld  mass means standard_mass
##                                 
##  1 Luke Skywalker     Tatooine     77  85.4         -8.38
##  2 C-3PO              Tatooine     75  85.4        -10.4 
##  3 R2-D2              Naboo        32  64.2        -32.2 
##  4 Darth Vader        Tatooine    136  85.4         50.6 
##  5 Leia Organa        Alderaan     49  64          -15   
##  6 Owen Lars          Tatooine    120  85.4         34.6 
##  7 Beru Whitesun lars Tatooine     75  85.4        -10.4 
##  8 R5-D4              Tatooine     32  85.4        -53.4 
##  9 Biggs Darklighter  Tatooine     84  85.4         -1.38
## 10 Obi-Wan Kenobi     Stewjon      77  77            0   
## # ... with 77 more rows

min_rank()函数返回顺序（秩次）：

# Overall rank
starwars %>% 
  select(name, homeworld, height) %>% 
  mutate(rank = min_rank(height))
## # A tibble: 87 x 4
##    name               homeworld height  rank
##                         
##  1 Luke Skywalker     Tatooine     172    29
##  2 C-3PO              Tatooine     167    21
##  3 R2-D2              Naboo         96     5
##  4 Darth Vader        Tatooine     202    72
##  5 Leia Organa        Alderaan     150    11
##  6 Owen Lars          Tatooine     178    35
##  7 Beru Whitesun lars Tatooine     165    17
##  8 R5-D4              Tatooine      97     7
##  9 Biggs Darklighter  Tatooine     183    45
## 10 Obi-Wan Kenobi     Stewjon      182    44
## # ... with 77 more rows

先根据homeworld进行分组，再新建列：

# Rank per homeworld
starwars %>% 
  select(name, homeworld, height) %>% 
  group_by(homeworld) %>% 
  mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## # Groups:   homeworld [49]
##    name               homeworld height  rank
##                         
##  1 Luke Skywalker     Tatooine     172     5
##  2 C-3PO              Tatooine     167     4
##  3 R2-D2              Naboo         96     1
##  4 Darth Vader        Tatooine     202    10
##  5 Leia Organa        Alderaan     150     1
##  6 Owen Lars          Tatooine     178     6
##  7 Beru Whitesun lars Tatooine     165     3
##  8 R5-D4              Tatooine      97     1
##  9 Biggs Darklighter  Tatooine     183     7
## 10 Obi-Wan Kenobi     Stewjon      182     1
## # ... with 77 more rows

filter()

筛选每个物种（species）中最高（height）的那一个：

by_species %>%
  select(name, species, height) %>% 
  filter(height == max(height))
## # A tibble: 35 x 3
## # Groups:   species [35]
##    name                  species        height
##                                
##  1 Greedo                Rodian            173
##  2 Jabba Desilijic Tiure Hutt              175
##  3 Yoda                  Yoda's species     66
##  4 Bossk                 Trandoshan        190
##  5 Ackbar                Mon Calamari      180
##  6 Wicket Systri Warrick Ewok               88
##  7 Nien Nunb             Sullustan         160
##  8 Nute Gunray           Neimodian         191
##  9 Roos Tarpals          Gungan            224
## 10 Watto                 Toydarian         137
## # ... with 25 more rows

去掉只有1个成员的物种：

by_species %>%
  filter(n() != 1) %>% 
  tally()
## # A tibble: 9 x 2
##   species      n
##       
## 1 Droid        6
## 2 Gungan       3
## 3 Human       35
## 4 Kaminoan     2
## 5 Mirialan     2
## 6 Twi'lek      2
## 7 Wookiee      2
## 8 Zabrak       2
## 9          4

Computing on grouping information

在dplyr verbs内部，可以使用带有cur前缀的函数族访问当前组的各种属性。

cur_data()

cur_data() returns the current group, excluding grouping variables. It’s useful to feed to functions that take a whole data frame. For example, the following code fits a linear model of mass ~ height to each species:

by_species %>%
  filter(n() > 1) %>% 
  mutate(mod = list(lm(mass ~ height, data = cur_data())))
## # A tibble: 58 x 15
## # Groups:   species [9]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75         gold       yellow         112   none  mascu~
##  3 R2-D2       96    32         white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32         white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 48 more rows, and 6 more variables: homeworld , species ,
## #   films , vehicles , starships , mod

cur_group() and cur_group_id()

by_species %>%
  arrange(species) %>% 
  select(name, species, homeworld) %>% 
  mutate(id = cur_group_id())
## # A tibble: 87 x 4
## # Groups:   species [38]
##    name            species  homeworld      id
##                          
##  1 Ratts Tyerell   Aleena   Aleen Minor     1
##  2 Dexter Jettster Besalisk Ojom            2
##  3 Ki-Adi-Mundi    Cerean   Cerea           3
##  4 Mas Amedda      Chagrian Champala        4
##  5 Zam Wesell      Clawdite Zolan           5
##  6 C-3PO           Droid    Tatooine        6
##  7 R2-D2           Droid    Naboo           6
##  8 R5-D4           Droid    Tatooine        6
##  9 IG-88           Droid                6
## 10 R4-P17          Droid                6
## # ... with 77 more rows

two-table verbs

可以参考R数据科学这本书中的介绍，非常详细。

处理两个数据集的函数。

根据另一个表的变量新建数据
根据另一个表筛选

合并连接

内连接
- inner_join()
外连接
- 左连接left_join()：保留 x 中的所有观测。
- 右连接right_join()：保留 y 中的所有观测
- 全连接full_join()：保留 x 和 y 中的所有观测。

筛选连接

semi_join(x, y)：保留x表中与y表中的观测相匹配的所有观测
anti_join(x, y)：丢弃x表中与y表中的观测相匹配的所有观测

集合操作

intersect(x, y)：返回既在 x 表，又在 y 表中的观测
union(x, y)：返回 x 表或 y 表中的唯一观测
setdiff(x, y)：返回在 x 表，但不在 y 表中的观测

合并连接

library(nycflights13)
library(dplyr)
# 选择部分数据方便演示
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
glimpse(flights2)
## Rows: 336,776
## Columns: 8
## $ year     2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20~
## $ month    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour     5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6,~
## $ origin   "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK",~
## $ dest     "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO",~
## $ tailnum  "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N~
## $ carrier  "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B~

glimpse(airlines)
## Rows: 16
## Columns: 2
## $ carrier  "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O~
## $ name     "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline~

有一列共有的，carrier

flights2 %>% 
  left_join(airlines)
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier name                    
##                                    
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

by = NULL是默认。

glimpse(weather)
## Rows: 26,115
## Columns: 15
## $ origin      "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW~
## $ year        2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,~
## $ month       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, ~
## $ temp        39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.~
## $ dewp        26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.~
## $ humid       59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.~
## $ wind_dir    270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,~
## $ wind_speed  10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, ~
## $ wind_gust   NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.~
## $ precip      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ pressure    1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101~
## $ visib       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ time_hour   2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:00~

flights2 %>% left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##                   
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir ,
## #   wind_speed , wind_gust , precip , pressure ,
## #   visib , time_hour

glimpse(planes)
## Rows: 3,322
## Columns: 9
## $ tailnum       "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW~
## $ year          2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199~
## $ type          "Fixed wing multi engine", "Fixed wing multi engine", "Fi~
## $ manufacturer  "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU~
## $ model         "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145~
## $ engines       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ seats         55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5~
## $ speed         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ engine        "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb~

flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##                               
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA             
## # ... with 336,766 more rows, and 6 more variables: manufacturer ,
## #   model , engines , seats , speed , engine

glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa    "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "~
## $ name   "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur~
## $ lat    41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4~
## $ lon    -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342~
## $ alt    1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10~
## $ tz     -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, ~
## $ dst    "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",~
## $ tzone  "America/New_York", "America/Chicago", "America/Chicago", "Ameri~

如果两个数据集中列名不一样也可以进行合并：

flights2 %>% left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##                     
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg~  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg~  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami~  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6          NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts~  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica~  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort ~  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi~  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan~  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica~  42.0 -87.9   668
## # ... with 336,766 more rows, and 3 more variables: tz , dst ,
## #   tzone

下面是一个简单的左连接的例子：

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = 10, b = "a")

df1
## # A tibble: 2 x 2
##       x     y
##    
## 1     1     2
## 2     2     1
df2
## # A tibble: 2 x 3
##       x     a b    
##     
## 1     3    10 a    
## 2     1    10 a

df1 %>% inner_join(df2)
## Joining, by = "x"
## # A tibble: 1 x 4
##       x     y     a b    
##      
## 1     1     2    10 a

df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a b    
##      
## 1     1     2    10 a    
## 2     2     1    NA

右连接：

df1 %>% right_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a b    
##      
## 1     1     2    10 a    
## 2     3    NA    10 a

df2 %>% left_join(df1)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     a b         y
##      
## 1     3    10 a        NA
## 2     1    10 a         2

全连接：

df1 %>% full_join(df2)
## Joining, by = "x"
## # A tibble: 3 x 4
##       x     y     a b    
##      
## 1     1     2    10 a    
## 2     2     1    NA  
## 3     3    NA    10 a

筛选连接

df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))

df1
## # A tibble: 4 x 2
##       x     y
##    
## 1     1     1
## 2     1     2
## 3     3     3
## 4     4     4
df2
## # A tibble: 3 x 2
##       x z    
##    
## 1     1 a    
## 2     1 b    
## 3     2 a

df1 %>% nrow()
## [1] 4

df1 %>% inner_join(df2, by = "x")
## # A tibble: 4 x 3
##       x     y z    
##     
## 1     1     1 a    
## 2     1     1 b    
## 3     1     2 a    
## 4     1     2 b

df1 %>% semi_join(df2, by = "x")
## # A tibble: 2 x 2
##       x     y
##    
## 1     1     1
## 2     1     2

df1 %>% anti_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 2
##       x     y
##    
## 1     3     3
## 2     4     4

集合操作

(df1 <- tibble(x = 1:2, y = c(1L, 1L)))
## # A tibble: 2 x 2
##       x     y
##    
## 1     1     1
## 2     2     1
(df2 <- tibble(x = 1:2, y = 1:2))
## # A tibble: 2 x 2
##       x     y
##    
## 1     1     1
## 2     2     2

intersect(df1, df2) # 取交集
## # A tibble: 1 x 2
##       x     y
##    
## 1     1     1

union(df1, df2) # 并集
## # A tibble: 3 x 2
##       x     y
##    
## 1     1     1
## 2     2     1
## 3     2     2

setdiff(df1, df2)
## # A tibble: 1 x 2
##       x     y
##    
## 1     2     1
setdiff(df2, df1)
## # A tibble: 1 x 2
##       x     y
##    
## 1     2     2

下面是一些集合操作的示意图：

column-wise operations

主要是介绍across函数的用法，这是dplyr1.0才出来的一个函数，大大简化了代码

可用于对多列做同一个操作。

library(dplyr, warn.conflicts = FALSE)

across()有两个基本参数：

.cols：选择你想操作的列
.fn：你想进行的操作，可以使一个函数或者多个函数组成的列表

可以替代_if()，at_()，all_()

starwars %>% 
  summarise(across(where(is.character), n_distinct))
## # A tibble: 1 x 8
##    name hair_color skin_color eye_color   sex gender homeworld species
##                               
## 1    87         13         31        15     5      3        49      38

可以直接写列名：

starwars %>% 
  group_by(species) %>% 
  filter(n() > 1) %>% 
  summarise(across(c(sex, gender, homeworld), n_distinct))
## # A tibble: 9 x 4
##   species    sex gender homeworld
##              
## 1 Droid        1      2         3
## 2 Gungan       1      1         1
## 3 Human        2      2        16
## 4 Kaminoan     2      2         1
## 5 Mirialan     1      1         1
## 6 Twi'lek      2      2         1
## 7 Wookiee      1      1         1
## 8 Zabrak       1      1         2
## 9          1      1         3

也可以和where函数连用，省时省力：

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
## # A tibble: 10 x 4
##    homeworld height  mass birth_year
##                 
##  1 Alderaan    176.  64         43  
##  2 Corellia    175   78.5       25  
##  3 Coruscant   174.  50         91  
##  4 Kamino      208.  83.1       31.5
##  5 Kashyyyk    231  124        200  
##  6 Mirial      168   53.1       49  
##  7 Naboo       175.  64.2       55  
##  8 Ryloth      179   55         48  
##  9 Tatooine    170.  85.4       54.6
## 10         139.  82        334.

如果没有缺失值，可以直接写mean，

library(tidyr)
starwars %>% drop_na() %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), mean))
## # A tibble: 4 x 4
##   homeworld height  mass birth_year
##                
## 1 Corellia    175   78.5       25  
## 2 Mirial      168   53.1       49  
## 3 Naboo       177   62         60  
## 4 Tatooine    181.  96         37.6

acorss支持多个函数同时使用，只要放入列表中即可：

min_max <- list(
  min = ~min(.x, na.rm = TRUE), 
  max = ~max(.x, na.rm = TRUE)
)
starwars %>% summarise(across(where(is.numeric), min_max))
## # A tibble: 1 x 6
##   height_min height_max mass_min mass_max birth_year_min birth_year_max
##                                          
## 1         66        264       15     1358              8            896
starwars %>% summarise(across(c(height, mass, birth_year), min_max))
## # A tibble: 1 x 6
##   height_min height_max mass_min mass_max birth_year_min birth_year_max
##                                          
## 1         66        264       15     1358              8            896

当然也是支持glue的：

starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##                                          
## 1         66        264       15     1358              8            896

starwars %>% summarise(across(c(height, mass, birth_year), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##                                          
## 1         66        264       15     1358              8            896

分开写也是可以的：

starwars %>% summarise(
  across(c(height, mass, birth_year), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
  across(c(height, mass, birth_year), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
## # A tibble: 1 x 6
##   min_height min_mass min_birth_year max_height max_mass max_birth_year
##                                          
## 1         66       15              8        264     1358            896

这种情况不能使用where(is.numeric)，因为第2个across会使用新创建的列（“min_height”, “min_mass” and “min_birth_year”）。

可以放在tibble里解决：

starwars %>% summarise(
  tibble(
    across(where(is.numeric), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
    across(where(is.numeric), ~max(.x, na.rm = TRUE), .names = "max_{.col}")  
  )
)
## # A tibble: 1 x 6
##   min_height min_mass min_birth_year max_height max_mass max_birth_year
##                                          
## 1         66       15              8        264     1358            896

陷阱

在使用where(is.numeric)，要注意下面这种情况：

df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))

df %>% 
  summarise(n = n(), across(where(is.numeric), sd))
##    n x        y
## 1 NA 1 4.041452

n这里是3，是一个常数，所以它的sd变成了NA，可以通过换一下顺序解决：

df %>% summarise(across(where(is.numeric), sd),
                 n = n()
                 )
##   x        y n
## 1 1 4.041452 3

或者通过下面两种方法解决：

df %>% 
  summarise(n = n(), across(where(is.numeric) & !n, sd))
##   n x        y
## 1 3 1 4.041452

df %>% 
  summarise(
    tibble(n = n(), across(where(is.numeric), sd))
  )
##   n x        y
## 1 3 1 4.041452

across其他连用

还可以和group_by()/count()/distinct()连用。

和filter()连用

across()不能直接和filter()连用，和filter()连用的是if_any()和if_all()。

if_any()：任何一列满足条件即可
if_all()：所有列都要满足条件

starwars %>% 
  filter(if_any(everything(), ~ !is.na(.x)))
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75         gold       yellow         112   none  mascu~
##  3 R2-D2       96    32         white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32         white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld , species ,
## #   films , vehicles , starships

starwars %>% 
  filter(if_all(everything(), ~ !is.na(.x)))
## # A tibble: 29 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##                                    
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  3 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  4 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  5 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  6 Biggs ~    183    84 black       light      brown           24   male  mascu~
##  7 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
##  8 Anakin~    188    84 blond       fair       blue            41.9 male  mascu~
##  9 Chewba~    228   112 brown       unknown    blue           200   male  mascu~
## 10 Han So~    180    80 brown       fair       brown           29   male  mascu~
## # ... with 19 more rows, and 5 more variables: homeworld , species ,
## #   films , vehicles , starships

row-wide operations

在tidyverse中，整洁数据一般都是每一行是一个观测，每一列是一个变量，基本上所有操作都是基于整洁的数据进行的，都是对某列做什么操作。但有时候我们也需要对某行做一些操作，dplyr中现在提供了rowwise()函数快速执行对行的操作。

简介

library(dplyr, warn.conflicts = FALSE)

rowwise()和group_by()很像，本身不做任何操作，但是使用了rowwise之后，再和mutate()等函数连用时，就会变成按照行进行操作！

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
## # A tibble: 2 x 3
## # Rowwise: 
##       x     y     z
##     
## 1     1     3     5
## 2     2     4     6

假如你想分别计算每行的均值（只是一个例子），不使用rowwise()函数，得到的结果是所有数据的均值，很明显不是想要的：

df %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
##       x     y     z     m
##      
## 1     1     3     5   3.5
## 2     2     4     6   3.5

使用rowwise()之后，神奇的事情发生了，变成了按行操作！

df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
## # Rowwise: 
##       x     y     z     m
##      
## 1     1     3     5     3
## 2     2     4     6     4

df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)
df
## # A tibble: 2 x 4
##   name       x     y     z
##       
## 1 Mara       1     3     5
## 2 Hadley     2     4     6

按照行计算均值：

df %>% 
  rowwise() %>% 
  summarise(m = mean(c(x, y, z)))
## # A tibble: 2 x 1
##       m
##   
## 1     3
## 2     4

根据name这一列按照行计算均值：

df %>% 
  rowwise(name) %>% 
  summarise(m = mean(c(x, y, z)))
## `summarise()` has grouped output by 'name'. You can override using
## the `.groups` argument.
## # A tibble: 2 x 2
## # Groups:   name [2]
##   name       m
##     
## 1 Mara       3
## 2 Hadley     4

rowwise()可以看做是group_by()的特殊形式，本身也是对数据先进行聚合操作，所以如果要解除聚合，也要使用ungroup()函数。

对行进行汇总统计

df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
## # A tibble: 6 x 5
##      id     w     x     y     z
##       
## 1     1    10    20    30    40
## 2     2    11    21    31    41
## 3     3    12    22    32    42
## 4     4    13    23    33    43
## 5     5    14    24    34    44
## 6     6    15    25    35    45

接下来要进行按行操作了！

rf <- df %>% rowwise(id)

计算加和：

rf %>% mutate(total = sum(c(w, x, y, z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

rf %>% summarise(total = sum(c(w, x, y, z)))
## `summarise()` has grouped output by 'id'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 2
## # Groups:   id [6]
##      id total
##    
## 1     1   100
## 2     2   104
## 3     3   108
## 4     4   112
## 5     5   116
## 6     6   120

rf %>% mutate(total = sum(c_across(w:z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

rf %>% mutate(total = sum(c_across(where(is.numeric))))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##        
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

可以和列操作联合使用：

rf %>% 
  mutate(total = sum(c_across(w:z))) %>% 
  ungroup() %>% 
  mutate(across(w:z, ~ . / total))
## # A tibble: 6 x 6
##      id     w     x     y     z total
##        
## 1     1 0.1   0.2   0.3   0.4     100
## 2     2 0.106 0.202 0.298 0.394   104
## 3     3 0.111 0.204 0.296 0.389   108
## 4     4 0.116 0.205 0.295 0.384   112
## 5     5 0.121 0.207 0.293 0.379   116
## 6     6 0.125 0.208 0.292 0.375   120

可以和``rowSums()函数和rowMeans()`等函数联合使用。

list columns

motivation

df <- tibble(
  x = list(1, 2:3, 4:6)
)

df
## # A tibble: 3 x 1
##   x        
##      
## 1 
## 2 
## 3

df %>% mutate(l = length(x))
## # A tibble: 3 x 2
##   x             l
##       
## 1      3
## 2      3
## 3      3

df %>% mutate(l = lengths(x))
## # A tibble: 3 x 2
##   x             l
##       
## 1      1
## 2      2
## 3      3

df %>% mutate(l = sapply(x, length))
## # A tibble: 3 x 2
##   x             l
##       
## 1      1
## 2      2
## 3      3
df %>% mutate(l = purrr::map_int(x, length))
## # A tibble: 3 x 2
##   x             l
##       
## 1      1
## 2      2
## 3      3

df %>% 
  rowwise() %>% 
  mutate(l = length(x))
## # A tibble: 3 x 2
## # Rowwise: 
##   x             l
##       
## 1      1
## 2      2
## 3      3

subsetting

df <- tibble(g = 1:2, y = list(1:3, "a"))
gf <- df %>% group_by(g)
rf <- df %>% rowwise(g)

gf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Groups:   g [2]
##       g y         type  length
##          
## 1     1  list       1
## 2     2  list       1
rf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Rowwise:  g
##       g y         type      length
##              
## 1     1  integer        3
## 2     2  character      1

# grouped
out1 <- integer(2)
for (i in 1:2) {
  out1[[i]] <- length(df$y[i])
}
out1
## [1] 1 1

# rowwise
out2 <- integer(2)
for (i in 1:2) {
  out2[[i]] <- length(df$y[[i]])
}
out2
## [1] 3 1

gf %>% mutate(y2 = y)
## # A tibble: 2 x 3
## # Groups:   g [2]
##       g y         y2       
##           
## 1     1  
## 2     2  
rf %>% mutate(y2 = y)
## Error in `mutate()`:
## ! Problem while computing `y2 = y`.
## x `y2` must be size 1, not 3.
## i Did you mean: `y2 = list(y)` ?
## i The error occurred in row 1.
rf %>% mutate(y2 = list(y))
## # A tibble: 2 x 3
## # Rowwise:  g
##       g y         y2       
##           
## 1     1  
## 2     2

modeling

by_cyl <- mtcars %>% nest_by(cyl)
by_cyl
## # A tibble: 3 x 2
## # Rowwise:  cyl
##     cyl                data
##    >
## 1     4           [11 x 10]
## 2     6            [7 x 10]
## 3     8           [14 x 10]

mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
## # A tibble: 3 x 3
## # Rowwise:  cyl
##     cyl                data mod   
##    > 
## 1     4           [11 x 10]   
## 2     6            [7 x 10]   
## 3     8           [14 x 10]

mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
## # A tibble: 3 x 4
## # Rowwise:  cyl
##     cyl                data mod    pred      
##    >      
## 1     4           [11 x 10]    
## 2     6            [7 x 10]     
## 3     8           [14 x 10]

mods %>% summarise(rmse = sqrt(mean((pred - data$mpg) ^ 2)))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl  rmse
##    
## 1     4 3.01 
## 2     6 0.985
## 3     8 1.87
mods %>% summarise(rsq = summary(mod)$r.squared)
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl   rsq
##    
## 1     4 0.509
## 2     6 0.465
## 3     8 0.423
mods %>% summarise(broom::glance(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 13
## # Groups:   cyl [3]
##     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##                               
## 1     4     0.509         0.454  3.33      9.32  0.0137     1 -27.7   61.5  62.7
## 2     6     0.465         0.357  1.17      4.34  0.0918     1  -9.83  25.7  25.5
## 3     8     0.423         0.375  2.02      8.80  0.0118     1 -28.7   63.3  65.2
## # ... with 3 more variables: deviance , df.residual , nobs

mods %>% summarise(broom::tidy(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##                              
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118

repeated function calls

simulations

df <- tribble(
  ~ n, ~ min, ~ max,
    1,     0,     1,
    2,    10,   100,
    3,   100,  1000,
)

df %>% 
  rowwise() %>% 
  mutate(data = list(runif(n, min, max)))
## # A tibble: 3 x 4
## # Rowwise: 
##       n   min   max data     
##         
## 1     1     0     1 
## 2     2    10   100 
## 3     3   100  1000

df %>% 
  rowwise() %>% 
  mutate(data = runif(n, min, max))
## Error in `mutate()`:
## ! Problem while computing `data = runif(n, min, max)`.
## x `data` must be size 1, not 2.
## i Did you mean: `data = list(runif(n, min, max))` ?
## i The error occurred in row 2.

multiple combinations

df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))

df %>% 
  rowwise() %>% 
  mutate(data = list(rnorm(10, mean, sd)))
## # A tibble: 9 x 3
## # Rowwise: 
##    mean    sd data      
##         
## 1    -1     1 
## 2     0     1 
## 3     1     1 
## 4    -1    10 
## 5     0    10 
## 6     1    10 
## 7    -1   100 
## 8     0   100 
## 9     1   100

varying functions

df <- tribble(
   ~rng,     ~params,
   "runif",  list(n = 10), 
   "rnorm",  list(n = 20),
   "rpois",  list(n = 10, lambda = 5),
) %>%
  rowwise()

df %>% 
  mutate(data = list(do.call(rng, params)))
## # A tibble: 3 x 3
## # Rowwise: 
##   rng   params           data      
##                   
## 1 runif  
## 2 rnorm  
## 3 rpois

你可能感兴趣的:(数据分析,r语言)

基于Python的健身数据分析工具的搭建流程day1 weixin_45677320 python 开发语言数据挖掘爬虫
基于Python的健身数据分析工具的搭建流程分数据挖掘、数据存储和数据分析三个步骤。本文主要介绍利用Python实现健身数据分析工具的数据挖掘部分。第一步：加载库加载本文需要的库，如下代码所示。若库未安装，请按照python如何安装各种库（保姆级教程）_python安装库-CSDN博客https://blog.csdn.net/aobulaien001/article/details/133298
数据分析常用指标名词解释及计算公式走过冬季学习笔记数据分析大数据
数据分析中有大量常用指标，它们帮助我们量化业务表现、用户行为、产品健康度等。下面是一些核心指标的名词解释及计算方式，按常见类别分类：一、流量与用户规模指标页面浏览量名词解释：用户访问网站或应用时，每次加载或刷新一个页面就算一次PV。它衡量的是页面被打开的总次数。计算方式：PV=∑(所有页面被加载的次数)(通常由埋点或日志直接统计)独立访客数名词解释：在特定时间范围内（如一天、一周、一月），访问网站
24GB GPU 中的 DeepSeek R1：Unsloth AI 针对 671B 参数模型进行动态量化知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek ollama
简介最初的DeepSeekR1是一个拥有6710亿个参数的语言模型，UnslothAI团队对其进行了动态量化，将模型大小减少了80%（从720GB减少到131GB），同时保持了强大的性能。当添加模型卸载功能时，该模型可以在24GBVRAM下以低令牌/秒的推理速度运行。推荐文章《本地构建AI智能分析助手之01快速安装，使用PandasAI和Ollama进行数据分析，用自然语言向你公司的数据提问为决策
Elasticsearch搜索引擎存储：从原理到实践的全景解析 Python×CATIA工业智造搜索引擎 elasticsearch 大数据
引言在大数据时代，数据规模呈指数级增长，传统数据库的模糊查询、实时分析能力逐渐成为瓶颈。Elasticsearch（简称ES）凭借其分布式架构、实时搜索和灵活的数据分析能力，成为企业级搜索与存储的核心引擎。截至2025年，ES在全球日志分析、电商搜索、实时监控等场景的市场占有率超过60%。本文将从存储架构、核心技术、应用场景及优化策略四个维度，深入解析Elasticsearch的设计哲学与实践价值
python-pandas数据分析+案例分析
文章目录前言一、汽车销售数据可视化分析1.各年度汽车总销量及环比，各车类、级别车辆销量及环比2.车辆销售规模及环比、不同价位车销量及环比3.各车系、厂商、品牌车销量及环比，市占率及变化趋势4.品牌、车类、车型、级别的各top销量二、地质灾害航空公司客户价值分析1.原始数据存在少量的缺失值和异常值前言一、汽车销售数据可视化分析1.各年度汽车总销量及环比，各车类、级别车辆销量及环比importnump
数据分析案例-电脑笔记本价格数据可视化分析3 艾派森数据分析信息可视化 python 数据分析数据挖掘电脑
‍♂️个人主页：@艾派森的个人主页✍作者简介：Python学习者希望大家多多支持，我们一起进步！如果文章对你有帮助的话，欢迎评论点赞收藏加关注+目录1.项目背景2.数据集介绍3.技术工具
用Python做数据分析之数据统计学掌门 Python 数据分析大数据 python 数据分析人工智能
接下来说说数据统计部分，这里主要介绍数据采样，标准差，协方差和相关系数的使用方法。1、数据采样Excel的数据分析功能中提供了数据抽样的功能，如下图所示。Python通过sample函数完成数据采样。2、数据抽样Sample是进行数据采样的函数，设置n的数量就可以了。函数自动返回参与的结果。1#简单的数据采样2df_inner.sample(n=3)3、简单随机采样Weights参数是采样的权重，
pandas销售数据分析
pandas销售数据分析数据保存在data目录消费者数据：customers.csv商品数据：products.csv交易数据：transactions.csvcustomers.csv数据结构：字段描述customer_id客户IDgender性别age年龄region地区membership_date会员日期products.csv数据结构：字段描述product_id产品IDcategory
Python数据分析：从入门到精通
引言在当今数据驱动的时代，数据分析已成为企业和组织做出明智决策的关键。Python作为一种强大的编程语言，因其简洁性和丰富的数据分析库而成为数据科学领域的首选工具。无论你是初学者还是有一定经验的数据分析师，本指南都将带你从入门到精通Python数据分析，掌握必备技能和最佳实践。数据分析的重要性与Python的角色数据分析涉及收集、处理和解释数据，以揭示模式、趋势和见解。它有助于解决复杂问题，优化业
数据分析框架和方法 XiaoQiong.Zhang 人工智能
一、核心分析框架(TheBigPictureFrameworks)描述性分析(WhatHappened?)目的：了解过去发生了什么，描述现状，监控业务健康。核心工作：汇总、聚合、计算基础指标(KPI)，生成报表和仪表盘。常用方法/指标：计数/求和/平均值/中位数：DAU/MAU，总销售额，客单价等。比率：转化率，点击率，流失率，毛利率等。分布：用户活跃度分布、订单金额分布、地域分布等。常用于理解群
python基于Hadoop的NBA球员大数据分析与可视化系统
目录技术栈介绍具体实现截图系统设计研究方法：设计步骤设计流程核心代码部分展示研究方法详细视频演示试验方案论文大纲源码获取/详细视频演示技术栈介绍Django-SpringBoot-php-Node.js-flask本课题的研究方法和研究步骤基本合理，难度适中，本选题是学生所学专业知识的延续，符合学生专业发展方向，对于提高学生的基本知识和技能以及钻研能力有益。该学生能够在预定时间内完成该课题的设计。
【数据分析】多数据集网络分析：探索健康与退休研究中的变量关系生信学习者1 数据分析 (2025版)数据分析 r语言数据挖掘数据可视化
禁止商业或二改转载，仅供自学使用，侵权必究，如需截取部分内容请后台联系作者!文章目录介绍加载R包数据下载导入数据数据预处理函数网络分析画图保存图片总结系统信息介绍在医学和社会科学研究中，理解多个变量之间的复杂关系对于揭示潜在的病理生理机制和社会行为模式至关重要。本文介绍了一种基于R语言的网络分析方法，用于探索HRS（健康与退休研究）及其类似研究（CHARLS、ELSA、MHAS、SHARE）中的变
基于Python的旅游数据可视化应用
摘要本文详细介绍了一个功能完善的基于Python语言开发的旅游行业数据可视化分析应用系统。该系统采用Pandas这一强大的数据处理库进行数据清洗、转换和预处理工作，确保数据质量可靠。在可视化展示方面，系统整合了Matplotlib和Seaborn两大主流可视化库，通过丰富的图表类型直观呈现数据分析结果。特别值得一提的是，所有可视化图表均采用统一的绿色主题配色方案，这种设计不仅美观大方，更能突出体现
Pandas 学习教程 _pass_ Data-Alaysis pandas 信息可视化
目录定义基本操作一维数组操作二维数组操作数据选择过滤数据处理数据清洗数据转换数据分析排序分组聚合数据透视表高级操作合并数据时间序列处理自定义函数调用数据可视化集成数据导出和导入大数据分块处理定义全称：'paneldata'and'pythondataanalysis'Analy:Series(一维数据)、DataFrame(二维数据)主要应用：数据清洗：处理缺失数据、重复数据等数据转换：改变数据的
【kafka】在Linux系统中部署配置Kafka的详细用法教程分享景天科技苑 linux基础与进阶 shell脚本编写实战 kafka linux 分布式 kafka安装配置 kafka优化
✨✨欢迎大家来到景天科技苑✨✨养成好习惯，先赞后看哦~作者简介：景天科技苑《头衔》：大厂架构师，华为云开发者社区专家博主，阿里云开发者社区专家博主，CSDN全栈领域优质创作者，掘金优秀博主，51CTO博客专家等。《博客》：Python全栈，PyQt5和Tkinter桌面应用开发，小程序开发，人工智能，js逆向，App逆向，网络系统安全，云原生K8S，Prometheus监控，数据分析，Django
动态时间规整（Dynamic Time Warping，DTW）介绍 EmorZhong 机器学习人工智能深度学习数据结构算法
在时序数据分析中，动态时间规整（DynamicTimeWarping，DTW）是一种经典的用于度量两个时间序列相似度的算法。它的核心价值在于解决了传统距离度量（如欧氏距离）在处理时间序列时的局限性——尤其是当序列存在时间错位（如节奏快慢不同）或长度差异时，仍能准确捕捉它们的“形状相似性”。一、为什么需要DTW？传统的距离度量（如欧氏距离）要求两个时间序列必须长度相同且时间点严格对齐。但实际场景中，
python 计算生态概览的概述
文章目录前言python计算生态库的介绍1.网络爬虫2.数据分析3.文本处理4.数据可视化5.机器学习6.图形用户界面7.游戏开发8.网络应用开发前言python计算生态概览的解释Python计算生态概览是对Python作为一门强大而广泛使用的编程语言所拥有的庞大软件集合的整体描述和概述。这个生态体系不仅包含了Python的标准库（stdlib），即随Python解释器安装的基本模块，还涵盖了极其
一文搞懂怎么入门大模型
在人工智能飞速发展的当下，大模型已然成为推动众多领域创新变革的核心力量。无论是在智能客服、内容创作，还是数据分析、科学研究等方面，大模型都展现出了令人瞩目的能力。对于渴望踏入大模型领域的初学者而言，构建一个系统且全面的入门路径至关重要。接下来，我们将以DeepSeek为例，详细阐述如何系统地入门大模型。一、理论基础：搭建认知框架在深入实践之前，理解大模型的基础理论是关键。大模型，通常指具有海量参数
从零到一：王者荣耀英雄数据采集与技能图谱异步爬虫实战程序员威哥爬虫 python 开发语言自动化 scrapy
引言：随着游戏行业的迅猛发展，王者荣耀作为一款深受玩家喜爱的手游，其英雄数据和技能信息成为了爬虫开发者研究的热点之一。通过抓取英雄数据并对技能图谱进行可视化，我们不仅能够更好地理解游戏数据，还可以为游戏爱好者或数据分析师提供一个有价值的数据分析平台。本篇文章将带你一步步实现王者荣耀英雄数据的采集与技能图谱的可视化，并使用异步爬虫技术提高爬取效率。我们将结合实际开发中的需求，深入讲解如何使用异步爬虫
【HTML网页】智能健康监测——全方位健康管理专家（包含网页源代码）
智能健康监测分析系统智能健康监测分析系统是一种基于物联网、大数据、人工智能等技术的综合性健康管理解决方案。它具有以下六大核心功能：实时监测系统通过智能传感器和可穿戴设备，实时采集用户的生理数据，例如心率、血压、血氧饱和度、血糖水平和睡眠质量等，确保用户随时掌握自己的身体状况。健康数据分析利用人工智能和大数据分析技术，系统对采集到的数据进行处理和分析，提取有价值的健康信息，如心率变异性、呼吸频率等，
【字节跳动】数据挖掘面试题0010：解释全国人均收入下降，各省份人均收入增加的现象，属于辛普森悖论（开放性问题）言析数智数据挖掘常见面试题辛普森悖论局部与整体分析差异归因数据分析面试题
文章大纲一、辛普森悖论的核心定义二、现象成因：加权平均中的“权重偏移”三、数学逻辑与案例说明1.数学表达式2.具体案例四、辛普森悖论的本质：忽略“混杂因素”的影响五、生活中常见的辛普森悖论案例及应对策略1.医疗疗法效果评估2.大学录取率的性别偏差3.篮球运动员投篮效率4.公司员工绩效与部门规模如何利用辛普森悖论？（数据分析中的价值）六、总结全国人均收入下降而各省份人均收入增加的现象，确实属于辛普森
大模型学习应用 6: Vercel 部署自动获取微信公众号文章获取项目大地之灯大模型应用与学习学习微信大模型应用开发 python github flask
大模型落地开发实战指南！请关注微信公众号：「AGI启程号」深入浅出，助你轻松入门！数据分析、深度学习、大模型与算法的综合进阶，尽在CSDN博客主页本文将详细介绍如何在Vercel平台上部署自动微信公众号文章获取项目，包括项目结构、代码实现、部署流程以及常见问题的解决方案。注意：本项目源代码github链接，可自行克隆到自己的代码仓库完成vercel部署，注意需要稳定ip输出（微信白名单需求），免费
ChatGPTNextChat项目重构计划（九）：NextChat 解析API路由处理逻辑 stream.ts
大模型落地开发实战指南！请关注微信公众号：「AGI启程号」深入浅出，助你轻松入门！数据分析、深度学习、大模型与算法的综合进阶，尽在CSDN博客主页目录一、文件作用概述二、导入模块与类型定义三、核心函数详细解析`fetch(url,options)`四、`fetch`函数详细步骤解析步骤1:检测Tauri环境并准备请求参数步骤2:创建数据流(`TransformStream`)步骤3:定义关闭数据流
x86架构CPU市场格局 InnoLink_1024 芯片架构硬件架构
x86架构的CPU市场是全球处理器市场的核心，涵盖PC（桌面端与移动端）、服务器和超算等领域，主要玩家为英特尔（Intel）和AMD。以下基于最新数据分析市场格局及各领域份额，辅以国产厂商动态。1.总体市场概况x86架构因其成熟的生态系统和强大的兼容性，在PC和服务器市场占据主导地位。根据2024年数据，x86架构在服务器CPU市场占约91%的份额，而ARM等其他架构（如华为鲲鹏、飞腾）占约8%，
Julia爬取数据能力及应用场景 q56731523 julia 开发语言
Julia是一种高性能编程语言，特别适合数值计算和数据分析。然而，关于数据爬取（即网络爬虫）方面，我们需要明确以下几点：虽然它是一门通用编程语言，但它的强项不在于网络爬取（WebScraping）这类任务。而且Julia的生态系统在爬虫方面还不够成熟和丰富。所以说Julia爬取数据后立即进行高性能的数据分析这点还是有一些优势。Julia虽然以高性能数值计算和数据分析见长，但它同样具备网络爬取（We
用Python的Chartify库，商业数据可视化效率提升13倍！忆愿 Python编程的脉动之声 python opencv 人工智能计算机视觉深度学习神经网络机器学习
文章目录为啥要用Chartify？安装那些事儿从零开始画图基础柱状图进阶折线图散点图与气泡图专业数据分析必备技能多维度分析时间序列分析高级可视化技巧自定义主题交互式特性批量图表生成性能优化技巧大数据集处理内存优化实战案例：销售数据分析系统数据可视化这事儿，搞过的都知道有多费劲。用matplotlib画个图要调半天参数，才能让图表看起来稍微顺眼一点；seaborn虽然画出来的图确实好看，但是配置项太
Python 机器学习核心入门与实战进阶 Day 8 - 数据建模与分析项目实战预备：项目规划与需求拆解蓝婷儿 python python 机器学习开发语言
✅今日目标理解数据分析/建模项目的一般流程练习项目需求理解与目标拆解明确后续模型评估指标与预期交付成果起草项目计划文档（可选写为Markdown）一、项目背景与题目建议（可选方向）项目名称简介学生成绩预测分析系统根据历史表现预测成绩是否达标、学科薄弱点等求职者简历筛选模型根据简历信息预测是否通过初筛电商用户购买预测系统分析用户行为数据预测是否购买公司销售数据趋势分析可视化+聚合分析：月销售趋势、区
R语言舆情监控与可视化统计 q56731523 r语言开发语言爬虫
用R语言进行舆情监控并且做到可视化，对我来说，总体难度还算可以，主要是舆情监控通常涉及文本数据的收集（如社交媒体、新闻评论），然后进行情感分析，最后通过图表展示结果。步骤看似简单实则一点也不简单。以下就是我使用R语言进行舆情监控和可视化统计的完整示例。该方案包括文本情感分析和时间趋势可视化：#加载必要的包library(tidyverse)#数据处理和可视化library(tidytext)#文本
从零开始：使用Python进行数据分析的基础指南热爱分享的博士僧 python 数据分析开发语言
引言在当今数据驱动的世界中，数据分析已成为各行各业不可或缺的技能。无论是商业决策、科学研究还是产品优化，掌握数据分析都能帮助我们更好地理解问题、发现规律并做出明智的判断。而Python作为一门简洁、强大且生态丰富的编程语言，已经成为数据分析领域的首选工具之一。本篇文章将带你从零开始，逐步了解如何使用Python进行基础的数据分析。无论你是完全没有编程经验的新手，还是有一定基础但想系统学习数据分析的
TensorBase开发者快速入门指南宗隆裙
TensorBase开发者快速入门指南tensorbasetensorbase/tensorbase:是一个现代的GPU加速的张量数据库。适合用于大规模数据分析和机器学习。项目地址:https://gitcode.com/gh_mirrors/te/tensorbase前言TensorBase是一个基于Rust构建的高性能时序数据库，专为大规模数据分析场景设计。本文将详细介绍如何搭建TensorB
对于规范和实现，你会混淆吗？ yangshangchuan HotSpot
昨晚和朋友聊天，喝了点咖啡，由于我经常喝茶，很长时间没喝咖啡了，所以失眠了，于是起床读JVM规范，读完后在朋友圈发了一条信息： JVM Run-Time Data Areas：The Java Virtual Machine defines various run-time data areas that are used during execution of a program. So
android 网络百合不是茶网络
android的网络编程和java的一样没什么好分析的都是一些死的照着写就可以了,所以记录下来方便查找 , 服务器使用的是TomCat 服务器代码; servlet的使用需要在xml中注册 package servlet; import java.io.IOException; import java.util.Arr
[读书笔记]读法拉第传 comsci 读书笔记
1831年的时候,一年可以赚到1000英镑的人..应该很少的... 要成为一个科学家,没有足够的资金支持,很多实验都无法完成但是当钱赚够了以后....就不能够一直在商业和市场中徘徊......
随机数的产生沐刃青蛟随机数
c++中阐述随机数的方法有两种：一是产生假随机数（不管操作多少次，所产生的数都不会改变）这类随机数是使用了默认的种子值产生的，所以每次都是一样的。 //默认种子 for (int i = 0; i < 5; i++) { cout<<
PHP检测函数所在的文件名 IT独行者 PHP 函数
很简单的功能，用到PHP中的反射机制，具体使用的是ReflectionFunction类，可以获取指定函数所在PHP脚本中的具体位置。创建引用脚本。代码： [php] view plain copy // Filename: functions.php <?php&nbs
银行各系统功能简介文强chu 金融
银行各系统功能简介　业务系统核心业务系统业务功能包括：总账管理、卡系统管理、客户信息管理、额度控管、存款、贷款、资金业务、国际结算、支付结算、对外接口等清分清算系统以清算日期为准，将账务类交易、非账务类交易的手续费、代理费、网络服务费等相关费用，按费用类型计算应收、应付金额，经过清算人员确认后上送核心系统完成结算的过程国际结算系
Python学习1(pip django 安装以及第一个project) 小桔子 python django pip
最近开始学习python,要安装个pip的工具。听说这个工具很强大，安装了它，在安装第三方工具的话so easy!然后也下载了，按照别人给的教程开始安装，奶奶的怎么也安装不上！第一步：官方下载pip-1.5.6.tar.gz, https://pypi.python.org/pypi/pip easy! 第二部：解压这个压缩文件，会看到一个setup.p
php 数组 aichenglong PHP 排序数组循环多维数组
1 php中的创建数组 $product = array('tires','oil','spark');//array()实际上是语言结构而不是函数 2 如果需要创建一个升序的排列的数字保存在一个数组中，可以使用range()函数来自动创建数组 $numbers=range(1,10)//1 2 3 4 5 6 7 8 9 10 $numbers=range(1,10,
安装python2.7 AILIKES python
安装python2.7 1、下载可从 http://www.python.org/进行下载#wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz 2、复制解压 #mkdir -p /opt/usr/python #cp /opt/soft/Python-2
java异常的处理探讨百合不是茶 JAVA异常
//java异常 /* 1，了解java 中的异常处理机制，有三种操作 a,声明异常 b,抛出异常 c,捕获异常 2，学会使用try-catch-finally来处理异常 3，学会如何声明异常和抛出异常 4，学会创建自己的异常 */ //2，学会使用try-catch-finally来处理异常
getElementsByName实例 bijian1013 element
实例1： <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/x
探索JUnit4扩展：Runner bijian1013 java 单元测试 JUnit
参加敏捷培训时，教练提到Junit4的Runner和Rule，于是特上网查一下，发现很多都讲的太理论，或者是举的例子实在是太牵强。多搜索了几下，搜索到两篇我觉得写的非常好的文章。文章地址：http://www.blogjava.net/jiangshachina/archive/20
[MongoDB学习笔记二]MongoDB副本集 bit1129 mongodb
1. 副本集的特性 1)一台主服务器(Primary),多台从服务器(Secondary) 2)Primary挂了之后，从服务器自动完成从它们之中选举一台服务器作为主服务器，继续工作，这就解决了单点故障，因此，在这种情况下，MongoDB集群能够继续工作 3)挂了的主服务器恢复到集群中只能以Secondary服务器的角色加入进来 2
【Spark八十一】Hive in the spark assembly bit1129 assembly
Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners: 1. DDL statements (e.g. CREATE TABLE, DROP TABLE, etc.)
Nginx问题定位之监控进程异常退出 ronin47
nginx在运行过程中是否稳定，是否有异常退出过？这里总结几项平时会用到的小技巧。 1. 在error.log中查看是否有signal项，如果有，看看signal是多少。比如，这是一个异常退出的情况： $grep signal error.log 2012/12/24 16:39:56 [alert] 13661#0: worker process 13666 exited on s
No grammar constraints (DTD or XML schema).....两种解决方法 byalias xml
方法一：常用方法关闭XML验证工具栏：windows => preferences => xml => xml files => validation => Indicate when no grammar is specified:选择Ignore即可。方法二：（个人推荐）添加内容如下 <?xml version=
Netty源码学习-DefaultChannelPipeline bylijinnan netty
package com.ljn.channel; /** * ChannelPipeline采用的是Intercepting Filter 模式 * 但由于用到两个双向链表和内部类，这个模式看起来不是那么明显，需要仔细查看调用过程才发现 * * 下面对ChannelPipeline作一个模拟，只模拟关键代码： */ public class Pipeline {
MYSQL数据库常用备份及恢复语句 chicony mysql
备份MySQL数据库的命令，可以加选不同的参数选项来实现不同格式的要求。 mysqldump -h主机 -u用户名 -p密码数据库名 > 文件备份MySQL数据库为带删除表的格式，能够让该备份覆盖已有数据库而不需要手动删除原有数据库。 mysqldump -–add-drop-table -uusername -ppassword databasename > ba
小白谈谈云计算--基于Google三大论文 CrazyMizzz Google 云计算 GFS
之前在没有接触到云计算之前，只是对云计算有一点点模糊的概念，觉得这是一个很高大上的东西，似乎离我们大一的还很远。后来有机会上了一节云计算的普及课程吧，并且在之前的一周里拜读了谷歌三大论文。不敢说理解，至少囫囵吞枣啃下了一大堆看不明白的理论。现在就简单聊聊我对于云计算的了解。我先说说GFS &n
hadoop 平衡空间设置方法 daizj hadoop balancer
在hdfs-site.xml中增加设置balance的带宽，默认只有1M： <property> <name>dfs.balance.bandwidthPerSec</name> <value>10485760</value> <description&g
Eclipse程序员要掌握的常用快捷键 dcj3sjt126com 编程
判断一个人的编程水平，就看他用键盘多，还是鼠标多。用键盘一是为了输入代码（当然了，也包括注释），再有就是熟练使用快捷键。曾有人在豆瓣评《卓有成效的程序员》：“人有多大懒，才有多大闲”。之前我整理了一个程序员图书列表，目的也就是通过读书，让程序员变懒。程序员作为特殊的群体，有的人可以这么懒，懒到事情都交给机器去做，而有的人又可以那么勤奋，每天都孜孜不倦得
Android学习之路 dcj3sjt126com Android学习
转自：http://blog.csdn.net/ryantang03/article/details/6901459 以前有J2EE基础，接触JAVA也有两三年的时间了，上手Android并不困难，思维上稍微转变一下就可以很快适应。以前做的都是WEB项目，现今体验移动终端项目，让我越来越觉得移动互联网应用是未来的主宰。下面说说我学习Android的感受，我学Android首先是看MARS的视
java 遍历Map的四种方法 eksliang java HashMap java 遍历Map的四种方法
转载请出自出处： http://eksliang.iteye.com/blog/2059996 package com.ickes; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; /** * 遍历Map的四种方式
【精典】数据库相关相关 gengzg 数据库
package C3P0; import java.sql.Connection; import java.sql.SQLException; import java.beans.PropertyVetoException; import com.mchange.v2.c3p0.ComboPooledDataSource; public class DBPool{
自动补全 huyana_town 自动补全
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml&quo
jquery在线预览PDF文件，打开PDF文件天梯梦 jquery
最主要的是使用到了一个jquery的插件jquery.media.js，使用这个插件就很容易实现了。核心代码 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.
ViewPager刷新单个页面的方法 lovelease android viewpager tag 刷新
使用ViewPager做滑动切换图片的效果时，如果图片是从网络下载的，那么再子线程中下载完图片时我们会使用handler通知UI线程，然后UI线程就可以调用mViewPager.getAdapter().notifyDataSetChanged()进行页面的刷新，但是viewpager不同于listview，你会发现单纯的调用notifyDataSetChanged()并不能刷新页面
利用按位取反（~）从复合枚举值里清除枚举值草料场 enum
以 C# 中的 System.Drawing.FontStyle 为例。如果需要同时有多种效果，如：“粗体”和“下划线”的效果，可以用按位或（|） FontStyle style = FontStyle.Bold | FontStyle.Underline; 如果需要去除 style 里的某一种效果，
Linux系统新手学习的11点建议刘星宇编程工作 linux 脚本
　　随着Linux应用的扩展许多朋友开始接触Linux，根据学习Windwos的经验往往有一些茫然的感觉：不知从何处开始学起。这里介绍学习Linux的一些建议。　　一、从基础开始：常常有些朋友在Linux论坛问一些问题，不过，其中大多数的问题都是很基础的。例如：为什么我使用一个命令的时候，系统告诉我找不到该目录，我要如何限制使用者的权限等问题，这些问题其实都不是很难的，只要了解了 Linu
hibernate dao层应用之HibernateDaoSupport二次封装 wangzhezichuan DAO Hibernate
/** * 方法描述:sql语句查询返回List<Class> * 方法备注: Class 只能是自定义类 * @param calzz * @param sql * @return * 创建人：王川 * 创建时间：Jul

R语言dplyr入门到进阶

文章目录

dplyr介绍

安装

数据集：starwars

针对单个数据集的操作

filter()根据条件筛选行

arrange()进行排序

slice()根据位置选择行

select()选择列

mutate()新建列

relocate()重排列的位置

summarise()汇总

grouped data

group_by()

查看分组信息

增加或改变用于聚合的变量

移除聚合的变量

联合使用

summarise()

select()/rename()/relocate()

arrange()

muatate() and transmutate()

filter()

Computing on grouping information

cur_data()

cur_group() and cur_group_id()

two-table verbs

合并连接

筛选连接

集合操作

合并连接

筛选连接

集合操作

column-wise operations

陷阱

across其他连用

和filter()连用

row-wide operations

简介

对行进行汇总统计

list columns

motivation

subsetting

modeling

repeated function calls

simulations

multiple combinations

varying functions

你可能感兴趣的:(数据分析,r语言)

`select()`/`rename()`/`relocate()`

`muatate()` and `transmutate()`