【r<-数据分析】使用dplyr（3）：mutate、summarize与管道操作

接1，2。完整笔记查看使用dplyr进行数据转换

使用mutate()添加新变量

除了选择已存在的列，另一个常见的操作是添加新的列。这就是mutate()函数的工作了。

mutate()函数通常将新增变量放在数据集的最后面。为了看到新生成的变量，我们使用一个小的数据集。

flights_sml <- select(flights,
                      year:day,
                      ends_with("delay"),
                      distance,
                      air_time)

mutate(flights_sml,
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time   gain speed
##                           
##  1  2013     1     1      2.00     11.0      1400    227     9.00   370
##  2  2013     1     1      4.00     20.0      1416    227    16.0    374
##  3  2013     1     1      2.00     33.0      1089    160    31.0    408
##  4  2013     1     1     -1.00    -18.0      1576    183   -17.0    517
##  5  2013     1     1     -6.00    -25.0       762    116   -19.0    394
##  6  2013     1     1     -4.00     12.0       719    150    16.0    288
##  7  2013     1     1     -5.00     19.0      1065    158    24.0    404
##  8  2013     1     1     -3.00    -14.0       229     53.0 -11.0    259
##  9  2013     1     1     -3.00    - 8.00      944    140   - 5.00   405
## 10  2013     1     1     -2.00      8.00      733    138    10.0    319
## # ... with 336,766 more rows

mutate(flights_sml,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay dista~ air_~   gain hours gain_p~
##                         
##  1  2013     1     1      2.00     11.0    1400 227     9.00 3.78     2.38
##  2  2013     1     1      4.00     20.0    1416 227    16.0  3.78     4.23
##  3  2013     1     1      2.00     33.0    1089 160    31.0  2.67    11.6 
##  4  2013     1     1     -1.00    -18.0    1576 183   -17.0  3.05   - 5.57
##  5  2013     1     1     -6.00    -25.0     762 116   -19.0  1.93   - 9.83
##  6  2013     1     1     -4.00     12.0     719 150    16.0  2.50     6.40
##  7  2013     1     1     -5.00     19.0    1065 158    24.0  2.63     9.11
##  8  2013     1     1     -3.00    -14.0     229  53.0 -11.0  0.883  -12.5 
##  9  2013     1     1     -3.00    - 8.00    944 140   - 5.00 2.33   - 2.14
## 10  2013     1     1     -2.00      8.00    733 138    10.0  2.30     4.35
## # ... with 336,766 more rows

如果你仅仅想要保存新的变量，使用transmute()：

transmute(flights,
          gain = arr_delay - dep_delay,
          hours = air_time / 60,
          gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
##      gain hours gain_per_hour
##               
##  1   9.00 3.78           2.38
##  2  16.0  3.78           4.23
##  3  31.0  2.67          11.6 
##  4 -17.0  3.05         - 5.57
##  5 -19.0  1.93         - 9.83
##  6  16.0  2.50           6.40
##  7  24.0  2.63           9.11
##  8 -11.0  0.883        -12.5 
##  9 - 5.00 2.33         - 2.14
## 10  10.0  2.30           4.35
## # ... with 336,766 more rows

有用的创造函数

有很多函数可以结合mutate()一起使用来创造新的变量。这些函数的一个关键属性就是向量化的：它必须使用一组向量值作为输入，然后返回相同长度的数值作为输出。我们没有办法将所有的函数都列举出来，这里选择一些被频繁使用的函数。

算术操作符

算术操作符本质都是向量化的函数，遵循“循环补齐”的规则。如果一个参数比另一个参数短，它会自动扩展为后者同样的长度。比如air_time / 60，hours * 60等等。

模运算（%/%和%%）

%/%整除和%%取余。

对数

log()，log2()和log10()

位移量/偏移量

lead()和lag()允许你前移或后移变量的值。

(x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA

累积计算

R提供了累积和、累积积、和累积最小值、和累积最大值：cumsum(),cumprod(),cummin(),cummax()。dplyr提供勒cummean()用于计算累积平均值。如果你想要进行滚动累积计算，可以尝试下RcppRoll包。

x
##  [1]  1  2  3  4  5  6  7  8  9 10
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
cummean(x)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

逻辑比较

<,<=,>,>=,!=

排序rank

存在很多rank函数，但我们从min_rank()的使用开始，它可以实现最常见的rank（例如第一、第二、第三、第四），使用desc()进行辅助可以给最大值最小的rank。

y <- c(1,2,2,NA,3,4)
min_rank(y)
## [1]  1  2  2 NA  4  5
min_rank(desc(y))
## [1]  5  3  3 NA  2  1

如果min_rank()解决不了你的需求，看看变种row_number()、dense_rank()、percent_rank()、cume_dist()和ntile()，查看他们的帮助页面获取使用方法。

row_number(y)
## [1]  1  2  3 NA  4  5
dense_rank(y)
## [1]  1  2  2 NA  3  4
percent_rank(y)
## [1] 0.00 0.25 0.25   NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6  NA 0.8 1.0

使用summarize()计算汇总值

最后一个关键的动词是summarize()，它将一个数据框坍缩为单个行：

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay
##   
## 1  12.6

除非我们将summarize()与group_by()配对使用，不然summarize()显得没啥用。这个操作会将分析单元从整个数据集转到单个的组别。然后，当你使用dplyr动词对分组的数据框进行操作时，它会自动进行分组计算。比如，我们想要按日期分组，得到每个日期的平均延期：

by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
##     year month   day delay
##       
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

group_by()与summarize()的联合使用是我们最常用的dplyr工具：进行分组汇总。在我们进一步学习之前，我们需要了解一个非常强大的思想：管道。

使用管道整合多个操作

想象你要探索每个位置距离和平均航班延迟的关系。使用你已经知道的dplyr知识，你可能会写出下面的代码：

by_dest <- group_by(flights, dest)
delay <- summarize(by_dest,
                   count = n(),
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE) )
delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
    geom_point(aes(size=count), alpha = 1/3) + 
    geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

[图片上传失败...(image-b53f79-1522248030773)]

看起来在大概750英里之前，距离增大，延误时间也增加；随后减少。可能是航班长了之后，飞机更有能力在空中进行调整？

上述代码分三步进行了数据准备：

按目的地将航班分组
汇总计算距离、平均延时和航班数目
移除噪声点和Honolulu航班，它太远了。

这个代码写的有点令人沮丧，尽管我们不关心中间变量（临时变量），但我们却不得不创造这些中间变量存储结果数据框。命名是一件非常困难的事情，它会降低我们分析的速度。

另一种方式可以解决同样的问题，这就是管道pipe，%>：

delays <- flights %>%
    group_by(dest) %>%
    summarize(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
    ) %>%
    filter(count > 20, dest != "HNL")

这代码聚焦于转换，而不是什么被转换，这让代码更容易阅读。你可以将这段代码当作命令式的语句：分组、然后汇总，然后过滤。对%>%理解的一种好的方式就是将它发音为”然后“。

在后台，x %>% f(y)会变成f(x, y)，x %>% f(y) %>% g(z)会变成g(f(x, y), z)等等如此。你可以使用管道——用一种从上到下，从左到右的的方式重写多个操作。从现在开始我们将会频繁地用到管道，因为它会提升代码的可读性，这些我们会在后续进行深入学习。

使用管道进行工作是属于tidyverse的一个重要标准。唯一的例外是ggplot2，它在管道开发之前就已经写好了。不幸的是，ggplot2的下一个版本ggvis会使用管道，但还没有发布。

缺失值

你可能会好奇我们先前使用的na.rm参数。如果我们不设置它会发生什么呢？

flights %>%
    group_by(dest) %>%
    summarize(
        count = n(),
        dist = mean(distance),
        delay = mean(arr_delay)
    ) %>%
    filter(count > 20, dest != "HNL")
## # A tibble: 96 x 4
##    dest  count  dist delay
##       
##  1 ABQ     254  1826  4.38
##  2 ACK     265   199 NA   
##  3 ALB     439   143 NA   
##  4 ATL   17215   757 NA   
##  5 AUS    2439  1514 NA   
##  6 AVL     275   584 NA   
##  7 BDL     443   116 NA   
##  8 BGR     375   378 NA   
##  9 BHM     297   866 NA   
## 10 BNA    6333   758 NA   
## # ... with 86 more rows

我们得到了一堆缺失值！如果输入不去除缺失值，结果必然是缺失值。幸运的是，所有的聚集函数都有na.rm参数，它可以在计算之前移除缺失值。

flights %>%
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
##     year month   day  mean
##       
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

这个例子中，缺失值代表了取消的航班，所以我们解决这样问题的办法就是首先移除取消的航班。

not_cancelled <- flights %>%
    filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
    group_by(year, month, day) %>%
    summarize(mean = mean(dep_delay))
## # A tibble: 365 x 4
## # Groups: year, month [?]
##     year month   day  mean
##       
##  1  2013     1     1 11.4 
##  2  2013     1     2 13.7 
##  3  2013     1     3 10.9 
##  4  2013     1     4  8.97
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.56
##  9  2013     1     9  2.30
## 10  2013     1    10  2.84
## # ... with 355 more rows

计数

无论什么时候你进行汇总，包含计数n()或者非缺失值计数sum(!is.na(x))总是一个好想法。这样你可以检查你下结论来源的数据数目。例如，让我们看下有最高平均延时的飞机（根据尾号识别）：

delays <- not_cancelled %>%
    group_by(tailnum) %>%
    summarize(
        delay = mean(arr_delay)
    )

ggplot(data = delays, mapping = aes(x = delay)) + 
    geom_freqpoly(binwidth = 10)

[图片上传失败...(image-8fe604-1522248030773)]

哇！居然有些飞机平均延时5个小时（300分钟）。

绘制平均延时下航班数目的散点图可以呈现更多的信息：

delays <- not_cancelled %>%
    group_by(tailnum) %>%
    summarize(
        delay = mean(arr_delay, na.rm = TRUE),
        n = n()
    )

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

[图片上传失败...(image-258585-1522248030773)]

当航班数少时平均延时存在很大的变异，这并不奇怪。这个图的形状很有特征性：无论什么时候你按照组别绘制均值（或其他汇总量），你会看到变异会随着样本量的增加而减少。

当你看到这种类型图时，过滤掉有很少数目的组别是很有用的，可以看到数据更多的模式和更少的极端值。这正是下面代码做的事情，它同时展示了整合dplyr与ggplot2的一种手动方式。突然从%>%转换到+可能会感觉有点伤，但习惯了就会感觉很便利啦：

delays %>%
    filter(n > 25) %>%
    ggplot(mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

[图片上传失败...(image-855b13-1522248030773)]

让我们看另一个例子：棒球运动中击球手的平均表现与上场击球次数的关系。这里我们使用来自Lahman包的数据计算每个选手平均成功率（击球平均得分数，击球数/尝试数）。

当我画出击球手技能（用成功率衡量）与击球的机会数关系时，你会看到两种模式：

数据点越多，变异越少
选手技能和击球机会成正相关关系。这是因为队伍可以控制谁可以上场，很显然他们都会选自己最棒的选手：

# 转换为tibble，看起来更舒服
batting <- as.tibble(Lahman::Batting)

batters <- batting %>%
    group_by(playerID) %>%
    summarize(
        ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
        ab = sum(AB, na.rm = TRUE)
    )

batters %>% 
    filter(ab > 100) %>%
    ggplot(mapping = aes(x = ab, y = ba)) + 
    geom_point() +
    geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam'

[图片上传失败...(image-b34bc7-1522248030773)]

有用的汇总函数

仅仅使用均值、计数和求和这些函数就可以帮我做很多事情，但R提供了许多其他有用的汇总函数：

位置度量

我们已经使用过mean()函数求取平均值（总和除以长度），median()函数也非常有用，它会找到中位数。

有时候整合聚集函数和逻辑操作符是非常有用的：

not_cancelled %>%
    group_by(year, month, day) %>% 
    summarize(
        # 平均延时
        avg_delay1 = mean(arr_delay),
        # 平均正延时
        avg_delay2 = mean(arr_delay[arr_delay > 0])
    )
## # A tibble: 365 x 5
## # Groups: year, month [?]
##     year month   day avg_delay1 avg_delay2
##                  
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4    - 1.93        28.3
##  5  2013     1     5    - 1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7    - 4.95        27.8
##  8  2013     1     8    - 3.23        20.8
##  9  2013     1     9    - 0.264       25.6
## 10  2013     1    10    - 5.90        27.3
## # ... with 355 more rows

分布度量sd(x),IQR(x),mad(x)

sd()计算均方差（也称为标准差或简写为sd），是分布的标准度量；IQR()计算四分位数极差；mad()计算中位绝对离差（存在离群点时，是更稳定的IQR值等价物）。

# 为何到某些目的地航班的距离比其他存在更多变异
not_cancelled %>% 
    group_by(dest) %>% 
    summarize(distance_sd = sd(distance)) %>% 
    arrange(desc(distance_sd))
## # A tibble: 104 x 2
##    dest  distance_sd
##           
##  1 EGE         10.5 
##  2 SAN         10.4 
##  3 SFO         10.2 
##  4 HNL         10.0 
##  5 SEA          9.98
##  6 LAS          9.91
##  7 PDX          9.87
##  8 PHX          9.86
##  9 LAX          9.66
## 10 IND          9.46
## # ... with 94 more rows

等级度量 min(x),quantile(x, 0.25),max(x)

分位数是中位数更通用化的一种形式。比如，quantile(x, 0.25)会找到x中刚好大于25%的值而小于7%的值的那个数。

# 每天第一班飞机和最后一般飞机是什么时候？
not_cancelled %>% 
    group_by(year, month, day) %>% 
    summarize(
        first = min(dep_time),
        last = max(dep_time)
    )
## # A tibble: 365 x 5
## # Groups: year, month [?]
##     year month   day  first  last
##         
##  1  2013     1     1 517     2356
##  2  2013     1     2  42.0   2354
##  3  2013     1     3  32.0   2349
##  4  2013     1     4  25.0   2358
##  5  2013     1     5  14.0   2357
##  6  2013     1     6  16.0   2355
##  7  2013     1     7  49.0   2359
##  8  2013     1     8 454     2351
##  9  2013     1     9   2.00  2252
## 10  2013     1    10   3.00  2320
## # ... with 355 more rows

位置度量 first(x), nth(x, 2), last(x)

这些函数跟x[1],x[2],x[length(x)]工作相似，但是如果该位置不存在会返回一个默认值。例如，我们想找到每天起飞的第一班和最后一班飞机：

not_cancelled %>% 
    group_by(year, month, day) %>% 
    summarize(
        first_dep = first(dep_time),
        last_dep = last(dep_time)
    )
## # A tibble: 365 x 5
## # Groups: year, month [?]
##     year month   day first_dep last_dep
##               
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # ... with 355 more rows

这些函数可以与基于rank的函数互补：

not_cancelled %>% 
    group_by(year, month, day) %>% 
    mutate(r = min_rank(desc(dep_time))) %>% 
    filter(r %in% range(r))
## # A tibble: 770 x 20
## # Groups: year, month, day [365]
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##                    
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1   2356    2359 - 3.00   425    437 -12.0  B6      727
##  3  2013     1     2     42    2359  43.0    518    442  36.0  B6      707
##  4  2013     1     2   2354    2359 - 5.00   413    437 -24.0  B6      727
##  5  2013     1     3     32    2359  33.0    504    442  22.0  B6      707
##  6  2013     1     3   2349    2359 -10.0    434    445 -11.0  B6      739
##  7  2013     1     4     25    2359  26.0    505    442  23.0  B6      707
##  8  2013     1     4   2358    2359 - 1.00   429    437 - 8.00 B6      727
##  9  2013     1     4   2358    2359 - 1.00   436    445 - 9.00 B6      739
## 10  2013     1     5     14    2359  15.0    503    445  18.0  B6      739
## # ... with 760 more rows, and 9 more variables: tailnum , origin
## #   , dest , air_time , distance , hour , minute
## #   , time_hour , r

计数

你已经见过了n()函数，它没有任何参数并返回当前组别的大小。为了对非缺失值计数，使用sum(!is.na(x))。要对唯一值进行计数，使用n_distinct()：

# 哪个目的地有最多的carrier
not_cancelled %>% 
    group_by(dest) %>% 
    summarize(carriers = n_distinct(carrier)) %>% 
    arrange(desc(carriers))
## # A tibble: 104 x 2
##    dest  carriers
##        
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

计数十分有用，如果你仅仅想要计数，dplyr提供了一个帮助函数：

not_cancelled %>% 
    count(dest)
## # A tibble: 104 x 2
##    dest      n
##     
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

你可以选择性提供一个权重变量。比如，你想用它计数（求和）一个飞机飞行的总里程：

not_cancelled %>% 
    count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum      n
##        
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

计数与逻辑值比例 sum(x > 10), mean(y == 0)

当与数值函数使用时，TRUE被转换为1，FALSE被转换为0。这让sum()与mean()变得非常有用，sum(x)可以计算x中TRUE的数目，mean()可以计算比例：

# 多少航班在5点前离开
not_cancelled %>% 
    group_by(year, month, day) %>% 
    summarize(n_early = sum(dep_time < 500))
## # A tibble: 365 x 4
## # Groups: year, month [?]
##     year month   day n_early
##         
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ... with 355 more rows


# 延时超过1小时的航班比例是多少
not_cancelled %>% 
    group_by(year, month, day) %>% 
    summarize(hour_perc = mean(arr_delay > 60))
## # A tibble: 365 x 4
## # Groups: year, month [?]
##     year month   day hour_perc
##           
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ... with 355 more rows

按多个变量分组

当你按多个变量分组时，可以非常容易地对数据框汇总：

daily <- group_by(flights, year, month, day)
(per_day   <- summarize(daily, flights = n()))
## # A tibble: 365 x 4
## # Groups: year, month [?]
##     year month   day flights
##         
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows
(per_month <- summarize(per_day, flights = sum(flights)))
## # A tibble: 12 x 3
## # Groups: year [?]
##     year month flights
##        
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135
(per_year  <- summarize(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
##    year flights
##      
## 1  2013  336776

解开分组

当你想要移除分组时，使用ungroup()函数：

daily %>%
    ungroup() %>%  # 不再按日期分组
    summarize(flights = n()) # 所有的航班
## # A tibble: 1 x 1
##   flights
##     
## 1  336776

分组的Mutates

分组在与汇总衔接时非常有用，但你也可以与mutate()和filter()进行便利操作：

找到每组中最糟糕的成员：

flights_sml %>% 
    group_by(year, month, day) %>% 
    filter(rank(desc(arr_delay)) < 10 )
## # A tibble: 3,306 x 7
## # Groups: year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##                        
##  1  2013     1     1       853       851      184     41.0
##  2  2013     1     1       290       338     1134    213  
##  3  2013     1     1       260       263      266     46.0
##  4  2013     1     1       157       174      213     60.0
##  5  2013     1     1       216       222      708    121  
##  6  2013     1     1       255       250      589    115  
##  7  2013     1     1       285       246     1085    146  
##  8  2013     1     1       192       191      199     44.0
##  9  2013     1     1       379       456     1092    222  
## 10  2013     1     2       224       207      550     94.0
## # ... with 3,296 more rows

找到大于某个阈值的所有组

(popular_dests <- flights %>% 
    group_by(dest) %>% 
    filter(n() > 365))
## # A tibble: 332,577 x 19
## # Groups: dest [77]
##     year month   day dep_t~ sched_~ dep_d~ arr_~ sched~ arr_d~ carr~ flig~
##                    
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... with 332,567 more rows, and 8 more variables: tailnum , origin
## #   , dest , air_time , distance , hour , minute
## #   , time_hour

标准化来计算每组的指标

popular_dests %>% 
    filter(arr_delay > 0) %>% 
    mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
    select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups: dest [77]
##     year month   day dest  arr_delay prop_delay
##                  
##  1  2013     1     1 IAH       11.0   0.000111 
##  2  2013     1     1 IAH       20.0   0.000201 
##  3  2013     1     1 MIA       33.0   0.000235 
##  4  2013     1     1 ORD       12.0   0.0000424
##  5  2013     1     1 FLL       19.0   0.0000938
##  6  2013     1     1 ORD        8.00  0.0000283
##  7  2013     1     1 LAX        7.00  0.0000344
##  8  2013     1     1 DFW       31.0   0.000282 
##  9  2013     1     1 ATL       12.0   0.0000400
## 10  2013     1     1 DTW       16.0   0.000116 
## # ... with 131,096 more rows

（完）