[21] 《R数据科学》分组

按多个变量分组

当使用多个变量进行分组时,每次统计摘要会用掉一个分组变量,这样就可以对数据集进行循序渐进的分析:

library(dplyr)
library(nycflights13)
daily <- group_by(flights,year,month,day)
(per_day <- summarize(daily,flights=n()))
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day flights
        
 1  2013     1     1     842
 2  2013     1     2     943
 3  2013     1     3     914
 4  2013     1     4     915
 5  2013     1     5     720
 6  2013     1     6     832
 7  2013     1     7     933
 8  2013     1     8     899
 9  2013     1     9     902
10  2013     1    10     932
# ... with 355 more rows
(per_month <- summarise(per_day,flights=sum(flights)))
# A tibble: 12 x 3
# Groups:   year [1]
    year month flights
       
 1  2013     1   27004
 2  2013     2   24951
 3  2013     3   28834
 4  2013     4   28330
 5  2013     5   28796
 6  2013     6   28243
 7  2013     7   29425
 8  2013     8   29327
 9  2013     9   27574
10  2013    10   28889
11  2013    11   27268
12  2013    12   28135
(per_year <- summarise(per_month,flights=sum(flights)))
# A tibble: 1 x 2
   year flights
     
1  2013  336776

取消分组

如果要取消分组,并返回到未分组的数据继续操作,可以使用ungroup()函数:

daily %>% ungroup() %>% summarise(flights=n())
# A tibble: 1 x 1
  flights
    
1  336776

分组新变量和筛选器

我们经常把group_by()summarize()结合起来使用,但分组也可以与mutate()filter()函数结合。

  • 找出每个分组中最差的成员:
flights %>% group_by(year,month,day) %>% filter(rank(desc(arr_delay))<10)
# A tibble: 3,306 x 19
# Groups:   year, month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
                                           
 1  2013     1     1      848           1835       853     1001           1950       851
 2  2013     1     1     1815           1325       290     2120           1542       338
 3  2013     1     1     1842           1422       260     1958           1535       263
 4  2013     1     1     1942           1705       157     2124           1830       174
 5  2013     1     1     2006           1630       216     2230           1848       222
 6  2013     1     1     2115           1700       255     2330           1920       250
 7  2013     1     1     2205           1720       285       46           2040       246
 8  2013     1     1     2312           2000       192       21           2110       191
 9  2013     1     1     2343           1724       379      314           1938       456
10  2013     1     2     1244            900       224     1431           1104       207
# ... with 3,296 more rows, and 10 more variables: carrier , flight ,
#   tailnum , origin , dest , air_time , distance , hour ,
#   minute , time_hour 
  • 找出大于某个阈值的所有分组:
(popular_dests <- flights %>% group_by(dest) %>% filter(n()>365))
# A tibble: 332,577 x 19
# Groups:   dest [77]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
                                           
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      542            540         2      923            850        33
 4  2013     1     1      544            545        -1     1004           1022       -18
 5  2013     1     1      554            600        -6      812            837       -25
 6  2013     1     1      554            558        -4      740            728        12
 7  2013     1     1      555            600        -5      913            854        19
 8  2013     1     1      557            600        -3      709            723       -14
 9  2013     1     1      557            600        -3      838            846        -8
10  2013     1     1      558            600        -2      753            745         8
# ... with 332,567 more rows, and 10 more variables: carrier , flight ,
#   tailnum , origin , dest , air_time , distance , hour ,
#   minute , time_hour 
  • 对数据标准化以计算分组指标:
head(popular_dests %>% filter(arr_delay>0) %>% mutate(prop_delay=arr_delay/sum(arr_delay)) %>% select(year:day,dest,arr_delay,prop_delay))
# A tibble: 6 x 6
# Groups:   dest [4]
   year month   day dest  arr_delay prop_delay
                
1  2013     1     1 IAH          11  0.000111 
2  2013     1     1 IAH          20  0.000201 
3  2013     1     1 MIA          33  0.000235 
4  2013     1     1 ORD          12  0.0000424
5  2013     1     1 FLL          19  0.0000938
6  2013     1     1 ORD           8  0.0000283

你可能感兴趣的:([21] 《R数据科学》分组)