在前面的数据处理笔记中提到了多个简单的数据处理函数(包括R内置的transform、aggregate、by、summary其他操作)以及工具包(主要为reshape、reshape2),这些工具虽然用起来比较方便,但是功能比较少,如aggregate和reshape2包groupby后处理都只能返回一个值,本篇将介绍一个更为强大而又系统的用来处理数据框结构数据的工具包——dplyr。值得一提的是,reshape、reshape2、plyr、dplyr以及ggplot2的作者都是同一人—— Hadley Wickham。下面将通过dplyr包官网中的示例了解一下大神的杰作。
dplyr包的功能应用方面主要包括3个:Single table verbs, Two-table verbs和Databases。
本文将主要了解dplyr对单个数据表(Single table,也即数据框)的处理。使用的示例数据集来自于hflights包,值得注意的是hflights数据结构类型是tibble
筛选符合条件的记录(rows) -
对数据进行排序 -
通过列名来选取变量 -
通过已有列创建(计算并赋值)新列 -
聚合数据,一般先分组(groupby)后再通过聚合函数返回分组的值 -
随机抽样函数(随机选取rows) -
分组函数 -
1.1 筛选:filter()
### 查看数据,可看到flights是tibble类型,而且直接读取也不会全部显示,很智能人性化
> library(nycflights13)
> dim(flights)
[1] 336776 19
> flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013 1 1 517 515 2. 830 819
2 2013 1 1 533 529 4. 850 830
3 2013 1 1 542 540 2. 923 850
4 2013 1 1 544 545 -1. 1004 1022
5 2013 1 1 554 600 -6. 812 837
6 2013 1 1 554 558 -4. 740 728
7 2013 1 1 555 600 -5. 913 854
8 2013 1 1 557 600 -3. 709 723
9 2013 1 1 557 600 -3. 838 846
10 2013 1 1 558 600 -2. 753 745
# ... with 336,766 more rows, and 11 more variables: arr_delay ,
# carrier , flight , tailnum , origin , dest ,
# air_time , distance , hour , minute , time_hour
筛选数据,格式为filter(data, formula)
,formula为逻辑判断,判断符号有==, >, >= etc,&, |, !, xor(),is.na(),between(), near()等。
### 筛选出month==1和day==2的行
> filter(flights, month == 1, day == 1)
Error in match.arg(method) : object 'day' not found
### 这里报错是因为有多个载入的包都含有filter函数,因此如下使用dplyr的filter函数
> dplyr::filter(flights, month==1, day==1)
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 1 1 517 515 2. 830 819 11.
2 2013 1 1 533 529 4. 850 830 20.
3 2013 1 1 542 540 2. 923 850 33.
4 2013 1 1 544 545 -1. 1004 1022 -18.
5 2013 1 1 554 600 -6. 812 837 -25.
6 2013 1 1 554 558 -4. 740 728 12.
7 2013 1 1 555 600 -5. 913 854 19.
8 2013 1 1 557 600 -3. 709 723 -14.
9 2013 1 1 557 600 -3. 838 846 -8.
10 2013 1 1 558 600 -2. 753 745 8.
# ... with 832 more rows, and 10 more variables: carrier , flight , tailnum ,
# origin , dest , air_time , distance , hour , minute ,
# time_hour
### 使用R内置方法进行同样的处理
df <- expand.grid(A = 1:100, B = 1:100, C = 1:100)
df$value <- 1:nrow(df)
library(dplyr); library(microbenchmark)
f1 <- function() subset(df, A == 1 & B == 3 | A == 3 & B == 2)
f2 <- function() filter(df, A == 1 & B == 3 | A == 3 & B == 2)
f3 <- function() df[with(df, A == 1 & B == 3 | A == 3 & B == 2), ]
f4 <- function() df[(df$A == 1 & df$B == 3) | (df$A == 3 & df$B == 2),]
microbenchmark(subset = f1(), filter = f2(), with = f3(), "$" = f4())
# Unit: milliseconds
# expr min lq mean median uq max neval
# subset 47.42671 49.99802 75.95385 92.24430 96.05960 141.2964 100
# filter 36.94019 38.77325 60.22831 42.64112 84.35896 155.0145 100
# with 38.90918 44.36299 71.29214 86.39629 88.89008 134.7670 100
# $ 40.22723 44.08606 71.32186 86.71372 89.59275 133.1132 100
1.2 排序:arrange()
根据某一列或多列进行排序,格式为:arrange(data, colnames , ...)
### 升序排序
> dplyr::arrange(flights, month, day)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 1 1 517 515 2. 830 819 11.
2 2013 1 1 533 529 4. 850 830 20.
3 2013 1 1 542 540 2. 923 850 33.
4 2013 1 1 544 545 -1. 1004 1022 -18.
5 2013 1 1 554 600 -6. 812 837 -25.
6 2013 1 1 554 558 -4. 740 728 12.
7 2013 1 1 555 600 -5. 913 854 19.
8 2013 1 1 557 600 -3. 709 723 -14.
9 2013 1 1 557 600 -3. 838 846 -8.
10 2013 1 1 558 600 -2. 753 745 8.
# ... with 336,766 more rows, and 10 more variables: carrier , flight ,
# tailnum , origin , dest , air_time , distance , hour ,
# minute , time_hour
### 降序排序
> dplyr::arrange(flights, desc(month, day))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 12 1 13 2359 14. 446 445 1.
2 2013 12 1 17 2359 18. 443 437 6.
3 2013 12 1 453 500 -7. 636 651 -15.
4 2013 12 1 520 515 5. 749 808 -19.
5 2013 12 1 536 540 -4. 845 850 -5.
6 2013 12 1 540 550 -10. 1005 1027 -22.
7 2013 12 1 541 545 -4. 734 755 -21.
8 2013 12 1 546 545 1. 826 835 -9.
9 2013 12 1 549 600 -11. 648 659 -11.
10 2013 12 1 550 600 -10. 825 854 -29.
# ... with 336,766 more rows, and 10 more variables: carrier , flight ,
# tailnum , origin , dest , air_time , distance , hour ,
# minute , time_hour
### 使用R内置的order函数进行排序
1.3 选择与重命名:select()
通过列名来选择子数据集,格式为:select(data, colnames, ...)
### 选择year,month,day 3列作为子集
> df<-dplyr::select(flights,year,month,DAY=day);df
# A tibble: 336,776 x 3
year month DAY
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ... with 336,766 more rows
> dplyr::rename(flights,DAY=day)
# A tibble: 336,776 x 19
year month DAY dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 1 1 517 515 2. 830 819 11.
2 2013 1 1 533 529 4. 850 830 20.
3 2013 1 1 542 540 2. 923 850 33.
4 2013 1 1 544 545 -1. 1004 1022 -18.
5 2013 1 1 554 600 -6. 812 837 -25.
6 2013 1 1 554 558 -4. 740 728 12.
7 2013 1 1 555 600 -5. 913 854 19.
8 2013 1 1 557 600 -3. 709 723 -14.
9 2013 1 1 557 600 -3. 838 846 -8.
10 2013 1 1 558 600 -2. 753 745 8.
# ... with 336,766 more rows, and 10 more variables: carrier , flight ,
# tailnum , origin , dest , air_time , distance , hour ,
# minute , time_hour
### R内置的选择子集的方法
> flights[c('year','month','day')]
# A tibble: 336,776 x 3
year month day
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ... with 336,766 more rows
1.4 变形:mutate()
> dplyr::mutate(flights,gain = arr_delay - dep_delay,gain_per_hour = gain / (air_time / 60))
# A tibble: 336,776 x 21
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 1 1 517 515 2. 830 819 11.
2 2013 1 1 533 529 4. 850 830 20.
3 2013 1 1 542 540 2. 923 850 33.
4 2013 1 1 544 545 -1. 1004 1022 -18.
5 2013 1 1 554 600 -6. 812 837 -25.
6 2013 1 1 554 558 -4. 740 728 12.
7 2013 1 1 555 600 -5. 913 854 19.
8 2013 1 1 557 600 -3. 709 723 -14.
9 2013 1 1 557 600 -3. 838 846 -8.
10 2013 1 1 558 600 -2. 753 745 8.
# ... with 336,766 more rows, and 12 more variables: carrier , flight ,
# tailnum , origin , dest , air_time , distance , hour ,
# minute , time_hour , gain , speed
### 这里可以看到flights是增加了两列的,而且新列gain_per_hour是通过gain这个新建的列创建的
> dplyr::transmute(flights,gain = arr_delay - dep_delay,gain_per_hour = gain / (air_time / 60))
# A tibble: 336,776 x 2
gain gain_per_hour
1 9. 2.38
2 16. 4.23
3 31. 11.6
4 -17. -5.57
5 -19. -9.83
6 16. 6.40
7 24. 9.11
8 -11. -12.5
9 -5. -2.14
10 10. 4.35
# ... with 336,766 more rows
1.5 聚合汇总:summarize()
> dplyr::summarise(flights,delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
1 12.6
1.6 抽样:sample_n()
则表示百分比的行数。格式:sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL)
,sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL)
> sample_n(flights, 10);sample_frac(flights,0.1)
# A tibble: 10 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 2 12 819 825 -6. 937 945 -8.
2 2013 5 5 2158 2159 -1. 2258 2337 -39.
3 2013 5 3 1535 1540 -5. 1745 1650 55.
4 2013 5 2 1824 1830 -6. 2115 2200 -45.
5 2013 5 1 1610 1610 0. 1719 1751 -32.
6 2013 9 8 1954 1859 55. 2134 2127 7.
7 2013 5 27 537 540 -3. 828 840 -12.
8 2013 1 24 806 810 -4. 1022 1044 -22.
9 2013 5 13 1551 1555 -4. 1659 1727 -28.
10 2013 8 12 NA 920 NA NA 1210 NA
# ... with 10 more variables: carrier , flight , tailnum , origin ,
# dest , air_time , distance , hour , minute , time_hour
# A tibble: 33,678 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
1 2013 11 16 1950 1930 20. 2055 2047 8.
2 2013 2 8 1724 1655 29. 2043 2009 34.
3 2013 9 12 652 700 -8. 931 949 -18.
4 2013 4 8 1830 1831 -1. 2157 2203 -6.
5 2013 10 17 1022 1025 -3. 1136 1140 -4.
6 2013 5 14 1736 1745 -9. 1942 2021 -39.
7 2013 11 28 745 736 9. 924 920 4.
8 2013 12 17 1034 1035 -1. 1418 1405 13.
9 2013 12 6 1953 2000 -7. 2114 2115 -1.
10 2013 10 22 1057 1100 -3. 1416 1415 1.
# ... with 33,668 more rows, and 10 more variables: carrier , flight , tailnum ,
# origin , dest , air_time , distance , hour , minute ,
# time_hour
> dim(flights)
[1] 336776 19
> length(levels(factor(flights$tailnum)))
[1] 4043
### 可以看到flights共有336776行,其中tailnum列包含4043个不同的航班号
### 分组
> df1<-group_by(flights, tailnum);df1
# A tibble: 336,776 x 19
# Groups: tailnum [4,044]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
1 2013 1 1 517 515 2. 830 819 11. UA 1545 N14228 EWR IAH
2 2013 1 1 533 529 4. 850 830 20. UA 1714 N24211 LGA IAH
3 2013 1 1 542 540 2. 923 850 33. AA 1141 N619AA JFK MIA
4 2013 1 1 544 545 -1. 1004 1022 -18. B6 725 N804JB JFK BQN
5 2013 1 1 554 600 -6. 812 837 -25. DL 461 N668DN LGA ATL
6 2013 1 1 554 558 -4. 740 728 12. UA 1696 N39463 EWR ORD
7 2013 1 1 555 600 -5. 913 854 19. B6 507 N516JB EWR FLL
8 2013 1 1 557 600 -3. 709 723 -14. EV 5708 N829AS LGA IAD
9 2013 1 1 557 600 -3. 838 846 -8. B6 79 N593JB JFK MCO
10 2013 1 1 558 600 -2. 753 745 8. AA 301 N3ALAA LGA ORD
# ... with 336,766 more rows, and 5 more variables: air_time , distance , hour , minute , time_hour
### 聚合
### 这里对分好组的数据进行了3个操作,(1)计算每个组内数据的个数(也即行数,通过n()函数获得);(2)计算每个组内距离的平均数(mean(distance));(3)计算每个组内晚点到达的平均数(mean(arr_delay))
> df2<-summarise(df1,count=n(),dist=mean(distance, na.rm=TRUE),delay=mean(arr_delay,na.rm=TRUE));df2
# A tibble: 4,044 x 4
tailnum count dist delay
1 D942DN 4 854. 31.5
2 N0EGMQ 371 676. 9.98
3 N10156 153 758. 12.7
4 N102UW 48 536. 2.94
5 N103US 46 535. -6.93
6 N104UW 47 535. 1.80
7 N10575 289 520. 20.7
8 N105UW 45 525. -0.267
9 N107US 41 529. -5.73
10 N108UW 60 534. -1.25
# ... with 4,034 more rows
### 最后对数据进行筛选
> df3<-filter(df2, count>=20 & dist<2000);df3
# A tibble: 2,986 x 4
tailnum count dist delay
1 N0EGMQ 371 676. 9.98
2 N10156 153 758. 12.7
3 N102UW 48 536. 2.94
4 N103US 46 535. -6.93
5 N104UW 47 535. 1.80
6 N10575 289 520. 20.7
7 N105UW 45 525. -0.267
8 N107US 41 529. -5.73
9 N108UW 60 534. -1.25
10 N109UW 48 536. -2.52
# ... with 2,976 more rows
> ggplot(data=df3) +
+ geom_point(aes(x=dist, y=delay, size=count)) +
+ geom_smooth(aes(x=dist,y=delay))
计算个数 -
计算 每个组中唯一值的个数 -
和nth(x, n)
返回对应秩的值, 类似于自带函数 x[1], x[length(x)], 和 x[n]
df<-flights %>%
group_by(tailnum) %>%
dist=mean(distance, na.rm=TRUE),
delay=mean(arr_delay,na.rm=TRUE)) %>%
filter(count>=20 & dist<2000)
> df
# A tibble: 2,986 x 4
tailnum count dist delay
1 N0EGMQ 371 676. 9.98
2 N10156 153 758. 12.7
3 N102UW 48 536. 2.94
4 N103US 46 535. -6.93
5 N104UW 47 535. 1.80
6 N10575 289 520. 20.7
7 N105UW 45 525. -0.267
8 N107US 41 529. -5.73
9 N108UW 60 534. -1.25
10 N109UW 48 536. -2.52
# ... with 2,976 more rows