filter()
arrange()
select()
mutate()
summarize()
dplyr函数不会修改输入,保存结果需要进行赋值
1.filter() 筛选行
filter(data, expr1, expr2..., preserve = F)
data: 数据框
expr: 用于筛选数据框的表达式
filter()函数自动排除NA值
e.g. nycflights13包中的flights数据为例
> nycflights13::flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# ... with 336,766 more rows, and 9 more variables: flight , tailnum , origin ,
# dest , air_time , distance , hour , minute , time_hour
> filter(flights, month == 1 | day == 1)
# A tibble: 37,198 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# ... with 37,188 more rows, and 9 more variables: flight , tailnum , origin ,
# dest , air_time , distance , hour , minute , time_hour
# 找出UA,AA,DL运行的航班
> filter(flights, carrier %in% c('UA','AA','DL'))
#找出延误至少1小时,但飞行过程弥补回30分钟的航班
> filter(flights, dep_delay >= 60, dep_delay > arr_delay + 30)
#0点到6点出发的航班
> filter(flights, dep_time >= 0, dep_time <= 600)
between(x, arg1, arg2)
函数可用用于简化 (x >= arg1 & x <= arg2)
计算
filter(flights, dep_time >= 0, dep_time <= 600) 等价于 filter(flights, between(dep_time, 0 , 600))
2.arrange() 排列行
改变行的顺序
arrange(data, col/expr ....)
data: 进行排序的数据框
col: 用于排序的列
expr: 表达式
默认按照升序进行排列,desc()
函数可进行降序排列。默认将缺失值NA
排在最后。
e.g.
# 寻找延误时间最长的航班
arrange(flights, desc(dep_delay))
#将缺失值排在最前面
arrange(flights, desc(is.na(dep_delay)))
3. select() 选择列
选择特定的列
select(data, var/expr)
e.g.
#选择year,month,day 三列
select(flights, year, month, day)
#选择 year到day之间的所有列
select(flights, year:day)
#选择除去 year到day 之间的所有列
select(flights, -(year:day))
辅助函数:
start_with(" ")
匹配开头字段格式
ends_with(" ")
匹配末尾字段格式
contains(" ")
匹配包含字段格式,不区分大小写
one_of(var)
匹配包含变量var
的列
matches(" ")
匹配正则表达式
- 重命名变量:
rename()
rename(flights, tail_num = tailnum)
- 将所选变量移至开头:
select()
结合everything()
# 选择 time_hour, aittime 变量并移至开头
select(flights, time_hour, airtime, everything())
4. mutate()添加列
mutate(data, colname = expr)
添加新列,且新列是现有列的函数
mutate(flights,
gain = arr_delay - depdalay,
speed = distance / airtime * 60
)
若只保留新列,可用 transmute()
函数
5. summarize() 函数
将数据框进行分析后折叠成一行
summarize(data, var=func(...))
summarize()
函数常与 group_by()
函数联用。group_by()
函数可将分析单位从整个数据集改为单个分组。
使用 ungroup()
函数取消分组。
e.g.
> by_day <- group_by(flights, year,month,day)
> summarize(by_day, delay = mean(dep_delay, na.rm=TRUE))
# A tibble: 365 x 4
# Groups: year, month [12]
year month day delay
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# ... with 355 more rows
# 区别于select()函数,group_by()在保留数据集所有数据的基础上对单个分组进行分析
> by.day <- select(flights, year,month,day)
> summarize(by.day, delay = mean(dep_delay, na.rm=TRUE))
Error in mean(dep_delay, na.rm = TRUE) : 找不到对象'dep_delay'
可使用管道符 %>%
减少变量命名,增强代码可读性
e.g.
# delay <- flights %>% group_by(year, month, day) %>% summarize(mean(dep_delay, na.rm = TRUE)) 等同于
# > by_day <- group_by(flights, year,month,day)
# > summarize(by_day, delay = mean(dep_delay, na.rm=TRUE))
> delay <- flights %>% group_by(year, month, day) %>% summarize(mean(dep_delay, na.rm = TRUE))
> delay
# A tibble: 365 x 4
# Groups: year, month [12]
year month day `mean(dep_delay, na.rm = TRUE)`
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
- 摘要函数中,聚合函数与逻辑筛选可进行组合使用
not_cacelled %>%
group_by(year, month, day) %>%
summarize(
# 平均延误时间
avg_delay1 = mean(arr_delay),
# 平均正延误时间
avg_delay2 = mean(arr_delay[arr_delay > 0])
)
- 常用的摘要函数:
位置度量:mean(x)
median(x)
分散程度度量:sd(x)
标准差;IQR(x)
四分位距;mad(x)
绝对中位差
秩的度量:min(x)
quantile(x, 0.25)
x位于25%-75%之间的值;max(x)
位度量:first(x)
nth(x)
last(x)
计数:n()
不需要任何参数,sum(! is.na(x))
可计算非缺失量的数值,n_distinct(x)
可计算唯一值
count(x)
用于只需要对x变量进行计数,不与summarize()
联用
逻辑值计数和比例:sum(x > 10)
mean(y == 0)
TRUE返回1, FALSE返回0
e.g.
> # 找出准点记录(平均延误时间)最差的航班(尾号)
> flights %>% group_by(tailnum) %>%
+ summarise(avrg_delay = mean(dep_delay, na.rm = TRUE)) %>%
+ arrange(desc(avrg_delay))
# A tibble: 4,044 x 2
tailnum avrg_delay
1 N844MH 297
2 N922EV 274
3 N587NW 272
4 N911DA 268
5 N851NW 233
6 N654UA 227
7 N928DN 203
8 N7715E 186
9 N665MQ 177
10 N136DL 165
# … with 4,034 more rows
> # 航班起飞时间与延误时间的关系
> flights %>% group_by(hour) %>%
+ summarize(avrg_delay = mean(dep_delay, na.rm = TRUE)) %>%
+ ggplot(aes(x = hour, y = avrg_delay)) +
+ geom_point() +
+ geom_smooth(method=lm, formula = y~poly(x,2),se=F) +
+ labs(x = 'dep_time', y = 'avrg_delay')
Warning messages:
1: Removed 1 rows containing non-finite values (stat_smooth).
2: Removed 1 rows containing missing values (geom_point).