笔记说明
数据清理可能是数据分析中耗时占比最大的操作了。dplyr包是一个用于数据清理的高效r包,也是tidyverse
的核心包之一。
dplyr包的常用操作包括:
mutate()
adds new variables that are functions of existing variables
select()
picks variables based on their names.
filter()
picks cases based on their values.
summarise()
reduces multiple values down to a single summary.
arrange()
changes the ordering of the rows.
group_by()
allows you to perform any operation “by group”
主要参考:https://b-rodrigues.github.io/modern_R/descriptive-statistics-and-data-manipulation.html#the-tidyverses-enfant-prodige-dplyr
推荐阅读:https://dplyr.tidyverse.org/
准备工作
加载dplyr包
library(dplyr)
数据准备,我们使用plm
包中的Gasoline
数据集作为示例数据。该数据集包含1960至1978年间18个国家的汽油消耗量。原始数据是一个data.frame对象,我们用as_tibble()
将其转换为一个tibble对象。
可以把tibble理解成一个优化版的data.frame。dplyr包中的各个函数可以作用于data.frame对象,也可以作用于tibble对象。
# 数据准备
install.packages("plm")
data(Gasoline, package = "plm")
gasoline <- as_tibble(Gasoline)
用filter()函数筛选观测
filter()
筛选出满足给定条件的观测(观测指数据集的行)。例如我们想筛选出gasoline数据集中年份在1969年的观测:
filter(gasoline, year == 1969)
## # A tibble: 18 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
##
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 3 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 4 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## 5 FRANCE 1969 3.77 -5.84 -0.315 -8.37
## 6 GERMANY 1969 3.90 -5.83 -0.589 -8.44
## 7 GREECE 1969 4.89 -6.59 -0.180 -10.7
## 8 IRELAND 1969 4.21 -6.38 -0.272 -8.95
## 9 ITALY 1969 3.74 -6.28 -0.248 -8.67
## 10 JAPAN 1969 4.52 -6.16 -0.417 -9.61
## 11 NETHERLA 1969 3.99 -5.88 -0.417 -8.63
## 12 NORWAY 1969 4.09 -5.74 -0.338 -8.69
## 13 SPAIN 1969 3.99 -5.60 0.669 -9.72
## 14 SWEDEN 1969 3.99 -7.77 -2.73 -8.20
## 15 SWITZERL 1969 4.21 -5.91 -0.918 -8.47
## 16 TURKEY 1969 5.72 -7.39 -0.298 -12.5
## 17 U.K. 1969 3.95 -6.03 -0.383 -8.47
## 18 U.S.A. 1969 4.84 -5.41 -1.22 -7.79
用管道操作符改写上面一行代码:
gasoline %>% filter(year == 1969)
效果是一样的。管道操作符%>%
的作用就是把符号前的对象作为第一个参数传递给符号后的函数。x %>% f(y)
等价于f(x,y)
假设我们想筛选出年份在1969和1973之间的观测,可以用%in%
操作符或者between()
来实现。
%in%
操作符判断前面一个向量内的元素是否在后面一个向量中。
between(x, left, right)
等价于x >= left & x <= right
(dplyr包的函数)
gasoline %>% filter(year %in% seq(1969, 1973))
gasoline %>% filter(between(year, 1969, 1973))
这两行代码结果是一样的:
## # A tibble: 90 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
##
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1970 4.08 -6.08 -0.597 -8.73
## 3 AUSTRIA 1971 4.11 -6.04 -0.654 -8.64
## 4 AUSTRIA 1972 4.13 -5.98 -0.596 -8.54
## 5 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 6 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 7 BELGIUM 1970 3.87 -5.80 -0.378 -8.45
## 8 BELGIUM 1971 3.87 -5.76 -0.399 -8.41
## 9 BELGIUM 1972 3.91 -5.71 -0.311 -8.36
## 10 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## # ... with 80 more rows
用select()函数筛选变量
select()
可以用来提取指定变量:
gasoline %>% select(country, year, lrpmg)
## # A tibble: 342 x 3
## country year lrpmg
##
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
select()
也可以用来删除指定变量:
gasoline %>% select(-country, -year, -lrpmg)
## # A tibble: 342 x 3
## lgaspcar lincomep lcarpcap
##
## 1 4.17 -6.47 -9.77
## 2 4.10 -6.43 -9.61
## 3 4.07 -6.41 -9.46
## 4 4.06 -6.37 -9.34
## 5 4.04 -6.32 -9.24
## 6 4.03 -6.29 -9.12
## 7 4.05 -6.25 -9.02
## 8 4.05 -6.23 -8.93
## 9 4.05 -6.21 -8.85
## 10 4.05 -6.15 -8.79
## # ... with 332 more rows
提取变量时可以用new_name = old_name
的表达方式对变量进行重新命名:
gasoline %>% select(country, date = year, lrpmg)
## # A tibble: 342 x 3
## country date lrpmg
##
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
如果只是单纯的改变量名字,可以用rename()
gasoline %>% rename(nation = country, date = year)
## # A tibble: 342 x 6
## nation date lgaspcar lincomep lrpmg lcarpcap
##
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
select()
可以用来调整变量的顺序:
gasoline %>% select(year, country, lrpmg, everything())
## # A tibble: 342 x 6
## year country lrpmg lgaspcar lincomep lcarpcap
##
## 1 1960 AUSTRIA -0.335 4.17 -6.47 -9.77
## 2 1961 AUSTRIA -0.351 4.10 -6.43 -9.61
## 3 1962 AUSTRIA -0.380 4.07 -6.41 -9.46
## 4 1963 AUSTRIA -0.414 4.06 -6.37 -9.34
## 5 1964 AUSTRIA -0.445 4.04 -6.32 -9.24
## 6 1965 AUSTRIA -0.497 4.03 -6.29 -9.12
## 7 1966 AUSTRIA -0.467 4.05 -6.25 -9.02
## 8 1967 AUSTRIA -0.506 4.05 -6.23 -8.93
## 9 1968 AUSTRIA -0.522 4.05 -6.21 -8.85
## 10 1969 AUSTRIA -0.559 4.05 -6.15 -8.79
## # ... with 332 more rows
代码中的everything()
的作用是选择所有变量。它是"select helper"中的一员。
select helper是一组只在select()
中起作用的特殊函数,它们的功能是方便地根据变量名选择变量, select helper包括:
starts_with(): Starts with a prefix.
ends_with(): Ends with a suffix.
contains(): Contains a literal string.
matches(): Matches a regular expression.
num_range(): Matches a numerical range like x01, x02, x03.
one_of(): Matches variable names in a character vector.
everything(): Matches all variables.
last_col(): Select last variable, possibly with an offset.
更多例子可以参看:https://dplyr.tidyverse.org/reference/select.html