参考资料:
Data Analysis and Prediction Algorithms with R
data.table库是用于数据整理和分析的,在第三章中我们介绍了dplyr包来进行数据处理。本章介绍在data.table中如何实现相同的功能
data.table
是一个单独的库。需要单独安装导入。本章介绍一些与第三章:R语言数据处理相关的方法:
mutate
,filter
,select
,group_by
等
首先我们使用setDT
函数将数据框装换为一个data.table
,否则 后面的操作可能会失效
library(tidyverse)
library(data.table)
library(dslabs)
murders <- copy(murders)
murders <- setDT(murders)
对数据进行选择指定列,在使用dplyr时,我们是这样写的
select(murders, state, region) %>% head()
state | region |
---|---|
Alabama | South |
Alaska | West |
Arizona | West |
Arkansas | South |
California | West |
Colorado | West |
下面我们演示一下在data.table中是如何使用的
murders[, c('state', 'region')] %>% head()
state | region |
---|---|
Alabama | South |
Alaska | West |
Arizona | West |
Arkansas | South |
California | West |
Colorado | West |
也可以直接使用.()来进行访问相应变量
murders[,.(state, rate)] %>% head()
state | rate |
---|---|
Alabama | 0.2824424 |
Alaska | 0.2675186 |
Arizona | 0.3629527 |
Arkansas | 0.3189390 |
California | 0.3374138 |
Colorado | 0.1292453 |
我们在dplyr中使用mutate
函数
murders %>% mutate(murders, rate = total / population * 10^5) %>% head()
state | abb | region | population | total | rate |
---|---|---|---|---|---|
Alabama | AL | South | 4779736 | 135 | 2.824424 |
Alaska | AK | West | 710231 | 19 | 2.675186 |
Arizona | AZ | West | 6392017 | 232 | 3.629527 |
Arkansas | AR | South | 2915918 | 93 | 3.189390 |
California | CA | West | 37253956 | 1257 | 3.374138 |
Colorado | CO | West | 5029196 | 65 | 1.292453 |
在data.table中,我们使用:=来定义新的一列,这样能节约电脑内存
murders[, rate := total/population * 10 ^5] %>% head()
state | abb | region | population | total | rate |
---|---|---|---|---|---|
Alabama | AL | South | 4779736 | 135 | 2.824424 |
Alaska | AK | West | 710231 | 19 | 2.675186 |
Arizona | AZ | West | 6392017 | 232 | 3.629527 |
Arkansas | AR | South | 2915918 | 93 | 3.189390 |
California | CA | West | 37253956 | 1257 | 3.374138 |
Colorado | CO | West | 5029196 | 65 | 1.292453 |
同样我们可以使用:=
定义多个列
murders[, ':='(rate=total / population * 10000, rank = rank(population))] %>% head()
state | abb | region | population | total | rate | rank |
---|---|---|---|---|---|---|
Alabama | AL | South | 4779736 | 135 | 0.2824424 | 29 |
Alaska | AK | West | 710231 | 19 | 0.2675186 | 5 |
Arizona | AZ | West | 6392017 | 232 | 0.3629527 | 36 |
Arkansas | AR | South | 2915918 | 93 | 0.3189390 | 20 |
California | CA | West | 37253956 | 1257 | 0.3374138 | 51 |
Colorado | CO | West | 5029196 | 65 | 0.1292453 | 30 |
data.table包的设计是为了避免浪费内存。因此我们可以复制一个表
x <- data.table(a=1)
y <- x
y实际是x的引用,而不是一个新对象,相当于是x的另一个名字。只有当改变y的时候,才会生成一个新对象
然而在使用:=
函数是,即便改变x也不会生成一个新的y对象,有时候我们不希望改变原来的对象,此时需要用copy()函数
x [,a:=2]
y
a |
---|
2 |
z = copy(x)
x[,a:=3]
z
a |
---|
1 |
在dplyr
中,我们通过下述代码过滤
filter(murders, rate <= 0.7) %>% head()
state | abb | region | population | total | rate | rank |
---|---|---|---|---|---|---|
Alabama | AL | South | 4779736 | 135 | 0.2824424 | 29 |
Alaska | AK | West | 710231 | 19 | 0.2675186 | 5 |
Arizona | AZ | West | 6392017 | 232 | 0.3629527 | 36 |
Arkansas | AR | South | 2915918 | 93 | 0.3189390 | 20 |
California | CA | West | 37253956 | 1257 | 0.3374138 | 51 |
Colorado | CO | West | 5029196 | 65 | 0.1292453 | 30 |
在data.table中,我们可以直接使用索引
murders[rate<=0.7,.(state, rate)] %>% head()
state | rate |
---|---|
Alabama | 0.2824424 |
Alaska | 0.2675186 |
Arizona | 0.3629527 |
Arkansas | 0.3189390 |
California | 0.3374138 |
Colorado | 0.1292453 |
和第三章一样,我们使用heights
数据集为例
data(heights)
# 将数据转换为data.table对象
heights <- setDT(heights)
在data.table中,我们可以使用.()
函数来直接访问相应的变量。因此我们可以在原来dplyr中简化代码如下
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]
s
average | standard_deviation |
---|---|
68.32301 | 4.078617 |
下面假设我们要查询女性的平均身高和标准差
s <- heights[sex == 'Female', .(avg = mean(height), standard_deviation = sd(height))]
s
avg | standard_deviation |
---|---|
64.93942 | 3.760656 |
还记得在第三章中,我们定义了如下函数
median_min_max <- function(x){
qs <- quantile(x, c(0.5,0,1))
data.frame(median=qs[1], min = qs[2], max = qs[3])
}
heights[,.(median_min_max(height))]
median | min | max |
---|---|---|
68.5 | 50 | 82.67717 |
在dplyr中我们使用group_by
来进行分组,在data.table中,我们使用by进行分组
heights[,.(avg = mean(height), standard_deviation=sd(height)), by = sex]
sex | avg | standard_deviation |
---|---|---|
Male | 69.31475 | 3.611024 |
Female | 64.93942 | 3.760656 |
我们可以使用与筛选相同的方法对行进行排序。以下是按谋杀率排序的州:
murders[order(population)] %>% head()
state | abb | region | population | total | rate | rank |
---|---|---|---|---|---|---|
Wyoming | WY | West | 563626 | 5 | 0.08871131 | 1 |
District of Columbia | DC | South | 601723 | 99 | 1.64527532 | 2 |
Vermont | VT | Northeast | 625741 | 2 | 0.03196211 | 3 |
North Dakota | ND | North Central | 672591 | 4 | 0.05947151 | 4 |
Alaska | AK | West | 710231 | 19 | 0.26751860 | 5 |
South Dakota | SD | North Central | 814180 | 8 | 0.09825837 | 6 |