data.table语法介绍
因为这篇文章主要是data.table,所以在详细对比之前,先来介绍一下dplyr的情况
dplyr的优点在于语法优雅,符合人的逻辑,简单易懂;而data.table则在于语法简介,运行速度快,对于大数据来说非常强大,但是语法有时候也不太容易理解
dplyr包经常用的函数
- select(),选择列
- filter(),筛选行
- mutate(),增加新列,类似于transform
- group_by,分组
- summarise(),汇总数据
data.table
data.table的通用格式为DT[i,j,by],i代表行,j代表列,by代表分组依据
这里的话我们选用iris数据集来进行说明
> DT <- data.table(iris)
> set.seed(45L)
> DT[,c("V1","V2"):=list(LETTERS[1:3],c(1L,2L))]
> names(DT) <- tolower(names(DT))
> head(DT)
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
1.通过i来筛选行
- 通过行数
选取3到5行的数据
> DT[3:5,] #or DT[3:5]
sepal.length sepal.width petal.length petal.width species v1 v2
1: 4.7 3.2 1.3 0.2 setosa C 1
2: 4.6 3.1 1.5 0.2 setosa A 2
3: 5.0 3.6 1.4 0.2 setosa B 1
- 通过特定条件
这里是用"=="这种方式,这种方式虽然简单易懂,但是会遍历整个数组,速度会有点慢,所以建议设置键,后面会有讲到
> head(DT[species=='setosa'])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
> tail(DT[species=='setosa'])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.8 1.9 0.4 setosa C 1
2: 4.8 3.0 1.4 0.3 setosa A 2
3: 5.1 3.8 1.6 0.2 setosa B 1
4: 4.6 3.2 1.4 0.2 setosa C 2
5: 5.3 3.7 1.5 0.2 setosa A 1
6: 5.0 3.3 1.4 0.2 setosa B 2
> head(DT[species %in% c("setosa","versicolor")]) #这两代表或的意思
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
> tail(DT[species %in% c("setosa","versicolor")])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.6 2.7 4.2 1.3 versicolor B 1
2: 5.7 3.0 4.2 1.2 versicolor C 2
3: 5.7 2.9 4.2 1.3 versicolor A 1
4: 6.2 2.9 4.3 1.3 versicolor B 2
5: 5.1 2.5 3.0 1.1 versicolor C 1
6: 5.7 2.8 4.1 1.3 versicolor A 2
> head(DT[sepal.length %between% c(4.5,5)])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 4.9 3.0 1.4 0.2 setosa B 2
2: 4.7 3.2 1.3 0.2 setosa C 1
3: 4.6 3.1 1.5 0.2 setosa A 2
4: 5.0 3.6 1.4 0.2 setosa B 1
5: 4.6 3.4 1.4 0.3 setosa A 1
6: 5.0 3.4 1.5 0.2 setosa B 2
> tail(DT[sepal.length %between% c(4.5,5)])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 4.6 3.2 1.4 0.2 setosa C 2
2: 5.0 3.3 1.4 0.2 setosa B 2
3: 4.9 2.4 3.3 1.0 versicolor A 2
4: 5.0 2.0 3.5 1.0 versicolor A 1
5: 5.0 2.3 3.3 1.0 versicolor A 2
6: 4.9 2.5 4.5 1.7 virginica B 1
2.通过j来对列进行操作
2.1 选取列
- 选取一列
.()相当于list()
> head(DT[,sepal.width]) #以向量形式展现
[1] 3.5 3.0 3.2 3.1 3.6 3.9
> head(DT[,.(sepal.width)]) #数据框的形式展现
sepal.width
1: 3.5
2: 3.0
3: 3.2
4: 3.1
5: 3.6
6: 3.9
- 选取多列
> head(DT[,.(sepal.width,sepal.length)])
sepal.width sepal.length
1: 3.5 5.1
2: 3.0 4.9
3: 3.2 4.7
4: 3.1 4.6
5: 3.6 5.0
6: 3.9 5.4
- 用列数来选取行
> head(DT[,1,with=FALSE]) #选取第一列
sepal.length
1: 5.1
2: 4.9
3: 4.7
4: 4.6
5: 5.0
6: 5.4
> head(DT[,2,with=FALSE]) #选取第二列
sepal.width
1: 3.5
2: 3.0
3: 3.2
4: 3.1
5: 3.6
6: 3.9
> head(DT[,3,with=FALSE]) #选取第三列
petal.length
1: 1.4
2: 1.4
3: 1.3
4: 1.5
5: 1.4
6: 1.7
2.2 在j上使用函数
> DT[,sum(sepal.width)]
[1] 458.6
> DT[,.(sum(sepal.width))]
V1
1: 458.6
> DT[,.(SUM=sum(sepal.width))] #可以重命名
SUM
1: 458.6
- 选取列和使用函数可以一起用
如果列的长度不一,则会循环对齐
> head(DT[,.(sepal.width,sd=sd(sepal.width))])
sepal.width sd
1: 3.5 0.4358663
2: 3.0 0.4358663
3: 3.2 0.4358663
4: 3.1 0.4358663
5: 3.6 0.4358663
6: 3.9 0.4358663
- 多个表达式可以包含在大括号中
> DT[,{print(head(sepal.width))
+ plot(sepal.width)
+ NULL}]
[1] 3.5 3.0 3.2 3.1 3.6 3.9
#这里应该是一副散点图,在代码块不好展示图(主要是懒)
NULL
3.根据分组来操作j
- 对species中的每一类来计算sepal.length的和
> DT[,.(SUM=sum(sepal.length),by=species)]
SUM by
1: 876.5 setosa
2: 876.5 setosa
3: 876.5 setosa
4: 876.5 setosa
5: 876.5 setosa
---
146: 876.5 virginica
147: 876.5 virginica
148: 876.5 virginica
149: 876.5 virginica
150: 876.5 virginica
#注意by加.()和没加.()的区别
> DT[,.(SUM=sum(sepal.length)),by=.(species)]
species SUM
1: setosa 250.3
2: versicolor 296.8
3: virginica 329.4
- 对多列进行分组
> DT[,.(SUM=sum(sepal.width)),by=.(species,v1)]
species v1 SUM
1: setosa A 59.0
2: setosa B 58.6
3: setosa C 53.8
4: versicolor C 46.5
5: versicolor A 45.5
6: versicolor B 46.5
7: virginica B 51.4
8: virginica C 49.6
9: virginica A 47.7
- 在by中使用函数
> DT[,.(SUM=sum(sepal.length)),by=sign(v2-1)]
sign SUM
1: 0 438.0
2: 1 438.5
- 指定i行子集进行分组汇总
> DT[1:40,.(SUM=sum(sepal.length)),by=species]
species SUM
1: setosa 201.5
- 使用.N来计算每个分组的个数
> DT[,.(count=.N),by=species]
species count
1: setosa 50
2: versicolor 50
3: virginica 50
4.使用:=来增加,更改,减少列
注意:用了:=这种方法,会直接在原数据集上进行更改,所以DT <- DT[,:=]是不需要的,直接DT[,:=]就可以了
- 更新一列
> dt <- copy(DT)
> head(dt)
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
> head(dt[,v1:=round(exp(v2),2)])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa 3 1
2: 4.9 3.0 1.4 0.2 setosa 7 2
3: 4.7 3.2 1.3 0.2 setosa 3 1
4: 4.6 3.1 1.5 0.2 setosa 7 2
5: 5.0 3.6 1.4 0.2 setosa 3 1
6: 5.4 3.9 1.7 0.4 setosa 7 2
- 增加多列
> dt[,c("h1","h2"):=.(round(exp(v2)),LETTERS[4:6])]
> head(dt)
sepal.length sepal.width petal.length petal.width species v1 v2 h1 h2
1: 5.1 3.5 1.4 0.2 setosa 3 1 3 D
2: 4.9 3.0 1.4 0.2 setosa 7 2 7 E
3: 4.7 3.2 1.3 0.2 setosa 3 1 3 F
4: 4.6 3.1 1.5 0.2 setosa 7 2 7 D
5: 5.0 3.6 1.4 0.2 setosa 3 1 3 E
6: 5.4 3.9 1.7 0.4 setosa 7 2 7 F
# 上面可以可以写成,因为展示方便,修改是只选取了第5至第9列数据
> head(dt[,':='(h1=round(exp(v2)),h2=LETTERS[4:6])][,5:9])
species v1 v2 h1 h2
1: setosa A 1 3 D
2: setosa B 2 7 E
3: setosa C 1 3 F
4: setosa A 2 7 D
5: setosa B 1 3 E
6: setosa C 2 7 F
- 删除列
> dt[,':='(h1=NULL,h2=NULL)]
> head(dt)
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
也可以写成下面这种
----------
> head(dt[,c("h1","h2"):=NULL])
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
- 修改特定条件下的值
> dt[sepal.length>4&v1=='A',v2:=3]
> head(dt[,.(v2)])
v2
1: 3
2: 2
3: 1
4: 3
5: 1
6: 2
5.设置索引列并进行操作
- 在创建数据框时就直接设定索引列
data <- data.table(a=c('A','B','C','A','A','B'),b=rnorm(6),key="a")
> head(data)
a b
1: A 0.3407997
2: A -0.7460474
3: A -0.8981073
4: B -0.7033403
5: B -0.3347941
6: C -0.3795377
- 有数据框之后再设定
> dt <- data.table(a=c('A','B','C','A','A','B'),b=rnorm(6))
> dt
a b
1: A -0.5013782
2: B -0.1745357
3: C 1.8090374
4: A -0.2301050
5: A -1.1304182
6: B 0.2159889
#仔细对比两个dt的值
> setkey(dt,a) #会自动对键值列进行排序
> dt
a b
1: A -0.5013782
2: A -0.2301050
3: A -1.1304182
4: B -0.1745357
5: B 0.2159889
6: C 1.8090374
- 查看数据框时候有key
> key(dt)
[1] "a"
> haskey(dt)
[1] TRUE
> attributes(dt)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6
$class
[1] "data.table" "data.frame"
$.internal.selfref
$sorted
[1] "a"
> attributes(dt)$sorted
[1] "a"
- 设置a列为索引列后取a列中值为B的行
> dt['B']
a b
1: B -0.1745357
2: B 0.2159889
- 设置索引之后取a列中值为B的第一行
> dt['B',mult='first'] #mult参数默认为"all"
a b
1: B -0.1745357
- 设置索引之后取a列中值为B的最后一行
> dt['B',mult='last']
a b
1: B 0.2159889
- 设置a列为索引列后取a列中值为A或B的行
> dt[c('A','B')]
a b
1: A -0.5013782
2: A -0.2301050
3: A -1.1304182
4: B -0.1745357
5: B 0.2159889
- nomatch参数用于给定在没有匹配到值得时候该给予什么值,默认为NA,也可以设置为0,0代表对于没有匹配到的行将不会返回
> dt[c('A','D')]
a b
1: A -0.5013782
2: A -0.2301050
3: A -1.1304182
4: D NA
----------
> dt[c('A','D'),nomatch=0]
a b
1: A -0.5013782
2: A -0.2301050
3: A -1.1304182
- by=.EACHI参数允许按每一个已知i的子集分组,使用前必须先设置键值列
> dt[c('A','B'),sum(b)]
[1] -1.820448
----------
> dt[c('A','B'),sum(b),by=.EACHI]
a V1
1: A -1.86190135
2: B 0.04145319
- 设置多个键值列
> head(DT)
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.9 3.0 1.4 0.2 setosa B 2
3: 4.7 3.2 1.3 0.2 setosa C 1
4: 4.6 3.1 1.5 0.2 setosa A 2
5: 5.0 3.6 1.4 0.2 setosa B 1
6: 5.4 3.9 1.7 0.4 setosa C 2
> setkey(DT,v1,v2) #会先按v1排序,在按v2排序
> head(DT[.('B',1)]) #筛选出v1列值为B,v2列值为1的数据
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.0 3.6 1.4 0.2 setosa B 1
2: 5.4 3.7 1.5 0.2 setosa B 1
3: 5.4 3.9 1.3 0.4 setosa B 1
4: 4.6 3.6 1.0 0.2 setosa B 1
5: 5.2 3.4 1.4 0.2 setosa B 1
6: 4.9 3.1 1.5 0.2 setosa B 1
> head(DT[.(c('A','B'),1)]) #筛选出v1列值为A或者B,v2列值为1的数据
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.1 3.5 1.4 0.2 setosa A 1
2: 4.6 3.4 1.4 0.3 setosa A 1
3: 4.8 3.0 1.4 0.1 setosa A 1
4: 5.7 3.8 1.7 0.3 setosa A 1
5: 4.8 3.4 1.9 0.2 setosa A 1
6: 4.8 3.1 1.6 0.2 setosa A 1
> tail(DT[.(c('A','B'),1)]) #筛选出v1列值为A或者B,v2列值为1的数据
sepal.length sepal.width petal.length petal.width species v1 v2
1: 7.7 2.6 6.9 2.3 virginica B 1
2: 6.7 3.3 5.7 2.1 virginica B 1
3: 7.4 2.8 6.1 1.9 virginica B 1
4: 6.3 3.4 5.6 2.4 virginica B 1
5: 5.8 2.7 5.1 1.9 virginica B 1
6: 6.2 3.4 5.4 2.3 virginica B 1
6 data.table高级操作
- 使用.N来表示行的数量
> DT[.N] #在i处使用可以返回最后一行
sepal.length sepal.width petal.length petal.width species v1 v2
1: 5.9 3 5.1 1.8 virginica C 2
> DT[,.N] #在j处使用可以返回最后一行的行数
[1] 150
- .SD
.SD是一个data.table,他包含了各个分组的数据,除了by中的变量的所有元素,且只能在j中使用
> DT[,print(.SD),by=v1]
sepal.length sepal.width petal.length petal.width species v2
1: 5.1 3.5 1.4 0.2 setosa 1
2: 4.6 3.4 1.4 0.3 setosa 1
3: 4.8 3.0 1.4 0.1 setosa 1
4: 5.7 3.8 1.7 0.3 setosa 1
5: 4.8 3.4 1.9 0.2 setosa 1
6: 4.8 3.1 1.6 0.2 setosa 1
7: 5.5 3.5 1.3 0.2 setosa 1
8: 4.4 3.2 1.3 0.2 setosa 1
9: 5.3 3.7 1.5 0.2 setosa 1
10: 6.5 2.8 4.6 1.5 versicolor 1
11: 5.0 2.0 3.5 1.0 versicolor 1
12: 5.6 3.0 4.5 1.5 versicolor 1
13: 6.3 2.5 4.9 1.5 versicolor 1
14: 6.0 2.9 4.5 1.5 versicolor 1
15: 5.4 3.0 4.5 1.5 versicolor 1
16: 5.5 2.6 4.4 1.2 versicolor 1
17: 5.7 2.9 4.2 1.3 versicolor 1
18: 7.1 3.0 5.9 2.1 virginica 1
19: 6.7 2.5 5.8 1.8 virginica 1
20: 5.8 2.8 5.1 2.4 virginica 1
21: 6.9 3.2 5.7 2.3 virginica 1
22: 6.2 2.8 4.8 1.8 virginica 1
23: 6.4 2.8 5.6 2.2 virginica 1
24: 6.0 3.0 4.8 1.8 virginica 1
25: 6.7 3.3 5.7 2.5 virginica 1
26: 4.6 3.1 1.5 0.2 setosa 2
27: 4.9 3.1 1.5 0.1 setosa 2
28: 5.7 4.4 1.5 0.4 setosa 2
29: 5.1 3.7 1.5 0.4 setosa 2
30: 5.2 3.5 1.5 0.2 setosa 2
31: 5.5 4.2 1.4 0.2 setosa 2
32: 5.1 3.4 1.5 0.2 setosa 2
33: 4.8 3.0 1.4 0.3 setosa 2
34: 6.4 3.2 4.5 1.5 versicolor 2
35: 4.9 2.4 3.3 1.0 versicolor 2
36: 6.1 2.9 4.7 1.4 versicolor 2
37: 5.6 2.5 3.9 1.1 versicolor 2
38: 6.6 3.0 4.4 1.4 versicolor 2
39: 5.5 2.4 3.7 1.0 versicolor 2
40: 6.3 2.3 4.4 1.3 versicolor 2
41: 5.0 2.3 3.3 1.0 versicolor 2
42: 5.7 2.8 4.1 1.3 versicolor 2
43: 7.6 3.0 6.6 2.1 virginica 2
44: 6.4 2.7 5.3 1.9 virginica 2
45: 7.7 3.8 6.7 2.2 virginica 2
46: 6.3 2.7 4.9 1.8 virginica 2
47: 7.2 3.0 5.8 1.6 virginica 2
48: 7.7 3.0 6.1 2.3 virginica 2
49: 6.9 3.1 5.1 2.3 virginica 2
50: 6.5 3.0 5.2 2.0 virginica 2
sepal.length sepal.width petal.length petal.width species v2
sepal.length sepal.width petal.length petal.width species v2
1: 5.0 3.6 1.4 0.2 setosa 1
2: 5.4 3.7 1.5 0.2 setosa 1
3: 5.4 3.9 1.3 0.4 setosa 1
4: 4.6 3.6 1.0 0.2 setosa 1
5: 5.2 3.4 1.4 0.2 setosa 1
6: 4.9 3.1 1.5 0.2 setosa 1
7: 5.0 3.5 1.3 0.3 setosa 1
8: 5.1 3.8 1.6 0.2 setosa 1
9: 6.9 3.1 4.9 1.5 versicolor 1
10: 6.6 2.9 4.6 1.3 versicolor 1
11: 5.6 2.9 3.6 1.3 versicolor 1
12: 5.9 3.2 4.8 1.8 versicolor 1
13: 6.8 2.8 4.8 1.4 versicolor 1
14: 5.8 2.7 3.9 1.2 versicolor 1
15: 5.6 3.0 4.1 1.3 versicolor 1
16: 5.6 2.7 4.2 1.3 versicolor 1
17: 6.3 3.3 6.0 2.5 virginica 1
18: 4.9 2.5 4.5 1.7 virginica 1
19: 6.8 3.0 5.5 2.1 virginica 1
20: 7.7 2.6 6.9 2.3 virginica 1
21: 6.7 3.3 5.7 2.1 virginica 1
22: 7.4 2.8 6.1 1.9 virginica 1
23: 6.3 3.4 5.6 2.4 virginica 1
24: 5.8 2.7 5.1 1.9 virginica 1
25: 6.2 3.4 5.4 2.3 virginica 1
26: 4.9 3.0 1.4 0.2 setosa 2
27: 5.0 3.4 1.5 0.2 setosa 2
28: 4.3 3.0 1.1 0.1 setosa 2
29: 5.1 3.8 1.5 0.3 setosa 2
30: 5.0 3.0 1.6 0.2 setosa 2
31: 5.4 3.4 1.5 0.4 setosa 2
32: 4.9 3.6 1.4 0.1 setosa 2
33: 5.0 3.5 1.6 0.6 setosa 2
34: 5.0 3.3 1.4 0.2 setosa 2
35: 5.7 2.8 4.5 1.3 versicolor 2
36: 5.9 3.0 4.2 1.5 versicolor 2
37: 5.8 2.7 4.1 1.0 versicolor 2
38: 6.1 2.8 4.7 1.2 versicolor 2
39: 5.7 2.6 3.5 1.0 versicolor 2
40: 6.0 3.4 4.5 1.6 versicolor 2
41: 6.1 3.0 4.6 1.4 versicolor 2
42: 6.2 2.9 4.3 1.3 versicolor 2
43: 6.3 2.9 5.6 1.8 virginica 2
44: 7.2 3.6 6.1 2.5 virginica 2
45: 6.4 3.2 5.3 2.3 virginica 2
46: 5.6 2.8 4.9 2.0 virginica 2
47: 6.1 3.0 4.9 1.8 virginica 2
48: 6.3 2.8 5.1 1.5 virginica 2
49: 6.9 3.1 5.4 2.1 virginica 2
50: 6.7 3.0 5.2 2.3 virginica 2
sepal.length sepal.width petal.length petal.width species v2
sepal.length sepal.width petal.length petal.width species v2
1: 4.7 3.2 1.3 0.2 setosa 1
2: 4.4 2.9 1.4 0.2 setosa 1
3: 5.8 4.0 1.2 0.2 setosa 1
4: 5.4 3.4 1.7 0.2 setosa 1
5: 5.0 3.4 1.6 0.4 setosa 1
6: 5.2 4.1 1.5 0.1 setosa 1
7: 4.4 3.0 1.3 0.2 setosa 1
8: 5.1 3.8 1.9 0.4 setosa 1
9: 7.0 3.2 4.7 1.4 versicolor 1
10: 6.3 3.3 4.7 1.6 versicolor 1
11: 6.0 2.2 4.0 1.0 versicolor 1
12: 6.2 2.2 4.5 1.5 versicolor 1
13: 6.4 2.9 4.3 1.3 versicolor 1
14: 5.5 2.4 3.8 1.1 versicolor 1
15: 6.7 3.1 4.7 1.5 versicolor 1
16: 5.8 2.6 4.0 1.2 versicolor 1
17: 5.1 2.5 3.0 1.1 versicolor 1
18: 6.5 3.0 5.8 2.2 virginica 1
19: 6.5 3.2 5.1 2.0 virginica 1
20: 6.5 3.0 5.5 1.8 virginica 1
21: 7.7 2.8 6.7 2.0 virginica 1
22: 6.4 2.8 5.6 2.1 virginica 1
23: 6.1 2.6 5.6 1.4 virginica 1
24: 6.7 3.1 5.6 2.4 virginica 1
25: 6.3 2.5 5.0 1.9 virginica 1
26: 5.4 3.9 1.7 0.4 setosa 2
27: 4.8 3.4 1.6 0.2 setosa 2
28: 5.1 3.5 1.4 0.3 setosa 2
29: 5.1 3.3 1.7 0.5 setosa 2
30: 4.7 3.2 1.6 0.2 setosa 2
31: 5.0 3.2 1.2 0.2 setosa 2
32: 4.5 2.3 1.3 0.3 setosa 2
33: 4.6 3.2 1.4 0.2 setosa 2
34: 5.5 2.3 4.0 1.3 versicolor 2
35: 5.2 2.7 3.9 1.4 versicolor 2
36: 6.7 3.1 4.4 1.4 versicolor 2
37: 6.1 2.8 4.0 1.3 versicolor 2
38: 6.7 3.0 5.0 1.7 versicolor 2
39: 6.0 2.7 5.1 1.6 versicolor 2
40: 5.5 2.5 4.0 1.3 versicolor 2
41: 5.7 3.0 4.2 1.2 versicolor 2
42: 5.8 2.7 5.1 1.9 virginica 2
43: 7.3 2.9 6.3 1.8 virginica 2
44: 5.7 2.5 5.0 2.0 virginica 2
45: 6.0 2.2 5.0 1.5 virginica 2
46: 7.2 3.2 6.0 1.8 virginica 2
47: 7.9 3.8 6.4 2.0 virginica 2
48: 6.4 3.1 5.5 1.8 virginica 2
49: 6.8 3.2 5.9 2.3 virginica 2
50: 5.9 3.0 5.1 1.8 virginica 2
sepal.length sepal.width petal.length petal.width species v2
Empty data.table (0 rows) of 1 col: v1
> DT[,.SD,by=v1][]
v1 sepal.length sepal.width petal.length petal.width species v2
1: A 5.1 3.5 1.4 0.2 setosa 1
2: A 4.6 3.4 1.4 0.3 setosa 1
3: A 4.8 3.0 1.4 0.1 setosa 1
4: A 5.7 3.8 1.7 0.3 setosa 1
5: A 4.8 3.4 1.9 0.2 setosa 1
---
146: C 7.2 3.2 6.0 1.8 virginica 2
147: C 7.9 3.8 6.4 2.0 virginica 2
148: C 6.4 3.1 5.5 1.8 virginica 2
149: C 6.8 3.2 5.9 2.3 virginica 2
150: C 5.9 3.0 5.1 1.8 virginica 2
- 返回以v1列为分组的数据的第一行和最后一行的数据
> DT[,.SD[c(1,.N)],by=v1]
v1 sepal.length sepal.width petal.length petal.width species v2
1: A 5.1 3.5 1.4 0.2 setosa 1
2: A 6.5 3.0 5.2 2.0 virginica 2
3: B 5.0 3.6 1.4 0.2 setosa 1
4: B 6.7 3.0 5.2 2.3 virginica 2
5: C 4.7 3.2 1.3 0.2 setosa 1
6: C 5.9 3.0 5.1 1.8 virginica 2
- 返回以v1和species分组的其他数据的汇总数据
> DT[,lapply(.SD,sum),by=c("v1","species")]
v1 species sepal.length sepal.width petal.length petal.width v2
1: A setosa 85.9 59.0 25.3 3.9 25
2: A versicolor 98.1 45.5 71.4 22.0 26
3: A virginica 108.1 47.7 89.1 33.1 24
4: B setosa 85.2 58.6 24.0 4.2 26
5: B versicolor 96.3 46.5 69.3 21.4 24
6: B virginica 109.6 51.4 93.3 35.5 25
7: C setosa 79.2 53.8 23.8 4.2 24
8: C versicolor 102.4 46.5 72.3 22.9 25
9: C virginica 111.7 49.6 95.2 32.7 26
- .SDcols
常与.SD一起用,用于对.SD取某些列
> DT[,.SD,by=v1,.SDcols=c("species","sepal.length")]
v1 species sepal.length
1: A setosa 5.1
2: A setosa 4.6
3: A setosa 4.8
4: A setosa 5.7
5: A setosa 4.8
---
146: C virginica 7.2
147: C virginica 7.9
148: C virginica 6.4
149: C virginica 6.8
150: C virginica 5.9
> DT[,.(species,sepal.length),by=v1] #相当于这句
v1 species sepal.length
1: A setosa 5.1
2: A setosa 4.6
3: A setosa 4.8
4: A setosa 5.7
5: A setosa 4.8
---
146: C virginica 7.2
147: C virginica 7.9
148: C virginica 6.4
149: C virginica 6.8
150: C virginica 5.9
#也可以是一个函数的返回值:
> DT[,lapply(.SD,sum),by=v1,.SDcols=paste0("v",2)]
v1 v2
1: A 75
2: B 75
3: C 75
- 串联操作,有点管道(%>%)操作的味道
不串联的情况
> DT2 <- copy(DT)
> DT2 <- DT2[,.(SUM=sum(sepal.length)),by=v1]
> DT2[SUM>291.5]
v1 SUM
1: A 292.1
2: C 293.3
> ##串联操作
> DT2 <- copy(DT)
> DT2[,.(SUM=sum(sepal.length)),by=v1][SUM>291.5] #分组的情况下有点像SQL中的having
v1 SUM
1: A 292.1
2: C 293.3
7.data.table中的melt和dcast
用法和reshape2包差不多,可以参考
利用reshape2包进行数据逆透视和数据透视
- 参考文章
1.R之data.table -melt/dcast(数据拆分和合并)
2.【数据处理】data.table包 - 知乎专栏
3.R语言data.table速查手册
4.超高性能数据处理包data.table