今天看到有人提问用readr::read_csv()
读csv文件时把所有character型的变量读成factor型,HY大牛提供了一个方法用dplyr
包的mutate_if()
,做变量类型转换速度很快。我后来搜索了一下data.table
包里fread()
读csv时可以直接设置stringsAsFactors = T
。所以就对比了一下readr::read_csv() + dplyr::mutate_if()
和data.table::fread()
的速度,同时用base
自带的read.csv()
做benchmark。
数据1: 10列,每列10个level,100,000行数据
library(dplyr)
library(data.table)
library(readr)
# test1: 10 columns with 10 levels for each column, 100,000 rows
v1<-as.factor(paste('A',c(1:10), sep=''))
df<-data.frame(matrix(nrow=100000))
for(i in 1:10){
df[,i]<-sample(v1, 100000, replace = T)
names(df)[i]<-paste('v', i, sep='')
}
write.csv(df, '/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', row.names = F)
system.time(x1<-read.csv('/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', header=T, stringsAsFactors = T))
# user system elapsed
# 1.080 0.054 1.326
system.time(x2<-read_csv('/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', col_names = T))
# user system elapsed
# 0.153 0.021 0.261
system.time(x2<-x2 %>% mutate_if(is.character, factor))
# user system elapsed
# 0.089 0.016 0.157
system.time(x3<-fread(input='/Users/xiatt/Desktop/compare_read_csv_with_factors.csv', stringsAsFactors = T))
# user system elapsed
# 0.111 0.012 0.255
system.time(x3<-as.data.frame(x3))
# user system elapsed
# 0.001 0.000 0.002
因为fread
产生的是data.table
对象,所以还要多一步把它转换成data.frame
类型。
仅看elapsed time:fread+as.data.frame略快
方法 | 第一步 | 第二步 | 总计 |
---|---|---|---|
read.csv | 1.326 | 1.326 | |
read_csv+mutate_if() | 0.261 | 0.157 | 0.418 |
fread+as.data.frame | 0.255 | 0.002 | 0.257 |
数据2: 100列,每列10个level,100,000行数据
v1<-as.factor(paste('A',c(1:10), sep=''))
df<-data.frame(matrix(nrow=100000))
for(i in 1:100){
df[,i]<-sample(v1, 100000, replace = T)
names(df)[i]<-paste('v', i, sep='')
}
write.csv(df, '/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', row.names = F)
system.time(x1<-read.csv('/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', header=T, stringsAsFactors = T))
# user system elapsed
# 12.406 1.200 19.187
system.time(x2<-read_csv('/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', col_names = T))
# user system elapsed
# 1.816 0.309 2.909
system.time(x2<-x2 %>% mutate_if(is.character, factor))
# user system elapsed
# 0.833 0.222 1.163
system.time(x3<-fread(input='/Users/xiatt/Desktop/compare_read_csv_with_factors2.csv', stringsAsFactors = T))
# user system elapsed
# 1.117 0.275 2.277
system.time(x3<-as.data.frame(x3))
# user system elapsed
# 0.025 0.088 0.115
仅看elapsed time:fread()
拉开差距了
方法 | 第一步 | 第二步 | 总计 |
---|---|---|---|
read.csv | 19.187 | 19.187 | |
read_csv+mutate_if() | 2.909 | 1.163 | 4.072 |
fread+as.data.frame | 2.277 | 0.115 | 2.392 |
数据3: 100列,每列100个level,1,000,000行数据
这里就不看read.csv()
了哈,电脑会烫死的
v1<-as.factor(paste('A',c(1:100), sep=''))
df<-data.frame(matrix(nrow=1000000))
for(i in 1:100){
df[,i]<-sample(v1, 1000000, replace = T)
names(df)[i]<-paste('v', i, sep='')
}
write.csv(df, '/Users/xiatt/Desktop/compare_read_csv_with_factors3.csv', row.names = F)
system.time(x2<-read_csv('/Users/xiatt/Desktop/compare_read_csv_with_factors3.csv', col_names = T))
# user system elapsed
# 22.708 13.303 55.010
system.time(x2<-x2 %>% mutate_if(is.character, factor))
# user system elapsed
# 6.074 2.329 9.411
system.time(x3<-fread(input='/Users/xiatt/Desktop/compare_read_csv_with_factors3.csv', stringsAsFactors = T))
# user system elapsed
# 15.236 6.787 38.246
system.time(x3<-as.data.frame(x3))
# user system elapsed
# 0.238 0.809 1.072
仅看elapsed time:这里差距就比较明显了,fread()
更快一些。
方法 | 第一步 | 第二步 | 总计 |
---|---|---|---|
read_csv+mutate_if() | 55.010 | 9.411 | 64.421 |
fread+as.data.frame | 38.246 | 1.072 | 39.318 |
其他对比
- 在行列数相同的情况下,每列的level数增加到100并不会影响读取时间。
-
fread()
有无stringsAsFactors = T
也并不会影响读取时间。 - 在
data.table
中转换每列的类型并不比mutate_if()
快多少。
结论
所以结论就是data.table
中的fread
包更快一些些啦。
一点衍生阅读
-
readr
包的作者关于readr
和data.table::fread()
的对比,很实诚:
Compared to fread, readr functions:
Are slower (currently ~1.2-2x slower. If you want absolutely the best performance, use data.table::fread().
data.table
和pandas
的处理速度对比:grouping
结论是data.table
稍稍快一些。HY推荐的python的
ParaText
挑战群雄,感觉很厉害呀,链接