数据文件智能读取: R语言vroom包

最近折腾Shiny的时候接触到了一款非常好用的数据读取包。写一下备忘录。

1. 自动识别分隔文件

vroom有自动识别文件格式功能，所以不管是csv，还是tsv文件都只需要同一个读取指令vroom(”xxx.csv”)就可以。

library(vroom)

data <- vroom("flights.tsv")
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

会跳出来一大段有关该数据各列属性的信息，不需要的话可以关掉。

s <- spec(data)

data <- vroom("flights.tsv", col_types = s)

2. 同时读取多个文件

批量读取数据是vroom的一大亮点。

files <- fs::dir_ls(glob = "flights_*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
data <- vroom(files)
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

3. 读取和写出压缩文件

vroom_write() 可以直接写出压缩文件

vroom_write(flights, "flights.tsv.gz")

# Check file sizes to show file is compressed
fs::file_size(c("flights.tsv", "flights.tsv.gz"))
#> 29.62M  7.87M

# Read the file back in
data <- vroom("flights.tsv.gz")
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

4. 读取网页文件

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
data <- vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

5. 读取和写出管道代码连接数据

这个有点神奇的，完全代替Perl。

提取United Airlines(包含UA字符)的数据

# Return only flights on United Airlines
data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights))
#> Observations: 58,665
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

又或者可以在写出压缩文件的时候指定压缩工具pigz

bench::workout({
  vroom_write(flights, "flights.tsv.gz")
  vroom_write(flights, pipe("pigz > flights.tsv.gz"))
})
#> # A tibble: 2 x 3
#>   exprs                                                process     real
#>                                              
#> 1 vroom_write(flights, "flights.tsv.gz")                  3.5s    2.69s
#> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz"))    1.54s 975.09ms

6. 选择数据列

提取指定列

data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

不提取指定列

data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

重命名指定列

data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#>    plane  year month   day dep_time sched_dep_time dep_delay arr_time
#>                              
#>  1 N142…  2013     1     1      517            515         2      830
#>  2 N242…  2013     1     1      533            529         4      850
#>  3 N619…  2013     1     1      542            540         2      923
#>  4 N804…  2013     1     1      544            545        -1     1004
#>  5 N668…  2013     1     1      554            600        -6      812
#>  6 N394…  2013     1     1      554            558        -4      740
#>  7 N516…  2013     1     1      555            600        -5      913
#>  8 N829…  2013     1     1      557            600        -3      709
#>  9 N593…  2013     1     1      557            600        -3      838
#> 10 N3AL…  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time ,
#> #   arr_delay , carrier , flight , origin ,
#> #   dest , air_time , distance , hour , minute ,
#> #   time_hour

7. 修改变量属性

大多数情况下vroom可以准确的判断变量属性，当然偶尔也会出错，这个时候可以手动指定。当然也可以后期用dplyr 改，当然这样做就会稍微麻烦点。

属性对照，[ ]里的字符是实际用到的缩写字符。

col_logical() ‘l’, containing only T, F, TRUE, FALSE, 1 or 0.
col_integer() ‘i’, integer values.
col_double() ‘d’, floating point values.
col_number() [n], numbers containing the grouping_mark
col_date(format = "") [D]: with the locale’s date_format.
col_time(format = "") [t]: with the locale’s time_format.
col_datetime(format = "") [T]: ISO8601 date times.
col_factor(levels, ordered) ‘f’, a fixed set of values.
col_character() ‘c’, everything else.
col_skip() ‘_, -', don’t import this column.
col_guess() ‘?', parse using the “best” type based on the input.

用例如下：

# read the 'year' column as an integer
data <- vroom("flights.tsv", col_types = c(year = "i"))

# also skip reading the 'time_hour' column
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_"))

# also read the carrier as a factor
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))

data <- vroom("flights.tsv",
  col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor())
)

8. 数据读取速度

一个字，快！非常适合机器学习动不动就几个G的数据。

下图是读取和输出1.55G数据时各个包所用的时间比较。