readr包顾名思义就是将数据导入R环境的方法,我们这里直接使用tidyverse框架,其中包含了readr包:
library(tidyverse)
主要方法有:
首先来看看read_csv():
heights <- read_csv("data/heights.csv")
#> Parsed with column specification:
#> cols(
#> earn = col_double(),
#> height = col_double(),
#> sex = col_character(),
#> ed = col_integer(),
#> age = col_integer(),
#> race = col_character()
#> )
read_csv("a,b,c
1,2,3
4,5,6")
#> # A tibble: 2 x 3
#> a b c
#>
#> 1 1 2 3
#> 2 4 5 6
这里可以发现与read.csv()不同的是,read_csv()默认读入的文件为一个tibble数据集,这会对一些老式方法写的数据读入造成一些困难,这时可以先read.csv()读入生成data.frame再as_tibble()转成一个tibble。
特殊用法:
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
#> # A tibble: 1 x 3
#> x y z
#>
#> 1 1 2 3
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
#> # A tibble: 1 x 3
#> x y z
#>
#> 1 1 2 3
read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 x 3
#> X1 X2 X3
#>
#> 1 1 2 3
#> 2 4 5 6
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
#> # A tibble: 2 x 3
#> x y z
#>
#> 1 1 2 3
#> 2 4 5 6
read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 x 3
#> a b c
#>
#> 1 1 2
以上方法已经可以涵盖75%日常遇到的问题,特殊问题可使用read_tsv()和read_fwf()解决。
readr读入数据时会对每一列猜测其数据量类型,这里用到了数据转换guess_parser()和parse_guess()函数:
guess_parser("2010-10-01")
#> [1] "date"
guess_parser("15:01")
#> [1] "time"
guess_parser(c("TRUE", "FALSE"))
#> [1] "logical"
guess_parser(c("1", "5", "9"))
#> [1] "integer"
guess_parser(c("12,352,561"))
#> [1] "number"
str(parse_guess("2010-10-10"))
#> Date[1:1], format: "2010-10-10"
然而这会有两个问题:
这里我们对readr_example(“challenge.csv”)进行试验,这个数据集由x, y 两列组成,x列前1000行为整形,后面为浮点数,y列前1000行为NA,后面为日期:
challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#> x = col_integer(),
#> y = col_character()
#> )
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 1000 parsing failures.
#> row # A tibble: 5 x 5 col row col expected actual file expected actual 1 1001 x no trailing cha… .2383797508… '/home/travis/R/Library/readr… file 2 1002 x no trailing cha… .4116799717… '/home/travis/R/Library/readr… row 3 1003 x no trailing cha… .7460716762… '/home/travis/R/Library/readr… col 4 1004 x no trailing cha… .7234505538… '/home/travis/R/Library/readr… expected 5 1005 x no trailing cha… .6145241374… '/home/travis/R/Library/readr…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
使用problems()调出错误信息:
problems(challenge)
#> # A tibble: 1,000 x 5
#> row col expected actual file
#>
#> 1 1001 x no trailing cha… .2383797508… '/home/travis/R/Library/readr…
#> 2 1002 x no trailing cha… .4116799717… '/home/travis/R/Library/readr…
#> 3 1003 x no trailing cha… .7460716762… '/home/travis/R/Library/readr…
#> 4 1004 x no trailing cha… .7234505538… '/home/travis/R/Library/readr…
#> 5 1005 x no trailing cha… .6145241374… '/home/travis/R/Library/readr…
#> 6 1006 x no trailing cha… .4739805692… '/home/travis/R/Library/readr…
#> # ... with 994 more rows
这里最佳方法是一点一点调整数据类型,我们首先看默认方法:
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_integer(),
y = col_character()
)
)
调整数据类型:
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_character()
)
)
tail(challenge)
#> # A tibble: 6 x 2
#> x y
#>
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06
这会解决第一个问题,再对y列进行调整:
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
#> # A tibble: 6 x 2
#> x y
#>
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06
前面我们说过guess_parser()默认根据前1000行进行猜测,我们可以手动设为1001:
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
#> Parsed with column specification:
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )
challenge2
#> # A tibble: 2,000 x 2
#> x y
#>
#> 1 404 NA
#> 2 4172 NA
#> 3 3004 NA
#> 4 787 NA
#> 5 37 NA
#> 6 2332 NA
#> # ... with 1,994 more rows
有时直接把所有数据默认为character更为方便:
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
这和type_convert()联用十分方便:
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
)
df
#> # A tibble: 3 x 2
#> x y
#>
#> 1 1 1.21
#> 2 2 2.32
#> 3 3 4.56
# Note the column types
type_convert(df)
#> Parsed with column specification:
#> cols(
#> x = col_integer(),
#> y = col_double()
#> )
#> # A tibble: 3 x 2
#> x y
#>
#> 1 1 1.21
#> 2 2 2.32
#> 3 3 4.56
write_csv()和write_tsv()是写文件的代表函数,写出的字符串都是UTF-8类型,日期都是ISO8601格式,若想导出csv文件到Excel,使用write_excel_csv(),这会告诉Excel我们用的是UTF-8编码。
write_csv(challenge, "challenge.csv")
这里注意,写出文件后每一列的数据类型都会丢失:
challenge
#> # A tibble: 2,000 x 2
#> x y
#>
#> 1 404 NA
#> 2 4172 NA
#> 3 3004 NA
#> 4 787 NA
#> 5 37 NA
#> 6 2332 NA
#> # ... with 1,994 more rows
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
#> Parsed with column specification:
#> cols(
#> x = col_integer(),
#> y = col_character()
#> )
#> # A tibble: 2,000 x 2
#> x y
#>
#> 1 404
#> 2 4172
#> 3 3004
#> 4 787
#> 5 37
#> 6 2332
#> # ... with 1,994 more rows
这里推荐使用write_rds()和read_rds(),会将数据存储为R的特殊二进制格式RDS,这两个函数是基本的readRDS()和saveRDS()的包装:
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
#> # A tibble: 2,000 x 2
#> x y
#>
#> 1 404 NA
#> 2 4172 NA
#> 3 3004 NA
#> 4 787 NA
#> 5 37 NA
#> 6 2332 NA
#> # ... with 1,994 more rows
这里也推荐feather包的方法,其中的二进制格式存储更快:
library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
#> # A tibble: 2,000 x 2
#> x y
#>
#> 1 404
#> 2 4172
#> 3 3004
#> 4 787
#> 5 37
#> 6 2332
#> # ... with 1,994 more rows
全文代码已上传GITHUB点此进入