【r<-数据整理】数据导入

library(tidyverse)

开始

大多数readr包的函数都关注于将文本文件转换为数据框：

read_csv() 读入逗号分隔文件，read_csv2() 读入分号分隔文件（通常用于将逗号作为小数点的国家），read_delim() 可以读入任何分隔符文件
read_fwf()读入固定宽度文件，你可以要么用 fwf_widths()指定宽度，要么用 fwf_positions()指定位置。 read_table()读入固定的宽度文件，并且该文件用空格符分隔。
read_log()读入 Apache 分隔的log文件. (查看 webreadr，它基于read_log()，可以提供更多用于网络文件读取的工具)

这些函数语法都相似：一旦你掌握其中一个，其他都不成问题。下面我们主要聚焦使用read_csv()。

read_csv() 的第一个参数最重要，是要读入文件的路径。

heights <- read_csv("data/heights.csv")
#> Parsed with column specification:
#> cols(
#>   earn = col_double(),
#>   height = col_double(),
#>   sex = col_character(),
#>   ed = col_integer(),
#>   age = col_integer(),
#>   race = col_character()
#> )

当你运行read_csv()后，它会打印读入列的名字及类型说明。

你也可以提供一个行内的csv文件，这在创建可重复例子用于共享时非常实用。

read_csv("a,b,c
1,2,3
4,5,6")
#> # A tibble: 2 x 3
#>       a     b     c
#>     
#> 1     1     2     3
#> 2     4     5     6

在上述两种情况中 read_csv()都使用数据的第一列作为列名，这是非常普遍的传统。但存在两种情况你可能想要更改这种行为：

有时候在文件顶部存在一些元数据行，你可以使用 skip = n 跳过头部n行；或者使用comment = "#"去掉以#起始的注释行。

read_csv("The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3", skip = 2)
#> # A tibble: 1 x 3
#>       x     y     z
#>     
#> 1     1     2     3

read_csv("# A comment I want to skip
  x,y,z
  1,2,3", comment = "#")
#> # A tibble: 1 x 3
#>       x     y     z
#>     
#> 1     1     2     3

数据可能没有列名，你可以使用col_names = FALSE告诉read_csv()不要将第一行作为列名，它会自动生成序列列名：

read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 x 3
#>      X1    X2    X3
#>     
#> 1     1     2     3
#> 2     4     5     6

另外你可以传给col_name一个字符向量指定列名：

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
#> # A tibble: 2 x 3
#>       x     y     z
#>     
#> 1     1     2     3
#> 2     4     5     6

另外一个通常需要处理的选项是na：它指定了文件中代表的缺失值：

read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 x 3
#>       a     b c    
#>     
#> 1     1     2

了解这些多足以解决你实际中遇见的~75%的问题。

与基础R相比较

我们为什么不使用基础的R，而是readr包中相应的替代函数？

首先，读入速度要更快，一般是10倍以上，想要更快，可以试试data.table::fread()
读入的是tibbles，不会将字符转换为因子，不使用行名等等
更容易重复

解析向量

我们现在讨论一下parse_*() 函数，这些函数使用字符向量并返回更加专业的向量（像逻辑值、整型以及日期）。

str(parse_logical(c("TRUE", "FALSE", "NA")))
#>  logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
#>  int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
#>  Date[1:2], format: "2010-01-01" "1979-10-14"

这些函数本身就非常有用，而且是readr非常重要的构建模块。一旦掌握它们，我们后续可以看看如何解析完整的文件。

像tidyverse中的所有其他函数，parse_*()有着通用的参数形式，第一个参数是一个要解析的字符向量，na指定字符中要转变为缺失值的字符：

parse_integer(c("1", "231", ".", "456"), na = ".")
#> [1]   1 231  NA 456

如果解析失败，你会得到一个警告：

x <- parse_integer(c("123", "345", "abc", "123.45"))
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 2 parsing failures.
#> row # A tibble: 2 x 4 col     row   col expected               actual expected                         actual 1     3    NA an integer             abc    row 2     4    NA no trailing characters .45

并且失败的值会以缺失值替换

x
#> [1] 123 345  NA  NA
#> attr(,"problems")
#> # A tibble: 2 x 4
#>     row   col expected               actual
#>                        
#> 1     3    NA an integer             abc   
#> 2     4    NA no trailing characters .45

如果存在许多的解析失败，你需要使用problems()去获取整个集合，这将返回一个tibble。

problems(x)
#> # A tibble: 2 x 4
#>     row   col expected               actual
#>                        
#> 1     3    NA an integer             abc   
#> 2     4    NA no trailing characters .45

使用解析器是理解什么可获取以及怎样处理不同输入类型的重点，存在8个非常重要的解析器：

parse_logical() 与parse_integer() 分别解析逻辑值与整数。
parse_double() 是一个严格的数值解析器， parse_number() 是一个灵活的数值解析器。
parse_character()开起来简单到没必要（因为输入就是字符向量啊），但一个问题让它变得灰常重要：字符编码
parse_factor()创建因子
parse_datetime()、 parse_date()、和parse_time()允许你解析不同的日期与时间格式。

下面描述了一些解析器的详情。

Numbers

下面三种问题比较棘手：

不同地方的人使用不同的数字，例如有些国家用.作为小数点，有的用,作为小数点。
数字经常被其他一些语境字符包裹，例如 “$1000” 或 “10%”。
数字经常包含分组字符以便于阅读，比如“1,000,000”。

为了解决第一个问题，readr有"locale"的概念，这是一个指定解析选项的对象，因地域而不同。当解析数字时，最关键的是小数点，我们可以创建一个locale来改变解析：

parse_double("1.23")
#> [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
#> [1] 1.23

readr默认都是以美国作为标准。

parse_number()解决第二个问题，它会忽略非数字字符。

parse_number("$100")
#> [1] 100
parse_number("20%")
#> [1] 20
parse_number("It cost $123.45")
#> [1] 123

最后一个问题可以使用parse_number以及选项组合解决：

# Used in America
parse_number("$123,456,789")
#> [1] 1.23e+08

# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
#> [1] 1.23e+08

# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
#> [1] 1.23e+08

字符串

使用charToRaw()我们可以获取R字符串的根本代表——十六进制：

charToRaw("Hadley")
#> [1] 48 61 64 6c 65 79

（关于十六进制的信息读者可以百度，现在最流行的是UTF-8）

例如有两个字符串我们想要从不同的编码中解析出来

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

x1
#> [1] "El Ni\xf1o was particularly bad this year"
x2
#> [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"

在parse_character()中指定编码：

parse_character(x1, locale = locale(encoding = "Latin1"))
#> [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
#> [1] "こんにちは"

你如何找到原始字符的编码呢？幸运的话，它会包含在数据文档的某处。不幸的是，这非常罕见，我们可以借助guess_encoding()函数搞明白。

guess_encoding(charToRaw(x1))
#> # A tibble: 2 x 2
#>   encoding   confidence
#>              
#> 1 ISO-8859-1       0.46
#> 2 ISO-8859-9       0.23
guess_encoding(charToRaw(x2))
#> # A tibble: 1 x 2
#>   encoding confidence
#>            
#> 1 KOI8-R         0.42

guess_encoding()的第一个参数要么是文件路径，要么是raw向量

编码参考 http://kunststube.net/encoding/.

因子

R使用因子代表已知值集合的分类变量：

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
#> Warning: 1 parsing failure.
#> row # A tibble: 1 x 4 col     row   col expected           actual   expected                       actual 1     3    NA value in level set bananana
#> [1] apple  banana   
#> attr(,"problems")
#> # A tibble: 1 x 4
#>     row   col expected           actual  
#>                      
#> 1     3    NA value in level set bananana
#> Levels: apple banana

如果你的数据字符存在大量问题，你需要先使用基本操作进行处理。

日期、日期时间与时间

根据你的需要选择解析函数。

parse_datetime()期待输入是ISO8601格式。该格式是国际标准，将日期时间从最大到最小进行组织： year, month, day, hour, minute, second。

parse_datetime("2010-10-01T2010")
#> [1] "2010-10-01 20:10:00 UTC"
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
#> [1] "2010-10-10 UTC"

这是最重要的日期/时间标准，如果你经常处理日期时间数据，推荐阅读https://en.wikipedia.org/wiki/ISO_8601

parse_date()期待输入——4个数字代表年，一个-或 /，月, 一个- 或 /,天：
```
parse_date("2010-10-01")
#> [1] "2010-10-01"
```
parse_time()期待输入——小时，:，分钟，可选的:以及seconds和可选的am/pm指示器：
```
library(hms)
parse_time("01:10 am")
#> 01:10:00
parse_time("20:10:01")
#> 20:10:01
```

基础R中并没有很好的时间数据类，因此我们使用hms提供的。

如果这些默认的不能处理你的数据，你可以根据自己的日期-时间格式，构建下面的部分：

Year

%Y (4 digits).

%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month

%m (2 digits).

%b (abbreviated name, like “Jan”).

%B (full name, “January”).
Day

%d (2 digits).

%e (optional leading space).
Time

%H 0-23 hour.

%I 0-12, must be used with %p.

%p AM/PM indicator.

%M minutes.

%S integer seconds.

%OS real seconds.

%Z Time zone (as name, e.g. America/Chicago). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.

%z (as offset from UTC, e.g. +0800).
Non-digits

%. skips one non-digit character.

%* skips any number of non-digits.

最好的方法是使用小的例子并进行测试：

parse_date("01/02/15", "%m/%d/%y")
#> [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
#> [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
#> [1] "2001-02-15"

如果你使用了非英语的列，需要指定locale：

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
#> [1] "2015-01-01"

解析文件

现在你已经学习了如何处理单独的向量，是时候学习如何使用readr包解析文件了。下面是这部分学习的内容：

readr如何自动地猜测每一列的数据类型
如何覆盖默认的指定

策略

readr使用启发式的方法搞明白每一列的数据类型：它开始读入前1000行并使用一些（相当保守）启发式方法弄明白数据的类型。你可以使用guess_parser()模仿这一个过程，它会返回readr最好的猜测：

guess_parser("2010-10-01")
#> [1] "date"
guess_parser("15:01")
#> [1] "time"
guess_parser(c("TRUE", "FALSE"))
#> [1] "logical"
guess_parser(c("1", "5", "9"))
#> [1] "integer"
guess_parser(c("12,352,561"))
#> [1] "number"

str(parse_guess("2010-10-10"))
#>  Date[1:1], format: "2010-10-10"

启发式方法会尝试下面的类型，当找到匹配时会停止和返回：

logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
integer: contains only numeric characters (and -).
double: contains only valid doubles (including numbers like 4.5e-5).
number: contains valid doubles with the grouping mark inside.
time: matches the default time_format.
date: matches the default date_format.
date-time: any ISO8601 date.

如果什么也没找到，该列保持为一个字符向量。

问题

上述默认行为有以下问题：

头1000行可能是特例，所以猜测不充分
该列可能会有大量缺失值，猜测可能会返回字符向量，而这不是你想要的

下面使用一个有挑战的CSV文件来描述上面问题：

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 1000 parsing failures.
#> row # A tibble: 5 x 5 col     row col   expected               actual             file               expected                                                   actual 1  1001 x     no trailing characters .23837975086644292 '/home/travis/R/L… file 2  1002 x     no trailing characters .41167997173033655 '/home/travis/R/L… row 3  1003 x     no trailing characters .7460716762579978  '/home/travis/R/L… col 4  1004 x     no trailing characters .723450553836301   '/home/travis/R/L… expected 5  1005 x     no trailing characters .614524137461558   '/home/travis/R/L…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.

检查：

problems(challenge)
#> # A tibble: 1,000 x 5
#>     row col   expected               actual             file              
#>                                                  
#> 1  1001 x     no trailing characters .23837975086644292 '/home/travis/R/L…
#> 2  1002 x     no trailing characters .41167997173033655 '/home/travis/R/L…
#> 3  1003 x     no trailing characters .7460716762579978  '/home/travis/R/L…
#> 4  1004 x     no trailing characters .723450553836301   '/home/travis/R/L…
#> 5  1005 x     no trailing characters .614524137461558   '/home/travis/R/L…
#> 6  1006 x     no trailing characters .473980569280684   '/home/travis/R/L…
#> # ... with 994 more rows

如果存在问题，我们可以逐一按列进行解决。

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_integer(),
    y = col_character()
  )
)

更改第一列的类型：

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_character()
  )
)

这解决了第一个问题。但你接着发现第二列末尾是日期：

tail(challenge)
#> # A tibble: 6 x 2
#>       x y         
#>         
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06

你可以指定日期列解决：

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)
tail(challenge)
#> # A tibble: 6 x 2
#>       x y         
#>        
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06

每一个解析器函数parse_*()都有一个对应的col_*()函数。这可以帮助我们在发现问题时纠正，更为推荐的做法是如果我们已知数据类型，在输入时就直接指定。

其他策略

有一些其他的策略棒我们解决问题：

多猜几列：

challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )
challenge2
#> # A tibble: 2,000 x 2
#>       x y         
#>        
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # ... with 1,994 more rows

有时候将列全部导入为字符更简单

challenge2 <- read_csv(readr_example("challenge.csv"), 
  col_types = cols(.default = col_character())
)

这种方式与type_convert()联合起来非常有用：

df <- tribble(
  ~x,  ~y,
  "1", "1.21",
  "2", "2.32",
  "3", "4.56"
)
df
#> # A tibble: 3 x 2
#>   x     y    
#>    
#> 1 1     1.21 
#> 2 2     2.32 
#> 3 3     4.56

# 注意列类型：
type_convert(df)
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_double()
#> )
#> # A tibble: 3 x 2
#>       x     y
#>    
#> 1     1  1.21
#> 2     2  2.32
#> 3     3  4.56

写入文件

readr有两个非常有用的函数——write_csv()和write_tsv()用来将数据写入磁盘。

总是以UTF-8编码
总是使用ISO8601日期时间格式便于在其他地方解析和使用

If you want to export a csv file to Excel, use write_excel_csv() — this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.

最重要的参数是第一个指定数据框，第二个指定文件路径。

write_csv(challenge, "challenge.csv")

注意你的数据类型信息在保存为文件后就丢失了：

challenge
#> # A tibble: 2,000 x 2
#>       x y         
#>        
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # ... with 1,994 more rows
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> # A tibble: 2,000 x 2
#>       x y    
#>    
#> 1   404  
#> 2  4172  
#> 3  3004  
#> 4   787  
#> 5    37  
#> 6  2332  
#> # ... with 1,994 more rows

这让重复读入指定解析格式文件有些困难，有两种备选方案：

使用write_rds()和read_rds()，它们是readRDS()和saveRDS()函数更通用的形式，将数据转换为R二进制文件RDS格式，保存数据所有信息。

write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
#> # A tibble: 2,000 x 2
#>       x y         
#>        
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # ... with 1,994 more rows

feather包使用了一个更快的二进制文件格式，可以在不同编程语言直接分享和使用。

library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
#> # A tibble: 2,000 x 2
#>       x      y
#>    
#> 1   404   
#> 2  4172   
#> 3  3004   
#> 4   787   
#> 5    37   
#> 6  2332   
#> # ... with 1,994 more rows

Feather比RDA更快并可以在R之外使用。

其他数据类型

想要把其他数据读入R，我推荐使用下面一些tidyverse包，对于表格数据：

haven 读入SPSS, Stata,和SAS文件。
readxl 读入excel文件 (.xls 和 .xlsx).
DBI，用于后台 (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

对于分层数据： use jsonlite (by Jeroen Ooms) for json, and xml2 for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/.

其他数据类型，看看R data import/export manual以及rio 包。

from