【R for Data Science】(1) readr和tidyr

学习数据整形（wrangle），包括数据输入（import），数据清洗(tidy)和数据转换(transform)。

wrangle.png

内容简述

主要内容

readr 数据输入

tidyr 数据清洗

tibbles 新数据框

主要对象

关系型数据（relational data）

字符串(strings)

因子（factors）

日期时间（dates and times)

1. 准备工作（preparation）

1.1 统一安装需求包: `tidyverse`

install.packages("tidyverse")
library(tidyverse)

1.2 文件准备

我把文件都放在“E:/R for Data Science/data”路径下，文件由此书作者提供，可从github下载。
文件名：“heights.csv”

2. 数据输入(data import）[`readr`]

这里主要用到readr,函数包括：

read_csv() 读取逗号分隔的文件（comma delimited files）

read_csv2() 读取分号分隔的文件 (常见于以“，”作为小数位的文件中)（semicolon separated files）

read_tsv() 读取制表符分隔的文本（tab delimited files）

read_fwf() 读取固定宽度的文本（fixed width files）fwf_widths(),fwf_positions()是两个比较重要的参数

read_table() 读取固定宽度的文本，但是每列由空白分割（common variation of fixed width files where columns are separated by white space）

read_log() 读取Apache样式日志文件（Apache style log files）

2.1 已知的数据文件

每个函数都有类似的语法(syntax)，以read_csv()为例进行说明：

library(tidyverse)   ## 载入包

setwd("E:/R for Data Science")  ## 设置工作目录
heights1 <- read_csv("data/heights.csv")   ##结果需要补充列的数据类型
heights2 <- read_csv("data/heights.csv",     ##加上所有列的数据类型，使用col_type参数
                     col_types =  cols(
                       earn = col_double(),   
                       height = col_double(),
                       sex = col_character(),
                       ed = col_double(),
                       age = col_double(),
                       race = col_character()))

代码结果如下图，heights2对象的数据结构是tibble：

read_csv截图.png

2.2 自己创建的数据

read_csv("a,b,c  ## 默认第一行数据为数据框中的行名
                1,2,3
                4,5,6")
read_csv("The first line of metadata
          The second line of metadata    ## 允许添加注释，但是需要用skip或其他参数进行略过
          x,y,z 
          1,2,3", skip = 2)
read_csv("# A comment I want to skip
         x,y,z
         1,2,3", comment = "#")

代码结果如下图：

read_csv截图2.png

其他参数

行名col_names = TRUE/FALSE 是否显示行名

其他参数

```
   col_types = cols(   col_logical()   ## T, F, TRUE or FALSE.
                       col_integer()   ## 整数 integers.
                       col_double()    ##doubles.
                       col_character()  ## everything else.
                       col_factor(levels, ordered)  ## a fixed set of values.
                       col_date(format = "")    ## with the locale’s date_format.
                       col_time(format = "")    ##  with the locale’s time_format.
                       col_datetime(format = "") ## ISO8601 date times
                       col_number()     ## numbers containing the grouping_mark
                       col_skip() [_, -]  ## don’t import this column.
                       col_guess()  ） ## parse using the “best” type based on the input.

R的Base包有 `read.csv()`、`data.table::fread()`等函数用于读取数据

2.3 矢量解析（parse a vector）

parse_*()函数是主要函数：
比如：

parse_logical()

parse_integer()

parse_date()

parse_time()

parse_datetime()

parse_character()

parse_factor()

2.3.1 数字（number)

parse_number()

parse_double()
需要注意的是：
1. 不同国家对小数位的表示不一样，因此需要用locale参数进行说明并转换为“.”；
2. $，@，%等会被忽略掉；

parse_logical(c("TRUE", "FALSE", "NA"))
str(parse_date(c("2010-01-01", "1979-10-14")))
str(parse_logical(c("TRUE", "FALSE", "NA")))
parse_integer(c("1", "231", ".", "456"), na = ".")
parse_integer(c("123", "345", "abc", "123.45"))
# Used in America
parse_number("$123,456,789")
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
parse_number("$100")
parse_number("20%")
parse_number("It cost $123.45")

parse_X（）.png

2.3.2 字符串（strings)

parse_character()

charToRaw() 意味着将字符串转换为十六进制数字

charToRaw("Hadley")
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
guess_encoding(charToRaw(x1))
guess_encoding(charToRaw(x2))

parse_character().png

2.3.3 因子（factors)

parse_factor() 伴随有已知的levels.

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)

parse_factor().png

2.3.4 时间日期（time, date, data-time)

parse_date() : "年-(/)月-( /)日" -----a date (the number of days since 1970-01-01)
parse_time() : " 时：分：秒"，有时候加上“am” 或者"pm"----- a time (the number of seconds since midnight)
parse_datetime() 需要 ISO8601 时间。从大到小表示：年-月-日-时-分-秒-----a date-time (the number of seconds since midnight 1970-01-01)

注意：语言必须是英文输入，如果不是则需要用locale()定义。

Year
%Y (4个数)
%y (2个数); 00-69 表示2000-2069, 70-99表示1970-1999.
Month
%m (2个数，1-12).
%b (简称，比如 “Jan”).
%B (全称, “January”).
Day
%d (2个数1-30/31).
%e (optional leading space).
Time
%H (0-23时)
%I (0-12, 必须和%p一起使用)
%p (AM/PM指示)
%M (分钟)
%S (整秒)
%OS (实际秒数)
%Z (时区，地名，比如America/Chicago)
%z (比如. +0800)
Non-digits
%. 跳过一个非数字字符
%* 跳过任何一个非数字字符

parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr")) ## french 
library(hms)
parse_time("01:10 am")
parse_time("20:10:01")
parse_datetime("2010-10-01T2010")
parse_datetime("20101010")
parse_date("2010-10-01")

时间.png

2.4 文件解析（parse a file）

一般情况下都是使用guess_parser() 和parser_guess()
查询问题出在哪，用problem()

logical: only “F”, “T”, “FALSE”, or “TRUE”.
integer: only numeric characters (and -).
double: only valid doubles (including numbers like 4.5e-5).
number: valid doubles with the grouping mark inside.
time: matches the default time_format
date: matches the default date_format
date-time: any ISO8601 date.

guess_parser("2010-10-01")
guess_parser("15:01")
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
guess_parser(c("12,352,561"))
parse_guess("2010-10-10")
str(parse_guess("2010-10-10"))

guess_parse().png

2.5 数据导出（write a file）

如果是CSV文件，使用write_csv();
write_rds()和read_rds()是一组；
readRDS() 和 saveRDS()是一组，用来保存R文件以RDS格式。

write_csv(heights_try, "heights_try.csv")

2.6 其他数据类型

haven 读取SPSS, Stata和SAS文件
readxl 读取 excel文件（ .xls 和 .xlsx).
DBI 数据库文件 (比如RMySQL, RSQLite, RPostgreSQL等)
分层数据 json，xml等网页文件：jsonlite 和 xml2

3. 数据清洗（tidy data）[`tidyr`]

数据框的规范

说明.png

示例：

table1
table2
table3
table4a   # cases
table4b   # population

table.png

拿table1举例：

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>% 
  count(year, wt = cases)

# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

1例.png

Rplot.jpeg

3.1 Gathering（纵向扩展）

gather()函数解决“一个变量分布在多个列中”的问题。
gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

以table4a为例，其表示不同国家在1999年和2000年的案例数，因此，新的数据框因该包含这些信息，然后重新定义两列分别为年份和案例数。
代码中，1999和2000属于一个变量“year”，其他6个数值是统计的案例数“cases”。

table4a   # cases
table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")

## combine the tidied versions of table4a and table4b into a single tibble, use `dplyr::left_join()`
tidy4a <- table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)

gather_table4.png

gather.png

3.2 Spreading（横向扩展）

spread()函数解决“一个观测值分布在多行”的问题。
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
以table2为例，type包含了两个变量--案例和人口

table2
table2 %>%   
  spread(key = type, value = count)

table2.png

spread.png

3.3 Separating (拆分）

separate()函数解决“一个结果值由两个值组成”的问题，将一列拆为多列。
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
以table3为例，rate同时包含案例数与人口数

一般在非字母数字字符 (即不是数字或字母的字符) 的任何地方进行拆分。此处在斜杠符处进行拆分。

table3
table3 %>% 
  separate(rate, into = c("cases", "population"))
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")  ## 结果同上 
table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)  ## 使结果的数据类型更加符合 
table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)  ## 将整数进行拆分

table3.png

separate.png

Q : 如果tibble数据框里出现缺失数值或者多出数值，分别设置参数fill = "right"("left"/"warn")，extra = "warn"("drop"/"merge")

3.4 Uniting (联合）

unite()函数解决“一个结果值由两个值组成”的问题，将多列合并为一列。
unite(data, col, ..., sep = "_", remove = TRUE)
以table5为例：

table5
table5 %>% 
  unite(new, century, year) 
table5 %>% 
  unite(new, century, year, sep = "") ##合并处会出现下横杠，用sep重新设置

table5.png

unite.png

3.5 缺失值

Explicitly, i.e. 以NA形式存在
Implicitly, i.e. 数据中不一定明确表示
complete(data, ..., fill = list())
fill(data, ..., .direction = c("down", "up"))

以stocks 和treatment为例：

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
stocks %>% 
  spread(year, return)   ## 显示隐藏的缺失值（implicity）
stocks %>% 
  spread(year, return) %>% 
  gather(year, return, `2015`:`2016`, na.rm = TRUE)  ## 如果缺失的数据不重要，可以“na.rm = TRUE” 直接删除
stocks %>% 
  complete(year, qtr)  ## 显示所有的数据，包括含有缺失值项

## 填充数据
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
treatment %>% 
  fill(person)       ## 以上一项或其他来填充数据

stocks.png

treatment.png

Reference:

R for Data Science_Data import
Introduction to readr
readr in R CRAN
tidyr in R CRAN