tidyr包
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.1.0 √ purrr 0.3.0
## √ tibble 2.0.1 √ dplyr 0.8.0.1
## √ tidyr 0.8.2 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.4.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
以下这些数据呈现的是同样的数据内容,但方式不一样
有些数据使用时比较困难
table1
## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 x 4
## country year type count
##
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## *
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## *
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## *
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
使用tidyr让数据tidy达到以下要求
- 每个变量占据一列 2. 每个观测占据一行 3. 每个值占据一格
- 注意行为观测, 列为变量,每个位置为一格
基于这个原则,只有第一个table1是tidy的
dplyr , ggplot2等都能很自然的使用tidy数据,所以让自己的数据tidy起来吧
计算rate,mutate的使用
table1 %>%
dplyr::mutate(rate = cases / population * 10000)
## # A tibble: 6 x 5
## country year cases population rate
##
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
计算每年的cases
table1 %>%
count(year, wt = cases)
## # A tibble: 2 x 2
## year n
##
## 1 1999 250740
## 2 2000 296920
顺便完成可视化
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
真实的数据分析过程中,我们遇到的绝大多数是非tidy的数据
spreading and gathering
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## *
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
gather数据整形,明确变量是什么,观测是什么很重要
table4a中 1999其实是个值,并不是变量,变量应该是year,在函数中称之为key="",值的名称在函数中为value=""
示例如下:##gather哪些列,变量名为year 值为cases
table4a<-table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population") %>%
right_join(table4a)##优先保留y
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
## country year population cases
##
## 1 Afghanistan 1999 19987071 745
## 2 Brazil 1999 172006362 37737
## 3 China 1999 1272915272 212258
## 4 Afghanistan 2000 20595360 2666
## 5 Brazil 2000 174504898 80488
## 6 China 2000 1280428583 213766
spread -逆向过程
table2
## # A tibble: 12 x 4
## country year type count
##
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
spread(key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
gather
table2
## # A tibble: 12 x 4
## country year type count
##
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
df3<-table2 %>%
spread(key = year, value = count) ##按year
df3
## # A tibble: 6 x 4
## country type `1999` `2000`
##
## 1 Afghanistan cases 745 2666
## 2 Afghanistan population 19987071 20595360
## 3 Brazil cases 37737 80488
## 4 Brazil population 172006362 174504898
## 5 China cases 212258 213766
## 6 China population 1272915272 1280428583
逆向 spread
df4<-gather(df3, "1999","2000",key = "year",value = "count")##
df4
## # A tibble: 12 x 4
## country type year count
##
## 1 Afghanistan cases 1999 745
## 2 Afghanistan population 1999 19987071
## 3 Brazil cases 1999 37737
## 4 Brazil population 1999 172006362
## 5 China cases 1999 212258
## 6 China population 1999 1272915272
## 7 Afghanistan cases 2000 2666
## 8 Afghanistan population 2000 20595360
## 9 Brazil cases 2000 80488
## 10 Brazil population 2000 174504898
## 11 China cases 2000 213766
## 12 China population 2000 1280428583
spread与gather不对称
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks
## # A tibble: 4 x 3
## year half return
##
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2016 1 0.92
## 4 2016 2 0.17
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
## half year return
##
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 1 2016 0.92
## 4 2 2016 0.17