tidyr包

library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.1.0       √ purrr   0.3.0  
## √ tibble  2.0.1       √ dplyr   0.8.0.1
## √ tidyr   0.8.2       √ stringr 1.4.0  
## √ readr   1.3.1       √ forcats 0.4.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

以下这些数据呈现的是同样的数据内容，但方式不一样
有些数据使用时比较困难

table1
## # A tibble: 6 x 4
##   country      year  cases population
##                  
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 x 4
##    country      year type            count
##                       
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 x 3
##   country      year rate             
## *                     
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## *           
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## *                   
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

使用tidyr让数据tidy达到以下要求

每个变量占据一列 2. 每个观测占据一行 3. 每个值占据一格
注意行为观测，列为变量，每个位置为一格

基于这个原则，只有第一个table1是tidy的

dplyr , ggplot2等都能很自然的使用tidy数据，所以让自己的数据tidy起来吧

计算rate,mutate的使用

table1 %>% 
  dplyr::mutate(rate = cases / population * 10000)
## # A tibble: 6 x 5
##   country      year  cases population  rate
##                   
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67

计算每年的cases

table1 %>% 
  count(year, wt = cases)
## # A tibble: 2 x 2
##    year      n
##     
## 1  1999 250740
## 2  2000 296920

顺便完成可视化

library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

真实的数据分析过程中，我们遇到的绝大多数是非tidy的数据

spreading and gathering
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## *           
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

gather数据整形，明确变量是什么，观测是什么很重要

table4a中 1999其实是个值，并不是变量，变量应该是year,在函数中称之为key="",值的名称在函数中为value=""

示例如下：##gather哪些列，变量名为year 值为cases

table4a<-table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population") %>% 
  right_join(table4a)##优先保留y
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
##   country     year  population  cases
##                  
## 1 Afghanistan 1999    19987071    745
## 2 Brazil      1999   172006362  37737
## 3 China       1999  1272915272 212258
## 4 Afghanistan 2000    20595360   2666
## 5 Brazil      2000   174504898  80488
## 6 China       2000  1280428583 213766

spread -逆向过程

table2
## # A tibble: 12 x 4
##    country      year type            count
##                       
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>%
    spread(key = type, value = count)
## # A tibble: 6 x 4
##   country      year  cases population
##                  
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

gather

table2
## # A tibble: 12 x 4
##    country      year type            count
##                       
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
df3<-table2 %>%
    spread(key = year, value = count) ##按year
df3
## # A tibble: 6 x 4
##   country     type           `1999`     `2000`
##                           
## 1 Afghanistan cases             745       2666
## 2 Afghanistan population   19987071   20595360
## 3 Brazil      cases           37737      80488
## 4 Brazil      population  172006362  174504898
## 5 China       cases          212258     213766
## 6 China       population 1272915272 1280428583

逆向 spread

df4<-gather(df3, "1999","2000",key = "year",value = "count")##
df4
## # A tibble: 12 x 4
##    country     type       year       count
##                       
##  1 Afghanistan cases      1999         745
##  2 Afghanistan population 1999    19987071
##  3 Brazil      cases      1999       37737
##  4 Brazil      population 1999   172006362
##  5 China       cases      1999      212258
##  6 China       population 1999  1272915272
##  7 Afghanistan cases      2000        2666
##  8 Afghanistan population 2000    20595360
##  9 Brazil      cases      2000       80488
## 10 Brazil      population 2000   174504898
## 11 China       cases      2000      213766
## 12 China       population 2000  1280428583

spread与gather不对称

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks
## # A tibble: 4 x 3
##    year  half return
##      
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2016     1   0.92
## 4  2016     2   0.17
stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
##    half year  return
##      
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     1 2016    0.92
## 4     2 2016    0.17

tidyr包-spread-gather

tidyr包

使用tidyr让数据tidy达到以下要求

基于这个原则，只有第一个table1是tidy的

计算rate,mutate的使用

计算每年的cases

顺便完成可视化

真实的数据分析过程中，我们遇到的绝大多数是非tidy的数据

gather数据整形，明确变量是什么，观测是什么很重要

示例如下：##gather哪些列，变量名为year 值为cases

spread -逆向过程

gather

逆向 spread

spread与gather不对称

你可能感兴趣的:(tidyr包-spread-gather)