R for data science ||使用dplyr处理关系数据

在处理数据的时候,很少遇到那种只对一张表进行处理的情况。经常的状况是关于某个主题有几张表,这些表以某种关键字联系在一起。

这时候:
• 合并连接:向数据框中加入新变量,新变量的值是另一个数据框中的匹配观测。
• 筛选连接:根据是否匹配另一个数据框中的观测,筛选数据框中的观测。
• 集合操作:将观测作为集合元素来处理。

> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.2.0       √ purrr   0.3.2  
√ tibble  2.1.1       √ dplyr   0.8.0.1
√ tidyr   0.8.3       √ stringr 1.4.0  
√ readr   1.3.1       √ forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning messages:
1: 程辑包‘tidyverse’是用R版本3.5.3 来建造的 
2: 程辑包‘ggplot2’是用R版本3.5.3 来建造的 
3: 程辑包‘tibble’是用R版本3.5.3 来建造的 
4: 程辑包‘tidyr’是用R版本3.5.3 来建造的 
5: 程辑包‘readr’是用R版本3.5.2 来建造的 
6: 程辑包‘purrr’是用R版本3.5.3 来建造的 
7: 程辑包‘dplyr’是用R版本3.5.2 来建造的 
8: 程辑包‘stringr’是用R版本3.5.2 来建造的 
> setwd("D:\\Users\\Administrator\\Desktop\\RStudio\\R-Programming")

描述航空公司的四个tibble数据:

  • airlines:可以根据航空公司的缩写码查到公司全名。
  • airports:给出了每个机场的信息,通过 faa 机场编码进行标识。
  • planes:给出了每架飞机的信息,通过 tailnum 进行标识。
  • weather:给出了纽约机场每小时的天气状况。
> library(nycflights13)
Warning message:
程辑包‘nycflights13’是用R版本3.5.3 来建造的 
> airlines
# A tibble: 16 x 2
   carrier name                       
                            
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
> airports
# A tibble: 1,458 x 8
   faa   name                             lat    lon   alt    tz dst   tzone              
                                                  
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/New_York   
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/Chicago    
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/Chicago    
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/New_York   
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/New_York   
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/New_York   
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/New_York   
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/New_York   
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/New_York   
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/Los_Angeles
# ... with 1,448 more rows
> planes
# A tibble: 3,322 x 9
   tailnum  year type                    manufacturer     model     engines seats speed engine   
                                                    
 1 N10156   2004 Fixed wing multi engine EMBRAER          EMB-145XR       2    55    NA Turbo-fan
 2 N102UW   1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 3 N103US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 4 N104UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 5 N10575   2002 Fixed wing multi engine EMBRAER          EMB-145LR       2    55    NA Turbo-fan
 6 N105UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 7 N107US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 8 N108UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 9 N109UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
10 N110UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
# ... with 3,312 more rows
> weather
# A tibble: 26,115 x 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour          
                                               
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA      0    1012     10 2013-01-01 01:00:00
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA      0    1012.    10 2013-01-01 02:00:00
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA      0    1012.    10 2013-01-01 03:00:00
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA      0    1012.    10 2013-01-01 04:00:00
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA      0    1012.    10 2013-01-01 05:00:00
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA      0    1012.    10 2013-01-01 06:00:00
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0         NA      0    1012.    10 2013-01-01 07:00:00
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4         NA      0    1012.    10 2013-01-01 08:00:00
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0         NA      0    1013.    10 2013-01-01 09:00:00
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8         NA      0    1012.    10 2013-01-01 10:00:00
# ... with 26,105 more rows

R for data science ||使用dplyr处理关系数据_第1张图片
  • flights connects to planes via a single variable, tailnum.
  • flights connects to airlines through the carrier variable.
  • flights connects to airports in two ways: via the origin and dest variables.
  • flights connects to weather via origin (the location), and year, month, day and hour (the time).

用于连接每对数据表的变量称为键。键是能唯一标识观测的变量(或变量集合)。
• 主键:唯一标识其所在数据表中的观测。例如, planes
$tailnum 是一个主键,因为它可以唯一表示planes的每架飞机。

  • 外键:唯一表示另一个数据框的观测数据。如filghts$tailnum 是一个外键,因为其出现在 flights 表中,并可以将每次航班与唯一一架飞机匹配。
> # 对主键进行 count() 操作,然后查看是否有 n 大于 1 的记录,即查看是否唯一值
> planes %>%
+   count(tailnum) %>%
+   filter(n > 1)
# A tibble: 0 x 2
# ... with 2 variables: tailnum , n 
> weather %>%
+   count(year, month, day, hour, origin) %>%
+   filter(n > 1)
# A tibble: 3 x 6
   year month   day  hour origin     n
        
1  2013    11     3     1 EWR        2
2  2013    11     3     1 JFK        2
3  2013    11     3     1 LGA        2
> #flights 表中的主键是什么?
> flights %>%
+   count(year, month, day, flight) %>%
+   filter(n > 1)
# A tibble: 29,768 x 5
    year month   day flight     n
        
 1  2013     1     1      1     2
 2  2013     1     1      3     2
 3  2013     1     1      4     2
 4  2013     1     1     11     3
 5  2013     1     1     15     2
 6  2013     1     1     21     2
 7  2013     1     1     27     4
 8  2013     1     1     31     2
 9  2013     1     1     32     2
10  2013     1     1     35     2
# ... with 29,758 more rows
> flights %>%
+   count(year, month, day, tailnum) %>%
+   filter(n > 1)
# A tibble: 64,928 x 5
    year month   day tailnum     n
         
 1  2013     1     1 N0EGMQ      2
 2  2013     1     1 N11189      2
 3  2013     1     1 N11536      2
 4  2013     1     1 N11544      3
 5  2013     1     1 N11551      2
 6  2013     1     1 N12540      2
 7  2013     1     1 N12567      2
 8  2013     1     1 N13123      2
 9  2013     1     1 N13538      3
10  2013     1     1 N13566      3
# ... with 64,918 more rows
> 

如果一张表没有主键,有时就需要使用 mutate() 函数和 row_number() 函数为表加上一个主键。这样一来,
如果你完成了一些筛选工作,并想要使用原始数据检查的话,就可以更容易地匹配观测。这种主键称为代理键。

理解链接
R for data science ||使用dplyr处理关系数据_第2张图片
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
R for data science ||使用dplyr处理关系数据_第3张图片
R for data science ||使用dplyr处理关系数据_第4张图片
Inner join
R for data science ||使用dplyr处理关系数据_第5张图片
x %>% 
  inner_join(y, by = "key")
#> # A tibble: 2 x 3
#>     key val_x val_y
#>     
#> 1     1 x1    y1   
#> 2     2 x2    y2
Outer joins

A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.

R for data science ||使用dplyr处理关系数据_第6张图片

Another way to depict the different types of joins is with a Venn diagram:

R for data science ||使用dplyr处理关系数据_第7张图片
Duplicate keys
R for data science ||使用dplyr处理关系数据_第8张图片
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
#> # A tibble: 4 x 3
#>     key val_x val_y
#>     
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x3    y2   
#> 4     1 x4    y1
R for data science ||使用dplyr处理关系数据_第9张图片
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")
#> # A tibble: 6 x 3
#>     key val_x val_y
#>     
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x2    y3   
#> 4     2 x3    y2   
#> 5     2 x3    y3   
#> 6     3 x4    y4
Defining the key columns
flights2 %>% 
  left_join(weather)
#> Joining, by = c("year", "month", "day", "hour", "origin")
#> # A tibble: 336,776 x 18
#>    year month   day  hour origin dest  tailnum carrier  temp  dewp humid
#>                  
#> 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
#> 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
#> 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
#> # … with 3.368e+05 more rows, and 7 more variables: wind_dir ,
#> #   wind_speed , wind_gust , precip , pressure ,
#> #   visib , time_hour 
flights2 %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#>   year.x month   day  hour origin dest  tailnum carrier year.y type 
#>                   
#> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
#> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
#> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
#> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
#> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
#> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer ,
#> #   model , engines , seats , speed , engine 
flights2 %>% 
  left_join(airports, c("dest" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon
#>                  
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Geor…  30.0 -95.3
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      Geor…  30.0 -95.3
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6         NA    NA  
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Chic…  42.0 -87.9
#> # … with 3.368e+05 more rows, and 4 more variables: alt , tz ,
#> #   dst , tzone 

flights2 %>% 
  left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon
#>                  
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2
#> # … with 3.368e+05 more rows, and 4 more variables: alt , tz ,
#> #   dst , tzone 
Other implementations
dplyr merge
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)
Filtering joins

semi_join(x, y): 保留 x 表中与 y 表中的观测相匹配的所有观测。
• anti_join(x, y): 丢弃 x 表中与 y 表中的观测相匹配的所有观测。

top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
# 筛选出前10个
flights %>%
filter(dest %in% top_dest$dest)
# 半连接,只保留两个数据框都有的行的数据
flights %>%
semi_join(top_dest)

# 反连接保留 x 表中那些没有匹配 y 表的行。
# 反连接可以用于诊断连接中的不匹配。例如,在连接 flights 和 planes 时,你可能想知道flights 中是否有很多行在 planes 中没有匹配记录:
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
R for data science ||使用dplyr处理关系数据_第10张图片
R for data science ||使用dplyr处理关系数据_第11张图片
R for data science ||使用dplyr处理关系数据_第12张图片
Join problems
Set operations

The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:

intersect(x, y): return only observations in both x and y.
union(x, y): return unique observations in x and y.
setdiff(x, y): return observations in x, but not in y.
Given this simple data:

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
The four possibilities are:
intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>    
#> 1     1     1
# Note that we get 3 rows, not 4
union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>    
#> 1     1     2
#> 2     2     1
#> 3     1     1
setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>    
#> 1     2     1
setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>    
#> 1     1     2

r4ds
R数据科学(九)使用dplyr处理关系数据

你可能感兴趣的:(R for data science ||使用dplyr处理关系数据)