R-dplyr 处理关系数据

  • 合并连接 加入新变量
  • 筛选连接 筛选原有观测
  • 集合操作 相同变量进行集合

由于连接每对数据表的变量称为键(key)
主键:唯一能识别其所在数据表中的观测
外键:唯一标识另一个数据表的观测
可通过count()函数验证主键

> planes %>% count(tailnum) %>% 
+   filter(n > 1)
# A tibble: 0 x 2
# … with 2 variables: tailnum , n 

1. 合并连接

R-dplyr 处理关系数据_第1张图片
连接方式

left_join(x, y, by = NULL)
e.g.

> flight2 <- flights %>%
+   select(year:day, hour, origin, dest, tailnum, carrier)
> flight2
# A tibble: 336,776 x 8
    year month   day  hour origin dest  tailnum carrier
               
 1  2013     1     1     5 EWR    IAH   N14228  UA     
 2  2013     1     1     5 LGA    IAH   N24211  UA     
 3  2013     1     1     5 JFK    MIA   N619AA  AA     
 4  2013     1     1     5 JFK    BQN   N804JB  B6     
 5  2013     1     1     6 LGA    ATL   N668DN  DL     
 6  2013     1     1     5 EWR    ORD   N39463  UA     
 7  2013     1     1     6 EWR    FLL   N516JB  B6     
 8  2013     1     1     6 LGA    IAD   N829AS  EV     
 9  2013     1     1     6 JFK    MCO   N593JB  B6     
10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
# … with 336,766 more rows

> flight2 %>% select(-origin, -dest) %>%
+   left_join(airlines, by = "carrier")
# A tibble: 336,776 x 7
    year month   day  hour tailnum carrier name                   
                               
 1  2013     1     1     5 N14228  UA      United Air Lines Inc.  
 2  2013     1     1     5 N24211  UA      United Air Lines Inc.  
 3  2013     1     1     5 N619AA  AA      American Airlines Inc. 
 4  2013     1     1     5 N804JB  B6      JetBlue Airways        
 5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.   
 6  2013     1     1     5 N39463  UA      United Air Lines Inc.  
 7  2013     1     1     6 N516JB  B6      JetBlue Airways        
 8  2013     1     1     6 N829AS  EV      ExpressJet Airlines In…
 9  2013     1     1     6 N593JB  B6      JetBlue Airways        
10  2013     1     1     6 N3ALAA  AA      American Airlines Inc. 
# … with 336,766 more rows
  • 当存在重复键时,会得倒所有可能的组合

  • 定义键列:

  1. key默认值为NULL
flight2 %>% left_join(weather)
  1. by = "x"
flights2 %>% left_join(planes, by = "tailnum")
  1. 命名字符变量 by = c("a" = "b") 匹配x表中的a变量与y表中的b变量
flight2 %>% left_join(airports, c("dest" = "faa"))

2. 筛选连接

semi_join(x, y) 保留x中与y中观测想匹配的观测
anti_join(x, y) 丢弃x中与y中观测想匹配的观测

e.g. 找出飞往最受欢迎的10个目的地的航班

> top_dest <- flights %>%
+   count(dest, sort = TRUE) %>%
+   head(10)

> flights %>%
+   semi_join(top_dest)
Joining, by = "dest"
# A tibble: 141,145 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
                            
 1  2013     1     1      542            540         2      923
 2  2013     1     1      554            600        -6      812
 3  2013     1     1      554            558        -4      740
 4  2013     1     1      555            600        -5      913
 5  2013     1     1      557            600        -3      838
 6  2013     1     1      558            600        -2      753
 7  2013     1     1      558            600        -2      924
 8  2013     1     1      558            600        -2      923
 9  2013     1     1      559            559         0      702
10  2013     1     1      600            600         0      851
# … with 141,135 more rows, and 12 more variables:
#   sched_arr_time , arr_delay , carrier ,
#   flight , tailnum , origin , dest ,
#   air_time , distance , hour , minute ,
#   time_hour 

3. 集合

集合操作基于整行,需要 x 和 y 具有相同的变量


R-dplyr 处理关系数据_第2张图片
集合

你可能感兴趣的:(R-dplyr 处理关系数据)