plydata包 - Python中的dplyr Pipe(管道)数据处理方式

Hassan Kibirige开发的plydata包借鉴了R dplyr包的语法以及实现了管道函数%>%，并且与该作者另一个包plotnine(类ggplot2语法绘图工具包)，本文在于对plydata包语法的入门以及与dplyr包的对比，让对R熟悉的同学（比如我）更好的在python进行数据处理工作。

首先导入包以及数据集，数据集使用经典的nycflights13中的flights，这个数据集在python和R中均有包直接调用

# in R
library(dplyr)
library(nycflights13)

> head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
                                                                
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5     58
# ... with 1 more variable: time_hour

# in Python
import pandas as pd
import numpy as np
from plydata import *
from nycflights13 import flights

flights.head(6)
Out[]: 
   year  month  day  dep_time  ...  distance  hour  minute             time_hour
0  2013      1    1     517.0  ...      1400     5      15  2013-01-01T10:00:00Z
1  2013      1    1     533.0  ...      1416     5      29  2013-01-01T10:00:00Z
2  2013      1    1     542.0  ...      1089     5      40  2013-01-01T10:00:00Z
3  2013      1    1     544.0  ...      1576     5      45  2013-01-01T10:00:00Z
4  2013      1    1     554.0  ...       762     6       0  2013-01-01T11:00:00Z
5  2013      1    1     554.0  ...       719     5      58  2013-01-01T10:00:00Z

[6 rows x 19 columns]

Single Table Verbs

dplyr与plydata的数据处理逻辑为提供一个函数，仅仅执行一种Verb数据处理（data manipulation）。所有的Verbs可以归于以下三类：

行
- filter() 根据条件筛选行注：python中为query()
- slice() 根据位置选择行
- arrange() 排列行
列
- select() 选择制定列
- rename() 修改列名
- mutate() 创建新列
- relocate() 修改列顺序
行聚合
- summarise() 根据运算方式聚合为一行

Pipe - 管道操作

所有的dplyr与plydata函数均取data frame为第一个参数，所以与之每个函数中放入data frame，管道操作提供了一个数据处理的流水线，通过%>%和>>符号实现。

filter()与query()

选取数据集中飞行距离小于50英里的航班

# in R
> flights %>% filter(distance < 50)
# A tibble: 1 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
                                                                
1  2013     7    27       NA            106        NA       NA            245        NA US        1632 NA      EWR    LGA         NA       17     1      6
# ... with 1 more variable: time_hour

# in Python
flights >> query('distance < 50')
Out[]: 
        year  month  day  ...  hour  minute             time_hour
275945  2013      7   27  ...     1       6  2013-07-27T05:00:00Z

[1 rows x 19 columns]

由结果可知，2013年唯一一个飞行距离低于50英里的航班是从纽约NWR纽瓦克机场飞到纽约LGA拉瓜地机场

选取所有1月1日从LGA机场起飞至DTW机场的航班

# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW')
# A tibble: 12 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
                                                                 
 1  2013     1     1      602            605        -3      821            805        16 MQ        4401 N730MQ  LGA    DTW        105      502     6      5
 2  2013     1     1      659            705        -6      907            913        -6 DL         831 N998DL  LGA    DTW        105      502     7      5
 3  2013     1     1      857            905        -8     1107           1120       -13 DL         181 N321NB  LGA    DTW        110      502     9      5
 4  2013     1     1      858            900        -2     1102           1110        -8 MQ        4478 N737MQ  LGA    DTW        103      502     9      0
 5  2013     1     1     1048           1050        -2     1302           1250        12 MQ        4589 N537MQ  LGA    DTW        112      502    10     50
 6  2013     1     1     1113           1115        -2     1318           1315         3 DL        1031 N320NB  LGA    DTW        104      502    11     15
 7  2013     1     1     1231           1238        -7     1449           1446         3 DL        1131 N920DL  LGA    DTW        108      502    12     38
 8  2013     1     1     1305           1315       -10     1523           1520         3 MQ        4564 N725MQ  LGA    DTW        102      502    13     15
 9  2013     1     1     1440           1440         0     1658           1643        15 DL        1231 N926DL  LGA    DTW         94      502    14     40
10  2013     1     1     1640           1640         0     1859           1850         9 MQ        4540 N723MQ  LGA    DTW        106      502    16     40
11  2013     1     1     1739           1745        -6     1956           1953         3 DL        2331 N965DL  LGA    DTW         93      502    17     45
12  2013     1     1     2053           2055        -2     2254           2250         4 MQ        4573 N730MQ  LGA    DTW        102      502    20     55
# ... with 1 more variable: time_hour

# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'")
Out[490]: 
     year  month  day  dep_time  ...  distance  hour  minute             time_hour
21   2013      1    1     602.0  ...       502     6       5  2013-01-01T11:00:00Z
66   2013      1    1     659.0  ...       502     7       5  2013-01-01T12:00:00Z
163  2013      1    1     857.0  ...       502     9       5  2013-01-01T14:00:00Z
166  2013      1    1     858.0  ...       502     9       0  2013-01-01T14:00:00Z
251  2013      1    1    1048.0  ...       502    10      50  2013-01-01T15:00:00Z
267  2013      1    1    1113.0  ...       502    11      15  2013-01-01T16:00:00Z
324  2013      1    1    1231.0  ...       502    12      38  2013-01-01T17:00:00Z
355  2013      1    1    1305.0  ...       502    13      15  2013-01-01T18:00:00Z
419  2013      1    1    1440.0  ...       502    14      40  2013-01-01T19:00:00Z
558  2013      1    1    1640.0  ...       502    16      40  2013-01-01T21:00:00Z
612  2013      1    1    1739.0  ...       502    17      45  2013-01-01T22:00:00Z
782  2013      1    1    2053.0  ...       502    20      55  2013-01-02T01:00:00Z

[12 rows x 19 columns]
...
[44 rows x 19 columns]

python的plydata包目前无法使用,作为多个条件中的分割，但是可以使用& and代替

arrange()

arrange()可以通过一个或多个字段进行排序

对所有1月1日从LGA机场起飞至DTW机场的航班，按照延误时间排序，延误时间相同的话按照起飞时间逆序排序

# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW') %>% arrange(arr_delay, dep_time) 
# A tibble: 12 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
                                                                 
 1  2013     1     1      857            905        -8     1107           1120       -13 DL         181 N321NB  LGA    DTW        110      502     9      5
 2  2013     1     1      858            900        -2     1102           1110        -8 MQ        4478 N737MQ  LGA    DTW        103      502     9      0
 3  2013     1     1      659            705        -6      907            913        -6 DL         831 N998DL  LGA    DTW        105      502     7      5
 4  2013     1     1     1113           1115        -2     1318           1315         3 DL        1031 N320NB  LGA    DTW        104      502    11     15
 5  2013     1     1     1231           1238        -7     1449           1446         3 DL        1131 N920DL  LGA    DTW        108      502    12     38
 6  2013     1     1     1305           1315       -10     1523           1520         3 MQ        4564 N725MQ  LGA    DTW        102      502    13     15
 7  2013     1     1     1739           1745        -6     1956           1953         3 DL        2331 N965DL  LGA    DTW         93      502    17     45
 8  2013     1     1     2053           2055        -2     2254           2250         4 MQ        4573 N730MQ  LGA    DTW        102      502    20     55
 9  2013     1     1     1640           1640         0     1859           1850         9 MQ        4540 N723MQ  LGA    DTW        106      502    16     40
10  2013     1     1     1048           1050        -2     1302           1250        12 MQ        4589 N537MQ  LGA    DTW        112      502    10     50
11  2013     1     1     1440           1440         0     1658           1643        15 DL        1231 N926DL  LGA    DTW         94      502    14     40
12  2013     1     1      602            605        -3      821            805        16 MQ        4401 N730MQ  LGA    DTW        105      502     6      5
# ... with 1 more variable: time_hour

# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'") >> arrange("arr_delay", "-dep_time") 
Out[]: 
     year  month  day  dep_time  ...  distance  hour  minute             time_hour
163  2013      1    1     857.0  ...       502     9       5  2013-01-01T14:00:00Z
166  2013      1    1     858.0  ...       502     9       0  2013-01-01T14:00:00Z
66   2013      1    1     659.0  ...       502     7       5  2013-01-01T12:00:00Z
612  2013      1    1    1739.0  ...       502    17      45  2013-01-01T22:00:00Z
355  2013      1    1    1305.0  ...       502    13      15  2013-01-01T18:00:00Z
324  2013      1    1    1231.0  ...       502    12      38  2013-01-01T17:00:00Z
267  2013      1    1    1113.0  ...       502    11      15  2013-01-01T16:00:00Z
782  2013      1    1    2053.0  ...       502    20      55  2013-01-02T01:00:00Z
558  2013      1    1    1640.0  ...       502    16      40  2013-01-01T21:00:00Z
251  2013      1    1    1048.0  ...       502    10      50  2013-01-01T15:00:00Z
419  2013      1    1    1440.0  ...       502    14      40  2013-01-01T19:00:00Z
21   2013      1    1     602.0  ...       502     6       5  2013-01-01T11:00:00Z

[12 rows x 19 columns]

slice()与slice_rows()

slice()与slice_rows()可以通过行号来选取行

# in R
> flights %>% slice(1:5)
# A tibble: 5 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
                                                                
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
# ... with 1 more variable: time_hour

# in Python
> flights %>% slice(1:5)
# A tibble: 5 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
                                                                
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
# ... with 1 more variable: time_hour

python中的计数项是从0开始，R中是从1开始
与dplyr中slice_head()和slice_tail()对应的函数是head()和tai()

select()

select()提供选取某列的功能，基本等同于sql中的select

# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance)
# A tibble: 5 x 3
  origin dest  distance
        
1 EWR    IAH       1400
2 LGA    IAH       1416
3 JFK    MIA       1089
4 JFK    BQN       1576
5 LGA    ATL        762

# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance')
Out[509]: 
  origin dest  distance
0    EWR  IAH      1400
1    LGA  IAH      1416
2    JFK  MIA      1089
3    JFK  BQN      1576
4    LGA  ATL       762

添加end_with/endswith可以截取以特定字符结尾的列

# in R
> flights %>% slice(1:5) %>% select(ends_with('time'))
# A tibble: 5 x 5
  dep_time sched_dep_time arr_time sched_arr_time air_time
                                 
1      517            515      830            819      227
2      533            529      850            830      227
3      542            540      923            850      160
4      544            545     1004           1022      183
5      554            600      812            837      116

# in Python
flights >> slice_rows(0, 5) >> select(endswith = 'time')
Out[511]: 
   dep_time  sched_dep_time  arr_time  sched_arr_time  air_time
0     517.0             515     830.0             819     227.0
1     533.0             529     850.0             830     227.0
2     542.0             540     923.0             850     160.0
3     544.0             545    1004.0            1022     183.0
4     554.0             600     812.0             837     116.0

mutate()

mutate()可以通过运算增加新列

# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance) %>% mutate(distance_in_km = distance/0.6213712 )
# A tibble: 5 x 4
  origin dest  distance distance_in_km
                  
1 EWR    IAH       1400          2253.
2 LGA    IAH       1416          2279.
3 JFK    MIA       1089          1753.
4 JFK    BQN       1576          2536.
5 LGA    ATL        762          1226.

# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'distance/0.6213712')
Out[]: 
  origin dest  distance  distance_in_km
0    EWR  IAH      1400     2253.081572
1    LGA  IAH      1416     2278.831076
2    JFK  MIA      1089     1752.575594
3    JFK  BQN      1576     2536.326112
4    LGA  ATL       762     1226.320113

plydata包同样支持使用自定义函数、以及Numpy、Pandas函数来创建新列

miles_to_km = lambda x: np.ceil(x/0.6213712)
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'miles_to_km(distance)')
Out[519]: 
  origin dest  distance  distance_in_km
0    EWR  IAH      1400          2254.0
1    LGA  IAH      1416          2279.0
2    JFK  MIA      1089          1753.0
3    JFK  BQN      1576          2537.0
4    LGA  ATL       762          1227.0

summarise() 和 group_by()

summarise()和group_by()可以实现分组聚合

选取1月1日的各机场起飞的航班数

# in R
> flights %>% filter(month == 1 & day == 1) %>% group_by(origin) %>% summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
  origin count
    
1 EWR      305
2 JFK      297
3 LGA      240

# in Python
flights >> query("month == 1 & day == 1") >> group_by('origin') >> summarise(count = 'n()')
Out[]: 
  origin  count
0    EWR    305
1    LGA    240
2    JFK    297

n() 返回值为该类的长度

总结

plydata的完成度目前来说已经很高了，大部分dplyr的功能都被完美移植了过来，并且与另外一个R移植包plotnine(ggplot2移植)很高的联动，应该会是笔者日后Python常用的包之一。
在这里感谢作者Hassan Kibirige，下篇文章会讲解他的另外一个包，也就是刚才提到大名鼎鼎的ggplot2在python中的移植版 - plotnine。

reference
Hassan Kibirige github主页 - https://github.com/has2k1
plydata github主页 - https://github.com/has2k1/plydata
plydata完整文档 - https://plydata.readthedocs.io/en/latest/index.html
dplyr github主页 - https://github.com/tidyverse/dplyr