import pandas as pd
left = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
'lv': [1, 2, 3, 4],
's': ['a', 'b', 'c', 'd']})
right = pd.DataFrame({'k': ['K1', 'K2', 'K4'],
'rv': [1, 2, 3]})
pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
# 当merge没有指定链接键时,默认从left和right列的交集将被推断为连接键
Out[2]:
k lv s rv
0 K0 1.0 a NaN
1 K1 1.0 a 1.0
2 K2 1.0 a 2.0
3 K4 1.0 a 3.0
4 K1 2.0 b 1.0
5 K2 2.0 b 2.0
6 K4 2.0 b 3.0
7 K1 3.0 c 1.0
8 K2 3.0 c 2.0
9 K4 3.0 c 3.0
10 K1 NaN d 1.0
11 K2 4.0 d 2.0
12 K4 4.0 d 3.0
trades = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.038',
'20160525 13:30:00.048',
'20160525 13:30:00.048',
'20160525 13:30:00.048']),
'ticker': ['MSFT', 'MSFT',
'GOOG', 'GOOG', 'AAPL'],
'price': [51.95, 51.95,
720.77, 720.92, 98.00],
'quantity': [75, 155,
100, 100, 100]},
columns=['time', 'ticker', 'price', 'quantity'])
quotes = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.023',
'20160525 13:30:00.030',
'20160525 13:30:00.041',
'20160525 13:30:00.048',
'20160525 13:30:00.049',
'20160525 13:30:00.072',
'20160525 13:30:00.075']),
'ticker': ['GOOG', 'MSFT', 'MSFT',
'MSFT', 'GOOG', 'AAPL', 'GOOG',
'MSFT'],
'bid': [720.50, 51.95, 51.97, 51.99,
720.50, 97.99, 720.50, 52.01],
'ask': [720.93, 51.96, 51.98, 52.00,
720.93, 98.01, 720.88, 52.03]},
columns=['time', 'ticker', 'bid', 'ask'])
trades
Out[4]:
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
quotes
Out[5]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
下面Oder的顺序是按照ticker排序,连接键是time:
pd.merge_asof(trades, quotes,
on='time',
by='ticker')
Out[6]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
可以看出相当于left_join,只根据left的key进行merge。值得注意的是,right对应的值的time最大不能超过left的time,比如left中 2016-05-25 13:30:00.038 MSFT,那么拼接的right对应的时间最大是 2016-05-25 13:30:00.030 MSFT 51.97 51.98。APPL在right中就没有对应的值。
pd.merge_asof(trades, quotes,on='time',by='ticker',tolerance=pd.Timedelta('2ms'))
Out[7]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
我们仅在报价时间和交易时间之间的10ms内,我们按时排除完全匹配。 请注意,虽然我们排除了(报价quote)完全匹配,但先前的quote会传播到该时间点。
pd.merge_asof(trades, quotes,
on='time',
by='ticker',
tolerance=pd.Timedelta('10ms'),
allow_exact_matches=False)
Out[8]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN
3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN