提供滑动窗口计算,可用于时间序列(时间和日期)数据
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')
参数:
窗口大小为2的求和
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'B':[0,1,2,np.nan,4]})
>>> df
B
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
>>> df.rolling(2).sum()
B
0 NaN
1 1.0
2 3.0
3 NaN
4 NaN
窗口为2s的求和
>>> df_time = pd.DataFrame({'B':[0,1,2,np.nan,4]},
index = [
pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
>>> df_time
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
>>> df_time.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
有 2 个观测值的前视窗口的滚动求和(a和a+1)
# 设置前向窗口
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2)
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
>>> df.rolling(window=indexer,min_periods=1).sum()
B
0 1.0
1 3.0
2 2.0
3 4.0
4 4.0
窗口长度为 2 个观测值的滚动和,但至少需要 1 个观测值才可计算值
>>> df.rolling(2,min_periods=1).sum()
B
0 0.0
1 1.0
2 3.0
3 2.0
4 4.0
滚动总和,并将结果分配到窗口索引的中心
>>> df.rolling(3, min_periods=1, center=True).sum()
B
0 1.0
1 3.0
2 3.0
3 6.0
4 4.0
>>> df.rolling(3, min_periods=1, center=False).sum()
B
0 0.0
1 1.0
2 3.0
3 3.0
4 6.0
高斯分布窗口
>>> df.rolling(2,win_type='gaussian').sum(std=3)
B
0 NaN
1 0.986207
2 2.958621
3 NaN
4 NaN
窗口由从当前观测值回溯窗口长度组成
>>> import pandas as pd
>>> s = pd.Series(range(5))
>>> s
0 0
1 1
2 2
3 3
4 4
dtype: int64
# 5个分区
>>> for window in s.rolling(window=2):
print(window)
0 0
dtype: int64
0 0
1 1
dtype: int64
1 1
2 2
dtype: int64
2 2
3 3
dtype: int64
3 3
4 4
dtype: int64
panadas支持4种窗口操作
>>> s = pd.Series(range(5),index = pd.date_range('2020-01-01',periods=5,freq='1D'))
>>> s
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
Freq: D, dtype: int64
>>> s.rolling(window='2D').sum()
2020-01-01 0.0
2020-01-02 1.0
2020-01-03 3.0
2020-01-04 5.0
2020-01-05 7.0
Freq: D, dtype: float64
部分窗口支持先分组再执行窗口操作
>>> df = pd.DataFrame({'A':['a', 'b', 'a', 'b', 'a'],'B':range(5)})
>>> df
A B
0 a 0
1 b 1
2 a 2
3 b 3
4 a 4
>>> df.groupby('A').expanding().sum()
B
A
a 0 0.0
2 2.0
4 6.0
b 1 1.0
3 4.0
>>> times = ['2020-01-01', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-29']
>>> s = pd.Series(range(5),index = pd.DatetimeIndex(times))
>>> s
2020-01-01 0
2020-01-03 1
2020-01-04 2
2020-01-05 3
2020-01-29 4
dtype: int64
# 两个观测值的窗口
>>> s.rolling(2).sum()
2020-01-01 NaN
2020-01-03 1.0
2020-01-04 3.0
2020-01-05 5.0
2020-01-29 7.0
dtype: float64
# 两天的窗口
>>> s.rolling('2D').sum()
2020-01-01 0.0
2020-01-03 1.0
2020-01-04 3.0
2020-01-05 5.0
2020-01-29 4.0
dtype: float64
窗口计算后默认标签是窗口的最后一个,center可以使中间索引作为标签
>>> s = pd.Series(range(10))
>>> s.rolling(window=5).mean()
0 NaN
1 NaN
2 NaN
3 NaN
4 2.0
5 3.0
6 4.0
7 5.0
8 6.0
9 7.0
dtype: float64
>>> s.rolling(window=5, center=True).mean()
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 NaN
9 NaN
dtype: float64
自定义窗口计算公式
>>> import numpy as np
>>> def mad(x):
return np.fabs(x - x.mean()).mean()
>>> s = pd.Series(range(10))
>>> s.rolling(window=4).apply(mad, raw=True)
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
dtype: float64
为窗口中的值添加权重
>>> s = pd.Series(range(10))
>>> s.rolling(window=5, win_type="gaussian").mean(std=0.1)
0 NaN
1 NaN
2 NaN
3 NaN
4 2.0
5 3.0
6 4.0
7 5.0
8 6.0
9 7.0
dtype: float64