pandas时序数据

文章目录

  • 一、时序的创建
    • DateOffset对象
  • 二、重采样
    • 1. resample对象的基本操作
    • 2. 采样聚合
  • 三、窗口函数
    • 1. Rolling
    • 2. Expanding
  • 问题与练习


一、时序的创建

bdate_range是一个类似与date_range的方法,特点在于可以在自带的工作日间隔设置上,再选择weekmask参数和holidays参数
它的freq中有一个特殊的’C’/‘CBM’/'CBMS’选项,表示定制,需要联合weekmask参数和holidays参数使用
例如现在需要将工作日中的周一、周二、周五3天保留,并将部分holidays剔除

weekmask = 'Mon Tue Thu'
holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
#注意holidays
print(pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays))

'''
DatetimeIndex(['2020-01-02', '2020-01-06', '2020-01-13', '2020-01-14'], dtype='datetime64[ns]', freq='C')
'''

从这里看出,weekmask是设置需要保留的周期,而holidays则是需要剔除的日子。需要剔除的日子,是保留不下来的。


DateOffset对象

DataOffset与Timedelta的区别
Timedelta绝对时间差的特点指无论是冬令时还是夏令时,增减1day都只计算24小时
DataOffset相对时间差指,无论一天是23\24\25小时,增减1day都与当天相同的时间保持一致
例如,英国当地时间 2020年03月29日,01:00:00 时钟向前调整 1 小时 变为 2020年03月29日,02:00:00,开始夏令时

ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
print(ts + pd.Timedelta(days=1))
print(ts + pd.DateOffset(days=1))

'''
2020-03-30 02:00:00+03:00
2020-03-30 01:00:00+03:00
'''

第一个Timedelta可以对时差进行自动计算,补上时差。
DateOffset则只会增加24小时。


增减一段时间
DateOffset的可选参数包括years/months/weeks/days/hours/minutes/seconds

print(pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2))

'''
2019-12-18 00:20:00
'''

二、重采样

1. resample对象的基本操作

采样频率一般设置为上面提到的offset字符

df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
                  columns=['A', 'B', 'C'])
print(df_r)

r = df_r.resample('3min')
print(r.sum())

'''
                             A          B          C
2020-01-01 00:00:00  21.871583  18.603125 -11.916000
2020-01-01 00:03:00  21.430003 -26.923329  -0.848451
2020-01-01 00:06:00  18.780852   3.067451  -2.614459
2020-01-01 00:09:00  -4.071431   1.847722  21.474951
2020-01-01 00:12:00  -3.105468  11.821253 -10.143737
2020-01-01 00:15:00  -1.277825   4.847213   0.936945
'''

2. 采样聚合

r = df_r.resample('3T')
print(r['A'].mean())

'''
2020-01-01 00:00:00   -0.041984
2020-01-01 00:03:00   -0.015345
2020-01-01 00:06:00   -0.002229
2020-01-01 00:09:00   -0.104800
2020-01-01 00:12:00    0.010224
2020-01-01 00:15:00   -0.018361
Freq: 3T, Name: A, dtype: float64
'''

三、窗口函数

1. Rolling

所谓rolling方法,就是规定一个窗口,它和groupby对象一样,本身不会进行操作,需要配合聚合函数才能计算结果

s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
print(s.head())
print(s.rolling(window=50))
print(s.rolling(window=50).mean().head())

'''
Rolling [window=50,center=False,axis=0]
2020-01-01   NaN
2020-01-02   NaN
2020-01-03   NaN
2020-01-04   NaN
2020-01-05   NaN
Freq: D, dtype: float64
'''

min_periods参数是指需要的非缺失数据点数量阀值

print(s.rolling(window=50,min_periods=3).mean().head())

'''
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -1.122268
2020-01-04   -0.924381
2020-01-05   -0.968945
Freq: D, dtype: float64
'''

使用apply聚合时,只需记住传入的是window大小的Series,输出的必须是标量即可,比如如下计算变异系数

print(s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head())

'''
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -1.257953
2020-01-04   -1.318450
2020-01-05   -1.094140
Freq: D, dtype: float64
'''

基于时间的rolling

print(s.rolling('15D').mean().head())

'''
2020-01-01   -1.825596
2020-01-02   -1.032927
2020-01-03   -1.076893
2020-01-04   -0.892363
2020-01-05   -0.895791
Freq: D, dtype: float64
'''

可选closed=‘right’(默认)‘left’‘both’'neither’参数,决定端点的包含情况

print(s.rolling('15D', closed='right').sum().head())

'''
2020-01-01   -0.077588
2020-01-02   -0.245890
2020-01-03   -0.377603
2020-01-04   -1.024685
2020-01-05   -0.571623
Freq: D, dtype: float64
'''

2. Expanding

普通的expanding函数等价与rolling(window=len(s),min_periods=1),是对序列的累计计算

print(s.rolling(window=len(s),min_periods=1).sum().head())

'''
2020-01-01   -0.350793
2020-01-02   -1.364236
2020-01-03   -2.225010
2020-01-04   -1.150104
2020-01-05   -0.270128
Freq: D, dtype: float64
'''
print(s.expanding().sum().head())

'''
2020-01-01   -0.120957
2020-01-02   -0.449439
2020-01-03   -0.123427
2020-01-04    0.718118
2020-01-05    2.795019
Freq: D, dtype: float64
'''

问题与练习

【问题一】 如何对date_range进行批量加帧操作或对某一时间段加大时间戳密度?
periods进行调整,或者对frq进行调整

print(pd.date_range(start='2020/1/1',end='2020/1/10',periods=3))

print(pd.date_range(start='2020/1/1',end='2020/1/10',periods=7))

'''
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-02 12:00:00',
               '2020-01-04 00:00:00', '2020-01-05 12:00:00',
               '2020-01-07 00:00:00', '2020-01-08 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
'''
print(pd.date_range(start='2020/1/1',end='2020/1/10',freq='D'))

print(pd.date_range(start='2020/1/1',end='2020/1/10',freq='S'))

'''
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:00:01',
               '2020-01-01 00:00:02', '2020-01-01 00:00:03',
               '2020-01-01 00:00:04', '2020-01-01 00:00:05',
               '2020-01-01 00:00:06', '2020-01-01 00:00:07',
               '2020-01-01 00:00:08', '2020-01-01 00:00:09',
               ...
               '2020-01-09 23:59:51', '2020-01-09 23:59:52',
               '2020-01-09 23:59:53', '2020-01-09 23:59:54',
               '2020-01-09 23:59:55', '2020-01-09 23:59:56',
               '2020-01-09 23:59:57', '2020-01-09 23:59:58',
               '2020-01-09 23:59:59', '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', length=777601, freq='S')

Process finished with exit code 0

'''

【问题二】 如何批量增加TimeStamp的精度?

Timestamp的精度远远不止day,可以最小到纳秒ns,如:

print(pd.to_datetime('2020/1/1 00:00:00.123456789'))

'''
2020-01-01 00:00:00.123456789
'''

【问题三】 对于超出处理时间的时间点,是否真的完全没有处理方法?
【问题四】 给定一组非连续的日期,怎么快速找出位于其最大日期和最小日期之间,且没有出现在该组日期中的日期?

time = pd.date_range(start='2020/12/23', end='2020/12/31', periods=3)
print(time)
'''
time = pd.date_range(start='2020/12/20', end='2020/12/31', periods=3)
print(time)
print(time.max(), time.min())
time1 = pd.date_range(start=str(time.min()), end=str(time.max()), freq='D')
print(time1[~time1.isin(time)])

'''
DatetimeIndex(['2020-12-20 00:00:00', '2020-12-25 12:00:00',
               '2020-12-31 00:00:00'],
              dtype='datetime64[ns]', freq=None)
2020-12-31 00:00:00 2020-12-20 00:00:00
DatetimeIndex(['2020-12-21', '2020-12-22', '2020-12-23', '2020-12-24',
               '2020-12-25', '2020-12-26', '2020-12-27', '2020-12-28',
               '2020-12-29', '2020-12-30'],
              dtype='datetime64[ns]', freq='D')
'''

你可能感兴趣的:(python)