时间序列数据分析--Time Series--时间数据重采样

重采样(resample)

  • 将时间序列从一个频率转换到另一个频率的过程,需要使用聚合显示
  • pandas中的resample方法实现重采样
    • 产生Resample对象
    • 高频  -> 低频率
    • resample(freq).sum(),resample(freq).mean()............
    • #resample
      
      import pandas as pd
      import numpy as np
      
      date_rng = pd.date_range('20170101', periods=100, freq='D')
      ser_obj = pd.Series(range(len(date_rng)), index=date_rng)
      print(ser_obj.head(10))
      answer:
      2017-01-01    0
      2017-01-02    1
      2017-01-03    2
      2017-01-04    3
      2017-01-05    4
      2017-01-06    5
      2017-01-07    6
      2017-01-08    7
      2017-01-09    8
      2017-01-10    9
      Freq: D, dtype: int64
      
      # 统计每个月的数据总和
      resample_month_sum = ser_obj.resample('M').sum()
      # 统计每个月的数据平均
      resample_month_mean = ser_obj.resample('M').mean()
      
      print('按月求和:', resample_month_sum)
      print('按月求均值:', resample_month_mean)
      answer:
      按月求和: 2017-01-31     465
      2017-02-28    1246
      2017-03-31    2294
      2017-04-30     945
      Freq: M, dtype: int64
      按月求均值: 2017-01-31    15.0
      2017-02-28    44.5
      2017-03-31    74.0
      2017-04-30    94.5
      Freq: M, dtype: float64

       

降采样

  • 将数据聚合到规整的低频率
  • OHLC重采样,open,high,low,close
  • 使用groupby降采样
  • # 将数据聚合到5天的频率
    five_day_sum_sample = ser_obj.resample('5D').sum()
    five_day_mean_sample = ser_obj.resample('5D').mean()
    five_day_ohlc_sample = ser_obj.resample('5D').ohlc()
    
    print('降采样,sum..')
    print(five_day_sum_sample.head())
    降采样,sum..
    2017-01-01     10
    2017-01-06     35
    2017-01-11     60
    2017-01-16     85
    2017-01-21    110
    dtype: int64
    
    print('降采样,ohlc')
    print(five_day_ohlc_sample.head())
    降采样,ohlc
                open  high  low  close
    2017-01-01     0     4    0      4
    2017-01-06     5     9    5      9
    2017-01-11    10    14   10     14
    2017-01-16    15    19   15     19
    2017-01-21    20    24   20     24
    
    # 使用groupby降采样
    print(ser_obj.groupby(lambda x: x.month).sum())
    answer
    1     465
    2    1246
    3    2294
    4     945
    dtype: int32
    
    print(ser_obj.groupby(lambda x: x.weekday).sum())
    answer
    0    750
    1    665
    2    679
    3    693
    4    707
    5    721
    6    735
    dtype: int32

3.升采样

  • 数据从低频至高频,需要插值,否则为NaN
  • 常用的插值方法
    • ffill(limit),空值取前面limit个值填充
    • bfill(limit)
    • fillna('ffill‘)/  'bfill'
    • interpolate   插值算法
#升采样
df = pd.DataFrame(np.random.randn(5, 3),
                 index=pd.date_range('20170101', periods=5, freq='W-MON'),
                 columns=['S1', 'S2', 'S3'])
print(df)
answer
  S1        S2        S3
2017-01-02  0.087264 -0.047404 -0.754223
2017-01-09  1.148830  2.439266 -0.889873
2017-01-16  0.331767  0.918984  1.164783
2017-01-23 -0.582157  0.923737  1.938061
2017-01-30 -0.637087  0.143846 -1.500307

# 直接重采样会产生空值
print(df.resample('D').asfreq().head(10))
answer
S1        S2        S3
2017-01-02  0.003409 -0.939362  2.036451
2017-01-03       NaN       NaN       NaN
2017-01-04       NaN       NaN       NaN
2017-01-05       NaN       NaN       NaN
2017-01-06       NaN       NaN       NaN
2017-01-07       NaN       NaN       NaN
2017-01-08       NaN       NaN       NaN
2017-01-09  0.291274 -0.655332 -1.034041
2017-01-10       NaN       NaN       NaN
2017-01-11       NaN       NaN       NaN

#ffill
print(df.resample('D').ffill(2).head())
answer
S1        S2        S3
2017-01-02  0.003409 -0.939362  2.036451
2017-01-03  0.003409 -0.939362  2.036451
2017-01-04  0.003409 -0.939362  2.036451
2017-01-05       NaN       NaN       NaN
2017-01-06       NaN       NaN       NaN
2017-01-07       NaN       NaN       NaN
2017-01-08       NaN       NaN       NaN
2017-01-09  0.291274 -0.655332 -1.034041
2017-01-10  0.291274 -0.655332 -1.034041
2017-01-11  0.291274 -0.655332 -1.034041

print(df.resample('D').bfill())
print(df.resample('D').fillna('ffill'))
print(df.resample('D').interpolate('linear'))

 

你可能感兴趣的:(python)