数据预处理时,经常需要将每年、每月、每周记录一次的数据,填充为每日的数据。对于月度、年度等低频数据,或者是不规则采样的数据而言,可以采用**resample()**将其转为日度数据。如下:
df.resample(‘D’).ffill().reset_index() #D for calendar day frequency
下面是将月度数据’month_data’用前面已知的值填充为每日的数据,避免了日度数据用到了未来的数据,这一点在时间序列预测中尤为重要。
import pandas as pd
from datetime import datetime
month_data = pd.read_excel('month_data.xlsx')
x_ticks = month_data['Date']
xs = [datetime.strptime(str(d), '%Y-%m-%d %H:%M:%S') for d in x_ticks]
data = month_data.iloc[:,1]
new_month_data = month_data.set_index('Date').resample('D').ffill().reset_index()
其中,'month_data’为每月月末记录一次的数据,共140行:
填充后,'new_month_data’为每日数据,共4231行:
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T minutely frequency
S secondly frequency
L milliseonds
U microseconds
N nanoseconds
甚至可以使用:
df.resample('5D') #得到一个重采样构建器,频率改为5天
df.resample('5D').sum() #得到一个新的聚合后的Series,聚合方式为求和
使用函数:
df.interpolate(axis=1)
df.columns = pd.to_datetime(df.columns, format='%Y%m')
#by first and last values of columns
rng = pd.date_range(df.columns[0], df.columns[-1])
#alternatively min and max of columns
#rng = pd.date_range(df.columns.min(), df.columns.max())
df = df.reindex(rng, axis=1).interpolate(axis=1)
df:
201301 201302 201303 … 201709 201710
a 0.747711 0.793101 0.771819 … 0.818161 0.812522
b 0.776537 0.759745 0.733673 … 0.757496 0.765181
c 0.801699 0.847655 0.796586 … 0.784537 0.763551
d 0.797942 0.687899 0.729911 … 0.819887 0.772395
e 0.777472 0.799676 0.782947 … 0.804533 0.791759
f 0.780933 0.750774 0.781056 … 0.790846 0.773705
g 2.071699 2.261739 2.126915 … 1.891780 2.098914
print (df)
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 …
a 0.747711 0.749175 0.750639 0.752104 0.753568 …
b 0.776537 0.775995 0.775454 0.774912 0.774370 …
c 0.801699 0.803181 0.804664 0.806146 0.807629 …
d 0.797942 0.794392 0.790842 0.787293 0.783743 …
e 0.777472 0.778188 0.778905 0.779621 0.780337 …
f 0.780933 0.779960 0.778987 0.778014 0.777042 …
g 2.071699 2.077829 2.083960 2.090090 2.096220 …
[7 rows x 1735 columns]
Python pandas.DataFrame.resample函数方法的使用
Pandas —— resample()重采样和asfreq()频度转换方式
时间序列 - 重采样.resample() .ffill()
Python插值月值到日值(线性):Pandas