↑↑↑关注后"星标"Datawhale每日干货 & 每月组队学习,不错过
Datawhale干货
作者:耿远昊,Datawhale成员,华东师范大学
时序数据是指时间序列数据。时间序列数据是同一统一指标按时间顺序记录的数据列。在同一数据列中的各个数据必须是同口径的,要求具有可比性。时序数据可以是时期数,也可以时点数。
时间序列分析的目的是通过找出样本内时间序列的统计特性和发展规律性,构建时间序列模型,进行样本外预测。
现在,一起来学习用Pandas处理时序数据。
本文目录
1. 时序的创建
1.1. 四类时间变量
1.2. 时间点的创建
1.3. DataOffset对象
2. 时序的索引及属性
2.1. 索引切片
2.2. 子集索引
2.3. 时间点的属性
3. 重采样
3.1. resample对象的基本操作
3.2. 采样聚合
3.3. 采样组的迭代
4. 窗口函数
4.1. Rolling
4.2. Expanding
5. 问题及练习
5.1. 问题
5.2. 练习
import pandas as pd
import numpy as np
pd.to_datetime('2020.1.1')
pd.to_datetime('2020 1.1')
pd.to_datetime('2020 1 1')
pd.to_datetime('2020 1-1')
pd.to_datetime('2020-1 1')
pd.to_datetime('2020-1-1')
pd.to_datetime('2020/1/1')
pd.to_datetime('1.1.2020')
pd.to_datetime('1.1 2020')
pd.to_datetime('1 1 2020')
pd.to_datetime('1 1-2020')
pd.to_datetime('1-1 2020')
pd.to_datetime('1-1-2020')
pd.to_datetime('1/1/2020')
pd.to_datetime('20200101')
pd.to_datetime('2020.0101')
Timestamp('2020-01-01 00:00:00')
#pd.to_datetime('2020\\1\\1')
#pd.to_datetime('2020`1`1')
#pd.to_datetime('2020.1 1')
#pd.to_datetime('1 1.2020')
pd.to_datetime('2020\\1\\1',format='%Y\\%m\\%d')
pd.to_datetime('2020`1`1',format='%Y`%m`%d')
pd.to_datetime('2020.1 1',format='%Y.%m %d')
pd.to_datetime('1 1.2020',format='%d %m.%Y')
Timestamp('2020-01-01 00:00:00')
pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))
type(pd.to_datetime(['2020/1/1','2020/1/2']))
pandas.core.indexes.datetimes.DatetimeIndex
df = pd.DataFrame({'year': [2020, 2020],'month': [1, 1], 'day': [1, 2]})
pd.to_datetime(df)
pd.to_datetime('2020/1/1 00:00:00.123456789')
Timestamp('2020-01-01 00:00:00.123456789')
pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
pd.date_range(start='2020/1/1',end='2020/1/10',periods=3)
pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')
pd.date_range(start='2020/1/1',periods=3,freq='D')
pd.date_range(end='2020/1/3',periods=3,freq='D')
pd.date_range(start='2020/1/1',periods=3,freq='T')
pd.date_range(start='2020/1/1',periods=3,freq='M')
pd.date_range(start='2020/1/1',periods=3,freq='BYS')
weekmask = 'Mon Tue Fri'
holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
#注意holidays
pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays)
ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 02:00:00+0300', tz='Europe/Helsinki')
ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00+0300', tz='Europe/Helsinki')
ts = pd.Timestamp('2020-3-29 01:00:00')
ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 01:00:00')
ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00')
pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2)
Timestamp('2019-12-18 00:20:00')
pd.Timestamp('2020-01-01') + pd.offsets.Week(2)
Timestamp('2020-01-15 00:00:00')
pd.Timestamp('2020-01-01') + pd.offsets.BQuarterBegin(1)
Timestamp('2020-03-02 00:00:00')
pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y'))
pd.date_range('20200101',periods=3,freq='Y') + pd.offsets.BYearBegin(3)
pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri',holidays='2020010').apply(i)
for i in pd.date_range('20200105',periods=3,freq='D'))
rng = pd.date_range('2020','2021', freq='W')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.head()
ts['2020-01-26']
-0.47982974619679947
ts['2020-01-26':'20200726'].head()
ts['2020-7'].head()
ts['2011-1':'20200726'].head()
pd.Series(ts.index).dt.week.head()
pd.Series(ts.index).dt.day.head()
pd.Series(ts.index).dt.strftime('%Y-间隔1-%m-间隔2-%d').head()
pd.date_range('2020','2021', freq='W').month
pd.date_range('2020','2021', freq='W').weekday
df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
columns=['A', 'B', 'C'])
r = df_r.resample('3min')
r
r.sum()
df_r2 = pd.DataFrame(np.random.randn(200, 3),index=pd.date_range('1/1/2020', freq='D', periods=200),
columns=['A', 'B', 'C'])
r = df_r2.resample('CBMS')
r.sum()
3.2. 采样聚合
r = df_r.resample('3T')
r['A'].mean()
r['A'].agg([np.sum, np.mean, np.std])
类似地,可以使用函数lambda表达式
r.agg({'A': np.sum,'B': lambda x: max(x)-min(x)})
3.3. 采样组的迭代
small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
, '2020-01-01 00:31:00','2020-01-01 01:00:00'
,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
print("Group: ", name)
print("-" * 27)
print(group, end="\n\n")
s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
s.head()
s.rolling(window=50)
Rolling [window=50,center=False,axis=0]
s.rolling(window=50).mean()
s.rolling(window=50,min_periods=3).mean().head()
s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()
s.rolling('15D').mean().head()
s.rolling('15D', closed='right').sum().head()
s.rolling(window=len(s),min_periods=1).sum().head()
s.expanding().sum().head()
s.expanding().apply(lambda x:sum(x)).head()
s.cumsum().head()
s.cumsum().head()
s.shift(2).head()
s.diff(3).head()
s.pct_change(3).head()
5.1. 问题
5.2. 练习
【练习二】 继续使用上一题的数据,请完成下列问题:
本文电子版 后台回复 时序数据 获取
“竟然学习完了,给自己点个赞↓