不管在哪个领域中(金融学、经济学、生态学、神经科学、物理学等),时间序列数据都是一种重要的结构化数据形式,在多个时间点观察或者测量到的任何事物都可以形成一段时间序列。很多时间序列是固定频率的,也就是说,数据点是根据某种规律定期出现的(比如每15分钟、每5分钟、每一个月等)。时间序列也可以是不定期的。时间序列数据的意义取决于具体的应用场景,主要有以下几种:
Python标准库包含用于时间(time)和日期(date)数据的数据类型,而且还有日历方面的功能,我会主要用到datetime、time以及calendar模块。datetime.datetime(也可以简写为datatime)是用的最多的数据类型:
from datetime import datetime
now = datetime.now()
print(now)
print(now.year)
print(now.month)
print(now.day)
"""
2019-03-25 13:07:37.236060
2019
3
25
"""
delta = datetime(2015,1,7) - datetime(2013,3,4,8,20)
print(delta)
print(delta.days)
print(delta.seconds)
"""
673 days, 15:40:00
673
56400
"""
from datetime import timedelta
start = datetime(2015,1,7)
s = start + timedelta(12)
print(s)
a = start - 2*timedelta(3)
print(a)
"""
2015-01-19 00:00:00
2015-01-01 00:00:00
"""
类型 | 说明 |
---|---|
date | 以公历形式存储日历日期(年月日) |
time | 将时间存储为时、分、秒、毫秒 |
datetime | 存储日期和时间 |
timedelta | 表示两个datetime值之间的差(日、秒、毫秒) |
利用str或strftime方法(传入一个格式化字符串),datetime对象和pandas的timestamp对象可以被格式化为字符串:
stamp = datetime(2015,1,7)
print(str(stamp))
"""
2015-01-07 00:00:00
"""
代码 | 说明 |
---|---|
%Y | 4位数的年 |
%y | 2位数的年 |
%m | 2位数的月[01,12] |
%d | 2位数的日[01,31] |
%H | 时(24小时制)[00,23] |
%I | 时(12小时制)[01,12] |
%M | 2位数的分[00,59] |
%S | 秒[00,61](秒60和61用于闰秒) |
%w | 用整数表示的星期几[0(星期天),6] |
value = '2001-01-07'
v = datetime.strptime(value,'%Y-%m-%d')
print(v)
values = ['7/1/2011','9/3/2019']
vs = [datetime.strptime(value,'%m/%d/%Y') for value in values]
print(vs)
"""
2001-01-07 00:00:00
[datetime.datetime(2011, 7, 1, 0, 0), datetime.datetime(2019, 9, 3, 0, 0)]
"""
from dateutil.parser import parse
print(parse('2017/01/07'))
"""
2017-01-07 00:00:00
"""
print(parse('Jan 31 1999 10:20 PM'))
print(parse('Jan 31 1999 10:20 PM'))
print(parse('12'))
"""
1999-01-31 22:20:00
2019-03-12 00:00:00
"""
print(parse('7/1/2015',dayfirst = True))
"""
2015-01-07 00:00:00
"""
print(values)
print(pd.to_datetime(values))
"""
['7/1/2011', '9/3/2019']
DatetimeIndex(['2011-07-01', '2019-09-03'], dtype='datetime64[ns]', freq=None)
"""
idx = pd.to_datetime(values + [None])
print(idx)
print(pd.isnull(idx))
"""
DatetimeIndex(['2011-07-01', '2019-09-03', 'NaT'], dtype='datetime64[ns]', freq=None)
[False False True]
"""
pandas最基本的时间序列类型就是以时间戳(通常以Python字符串或者datatime对象表示)为索引的Series:
from datetime import datetime
dates = [datetime(2011,1,2),datetime(2011,1,5),datetime(2011,1,7),datetime(2011,1,8),datetime(2011,1,10),datetime(2011,1,12)]
ts = pd.Series(np.random.randn(6),index=dates)
print(ts)
"""
2011-01-02 1.598238
2011-01-05 0.333948
2011-01-07 -0.347980
2011-01-08 -1.603143
2011-01-10 -1.080838
2011-01-12 -0.655313
dtype: float64
"""
print(type(ts))
print(ts.index)
"""
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
"""
注意:没必要显示使用TimeSeries的构造函数。当创建一个带有DatetimeIndex的Series时,pandas就会知道该对象是一个时间序列。
print(ts + ts[::2])
"""
2011-01-02 3.196476
2011-01-05 NaN
2011-01-07 -0.695960
2011-01-08 NaN
2011-01-10 -2.161677
2011-01-12 NaN
dtype: float64
"""
print(ts.index.dtype)
"""
datetime64[ns]
"""
stamp = ts.index[0]
print(stamp)
"""
2011-01-02 00:00:00
"""
由于TimeSeries是Series的一个子类,所以在索引以及数据选取方面他们的行为是一样的:
stamp = ts.index[2]
print(ts[stamp])
"""
-0.34797996475619797
"""
# 注意格式,月/日/年
ts['1/07/2011']
"""
-0.34797996475619797
"""
longer_ts = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000',periods=1000))
print(longer_ts['2001'])
"""
2001-01-01 -0.354147
2001-01-02 -0.411002
...
2001-12-30 -0.777096
2001-12-31 -1.450070
Freq: D, Length: 365, dtype: float64
"""
print(longer_ts['2001-5'])
"""
2001-05-01 0.624091
2001-05-02 -0.702856
...
2001-05-30 -0.496207
2001-05-31 1.000048
Freq: D, Length: 31, dtype: float64
"""
print(ts[datetime(2011,1,7):])
"""
2011-01-07 -0.347980
2011-01-08 -1.603143
2011-01-10 -1.080838
2011-01-12 -0.655313
dtype: float64
"""
print(ts)
"""
2011-01-02 1.598238
2011-01-05 0.333948
2011-01-07 -0.347980
2011-01-08 -1.603143
2011-01-10 -1.080838
2011-01-12 -0.655313
dtype: float64
"""
print(ts['1/7/2011':'1/10/2011'])
"""
2011-01-07 -0.347980
2011-01-08 -1.603143
2011-01-10 -1.080838
dtype: float64
"""
print(ts.truncate(after = '1/7/2011'))
"""
2011-01-02 1.598238
2011-01-05 0.333948
2011-01-07 -0.347980
dtype: float64
"""
dates = pd.date_range('1/1/2001',periods=100,freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100,4),
index=dates,
columns=['A','B','C','D'])
print(long_df.ix['4-2002'])
"""
A B C D
2002-04-03 0.382645 0.297246 0.259205 -0.355514
2002-04-10 2.171299 -0.234009 -2.047130 -0.568277
2002-04-17 0.436831 0.856942 0.485755 0.490481
2002-04-24 0.220224 -0.485198 -0.263579 -0.562568
"""
在某些应用场景中,可能会存在多个观测数据罗在同一个时间点上的情况,下面就一个例子:
dates = pd.DatetimeIndex(['1/1/2001','1/2/2001','1/2/2001','1/2/2001',
'1/2/2001','1/3/2001'])
dup_ts = pd.Series(np.arange(6),index=dates)
print(dup_ts)
"""
2001-01-01 0
2001-01-02 1
2001-01-02 2
2001-01-02 3
2001-01-02 4
2001-01-03 5
dtype: int64
"""
"""
dup_ts.index.is_unique
"""
False
"""
dup_ts['1/3/2001']# 不重复
"""
5
"""
dup_ts['1/2/2001']# 重复
"""
2001-01-02 1
2001-01-02 2
2001-01-02 3
2001-01-02 4
dtype: int64
"""
grouped = dup_ts.groupby(level= 0)
print(grouped.mean())
"""
2001-01-01 0.0
2001-01-02 2.5
2001-01-03 5.0
dtype: float64
"""
print(grouped.count())
"""
2001-01-01 1
2001-01-02 4
2001-01-03 1
dtype: int64
"""