走马观花过下,方便后面查阅
安装:https://facebook.github.io/prophet/docs/installation.html
# 一些warning不要了
import warnings
warnings.filterwarnings('ignore')
1. 先跑个demo
用法:先创建Prophet对象,再调用fit
和predict
输入数据:含有ds
和y
列的DataFrame
y: 必须是数值,要预测的值
ds: datestamp, 日期(YYYY-MM-DD)或者时间戳(YYYY-MM-DD HH:MM:SS)
下面使用数据集peyton_manning
试一下。
1.1 下载数据:
!curl -o example_wp_log_peyton_manning.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_peyton_manning.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 86845 100 86845 0 0 14309 0 0:00:06 0:00:06 --:--:-- 9945
1.2 导入数据
import pandas as pd
from fbprophet import Prophet
df = pd.read_csv('example_wp_log_peyton_manning.csv')
df.head()
ds | y | |
---|---|---|
0 | 2007-12-10 | 9.590761 |
1 | 2007-12-11 | 8.519590 |
2 | 2007-12-12 | 8.183677 |
3 | 2007-12-13 | 8.072467 |
4 | 2007-12-14 | 7.893572 |
df.ds
0 2007-12-10
1 2007-12-11
2 2007-12-12
3 2007-12-13
4 2007-12-14
...
2900 2016-01-16
2901 2016-01-17
2902 2016-01-18
2903 2016-01-19
2904 2016-01-20
Name: ds, Length: 2905, dtype: object
可以看到ds从2007-12-10到2016-01-20
1.3 创建一个Prophet实例进行fit
m = Prophet()
m.fit(df)
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
1.4 扩展ds列
使用Prophet.make_future_dataframe(看看R API的文档)扩展ds列,扩展后包含有历史的日期和扩展日期
也可以扩展有时分秒的时间
future = m.make_future_dataframe(periods=365)
future.ds
0 2007-12-10
1 2007-12-11
2 2007-12-12
3 2007-12-13
4 2007-12-14
...
3265 2017-01-15
3266 2017-01-16
3267 2017-01-17
3268 2017-01-18
3269 2017-01-19
Name: ds, Length: 3270, dtype: datetime64[ns]
可以看到future.ds从2007-12-10到2017-01-19,包含了历史的日期(2007-12-10到2016-01-20)和新扩展的日期(2016-01-21到2017-01-19)
1.5 预测
预测结果是一个新的DataFrame对象,并且包含了预测值yhat,各种成分及预测区间[xxx_lower, xxx_upper]
forecast = m.predict(future)
forecast.tail()
画图看看预测效果
# 这里不用变量接收会画两次
fig1 = m.plot(forecast)
看看预测的各种成分,趋势、年度季节性、星期季节性、以及节假日等
fig2 = m.plot_components(forecast)
可以使用plotly画出交互图
from fbprophet.plot import plot_plotly
import plotly.offline as py
py.init_notebook_mode()
fig = plot_plotly(m, forecast) # This returns a plotly Figure
py.iplot(fig)
2. 增长/下降预测
2.1 预测增长
默认情况下,Prophet使用线性模型进行预测
预测增长时,通常会有一个最大值(承载容量
)(默认最小值是0)
Prophet可以指定一个承载容量,使用逻辑斯蒂增长趋势模型(Logistic growth trend model)进行预测。
下面使用维基百科页面R (programming language) 的pv
值log
后举例:
2.1.1 下载数据
!curl -o example_wp_log_R.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_R.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85609 100 85609 0 0 10412 0 0:00:08 0:00:08 --:--:-- 14715 0:00:12 5486
2.1.2 读取数据并指定最大容量(cap
列)
df = pd.read_csv('example_wp_log_R.csv')
df['cap'] = 8.5
注:每一行都需要指定cap
,cap
不一定是个固定值,cap
也可以是一个增长序列(随着y
的增长而增长)
2.1.3 使用参数growth
=logistic
指定logistic增长,再fit
m = Prophet(growth='logistic')
m.fit(df)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2.1.4 扩展ds列
和之前不同的是,这次需要指定最大增长量
这里指定为一个定值,并预测未来三年的log(pv)
future = m.make_future_dataframe(periods=3*365)
future['cap'] = 8.5
fcst = m.predict(future)
fig = m.plot(fcst)
2.2 下降预测
默认最小值为0
,也可以通过floor
列单独指定
使用逻辑斯蒂增长趋势模型预测减少的时候,必须指定最大值
df['y'] = 10 - df['y']
df['cap'] = 6
df['floor'] = 1.5
future['cap'] = 6
future['floor'] = 1.5
m = Prophet(growth='logistic')
m.fit(df)
fcst = m.predict(future)
fig = m.plot(fcst)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
这里不指定最小值再试下
# 上面已经修改过df了,重新加载数据
df = pd.read_csv('example_wp_log_R.csv')
# 最大为8.5
df['y'] = 8.5 - df['y']
df['cap'] = 4
#df['floor'] = 1.5
future['cap'] = 4
#future['floor'] = 1.5
m = Prophet(growth='logistic')
m.fit(df)
fcst = m.predict(future)
fig = m.plot(fcst)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
可以看到
默认最小值为0
3. 趋势中的变化点
前面的几个例子中,可以看到真实的趋势中会存在一些突变的点
Prophet默认会自动检测到这些点并调整趋势,但是,自动调整出现问题时(比如过拟合),也可以手动调整
3.1 自动检测变化点
- 通过大量的速率改变的点检测变化点
- 对这些点做稀疏先验(sparse prior)
实际上有很多可能的变化点,但会尽可能少用
在数据集example_wp_log_peyton_manning.csv
上,默认情况下会检测出25个变化点,这些点均匀的分布在前80%的时间序列中
就是下面这些竖线:
df = pd.read_csv('example_wp_log_peyton_manning.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
fig = m.plot(forecast)
for cp in m.changepoints:
plt.axvline(cp, c='gray', ls='--', lw=2)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
因为稀疏先验,大部分的变化点并没有用到
看看每个变化点的速率变化图:
deltas = m.params['delta'].mean(0)
fig = plt.figure(facecolor='w', figsize=(10, 6))
ax = fig.add_subplot(111)
ax.bar(range(len(deltas)), deltas, facecolor='#0072B2', edgecolor='#0072B2')
ax.grid(True, which='major', c='gray', ls='-', lw=1, alpha=0.2)
ax.set_ylabel('Rate change')
ax.set_xlabel('Potential changepoint')
fig.tight_layout()
变化点的数量可以通过参数
n_changepoints
指定,但最好还是通过调整正则化来修改
下面看看比较明显的变化点:
from fbprophet.plot import add_changepoints_to_plot
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)
默认情况下,为了有足够的长度预测未来趋势和防止过拟合,会在时间序列的前
80%
推断变化点,但是也可以通过参数changepoint_range
来修改变化点所在范围,比如m = Prophet(changepoint_range=0.9)
改为前90%
3.2 调整趋势灵活性
当趋势出现过拟合或者欠拟合的情况下,可以通过参数changepoint_prior_scale
调整稀疏先验的程度,默认为0.05
该参数值越大,则趋势越灵活
增大灵活性
m = Prophet(changepoint_prior_scale=0.9)
forecast = m.fit(df).predict(future)
fig = m.plot(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
减少灵活性
m = Prophet(changepoint_prior_scale=0.001)
forecast = m.fit(df).predict(future)
fig = m.plot(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
3.3 指定变化点的位置
也可以通过参数changepoints
手动指定位置,只有指定的这些点可以有速率变化
m = Prophet(changepoints=['2014-01-01'])
forecast = m.fit(df).predict(future)
fig = m.plot(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
4. 季节性、假期效应和回归因子
4.1 假期和特殊事件建模
要给假期或者其他重复事件建模,就需要创建一个包含holiday
和ds
列的DataFrame
需要包含过去和将来所有的特殊日子,如果这些特殊日子没有出现在将来(要预测的日期),那预测就不会用到
通过lower_window
列和upper_window
列来扩展假期,比如双十一扩展为11.1~11.20
通过prior_scale
列来指定每个假期的prior scale
下面创建一个包含了Peyton Manning’s
所有决赛日期的DataFrame
playoffs = pd.DataFrame({
'holiday': 'playoff',
'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
'2010-01-24', '2010-02-07', '2011-01-08',
'2013-01-12', '2014-01-12', '2014-01-19',
'2014-02-02', '2015-01-11', '2016-01-17',
'2016-01-24', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
superbowls = pd.DataFrame({
'holiday': 'superbowl',
'ds': pd.to_datetime(['2010-02-07', '2014-02-02', '2016-02-07']),
'lower_window': 0,
'upper_window': 1,
})
holidays = pd.concat((playoffs, superbowls))
上面superbowl
的日期也包含在playoff
的日期中,也就是superbowl
日期的影响会有个叠加效应
使用下创建好的holidays:
m = Prophet(holidays=holidays)
forecast = m.fit(df).predict(future)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
可以通过forecast看看假期效应
forecast[(forecast['playoff'] + forecast['superbowl']).abs() > 0][
['ds', 'playoff', 'superbowl']][-10:]
看看假期效应在图上的显示,
playoff
日期有高峰,superbowl
日期有更明显的高峰:
fig = m.plot_components(forecast)
可以使用
from fbprophet.plot import plot_forecast_component
plot_forecast_component(m, forecast, 'superbowl')
对假期单独画图
4.2 内置假期
可以通过add_country_holidays
使用内置假期
通过模型的train_holiday_names
方法查看哪些假期
m = Prophet(holidays=holidays)
m.add_country_holidays(country_name='CN')
m.fit(df)
m.train_holiday_names
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
0 playoff
1 superbowl
2 New Year's Day
3 Chinese New Year
4 Tomb-Sweeping Day
5 Labor Day
6 Dragon Boat Festival
7 Mid-Autumn Festival
8 National Day
dtype: object
所有的假期都在holidays
包中提供
假期的日期范围可以这个脚本替换: generate_holidays_file.py
再画个图:
m = Prophet(holidays=holidays)
m.add_country_holidays(country_name='US')
m.fit(df)
forecast = m.predict(future)
fig = m.plot_components(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
4.3 季节性的傅里叶级数(Fourier Order for Seasonalities)
季节性是用傅里叶和(Fourier sum)
估算的
这段直接贴原文吧:
Seasonalities are estimated using a partial Fourier sum. See the paper for complete details, and this figure on Wikipedia for an illustration of how a partial Fourier sum can approximate an aribtrary periodic signal. The number of terms in the partial sum (the order) is a parameter that determines how quickly the seasonality can change.
这里仍然以Peyton Manning
的数据为例,年度季节性的傅里叶级数默认是10,画下图:
df = pd.read_csv('example_wp_log_peyton_manning.csv')
from fbprophet.plot import plot_yearly
m = Prophet().fit(df)
a = plot_yearly(m)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
这个默认值大多数是没问题的,但是季节性可能有更高频率的变化,并且通常没有这么平滑,这时可以增加这个值
增加这个值也可能导致过拟合, N Fourier terms corresponds to 2N variables used for modeling the cycle
这里增加到20,画图看看
from fbprophet.plot import plot_yearly
m = Prophet(yearly_seasonality=20).fit(df)
a = plot_yearly(m)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
4.4 自定义季节性
- 时间序列超过两个周期时,Prophet默认训练星期和年的季节性
- 在
sub-daily
的时间序列时,会训练每天的季节性 - 可以使用函数
add_seasonality
添加小时/月/季度等其他季节性
函数add_seasonality
的参数:
-
name
哪种周期 -
period
季节性的周期 -
fourier_order
季节性的傅里叶级数 -
prior_scale
可选参数,下面会说
默认情况下,周的季节性傅里叶级数为3,年的季节性傅里叶级数为10
仍然使用Peyton Manning
数据集,将每周的季节性替换为每月的季节性(period=30.5),画个图:
m = Prophet(weekly_seasonality=False)
m.add_seasonality(name='monthly', period=30.5, fourier_order=5)
forecast = m.fit(df).predict(future)
fig = m.plot_components(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
4.5 依赖于其他因素的季节性
有时候季节性依赖于其他一些因素,比如每周的季节性在夏天和其他季节表现不一致,每天的季节性在周末和周内表现不一致
这种季节性可以使用带条件的季节性训练模型(These types of seasonalities can be modeled using conditional seasonalities.
)
在前面的Peyton Manning
例子中,在一年中默认每周的季节性表现是一致的,但是可能希望每周的季节性在淡季和旺季(每周末有比赛)表现不一致
我们可以使用带条件的季节性为淡季和旺季单独构建每周的季节性
先增加一列布尔类型的数据,来表示日期在淡季还是旺季:
def is_nfl_season(ds):
date = pd.to_datetime(ds)
return (date.month > 8 or date.month < 2)
df['on_season'] = df['ds'].apply(is_nfl_season)
df['off_season'] = ~df['ds'].apply(is_nfl_season)
接着禁用内置的每周季节性,使用淡季的周季节性和旺季的周季节性代替
因此,只有condition_name
列为True的时候季节性才有日期
在预测的DataFrame上,也要做同样的操作
m = Prophet(weekly_seasonality=False)
m.add_seasonality(name='weekly_on_season', period=7, fourier_order=3, condition_name='on_season')
m.add_seasonality(name='weekly_off_season', period=7, fourier_order=3, condition_name='off_season')
future['on_season'] = future['ds'].apply(is_nfl_season)
future['off_season'] = ~future['ds'].apply(is_nfl_season)
forecast = m.fit(df).predict(future)
fig = m.plot_components(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
从图中可以看到,在旺季的时候每周末都会打球,周日和周一都有大幅度增长,但在淡季则完全没有。
4.6 假期和季节性的prior scale
如果发现假期过拟合,可以设置参数holidays_prior_scale
调整假期的prior scale
使之平滑
这个参数默认是10,减少可以限制假期效果
m = Prophet(holidays=holidays, holidays_prior_scale=0.05).fit(df)
forecast = m.predict(future)
forecast[(forecast['playoff'] + forecast['superbowl']).abs() > 0][
['ds', 'playoff', 'superbowl']][-10:]
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
可以看到,比起之前假期效应被减弱了,特别是在观看最少的superbowls
上
可以用下面的方式设置每周季节性的prior_scale
m = Prophet()
m.add_seasonality(
name='weekly', period=7, fourier_order=3, prior_scale=0.1)
4.7 额外的回归特征
可以使用函数add_regressor
将其他回归特征添加到模型的线性部分
训练和预测的数据集上都需要包含这些回归特征的值
下面,为NFL赛季的每周日添加这样一个回归特征,再画图看看这个特征的效果
def nfl_sunday(ds):
date = pd.to_datetime(ds)
if date.weekday() == 6 and (date.month > 8 or date.month < 2):
return 1
else:
return 0
df['nfl_sunday'] = df['ds'].apply(nfl_sunday)
m = Prophet()
m.add_regressor('nfl_sunday')
m.fit(df)
future['nfl_sunday'] = future['ds'].apply(nfl_sunday)
forecast = m.predict(future)
fig = m.plot_components(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
也可以使用前面说过的holidays
的接口,通过创建一个过去和未来的这些周日的list来处理NFL赛季周日
的这种情况
函数add_regressor
为定义额外的线性回归提供了一个更加通用的接口
接下来的给出原文吧:
NFL Sundays could also have been handled using the “holidays” interface described above, by creating a list of past and future NFL Sundays. The
add_regressor
function provides a more general interface for defining extra linear regressors, and in particular does not require that the regressor be a binary indicator. Another time series could be used as a regressor, although its future values would have to be known.This notebook shows an example of using weather factors as extra regressors in a forecast of bicycle usage, and provides an excellent illustration of how other time series can be included as extra regressors.
The
add_regressor
function has optional arguments for specifying the prior scale (holiday prior scale is used by default) and whether or not the regressor is standardized - see the docstring withhelp(Prophet.add_regressor)
in Python and?add_regressor
in R. Note that regressors must be added prior to model fitting.The extra regressor must be known for both the history and for future dates. It thus must either be something that has known future values (such as
nfl_sunday
), or something that has separately been forecasted elsewhere. Prophet will also raise an error if the regressor is constant throughout the history, since there is nothing to fit from it.Extra regressors are put in the linear component of the model, so the underlying model is that the time series depends on the extra regressor as either an additive or multiplicative factor (see the next section for multiplicativity).
5. 乘法季节性
默认情况下Prophet训练的是加法季节性,这种方式预测值是季节性影响加上趋势得到的
对于航空乘客数的变化,加法季节性就会有问题:
先下载数据集
!curl -o example_air_passengers.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_air_passengers.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2165 100 2165 0 0 1980 0 0:00:01 0:00:01 --:--:-- 1982
df = pd.read_csv('example_air_passengers.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(50, freq='MS')
forecast = m.predict(future)
fig = m.plot(forecast)
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
这个时间序列中在每一年中明显存在周期,但是预测的季节性在开始时间太大,在结束时间太小
这里的季节性不是一个恒定的加数,而是随着趋势在增长
可以通过设置参数seasonality_mode='multiplicative'
来为乘法季节性进行建模:
m = Prophet(seasonality_mode='multiplicative')
m.fit(df)
forecast = m.predict(future)
fig = m.plot(forecast)
fig2 = m.plot_components(forecast)
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
因为设置了参数seasonality_mode='multiplicative'
, 所以假期效果也会使用乘法建模
可以通过参数seasonality_mode
给模型设置季节性模式,如果要修改单个季节性的模式,可以通过设置mode='additive'
或者mode='multiplicative'
来覆盖原有设置
看这里,模型设置的季节性为乘法模式,季度的季节性和另一个回归因子使用加法模式覆盖了原来的乘法模式:
m = Prophet(seasonality_mode='multiplicative')
m.add_seasonality('quarterly', period=91.25, fourier_order=8, mode='additive')
m.add_regressor('regressor', mode='additive')
6. 预测区间
默认情况下Prophet会返回一个预测区间
三部分不确定性:趋势和季节性的不确定性、观测结果的噪声影响
6.1 趋势的不确定
对于预测结果的不确定性,影响最大的就是未来趋势的变化
未来趋势的变化时没有办法准确预知的,只能尽可能合理的预测,这里需要假定未来和历史有相似的趋势
特别是,假设未来和过去的趋势变化频率和大小是一致的,从而预测未来的趋势变化,进而计算出预测范围。这个假设虽然合理却不一定正确,所以预测的结果范围可能也不会完全准确
使用这种计算不确定性的方法时,变化速率越灵活,预测的结果范围也越大,变化范围的灵活性可以通过参数changepoint_prior_scale
设置(越大越灵活),这个预测范围也可以用来判定是否过拟合
可以使用interval_width
来设置预测区间的宽度(默认80%):
forecast = Prophet(interval_width=0.95).fit(df).predict(future)
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
6.2 季节性的不确定
默认Prophet只会返回趋势的不确定和观测结果的噪声影响
要想得到季节性的不确定性,必须通过全贝叶斯采样(full Bayesian sampling)
,这里需要设置参数mcmc.samples
(默认0)来实现
拿Peyton Manning
数据集的前六个月数据为例,这里把最大后验估计(Maximum-a-Posteriori (MAP) Estimation)
替换成了MCMC采样
,并且将观测结果的时间从几秒延长到了几分钟
如果做了全采样,就能画图看到季节性的不确定性了
原文:
This replaces the typical MAP estimation with MCMC sampling, and can take much longer depending on how many observations there are - expect several minutes instead of several seconds. If you do full sampling, then you will see the uncertainty in seasonal components when you plot them:
df = pd.read_csv('example_wp_log_peyton_manning.csv')
df = df.loc[:180,] # Limit to first six months
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=60)
m = Prophet(mcmc_samples=300)
forecast = m.fit(df).predict(future)
fig = m.plot_components(forecast)
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
可以使用函数m.predictive_samples(future)
来访问原始的后验预测样本(posterior predictive samples)
在Windows下使用PyStan做MCMC采样的时候巨慢,做MCMC采样的时候,如果使用Windows,最好使用R,如果用Linux,最好使用Python
7. 异常值(Outliers)
异常值主要有两种方式影响预测结果
下面用一个有错误数据的维基百科某词条的PV数据来试试
先下载数据集
!curl -o example_wp_log_R_outliers1.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_R_outliers1.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 80366 100 80366 0 0 15180 0 0:00:05 0:00:05 --:--:-- 19630
df = pd.read_csv('example_wp_log_R_outliers1.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=1096)
forecast = m.predict(future)
fig = m.plot(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
这个预测的趋势看起来没啥问题,但这个预测的区间太特么大了
Prophet可以处理历史数据的异常值,但只是把这些异常值和趋势变化拟合在一起,模型会认为未来的趋势变化也有相同的程度
最好的处理异常值的方式就是干掉它们(Prophet可以处理缺失数据),如果将一些日期的历史数据设置为NA,但未来的日期中存在,Prophet照样能给出这些日期的预测值
df.loc[(df['ds'] > '2010-01-01') & (df['ds'] < '2011-01-01'), 'y'] = None
model = Prophet().fit(df)
fig = model.plot(model.predict(future))
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
上面的例子中异常值扰乱了不确定性估计,却并没有影响到预测结果
有时候又不是酱紫,下面添加了一些新的异常值:
!curl -o example_wp_log_R_outliers2.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_R_outliers2.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85194 100 85194 0 0 8389 0 0:00:10 0:00:10 --:--:-- 128326
df = pd.read_csv('example_wp_log_R_outliers2.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=1096)
forecast = m.predict(future)
fig = m.plot(forecast)
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2015年6月这里有一堆极端异常值扰乱了季节性估计,这种影响会在未来的预测中会一直存在,要消除这种影响,还是得干掉这些数据:
df.loc[(df['ds'] > '2015-06-01') & (df['ds'] < '2015-06-30'), 'y'] = None
m = Prophet().fit(df)
fig = m.plot(m.predict(future))
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
8. Non-Daily日期(Non-Daily Data)
8.1 Sub-daily日期
Prophet可以为sub-daily
的时间序列预测结果,时间戳格式要求为YYYY-MM-DD HH:MM:SS
,这种情况下,每日季节性就会自动拟合
下面用一个5分钟间隔的温度数据试一下
下载数据集
!curl -o example_yosemite_temps.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_yosemite_temps.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 451k 100 451k 0 0 6086 0 0:01:15 0:01:15 --:--:-- 6972 8312 0:01:23 0:00:18 0:01:05 605424 0:00:25 0:00:59 480915 0:00:38 0:00:37 4779 0:01:17 0:00:50 0:00:27 4239:01:14 0:00:02 7468
df = pd.read_csv('example_yosemite_temps.csv')
m = Prophet(changepoint_prior_scale=0.01).fit(df)
future = m.make_future_dataframe(periods=300, freq='H')
fcst = m.predict(future)
fig = m.plot(fcst)
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
看下每日季节性吧:
fig = m.plot_components(fcst)
8.2 有规律间隔的数据
假设上面的数据集只有每天凌晨到早上六点的数据
df2 = df.copy()
df2['ds'] = pd.to_datetime(df2['ds'])
df2 = df2[df2['ds'].dt.hour < 6]
m = Prophet().fit(df2)
future = m.make_future_dataframe(periods=300, freq='H')
fcst = m.predict(future)
fig = m.plot(fcst)
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
这次预测的很差,未来波动实在太大了。因为训练数据只有零点到六点的数据,所以在每天的其他时间,每日季节性就不受约束了
要解决这个问题,还是要把预测的时间控制在零点到六点,其他时间干掉
future2 = future.copy()
future2 = future2[future2['ds'].dt.hour < 6]
fcst = m.predict(future2)
fig = m.plot(fcst)
对于其他类似的,也需要这样处理,比如,如果历史记录只包含周内,那么只能对周内进行预测,要是预测周末就会垮了
8.3 月数据
Prophet可以拟合月数据,但是,基础模型是时间连续的,如果使用月数据训练模型,但是进行每日预测,结果就会很诡异。
这里,预测下美国未来10年的零售额:
下载数据集
!curl -o example_retail_sales.csv https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5279 100 5279 0 0 3207 0 0:00:01 0:00:01 --:--:-- 3207
df = pd.read_csv('example_retail_sales.csv')
m = Prophet(seasonality_mode='multiplicative').fit(df)
future = m.make_future_dataframe(periods=3652)
fcst = m.predict(future)
fig = m.plot(fcst)
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
跟上面的问题类似,拟合的是年季节性,训练数据只有每月第一天的数据,因此其他天的季节性是不可识别和过拟合的
通过MCMC
看看这种不确定的季节性:
m = Prophet(seasonality_mode='multiplicative', mcmc_samples=300).fit(df)
fcst = m.predict(future)
fig = m.plot_components(fcst)
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
可以发现,每月初的季节不确定性很低
可以通过函数make_future_dataframe
传入freq
参数来实现只对月数据的预测
future = m.make_future_dataframe(periods=120, freq='M')
fcst = m.predict(future)
fig = m.plot(fcst)
9. 诊断
Prophet有交叉验证功能
具体做法是通过在历史数据中选择一些截断点,对于这些截断点,只使用这些点之前的数据来拟合模型,然后比较真实值和预测值
下面这个图显示了Peyton-Manning
数据集上模拟的历史预测,模型使用前五年的数据训练,预测后一年的数据
from fbprophet import Prophet
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('example_wp_log_peyton_manning.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=366)
from fbprophet.diagnostics import cross_validation
df_cv = cross_validation(
m, '365 days', initial='1825 days', period='365 days')
cutoff = df_cv['cutoff'].unique()[0]
df_cv = df_cv[df_cv['cutoff'].values == cutoff]
fig = plt.figure(facecolor='w', figsize=(10, 6))
ax = fig.add_subplot(111)
ax.plot(m.history['ds'].values, m.history['y'], 'k.')
ax.plot(df_cv['ds'].values, df_cv['yhat'], ls='-', c='#0072B2')
ax.fill_between(df_cv['ds'].values, df_cv['yhat_lower'],
df_cv['yhat_upper'], color='#0072B2',
alpha=0.2)
ax.axvline(x=pd.to_datetime(cutoff), c='gray', lw=4, alpha=0.5)
ax.set_ylabel('y')
ax.set_xlabel('ds')
ax.text(x=pd.to_datetime('2010-01-01'),y=12, s='Initial', color='black',
fontsize=16, fontweight='bold', alpha=0.8)
ax.text(x=pd.to_datetime('2012-08-01'),y=12, s='Cutoff', color='black',
fontsize=16, fontweight='bold', alpha=0.8)
ax.axvline(x=pd.to_datetime(cutoff) + pd.Timedelta('365 days'), c='gray', lw=4,
alpha=0.5, ls='--')
ax.text(x=pd.to_datetime('2013-01-01'),y=6, s='Horizon', color='black',
fontsize=16, fontweight='bold', alpha=0.8);
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Making 3 forecasts with cutoffs between 2013-01-20 00:00:00 and 2015-01-20 00:00:00
关于模拟的历史预测,可以看这个paper
The output of cross_validation is a dataframe with the true values y and the out-of-sample forecast values yhat, at each simulated forecast date and for each cutoff date. In particular, a forecast is made for every observed point between cutoff and cutoff + horizon. This dataframe can then be used to compute error measures of yhat vs. y.
Here we do cross-validation to assess prediction performance on a horizon of 365 days, starting with 730 days of training data in the first cutoff and then making predictions every 180 days. On this 8 year time series, this corresponds to 11 total forecasts.
可以使用函数cross_validation
给这些历史截断点自动完成交叉验证
参数如下:
- horizon 从截断点往后预测多久
- initial 开始时间
- period 隔多久设置一个截断点
默认情况下,period
是horizon
的三倍,并且每隔半个horizon
设置一个截断点
交叉验证的输出是一个DataFrame,包含真实的y
和预测的yhat
,可以用来评判效果
下面来个交叉验证,horizon=365天,initial=730天,period=180天,在八年的时间序列中,等于有11((365*8-730-365)/180
)个总的预测
from fbprophet.diagnostics import cross_validation
df_cv = cross_validation(m, initial='730 days', period='180 days', horizon = '365 days')
df_cv.head()
INFO:fbprophet:Making 11 forecasts with cutoffs between 2010-02-15 00:00:00 and 2015-01-20 00:00:00
Python中,initial
, period
, horizon
的字符串格式得是Pandas Timedelta
,最大单位是天
函数performance_metrics
可以用来评判模型效果,提供MSE
,RMSE
,MAE
,MAPE
,预估区间覆盖率
等指标
from fbprophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
可以使用plot_cross_validation_metric
来可视化这些指标,下面看看mape
的可视化,可以看到对未来一个月的预测有5%的误差,一年后的预测误差增加到了11%
from fbprophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape')
The size of the rolling window in the figure can be changed with the optional argument rolling_window, which specifies the proportion of forecasts to use in each rolling window. The default is 0.1, corresponding to 10% of rows from df_cv included in each window; increasing this will lead to a smoother average curve in the figure.
The initial period should be long enough to capture all of the components of the model, in particular seasonalities and extra regressors: at least a year for yearly seasonality, at least a week for weekly seasonality, etc.
可以使用参数rolling_window
修改滚动窗口的大小,这个参数表示每个滚动窗口中预测的比例,默认0.1,相当于每个窗口中包含的df_cv的10%,增大这个值会让图中的平均曲线更光滑
初始周期应足够长,才能捕捉到各种东东,特别是季节性和额外回归因子,比如对于年季节性应该至少为一年(365 days),对于周季节性至少为一周(7 days)...
参考:
官方文档
trend_changepoints.ipynb
diagnostics.ipynb
时间序列模型Prophet使用详细讲解