So far, we have covered techniques to extract data from various sources. Tis was covered in Chapter 2, Reading Time Series Data from Files, and Chapter 3, Reading Time Series Data from Databases. Chapter 6, Working with Date and Time in Python, and Chapter 7, Handling Missing Data, covered several techniques to help prepare, clean, and adjust data.
You will continue to explore additional techniques to better understand the time series process behind the data. Before modeling the data or doing any further analysis, an important step is to inspect the data at hand. More specifcally, there are specifc time series characteristics that you need to check for, such as stationarity, effects of trend and seasonality, and autocorrelation, to name a few. These characteristics that describe the time series process you are working with need to be combined with domain knowledge behind the process itself.
Tis chapter will build on what you have learned from previous chapters to prepare you for creating and evaluating forecasting models starting from Chapter 10, Building Univariate Time Series Models Using Statistical Methods.
In this chapter, you will learn how to visualize time series data, decompose a time series into its components (trend, seasonality, and residuals), test for different assumptions that your models may rely on (such as stationarity, normality, and homoscedasticity/ ˈhoʊməsɪdæsˈtɪsəti /同方差性,[数] 方差齐性), and explore techniques to transform the data to satisfy some of these assumptions.
Te recipes that you will encounter in this chapter are as follows:
Troughout this chapter, you will be using three datasets (Closing Price Stock Data, CO2, and Air Passengers). The CO2 and Air Passengers datasets are provided with the statsmodels library. Thee Air Passengers dataset contains monthly airline passenger numbers from 1949 to 1960. Te CO2 dataset contains weekly atmospheric/ˌætməsˈfɪrɪk /大气(层)的 carbon/ˈkɑːrbən/碳 dioxide/daɪˈɑːksaɪd/二氧化物 levels on Mauna Loa. The Closing Price Stock Data dataset includes Microsoft, Apple, and IBM stock prices from November 2019 to November 2021.
The pandas library ofers built-in plotting capabilities for visualizing data stored in a DataFrame or Series data structure. In the backend, these visualizations are powered by the Matplotlib library, which is also the default option.
The pandas library offers many convenient methods to plot data. Simply calling DataFrame. plot() or Series.plot() will generate a line plot by default. You can change the type of the plot in two ways:
Tips recipe will use the standard pandas .plot() method with Matplotlib backend support.
You will be using the stock data for Microsoft, Apple, and IBM, which you can find in the closing_price.csv fle.
import yfinance as yf
df = yf.download('AAPL MSFT IBM',
start='2019-01-01')
df
import matplotlib.pyplot as plt
fig, ax = plt.subplots( 1,1, figsize=(10,8) )
# df['Adj Close'].plot(kind='line', ax=ax)
symbols = list( set( df.columns.get_level_values(1) ) )
color_list=['b','g','k']
for idx, tick in enumerate(symbols):
ax.plot( df.index,
df['Adj Close'][tick],
label=tick,
color=color_list[idx],
)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Adj Close Price', fontsize=14)
plt.setp( ax.get_xticklabels(), rotation=45,
horizontalalignment='right', fontsize=12 )
plt.setp( ax.get_yticklabels(), #rotation=45,
horizontalalignment='right', fontsize=12 )
plt.legend( loc='best', fontsize=14)
plt.show()
https://seekingalpha.com/symbol/GOOG/splits
Apple Inc. (AAPL) Stock Split History | Seeking Alpha
import matplotlib.pyplot as plt
fig, ax = plt.subplots( 1,1, figsize=(18,10) )
# df['Adj Close'].plot(kind='line', ax=ax)
symbols = list( set( df.columns.get_level_values(1) ) )
color_list=['b','g','k']
aapl_event={"2020-08-31": "4:1 split",
"2022-09-16" : "iphone 14",
"2021-09-14" : "iphone 13",
"2020-10-23" : "iphone 12",
"2019-09-20" : "iphone 11",
}
for idx, tick in enumerate(symbols):
ax.plot( df.index,
df['Adj Close'][tick],
label=tick,
color=color_list[idx],
)
from datetime import datetime, timedelta
for date, label in aapl_event.items():
ax.annotate(label,
ha='center',
va='top',
# String to date object
xytext=( datetime.strptime(date, '%Y-%m-%d') -timedelta(days=7) ,
df['Adj Close']['AAPL'].loc[date] +50), #The xytext parameter specifies the text position
xy=( datetime.strptime(date, '%Y-%m-%d'),
df['Adj Close']['AAPL'].loc[date]+10), #The xy parameter specifies the arrow's destination
arrowprops=dict( arrowstyle="-|>,head_width=0.5, head_length=1",
facecolor='r',
linewidth=2, edgecolor='k' ),
#arrowprops={'facecolor':'blue', 'headwidth':10, 'headlength':4, 'width':2} #OR
fontsize=12
)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Adj Close Price', fontsize=14)
plt.setp( ax.get_xticklabels(), rotation=45,
horizontalalignment='right', fontsize=12 )
plt.setp( ax.get_yticklabels(), #rotation=45,
horizontalalignment='right', fontsize=12 )
ax.autoscale(enable=True, axis='x', tight=True) # move all curves to left(touch y-axis)
plt.legend( loc='best', fontsize=14)
plt.show()
Apple stock price rises for a short period before Apple releases a new phone, then falls after the phone is released, then rises; similar to stock splits.
import matplotlib.pyplot as plt
fig, ax = plt.subplots( 1,1, figsize=(18,10) )
# df['Adj Close'].plot(kind='line', ax=ax)
symbols = list( set( df.columns.get_level_values(1) ) )
color_list=['b','g','k']
aapl_event={"2020-08-31": "4:1 split",
"2022-09-16" : "iphone 14",
"2021-09-14" : "iphone 13",
"2020-10-23" : "iphone 12",
"2019-09-20" : "iphone 11",
}
hike_dates=['2022-11-2', '2022-09-21', '2022-07-27', '2022-06-16', '2022-05-05', '2022-03-17']
cuts_dates=['2019-10-31', '2019-09-19', '2019-08-01',
'2020-03-16', '2020-03-13']
for idx, tick in enumerate(symbols):
ax.plot( df.index,
df['Adj Close'][tick],
label=tick,
color=color_list[idx],
)
from datetime import datetime, timedelta
for date, label in aapl_event.items():
ax.annotate(label,
ha='center',
va='top',
# String to date object
xytext=( datetime.strptime(date, '%Y-%m-%d') -timedelta(days=7) ,
df['Adj Close']['AAPL'].loc[date] +50), #The xytext parameter specifies the text position
xy=( datetime.strptime(date, '%Y-%m-%d'),
df['Adj Close']['AAPL'].loc[date]+10), #The xy parameter specifies the arrow's destination
arrowprops=dict(facecolor='k', headwidth=5, headlength=5, width=1 ),
#arrowprops={'facecolor':'blue', 'headwidth':10, 'headlength':4, 'width':2} #OR
fontsize=14
)
for date in hike_dates:
ax.axvline( datetime.strptime(date, '%Y-%m-%d'),
ls=':',
color='r')
for date in cuts_dates:
ax.axvline( datetime.strptime(date, '%Y-%m-%d'),
ls='--',
lw=0.9,
color='y')
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Adj Close Price', fontsize=14)
plt.setp( ax.get_xticklabels(), rotation=45,
horizontalalignment='right', fontsize=12 )
plt.setp( ax.get_yticklabels(), #rotation=45,
horizontalalignment='right', fontsize=12 )
ax.autoscale(enable=True, axis='x', tight=True) # move all curves to left(touch y-axis)
plt.legend( loc='best', fontsize=14)
plt.show()
The Fed's rate cuts and rate hikes have a certain impact on stock prices
2. If you want to see how the prices fluctuate (up or down) in comparison to each other, one easy approach is to normalize the data. To accomplish this, just divide the stock prices by the first-day price (first row) for each stock. Tis will make all the stocks have the same starting point:
closing_price_n=df['Adj Close'].div(df['Adj Close'].iloc[0])
import matplotlib.pyplot as plt
fig, ax = plt.subplots( 1,1, figsize=(18,10) )
# df['Adj Close'].plot(kind='line', ax=ax)
symbols = list( set( df.columns.get_level_values(1) ) )
color_list=['b','g','k']
aapl_event={"2020-08-31": "4:1 split",
"2022-09-16" : "iphone 14",
"2021-09-14" : "iphone 13",
"2020-10-23" : "iphone 12",
"2019-09-20" : "iphone 11",
}
for idx, tick in enumerate(symbols):
ax.plot( df.index,
closing_price_n[tick],
label=tick,
color=color_list[idx],
)
from datetime import datetime, timedelta
for date, label in aapl_event.items():
ax.annotate(label,
ha='center',
va='top',
# String to date object
xytext=( datetime.strptime(date, '%Y-%m-%d') -timedelta(days=7) ,
closing_price_n['AAPL'].loc[date] +0.9), #The xytext parameter specifies the text position
xy=( datetime.strptime(date, '%Y-%m-%d'),
closing_price_n['AAPL'].loc[date]+0.35), #The xy parameter specifies the arrow's destination
# arrowprops=dict( arrowstyle="-|>,head_width=1, head_length=1",
# facecolor='b',
# linewidth=4, edgecolor='k' ),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.2',
color='b',),
bbox=dict(boxstyle='round,pad=0.2', fc='yellow', alpha=0.95),
fontsize=12
)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('Adj Close Price', fontsize=14)
plt.setp( ax.get_xticklabels(), rotation=45,
horizontalalignment='right', fontsize=12 )
plt.setp( ax.get_yticklabels(), #rotation=45,
horizontalalignment='right', fontsize=12 )
ax.autoscale(enable=True, axis='x', tight=True) # move all curves to left(touch y-axis)
plt.legend( loc='best', fontsize=14)
plt.show()
From the normalization output, you can observe that the lines now have the same starting point (origin), set to 1. Te plot shows how the prices in the time series plot deviate from each other:
closing_price_n
Figure 9.3 – Output of normalized time series with a common starting point at 1
3. Additionally, Matplotlib allows you to change the style of the plots. To do that, you can use the style. use function. You can specify a style name from an existing template or use a custom style. For example, the following code shows how you can change from the default style to the ggplot style:
You can explore other attractive styles: fivethirtyeight , which is inspired by https://fivethirtyeight.com/, dark_background, seaborn-dark, and tableau-colorblind10. For a comprehensive list of available style sheets, you can reference the Matplotlib documentation here: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html
If you want to revert to the original theme, you specify
plt.style.use("default")
https://blog.csdn.net/Linli522362242/article/details/121045744 (Adjusting the resolution: dpi)
You can customize the plot further by adding a title, updating the axes labels, and customizing the x ticks and y ticks, to name a few.
Add a title and a label to the y axis, then save it as a .jpg fle:
start_date = '2019'
end_date = '2022'
plt.style.use('ggplot' )
plot = closing_price_n.plot( figsize=(10,8),
title=f'Stock Prices from {start_date} - {end_date}',
ylabel='Norm. Price'
)
# plot.get_figure().savefig('plot_1.jpg')
There is good collaboration between pandas and Matplotlib, with an ambition to integrate and add more plotting capabilities within pandas.
There are many plotting styles that you can use within pandas simply by providing a value to the kind argument. For example, you can specify the following:
tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])
tips.head()
https://blog.csdn.net/Linli522362242/article/details/87891370As observed in the previous section, we plotted all three columns in the time series in one plot (three line charts in the same plot). What if you want each symbol (column) plotted separately?
The preceding code will generate a subplot for each column in the DataFrame. For the closing_price DataFrame, this will generate three subplots.
fig, axes = plt.subplots( 3,1, figsize=(12,8) )
symbols = list( set( df.columns.get_level_values(1) ) )
color_list=['b','g','k']
for idx in range( len(axes) ):
axes[idx].plot( df.index, df['Adj Close'][df['Adj Close'].columns[idx]],
label=df['Adj Close'].columns[idx],
color=color_list[idx]
)
plt.setp( axes[idx].get_yticklabels(), fontsize=12 )
axes[idx].set_xticks([])
axes[idx].legend(fontsize=12)
from matplotlib.dates import DateFormatter
import matplotlib.ticker as ticker
axes[-1].set_xticks(closing_price_n.index)
axes[-1].xaxis.set_major_locator(ticker.MaxNLocator(12))
axes[-1].xaxis.set_major_formatter( DateFormatter('%Y-%m') )
axes[0].set_title(f'Stock Prices from {start_date} - {end_date}')
plt.setp( axes[-1].get_xticklabels(), rotation=45, horizontalalignment='right', fontsize=12 )
plt.show()
To learn more about pandas charting and plotting capabilities, please visit the ofcial
documentation here: Chart visualization — pandas 1.5.1 documentation .
In this recipe, you will explore the hvPlot library to create interactive visualizations. hvPlot works well with pandas DataFrames to render interactive visualizations with minimal effort. You will be using the same closing_price.csv dataset to explore the library.
hvplot and PyViz
conda install -c pyviz hvplot
OR in jupyter notebook:
!pip install hvplot
1. Start by importing the libraries needed. Notice that hvPlot has a pandas extension, which makes it more convenient. Tis will allow you to use the same syntax as in the previous recipe:
import hvplot.pandas
# normalize the data :
# divide the stock prices by the first-day price (first row)
# closing_price_n=df['Adj Close'].div( df['Adj Close'].iloc[0] )
closing_price_n.hvplot( title='Time Series plot using hvplot',
width=800, height=400 )
Figure 9.6 – hvPlot interactive visualization
The same result could be accomplished simply by switching the pandas plotting backend. Te default backend is matplotlib . To switch it to hvPlot, you can just update backend=' hvplot' :
closing_price_n.plot( backend='hvplot',
title='Time Series plot using hvplot', width=800, height=400
)
Notice the widget bar to the right, which has a set of modes for interaction, including pan平移, box zoom框缩放, wheel zoom滚轮缩放, save, reset, and hover.
Figure 9.7 – Widget bar with six modes of interaction
2. You can split each time series into separate plots per symbol (column). For example, to split into three columns one for each symbol (or ticker): MSFT, AAPL, and IBM. Subplotting can be done by specifying subplots=True
You can use the .cols() method for more control over the layout. The method allows you to control the number of plots per row. For example, .cols(1) means one plot per row, whereas . cols(2) indicates two plots per line:
# fontsize={
# 'title': '200%',
# 'labels': '200%',
# 'ticks': '200%',
# }
closing_price_n.hvplot( width=300, height=400,
subplots=True,
rot=45,
fontsize={ 'title': 14,
'labels': 14,
'xticks': 12,
'yticks': 10,
}
).cols(2)
Keep in mind that the .cols() method only works if the subplots parameter is set to True. Otherwise, you will get an error.
hvPlot ofers convenient options for plotting your DataFrame: switching the backend, extending pandas with DataFrame.hvplot(), or using hvPlot's native API.
hvPlot allows you to use two arithmetic operators, + and * , to confgure the layout of the plots.
The plus sign ( + ) allows you to add two charts side by side, while multiply ( * ) will enable you to combine charts (merge one graph with another). In the following example, we will add two plots, so they are aligned side by side on the same row:
( closing_price_n['AAPL'].hvplot( width=400, rot=45, fontsize={'xticks': 12} ) +
closing_price_n['MSFT'].hvplot( width=400, rot=45, fontsize={'xticks': 12} )
)
Notice that the two plots will share the same widget bar. If you fllter or zoom into one of the charts, the other chart will have the same action applied.
Now, let's see how multiplication will combine the two plots into one:
( closing_price_n['AAPL'].hvplot( width=800, height=400, rot=45, fontsize={'xticks': 12} ) *
closing_price_n['MSFT'].hvplot()
)
Figure 9.11 – Two plots combined into one using the multiplication operator
For more information on hvPlot, please visit their ofcial page here: hvPlot — hvPlot 0.8.1 documentation
When performing time series analysis, one of your objectives may be forecasting, where you build a model to make a future prediction. Before starting the modeling process, you will need to extract the components of the time series process for analysis. Tis will help you make informed decisions during the modeling process. In addition, there are three major components for any time series process: trend, seasonality, and residual.
The decomposition of a time series is the process of extracting the three components and representing them as their models. The modeling of the decomposed components can be either additive or multiplicative.
Furthermore, you can group these into predictable versus non-predictable components.
In this recipe, you will explore different techniques for decomposing your time series using the seasonal_decompose, Seasonal-Trend decomposition with LOESS (STL), and hp_filter methods available in the statsmodels library.
You will start with statsmodels' seasonal_decompose approach:
https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record.html
"The data file below contains 10 columns. Columns 1-4 give the dates in several redundant formats.
import numpy as np
import pandas as pd
source='https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv'
co2_ds = pd.read_csv( source,
comment='"',
header=[0,1,2],
sep=',',
na_values='-99.99'
)
co2_ds
co2_ds.columns
cols = [ '_'.join( ' '.join(col).strip().split() )
for col in co2_ds.columns.values
]
co2_ds.set_axis(cols, axis = 1, inplace = True)
co2_ds
co2_ds.columns
The monthly values have been adjusted to 24:00 hours on the 15th of each month
# Converting Excel date format to datetime
# 1958-21200/365=1899.9178082191781
# 365-.9178082191781*365 = 29.99999999999352 = 30
co2_ds['datetime'] = pd.to_datetime( co2_ds['Date_Excel'], # 1958-21200/365=1899.9178082191781
origin = pd.Timestamp('1899-12-30'), # before 1890
unit = 'D'
)
co2_ds
# and setting as dataframe index
co2_ds.set_index('datetime', inplace = True)
co2_ds
Column 5(CO2_[ppm]) below gives monthly Mauna Loa CO2 concentrations in micro-mol CO2 per mole (ppm), reported on the 2012 SIO manometric mole fraction scale. This is the standard version of the data most often sought. The monthly values have been adjusted to 24:00 hours on the 15th of each month.
Column 9(CO2_filled_[ppm]) is identical to Column 5 except that the missing values from Column 5 have been filled with values from Column 7.
co2_df = pd.DataFrame( co2_ds[ 'CO2_filled_[ppm]' ] )
co2_df.rename( columns={'CO2_filled_[ppm]':'CO2'},
inplace=True
)
co2_df.dropna( inplace=True )
co2_df = co2_df.resample('M').sum()
co2_df
##############
why resample('M').sum() ?
because we are going to use seasonal_decompose() , which requires the "x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None"
co2_df = co2_df.resample('M').sum()
co2_df.index
hvplot.extension("bokeh")
co2_df.hvplot( title='Mauna Loa Weekly Atmospheric CO2 Data',
width=600, height=400,
rot=45, fontsize={'xticks':12, 'yticks':12, 'xlabel':14}
)
Figure 9.12 – Te CO2 dataset showing an upward trend and constant seasonal variation
The co2_df data shows a long-term linear (upward) trend, with a repeated seasonal pattern at a constant rate (seasonal variation).
This indicates that the CO2 dataset is an additive model( The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations季节性波动的幅度, or the variation around the trend-cycle,围绕趋势周期的变化 does not vary with the level of the time series.).
Similarly, you can explore the airp_df DataFrame for the Air Passengers dataset to observe whether the seasonality shows multiplicative or additive behavior:
airp_df = pd.read_csv('air_passenger.csv')
# and setting as dataframe index
airp_df.set_index('date', inplace = True)
airp_df
airp_df.index
airp_df.index = pd.to_datetime( airp_df.index )
airp_df = airp_df.resample('M').sum()
airp_df.index
why resample('M').sum() ?
because we are going to use seasonal_decompose() , which requires the "x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None"
hvplot.extension('plotly') # 'matplotlib' # 'bokeh' # holoviews
start = pd.DatetimeIndex( airp_df.index ).year[0]
end = pd.DatetimeIndex( airp_df.index ).year[-1]
airp_df.plot( backend='hvplot',
title=f'Monthly Airline Passenger Numbers {start}-{end}',
xlabel='Date',
width=800, height=400,
)
Figure 9.13 – The Air Passengers dataset showing trend and increasing seasonal variation
The airp_df data shows a long-term linear (upward) trend and seasonality. However, the seasonality fuctuations seem to be increasing as well, indicating a multiplicative model(A multiplicative model is suitable when the seasonal variation fuctuates over time. OR When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series与时间序列的水平成正比时, then a multiplicative decomposition is more appropriate. ).==>
3. Use seasonal_decompose on the two datasets. For the CO2 data, use an additive model and a multiplicative model for the air passenger data:
from statsmodels.tsa.seasonal import seasonal_decompose
co2_decomposed = seasonal_decompose( co2_df['CO2'], model='additive' )
air_decomposed = seasonal_decompose( airp_df, model='multiplicative' )
Both co2_decomposed and air_decomposed have access to several methods, including
air_dec_df = airp_df
air_dec_df['trend']=air_decomposed.trend
air_dec_df['seasonal']=air_decomposed.seasonal
air_dec_df['resid']=air_decomposed.resid
air_dec_df
You can plot all three components by using the .plot() method:
plt.rcParams['figure.figsize'] = (10,10)
#https://matplotlib.org/stable/tutorials/introductory/customizing.html
plt.style.use('seaborn-dark')
air_decomposed.plot()
plt.show()
hvplot.extension("bokeh")
air_dec_df.hvplot( width=350, height=350,
xlabel='Date',
subplots=True, shared_axes=False
).cols(2)
Figure 9.14 – Air Passengers multiplicative decomposed into trend, seasonality, and residual
Let's break down the resulting plot into four parts:
Similarly, you can plot the decomposition of the CO2 dataset:
plt.rcParams['figure.figsize'] = (10,10)
# https://matplotlib.org/stable/tutorials/introductory/customizing.html
plt.style.use('seaborn-white')
fig=co2_decomposed.plot()
axs = fig.get_axes()
axs[3].clear()
axs[3].plot(co2_decomposed.resid)
axs[3].axhline(y=0, color='k', linestyle='--')
axs[3].set_ylabel('Resid')
plt.show()
co2_dec_df = co2_df.copy(deep=True)
co2_dec_df['trend']=co2_decomposed.trend
co2_dec_df['seasonal']=co2_decomposed.seasonal
co2_dec_df['resid']=co2_decomposed.resid
co2_dec_df
hvplot.extension("bokeh")
co2_dec_df.hvplot(width=800, height=240,
xlabel='Date',
subplots=True, shared_axes=False
).cols(1)
Creating layouts — Bokeh 2.4.3 Documentation
from bokeh.layouts import column # row,
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
# bokeh.__version__ : '2.4.3'
source = ColumnDataSource(data={ 'date': co2_decomposed.observed.index,
'co2' : co2_decomposed.observed,
'trend': co2_decomposed.trend,
'seasonl': co2_decomposed.seasonal,
'residual': co2_decomposed.resid
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
ps = []
# source.data.keys() : dict_keys(['date', 'co2', 'trend', 'seasonl', 'residual'])
for col in list( source.data.keys() )[1:]:
p = figure( width=800, height=230, #background_fill_color="#fafafa"
x_axis_type="datetime",
# x_axis_label='Date',
y_axis_label=col,
)
p.line( x='date', y=col, source=source, line_width=2, color='blue'
# legend_label=col
)
p.add_tools( HoverTool( # key
tooltips=[ ( 'Date', '@date{%F}'),
( col, '@%s{0.000}' % col ), # use @{ } for field names with spaces
],
formatters={ '@date' : "datetime", # use 'datetime' formatter for 'date' field
'@%s{0.000}' % col : 'numeral', # use default 'numeral' formatter
},
# display a tooltip whenever the cursor is vertically in line with a glyph
mode='vline'
)
)
ps.append(p)
show(column(ps))
from bokeh.layouts import column # row,
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
# bokeh.__version__ : '2.4.3'
source = ColumnDataSource(data={ 'date': co2_decomposed.observed.index,
'co2' : co2_decomposed.observed,
'trend': co2_decomposed.trend,
'seasonl': co2_decomposed.seasonal,
'residual': co2_decomposed.resid
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
ps = []
# source.data.keys() : dict_keys(['date', 'co2', 'trend', 'seasonl', 'residual'])
for col in list( source.data.keys() )[1:]:
p = figure( width=800, height=230, #background_fill_color="#fafafa"
x_axis_type="datetime",
# x_axis_label='Date',
y_axis_label=col,
)
p.line( x='date', y=col, source=source, line_width=2, color='blue'
# legend_label=col
)
p.add_tools( HoverTool( # key
tooltips=[ ( 'Date', '@date{%F}' ),
( 'co2', '@co2{0.000}' ), # use @{ } for field names with spaces
( 'trend', '@trend{0.000}' ),
( 'seasonl', '@seasonl{0.000}' ),
( 'residual', '@residual{0.000}'),
],
formatters={ '@date' : "datetime", # use 'datetime' formatter for 'date' field
'@co2{0.000}': 'numeral', # use default 'numeral' formatter
},
# display a tooltip whenever the cursor is vertically in line with a glyph
mode='vline'
)
)
ps.append(p)
# https://docs.bokeh.org/en/2.4.2/docs/reference/models/tools.html
# https://docs.bokeh.org/en/latest/docs/reference/colors.html
from bokeh.models import CrosshairTool
def addLinkedCrosshairs(plots):
crosshair = CrosshairTool(dimensions="height", line_color='green')
for p in plots:
p.add_tools(crosshair)
addLinkedCrosshairs(ps)
show(column(ps))
Figure 9.15 – CO2 additive decomposed into trend, seasonality, and residual
5. When reconstructing the time series, for example, in a multiplicative model(), you will be multiplying the three components. To demonstrate this concept, use air_decomposed, an instance of the DecomposeResult class. The class provides the seasonal, trend, and resid attributes as well as the .plot() method.
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
import numpy as np
rec_model=air_decomposed.trend * air_decomposed.seasonal * air_decomposed.resid
source = ColumnDataSource(data={ 'date': air_decomposed.observed.index,
'origin': air_decomposed.observed,
'refactored': rec_model,
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
p = figure( width=800, height=500,
title='Refactored VS Original models',
x_axis_type='datetime',
x_axis_label='Date',
y_axis_label='Passengers',
)
p.title.align = "center"
p.xaxis.major_label_orientation=np.pi/4 # rotation
p.line( x='date', y='origin', source=source, legend_label='Origin',
line_width=2, color='blue'
)
p.circle( x='date', y='refactored', source=source, legend_label='Refactored',
fill_color='white', size=5
)
p.legend.location = "top_left"
p.add_tools( HoverTool(
tooltips=[ ('Date', '@date{%F}'),
('Origin', '@origin{0.0}'),
('Refactored', '@refactored{0.0}' ),
],
formatters={'@date':'datetime',},
#model='vline'
)
)
show(p)
Note : There are missing points in some locations
STL is a versatile/ ˈvɜːrsət(ə)l /多功能的 and robust method for decomposing time series. STL is an acronym/ ˈækrənɪm /首字母缩略词 for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland, Cleveland, McRae, & Terpenning (1990).
The STL class uses the LOESS seasonal smoother, which stands for Locally Estimated Scatterplot Smoothing. STL is more robust than seasonal_decompose for measuring non-linear relationships. On the other hand, STL assumes additive composition, so you do not need to indicate a model, unlike with seasonal_decompose.
STL has several advantages over the classical, SEATS and X11 decomposition methods:
Unlike SEATS and X11, STL will handle any type of seasonality, not only monthly and quarterly data.
The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
The smoothness of the trend-cycle can also be controlled by the user.
It can be robust to outliers对异常值具有鲁棒性 (i.e., the user can specify a robust decomposition, Setting robust=True helps remove the impact of outliers on seasonal and trend components when calculated), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components偶尔的异常观测值不会影响趋势周期和季节性分量的估计. They will, however, affect the remainder component会影响其余组成分.
On the other hand, STL has some disadvantages. In particular,
It is possible to obtain a multiplicative decomposition by
We will look at several methods for obtaining the components , and later in this chapter, but first, it is helpful to see an example. We will decompose the new orders index for electrical equipment shown in Figure 6.1. The data show the number of new orders for electrical equipment (computer, electronic and optical products) in the Euro area (16 countries). The data have been adjusted by working days and normalised so that a value of 100 corresponds to 2005.
Figure 6.1 shows the trend-cycle component, , in red and the original data, , in grey. The trend-cycle shows the overall movement in the series, ignoring the seasonality and any small random fluctuations.
Figure 6.2 shows an additive decomposition of these data. The method used for estimating components in this example is STL.
The electrical equipment orders (top). The three additive components are shown separately in the bottom three panels of Figure 6.2. These components can be added together to reconstruct the data shown in the top panel. Notice that the seasonal component changes slowly over time, vsso that any two consecutive years have similar patterns, but years far apart may have different seasonal patterns. The remainder component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.
The grey bars to the right of each panel show the relative scales of the components组件的相对比例. Each grey bar represents the same length but because the plots are on different scales, the bars vary in length. The longest grey bar in the bottom panel shows that the variation in the remainder component is small compared to the variation in the data, which has a bar about one quarter the size. If we shrunk the bottom three panels until their bars became the same size as that in the data panel, then all the panels would be on the same scale.
##########
So on the upper panel, we might consider the bar as 1 unit of variation.
Now consider the trend panel;
If we look at the relative sizes of the bars on this plot,
So the general idea is that if you scaled all the panels such that the grey bars were all the same length, you would be able to determine the relative magnitude of the variations in each of the components and how much of the variation in the original data they contained.您将能够确定每个组件中变化的相对幅度以及原始数据中有多少变化 他们包含 But because the plot draws each component on it's own scale, we need the bars to give us a relative scale for comparison.
##########
Seasonally adjusted data
If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data. For an additive decomposition, the seasonally adjusted data are given by , and for multiplicative data, the seasonally adjusted values are obtained using .
If the variation due to seasonality is not of primary interest(longer grey bar), the seasonally adjusted series can be useful. For example, monthly unemployment data are usually seasonally adjusted(the seasonal component is removed from the original data) in order to highlight variation due to the underlying state of the economy rather than the seasonal variation每月失业数据通常会进行季节性调整,以突出由于潜在经济状况而不是季节性变化引起的变化.
Seasonally adjusted series contain the remainder component as well as the trend-cycle. Therefore, they are not “smooth”(不“平稳” or Non-stationary), and “downturns” or “upturns” can be misleading. If the purpose is to look for turning points in a series, and interpret any changes in direction, then it is better to use the trend-cycle component rather than the seasonally adjusted data.
Figure 6.2 shows an additive decomposition of these data. The method used for estimating components in this example is STL. Notice that the seasonal component changes slowly over time, vsso that any two consecutive years have similar patterns, but years far apart may have different seasonal patterns.
The best way to begin learning how to use STL is to see some examples and experiment with the settings. Figure 6.2 showed an example of STL applied to the electrical equipment orders data. Figure 6.13 shows an alternative STL decomposition where the trend-cycle is more flexible, the seasonal component does not change over time, and the robust option has been used. Here, it is more obvious that there has been a down-turn at the end of the series, and that the orders in 2009 were unusually low (corresponding to some large negative values(e.g. : -10) in the remainder component). 在这里,更明显的是,系列末期出现了下滑,2009年的订单异常低(对应于剩余部分的一些较大的负值)。
Figure 6.13: The electrical equipment orders (top) and its three additive components obtained from a robust STL decomposition with flexible trend-cycle and fixed seasonality.
The two main parameters to be chosen when using STL are
t.window
)t.window
is the number of consecutive observations to be used when estimating the trend-cyclet.window
is optional, and a default value will be used if it is omitted.s.window
).s.window
is the number of consecutive years to be used in estimating each value in the seasonal components.window
as there is no default. Setting it to be infinite is equivalent to forcing the seasonal component to be periodic (i.e., identical across years).t.window
and s.window
should be odd numbers; The mstl()
function provides a convenient automated STL decomposition using s.window=13
, and t.window
also chosen automatically. This usually gives a good balance between overfitting the seasonality and allowing it to slowly change over time. But, as with any automated procedure, the default settings will need adjusting for some time series.
As with the other decomposition methods discussed in this book, to obtain the separate components plotted in Figure 6.8, use the seasonal()
function for the seasonal component, the trendcycle()
function for trend-cycle component, and the remainder()
function for the remainder component. The seasadj()
function can be used to compute the seasonally adjusted series.
6. Another decomposition option within statsmodels is STL, which is a more advanced decomposition technique. In statsmodels, the STL class requires additional parameters than the seasonal_decompose function. Thee two other parameters you will use are seasonal and robust.
You will use STL to decompose the co2_df DataFrame:
https://docs.bokeh.org/en/2.4.2/docs/reference/models/glyphs/scatter.html
Linking behavior — Bokeh 2.4.3 Documentation
It’s often desired to link pan or zooming actions across many plots. All that is needed to enable this feature is to share range objects between figure() calls.
When you used STL , you provided seasonal=13 because the data has an annual seasonal effect.
from statsmodels.tsa.seasonal import STL
plt.style.use('seaborn-white')
#plt.style.use('ggplot' )
# robust : Flag indicating whether to use a weighted version that
# is robust to some forms of outliers.
co2_stl = STL( co2_df, seasonal=13, robust=True ).fit()
# co2_stl.plot()
# plt.show()
from bokeh.layouts import column # row,
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
# bokeh.__version__ : '2.4.3'
source = ColumnDataSource(data={ 'date': co2_stl.observed.index,
'co2' : co2_stl.observed['CO2'], # co2_stl.observed is a dataframe
'trend': co2_stl.trend,
'seasonl': co2_stl.seasonal,
'residual': co2_stl.resid
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
ps = []
# source.data.keys() : dict_keys(['date', 'co2', 'trend', 'seasonl', 'residual'])
for col in list( source.data.keys() )[1:]:
p = figure( width=800, height=220, #background_fill_color="#fafafa"
x_axis_type="datetime",
# x_axis_label='Date',
y_axis_label=col,
x_range=ps[0].x_range if len(ps)>0 else None, ###########
y_range=ps[0].y_range if len(ps)==1 else None, ###########
)
# if col != 'residual':
p.line( x='date', y=col, source=source, line_width=2, color='blue'
# legend_label=col
)
# else:
# p.scatter( x='date', y=col, source=source, line_width=2, color='blue',
# marker='circle'
# # legend_label=col
# )
p.add_tools( HoverTool( # key
tooltips=[ ( 'Date', '@date{%F}' ),
( 'co2', '@co2{0.000}' ), # use @{ } for field names with spaces
( 'trend', '@trend{0.000}' ),
( 'seasonl', '@seasonl{0.000}' ),
( 'residual', '@residual{0.000}'),
],
formatters={ '@date' : "datetime", # use 'datetime' formatter for 'date' field
'@co2{0.000}': 'numeral', # use default 'numeral' formatter
},
# display a tooltip whenever the cursor is vertically in line with a glyph
mode='vline'
)
)
ps.append(p)
ps[3].xaxis.major_label_orientation=np.pi/4 # rotation
# https://docs.bokeh.org/en/2.4.2/docs/reference/models/tools.html
# https://docs.bokeh.org/en/latest/docs/reference/colors.html
from bokeh.models import CrosshairTool
def addLinkedCrosshairs(plots):
crosshair = CrosshairTool(dimensions="height", line_color='green', line_alpha=1)
for p in plots:
p.add_tools(crosshair)
addLinkedCrosshairs(ps)
show(column(ps))
Figure 9.17 – Decomposing the CO2 dataset with STL
Compare the output in Figure 9.16 to that in Figure 9.15. You will notice that the residual plots look diferent, indicating that both methods capture similar information using distinct mechanisms. When you used STL , you provided seasonal=13 because the data has an annual seasonal effect.
You used two diferent approaches for time series decomposition. Both methods decompose a time series into trend, seasonal, and residual components.
The STL class uses the LOESS seasonal smoother, which stands for Locally Estimated Scatterplot Smoothing. STL is more robust than seasonal_decompose for measuring non-linear relationships. On the other hand, STL assumes additive composition, so you do not need to indicate a model, unlike with seasonal_decompose.
Both approaches can extract seasonality from time series to better observe the overall trend in the data.
########### STL is more robust than seasonal_decompose for measuring non-linear relationships Proved!
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
import numpy as np
rec_co2_stl=co2_stl.trend + co2_stl.seasonal + co2_stl.resid
source = ColumnDataSource(data={ 'date': co2_stl.observed.index,
'origin': co2_stl.observed['CO2'],
'reconstructed': rec_co2_stl,
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
p = figure( width=800, height=500,
title='Refactored(STL) VS Original models',
x_axis_type='datetime',
x_axis_label='Date',
)
# https://docs.bokeh.org/en/1.1.0/docs/user_guide/annotations.html
p.title.align = "center"
p.xaxis.major_label_orientation=np.pi/4 # rotation
p.line( x='date', y='origin', source=source, legend_label='Origin',
line_width=2, color='blue'
)
p.circle( x='date', y='reconstructed', source=source, legend_label='Reconstructed',
fill_color='white', size=3
)
p.legend.location = "top_left"
p.add_tools( HoverTool(
tooltips=[ ('Date', '@date{%F}'),
('Origin', '@origin{0.0}'),
('Refactored', '@refactored{0.0}' ),
],
formatters={'@date':'datetime',},
#model='vline'
)
)
show(p)
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
import numpy as np
rec_co2_dec=co2_decomposed.trend + co2_decomposed.seasonal + co2_stl.resid
source = ColumnDataSource(data={ 'date': co2_decomposed.observed.index,
'origin': co2_decomposed.observed,
'reconstructed': rec_co2_dec,
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
p = figure( width=800, height=500,
title='Refactored(seasonal_decompose) VS Original models',
x_axis_type='datetime',
x_axis_label='Date',
)
# https://docs.bokeh.org/en/1.1.0/docs/user_guide/annotations.html
p.title.align = "center"
p.xaxis.major_label_orientation=np.pi/4 # rotation
p.line( x='date', y='origin', source=source, legend_label='Origin',
line_width=2, color='blue'
)
p.circle( x='date', y='reconstructed', source=source, legend_label='Reconstructed',
fill_color='white', size=3
)
p.legend.location = "top_left"
p.add_tools( HoverTool(
tooltips=[ ('Date', '@date{%F}'),
('Origin', '@origin{0.0}'),
('Refactored', '@refactored{0.0}' ),
],
formatters={'@date':'datetime',},
#model='vline'
)
)
show(p)
Note : There are missing points in some locations
###########
A time series decomposition can be used to measure the strength of trend and seasonality in a time series (Wang, Smith, & Hyndman, 2006). Recall that the decomposition is written as
where is the smoothed trend component, is the seasonal component and is a remainder component.
The Hodrick-Prescott filter is a smoothing filter that can be used to separate short-term
fuctuations (cyclic variations周期性变化) from long-term trends. This is implemented as hp_filter in the statsmodels library.
Recall that STL and seasonal_decompose returned three components (trend, seasonal, and residual). On the other hand, hp_filter returns two components:
Start by importing the hpfilter function from the statsmodels library:
lamb : float
The Hodrick-Prescott smoothing parameter. A value of 1600 is suggested for quarterly data. Ravn and Uhlig suggest using a value of 6.25 (1600/4**4) for annual data and 129600 (1600*3**4) for monthly data.
The reasoning for the methodology uses ideas related to the decomposition of time series. Let for denote the logarithms of a time series variable. The series is made up of a trend component , a cyclical component , and an error component such that Given an adequately chosen, positive value of , there is a trend component that will solve(The HP filter removes a smooth trend, , from the data by solving)
Here we implemented the HP filter as a ridge-regression rule using scipy.sparse.statsmodels.tsa.filters.hp_filter — statsmodels In this sense, the solution can be written as
: the number of observations
where is a identity matrix, and K is a ) matrix such that
K[i,j] = 1 if i == j or i == j + 2
K[i,j] = -2 if i == j + 1
K[i,j] = 0 otherwise
The Hodrick–Prescott filter is explicitly given by
where denotes the lag operator, as can seen from the first-order condition for the minimization problem.
from statsmodels.tsa.filters.hp_filter import hpfilter
plt.rcParams["figure.figsize"] = (20, 3)
plt.rcParams['font.size']=12
# co2_df = pd.DataFrame( co2_ds[ 'CO2_filled_[ppm]' ] )
# co2_df.rename( columns={'CO2_filled_[ppm]':'CO2'},
# inplace=True
# )
# co2_df.dropna( inplace=True )
# co2_df = co2_df.resample('M').sum()
co2_cyclic, co2_trend = hpfilter(co2_df)
The hpfilter function returns two pandas Series: the first Series is for the cycle and the second Series is for the trend. Plot co2_cyclic and co2_trend side by side to gain a better idea of what information the Hodrick-Prescott filter was able to extract from the data:
fig, ax = plt.subplots(2, 1, figsize=(10,8))
co2_cyclic.plot( ax=ax[0], title='CO2 Cyclic Component' )
co2_trend.plot( ax=ax[1] , title='CO2 Trend Component' )
ax[0].title.set_size(20)
ax[1].title.set_size(20)
plt.subplots_adjust(hspace = 0.3)
Note that the two components from hp_filter are additive. In other words, to reconstruct the original time series, you would add co2_cyclic and co2_trend.
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
import numpy as np
rec_co2_hp= co2_trend + co2_cyclic
source = ColumnDataSource(data={ 'date': co2_df.index,
'origin': co2_df['CO2'],
'reconstructed': rec_co2_hp,
}
)
def datetime(x):
return np.array(x, dtype=np.datetime64)
p = figure( width=800, height=500,
title='Refactored(hp) VS Original models',
x_axis_type='datetime',
x_axis_label='Date',
)
# https://docs.bokeh.org/en/1.1.0/docs/user_guide/annotations.html
p.title.align = "center"
p.xaxis.major_label_orientation=np.pi/4 # rotation
p.line( x='date', y='origin', source=source, legend_label='Origin',
line_width=2, color='blue'
)
p.circle( x='date', y='reconstructed', source=source, legend_label='Reconstructed',
fill_color='white', size=3
)
p.legend.location = "top_left"
p.add_tools( HoverTool(
tooltips=[ ('Date', '@date{%F}'),
('Origin', '@origin{0.0}'),
('Refactored', '@refactored{0.0}' ),
],
formatters={'@date':'datetime',},
#model='vline'
)
)
show(p)
To learn more about hpfilter(), please visit the ofcial documentation page here: https://www.statsmodels.org/0.8.0/generated/statsmodels.tsa.filters.hp_filter.hpfilter.htmlstatsmodels.tsa.filters.hp_filter.hpfilter — statsmodelshttps://www.statsmodels.org/0.8.0/generated/statsmodels.tsa.filters.hp_filter.hpfilter.html
Several time series forecasting techniques assume stationarity. Tips makes it essential to understand whether the time series you are working with is stationary or non-stationary.
There are diferent approaches for defning stationarity; some are strict and may not be possible to observe in real-world data, referred to as strong stationarity. In contrast, other defnitions are more modest in their criteria and can be observed in (or transformed into) real-world data, known as weak stationarity.
Stationarity is an essential concept in time series forecasting, and more relevant when working with financial or economic data. The mean is considered stable and constant if the time series is stationary. In other words, there is an equilibrium存在一个平衡 as values may deviate from the mean (above or below), but eventually, it always returns to the mean. Some trading strategies rely on this core assumption, formally called a mean reversion strategyhttps://blog.csdn.net/Linli522362242/article/details/121896073https://blog.csdn.net/Linli522362242/article/details/126353102.
These are a number of definitions of stationarity that you may come across in time series studies:
In this recipe, and for practical reasons, a stationary time series is defned as a time series with a constant mean(μ), a constant variance(), and a consistent covariance (or autocorrelation) between identical distanced periods (lags). Having the mean and variance as constants simplifes modeling since you are not solving for them as functions of time.
Generally, a time series with trend or seasonality can be considered non-stationary. Usually, spotting trends or seasonality visually in a plot can help you determine whether the time series is stationary or not. In such cases, a simple line plot would suffice. But in this recipe, you will explore statistical tests to help you identify a stationary or non-stationary time series numerically. You will explore testing for stationarity and techniques for making a time series stationary.
The statsmodels library ofers stationarity tests, such as the adfuller and kpss functions. Both are considered unit root tests and are used to determine whether differencing or other transformations are needed to make the time series stationary.
You will explore two statistical tests, the Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, using the statsmodels library. Both ADF and KPSS test for unit roots in a univariate time series process. Note that unit roots are just one cause for a time series to be non-stationary, but generally, the presence of unit roots indicates non-stationarity.
Both ADF and KPSS are based on linear regression and are a type of statistical hypothesis test. For example,
Therefore, you will need to interpret the test results to determine whether you can reject or fail to reject the null hypothesis. Generally, you can rely on the p-values returned to decide whether you reject or fail to reject the null hypothesis. Remember, the interpretation for ADF and KPSS results is different given their opposite null hypotheses.
In this recipe, you will be using the CO2 dataset, which was previously loaded as a pandas DataFrame under the Technical requirements section of this chapter.
In addition to the visual interpretation of a time series plot to determine stationarity, a more concrete method would be to use one of the unit root tests, such as the ADF KPSS test.
In Figure 9.13, you can spot an upward trend and a reoccurring seasonal pattern (annual). However, when trend or seasonality exists (in this case, both), it makes the time series non-stationary. It's not always this easy to identify stationarity or lack of it visually, and therefore, you will rely on statistical tests.
You will use both the adfuller and KPSS tests from the statsmodels library and interpret their results knowing they have opposite null hypotheses:
from datetime import datetime
source='https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/weekly/weekly_in_situ_co2_mlo.csv'
co2_df = pd.read_csv( source,
comment='"',
sep=',',
names=['co2'],# as second column name
index_col=0, # use first column as index
parse_dates=True ,
na_values='-99.99'
)
#co2_df.set_index('Date', inplace=True,)
#co2_df.index.name=None
co2_df.dropna( inplace=True )
co2_df=co2_df.asfreq('W-SAT', 'ffill')#ffill()
#co2_df=co2_df.loc['1958-03-29':'2001-12-30']
co2_df
co2_df.plot(kind='hist', figsize=(10,8))
Run both the kpss and adfuller tests. Use the default parameter values for both functions:
from statsmodels.tsa.stattools import adfuller, kpss
adf_output = adfuller( co2_df )
kpss_output = kpss( co2_df)
adf_output
1.2234524495363004, | The test statistic. |
0.9961439788365943, | MacKinnon’s approximate p-value based on MacKinnon |
29, | The number of lags used. |
3341, | The number of observations used for the ADF regression and calculation of the critical values. |
{'1%': -3.4323087941815134, | Critical values for the test statistic at the 1 % |
'5%': -2.8624054806561885, | Critical values for the test statistic at the 5 % |
'10%': -2.5672307125909124}, | Critical values for the test statistic at the 10 % |
4511.855869092864) | The maximized information criterion if autolag is not None.(default autolag='AIC', ) |
kpss_output
To simplify the interpretation of the test results, create a function that outputs the results in a user-friendly way. Let's call the function print_results :
def print_results( output, test='adf' ):
test_score = output[0]
pval = output[1]
lags = output[2]
decision = 'Non-Stationary'
if test == 'adf':
critical = output[4]
if pval < 0.05:
decision = 'Stationary'
elif test =='kpss':
critical = output[3]
if pval >= 0.05:
decision='Stationary'
output_dict = { 'Test Statistic': test_score,
'p-value': pval,
'Numbers of lags': lags,
'decision': decision
}
for key, value in critical.items():
output_dict['Critical Value (%s)' % key] = value
return pd.Series(output_dict, name=test)
Pass both outputs to the print_results function and concatenate them into a pandas DataFrame for easier comparison:
pd.concat([ print_results(adf_output, 'adf'),
print_results(kpss_output, 'kpss')
],
axis=1
)
You will explore six techniques for making the time series stationary, such as transformations and differencing. The techniques covered are
Essentially, stationarity can be achieved by removing trend (detrending) and seasonality effects. For each transformation, you will run the stationarity tests and compare the results between the different techniques. To simplify the interpretation and comparison, you will create two functions:
Create the check_stationarity function, which is a simplifed rewrite of the print_results function used earlier:
def check_stationarity( df ):
kps = kpss(df)
adf = adfuller(df)
kpss_pv, adf_pv = kps[1], adf[1]
kpss_h0, adf_h0 = 'Stationary', 'Non-stationary'
if adf_pv < 0.05:
# Reject ADF Null Hypothesis
adf_h0 = 'Stationary'
if kpss_pv < 0.05:
kpss_h0 = 'Non Stationary'
return (kpss_h0, adf_h0)
#plt.rc('text', usetex=False)
def plot_comparison( methods, plot_type='line' ):
n = len(methods) // 2
fig, ax = plt.subplots( n,2, sharex=True, figsize=(20,16) )
for i, method in enumerate(methods):
method.dropna( inplace=True )
name = [n for n in globals()
if globals()[n] is method
]
row_idx, col_idx = i//2, i%2
kpss_decision, adf_decision = check_stationarity(method)
method.plot( kind=plot_type,
ax=ax[row_idx, col_idx],
legend=False,
title=f'{name[0].upper()}: KPSS={kpss_decision}, ADF={adf_decision}'
)
ax[row_idx, col_idx].title.set_size(14)
method.rolling(52).mean().plot( ax=ax[row_idx, col_idx],color='blue',
legend=False
)
Notice the center line(color='blue') representing the time series average (moving average). The mean should be constant for a stationary time series and look more like a straight line.
Let's implement some of the methods for making the time series stationary or extracting a stationary component. Then, combine the methods into a Python list:
# using first order differencing (detrending)
first_order_diff = co2_df.diff(periods=1).dropna()
# using second order differencing # note diff(periods=1 default)
second_order_diff = co2_df.diff(52).diff().dropna()
# subtracting moving average
ma = co2_df.rolling( window=52 ).mean()
subtract_ma = co2_df - ma
# log transformation,
log_transform = np.log( co2_df )
# using seasonal_decompose to remove trend
decomp = seasonal_decompose( co2_df, model='additive' )
detrend_sd = (decomp.observed - decomp.trend)
# using Hodrick-Prescott filter(additive)
cyclic_extract, trend = hpfilter( co2_df )
Now, let's combine the methods into a Python list, then pass the list to the plot_comparison function:
# using first order differencing (detrending)
first_order_diff = co2_df.diff(periods=1).dropna()
# using second order differencing # note diff(periods=1 default)
second_order_diff = co2_df.diff(52).diff().dropna()
# differencing to remove seasonality
disseasonalize = co2_df.diff(52).dropna()
# subtracting moving average
ma = co2_df.rolling( window=52 ).mean()
subtract_ma = co2_df - ma
# log transformation,
log_transform = np.log( co2_df )
# Square root transform
square_root = np.sqrt(co2_df)
# using seasonal_decompose to remove trend
decomp = seasonal_decompose( co2_df, model='additive' )
detrend_sd = (decomp.observed - decomp.trend)
# using Hodrick-Prescott filter(additive)
cyclic_extract, trend = hpfilter( co2_df )
# combine the methods into a list
methods = [ first_order_diff, second_order_diff,
disseasonalize, subtract_ma,
log_transform, square_root,
detrend_sd, cyclic_extract
]
This should display 4 x 2 subplots, which defaults to line charts:
import warnings
warnings.filterwarnings('ignore')
###### configurations for image quality#######
plt.rcParams["figure.figsize"] = [12, 6] ##
# plt.rcParams['figure.dpi'] = 300 ## 300 for printing
# plt.rc('font', size=8) ##
# plt.rc('axes', titlesize=10) ##
# plt.rc('axes', labelsize=12) ##
# plt.rc('xtick', labelsize=10) ##
# plt.rc('ytick', labelsize=10) ##
# plt.rc('legend', fontsize=10) ##
# plt.rc('figure', titlesize=10) ##
#############################################
plot = plot_comparison(methods)
warnings.simplefilter(action='ignore')
Figure 9.20 – Plotting the different methods to make the CO2 time series stationary
Generally, you do not want to over-difference your time series as some studies have shown that models based on over-differenced data are less accurate. For example, first_order_diff already made the time series stationary, and thus there was no need to difference it any further. In other words, second_order_diff = co2_df.diff(52).diff().dropna() was not needed. Additionally, notice how log_transform is still non-stationary.
Notice the center line(color='blue') representing the time series average (moving average). The mean should be constant for a stationary time series and look more like a straight line.
When you decide to detrend your data, you are essentially removing an element of distraction so you can focus on hidden patterns that are not as obvious. Hence, you can build a model to capture these hidden patterns and not be overshadowed by the long-term trend (upward or downward movement).
An example was the first differencing approach. However, in the presence of seasonal patterns you will need to remove the seasonal effect as well, which can be done through seasonal differencing. This is done in addition to the first-order differencing for detrending; hence it can be called second-order differencing, twice-diferencing, or differencing twice as you use differencing to remove seasonality effect first and again to remove the trend(co2_df.diff(52).diff().dropna()). This assumes the seasonal differencing was insufcient to make the time series stationary and thus, you need derend. Your goal is to use the minimal amount of differencing needed and avoid over-differencing. You will rarely need to go beyond differencing twice.
In the introduction section of this recipe, we mentioned that both ADF and KPSS use Ordinary Least Squares (OLS) regression. More specifcally, OLS regression is used to compute the model's coeffcients. To view the OLS results for ADF, you use the store parameter and set it to True :
# using first order differencing (detrending)
# first_order_diff = co2_df.diff(periods=1).dropna()
adf_result = adfuller( first_order_diff, store=True)
adf_result
The preceding code will return a tuple that contains the test results. The regression summary will be appended as the last item. There should be four items in the tuple:
adf_result[-1].resols.summary()
The ResultStore object gives you access to .resols , which contains the .summary() method. This should produce the following output:Figure 9.21 – ADF OLS regression summary and the first 30 lags and their coefficients
Time series data can be complex, and embedded within the data is critical information that you will need to understand and peek into to determine the best approach for building a model. For example, you have explored time series decomposition, understood the impact of trend and seasonality, and tested for stationarity. In the previous recipe, Detecting time series stationarity, you examined the technique to transform data from non-stationary to stationary. This includes the idea of detrending, which attempts to stabilize the mean over time.
Depending on the model and analysis you are pursuing, you may need to test for additional assumptions against the observed dataset or the model's residuals. For example, testing for
Therefore, it is important to be aware of the assumptions made by specifc models or techniques so you can determine which test to use and against which dataset. If you do not do this, you may end up with a flawed/ flɔːd /有瑕疵的,有缺陷的 model or an outcome that may be overly optimistic or overly pessimistic/ ˌpesɪˈmɪstɪk /悲观的.
Additionally, in this recipe, you will learn about Box-Cox transformation, which you can use to transform the data to satisfy normality and homoskedasticity. Box-Cox transformation takes the following form:Figure 9.22 – Box-Cox transformation
The Box-Cox transformation relies on just one parameter, lambda ( λ ), and covers both logarithm and power transformations.
The approach is to try different values of and then test for normality and homoskedasticity. For example, the SciPy library has the boxcox function, and you can specify different λ values using the lambda parameter (interestingly, this is how it is spelled in the implementation since lambda is a reserved Python keyword). If the lambda parameter is set to None , the function will find the optimal lambda ( λ ) value for you.
In this recipe, you will extend what you learned from the previous recipe, Detecting time series stationarity, and test for two additional assumptions: normality and homoskedasticity.
Usually, stationarity is the most crucial assumption you will need to worry about but being familiar with additional diagnostic techniques will serve you well.
Sometimes, you can determine normality and homoskedasticity from plots, for example, a histogram or a Q-Q plot. This recipe aims to teach you how to perform these diagnostic tests programmatically in Python. In addition, you will be introduced to the White test and the Breusch-Pagan Lagrange statistical test for homoskedactisity.
For normality diagnostics, you will explore the Shapiro-Wilk, D'Agostino-Pearson, and Kolmogorov-Smirnov statistical tests. Overall, Shapiro-Wilk tends to perform best and handles a broader set of cases.
The statsmodels library and the SciPy library have overlapping implementations. For example, the Kolmogorov-Smirnov test is implemented as ktest in SciPy and ktest_normal in statsmodels. In SciPy, the D'Agostino-Pearson test is implemented as normaltest and the Shapiro-Wilk test as shapiro.
The normality diagnostic is a statistical test based on a null hypothesis that you need to determine whether you can accept or reject. Conveniently, the following tests that you will implement have the same null hypothesis. The null hypothesis states that the data is normally distributed; for example, you would reject the null hypothesis if the p-value is less than 0.05, making the time series not normally distributed. Let's create a simple function, is_normal() , that will return either Normal or Not Normal based on the p-value:
from scipy.stats import shapiro, kstest, normaltest
from statsmodels.stats.diagnostic import kstest_normal, normal_ad
def is_normal( test, p_level=0.05, name='' ):
stat, pvalue = test
print( name + ' test')
print( 'statistic: ', stat )
print( 'p-value:', pvalue)
return 'Normal' if pvalue>0.05 else 'Not Normal'
normal_args = ( np.mean(co2_df), np.std(co2_df) )
# The Shapiro-Wilk test tests the null hypothesis that
# the data was drawn from a normal distribution.
print( is_normal( shapiro(co2_df), name='Shapiro-Wilk' ) )
# Test whether a sample differs from a normal distribution.
# statistic: z-score = (x-mean)/std
# s^2 + k^2, where s is the z-score returned by skewtest
# and k is the z-score returned by kurtosistest.
print( is_normal( normaltest(co2_df), name='normaltest' ) )
# Anderson-Darling test for normal distribution unknown mean and variance.
print( is_normal( normal_ad(co2_df), name='Anderson-Darling' ) )
# Test assumed normal or exponential distribution using Lilliefors’ test.
# Kolmogorov-Smirnov test statistic with estimated mean and variance.
print( is_normal( kstest_normal(co2_df), name='Kolmogorov-Smirnov') )
# The one-sample test compares the underlying distribution F(x) of
# a sample against a given distribution G(x).
# The two-sample test compares the underlying distributions of
# two independent samples.
# Both tests are valid only for continuous distributions.
print( is_normal( kstest(co2_df, cdf='norm', args=normal_args), name='KS' ))
The output from the tests confirms the data does not come from a normal distribution. You do not need to run that many tests. The shapiro test, for example, is a very common and popular test that you can rely on. Generally, as with any statistical test, you need to read the documentation regarding the implementation to gain an understanding of the test. More specifcally, you will need to understand the null hypothesis behind the test to determine whether you can reject or fail to reject the null hypothesis.
###########################
scipy.stats.kstest(rvs, cdf, args=(), N=20, alternative='two-sided', method='auto')
The one-sample test compares the underlying distribution F(x) of a sample against a given distribution G(x). The two-sample test compares the underlying distributions of two independent samples. Both tests are valid only for continuous distributions.
There are three options for the null and corresponding alternative hypothesis that can be selected using the alternative parameter.
two-sided:
The null hypothesis is that the two distributions are identical, F(x)=G(x) for all x;
the alternative is that they are not identical.
less:
The null hypothesis is that F(x) >= G(x) for all x;
the alternative is that F(x) < G(x) for at least one x.
greater:
The null hypothesis is that F(x) <= G(x) for all x;
the alternative is that F(x) > G(x) for at least one x.
Note that the alternative hypotheses describe the CDFs of the underlying distributions, not the observed values. For example, suppose x1 ~ F and x2 ~ G. If F(x) > G(x) for all x, the values in x1 tend to be less than those in x2.
Suppose we wish to test the null hypothesis that a sample is distributed according to the standard normal. We choose a confidence level of 95%; that is, we will reject the null hypothesis in favor of the alternative if the p-value is less than 0.05.
When testing uniformly distributed data, we would expect the null hypothesis to be rejected.
from scipy import stats
rng = np.random.default_rng()
stats.kstest(stats.uniform.rvs(size=100, random_state=rng),
stats.norm.cdf)
Indeed, the p-value is lower than our threshold of 0.05, so we reject the null hypothesis in favor of the default “two-sided” alternative: the data are not distributed according to the standard normal.
When testing random variates from the standard normal distribution, we expect the data to be consistent with the null hypothesis most of the time.
x = stats.norm.rvs(size=100, random_state=rng)
stats.kstest(x, stats.norm.cdf)
As expected, the p-value of 0.75 is not below our threshold of 0.05, so we cannot reject the null hypothesis.
Note that the alternative hypotheses describe the CDFs of the underlying distributions, not the observed values. For example, suppose x1 ~ F and x2 ~ G. If F(x) > G(x) for all x, the values in x1 tend to be less than those in x2.
Suppose, however, that the random variates are distributed according to a normal distribution that is shifted toward greater values. In this case, the cumulative density function (CDF) of the underlying distribution tends to be less than the CDF of the standard normal. Therefore, we would expect the null hypothesis to be rejected with alternative='less'
:
less:
The null hypothesis is that F(x) >= G(x) for all x;
the alternative is that F(x) < G(x) for at least one x.
x = stats.norm.rvs(size=100, loc=0.5, random_state=rng)
stats.kstest(x, stats.norm.cdf, # or "norm"
alternative='less')
and indeed, with p-value smaller than our threshold, we reject the null hypothesis in favor of the alternative.
The examples above have all been one-sample tests identical to those performed by ks_1samp. Note that kstest can also perform two-sample tests identical to those performed by ks_2samp. For example, when two samples are drawn from the same distribution, we expect the data to be consistent with the null hypothesis most of the time.
sample1 = stats.laplace.rvs(size=105, random_state=rng)
sample2 = stats.laplace.rvs(size=95, random_state=rng)
stats.kstest(sample1, sample2)
As expected, the p-value of 0.45 is not below our threshold of 0.05, so we cannot reject the null hypothesis.
###########################
Sometimes, you may need to test normality as part of model evaluation and diagnostics. For example, you would evaluate the residuals (defined as the difference between actual and predicted values) if they follow a normal distribution. In Chapter 10, Building Univariate Time Series Models Using Statistical Methods, you will explore building forecasting models using autoregressive and moving average models. For now, you will run a simple autoregressive (AR(1)) model to demonstrate how you can use a normality test against the residuals of a model:
import statsmodels.tsa.api as smt
fig, ax = plt.subplots(figsize = (12,8))
smt.graphics.plot_pacf( co2_df,
lags=26,
ax = ax,
auto_ylims=True,
zero=True # Flag indicating whether to include the 0-lag autocorrelation.
)
plt.show()
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg( co2_df.dropna(),
lags=1, # AR(1) # ‘t’-Time trend only. ‘ct’-Constant and time trend.
#trend='n', # ‘n’-No trend. ‘c’-Constant only.
).fit()
model.summary()
You can run the shapiro test against the residuals. To access the residuals, you would use the . resid property as in model.resid . Tis is common in many models you will build in Chapter 10, Building Univariate Time Series Models Using Statistical Methods:
print( is_normal( shapiro(model.resid) ) )
model = AutoReg( co2_df.diff(periods=1).dropna(),
lags=1, # AR(1) # ‘t’-Time trend only. ‘ct’-Constant and time trend.
trend='n', # ‘n’-No trend. ‘c’-Constant only.
).fit()
model.summary()
print( is_normal( shapiro(model.resid) ) )
###### detrending + seasonalize
co2_decomposed = seasonal_decompose( co2_df, period=13, #seasonal=13 because the data has an annual seasonal effect.
model='additive' )
print( is_normal( shapiro(co2_decomposed.resid.dropna()) ) )
smt.graphics.plot_acf(co2_decomposed.resid.dropna())
plt.show()
smt.graphics.plot_pacf(co2_decomposed.resid.dropna())
plt.show()
model = AutoReg( co2_decomposed.resid.dropna(),
lags=1, # AR(1) # ‘t’-Time trend only. ‘ct’-Constant and time trend.
trend='n', # ‘n’-No trend. ‘c’-Constant only.
).fit()
print( is_normal( shapiro(model.resid.dropna()) ) )
smt.graphics.plot_acf(model.resid.dropna())
plt.show()
####
# co2_decomposed = seasonal_decompose( co2_df, model='additive' )
co2_stl = STL( co2_df, seasonal=13, #seasonal=13 because the data has an annual seasonal effect.
robust=True ).fit()
print( is_normal( shapiro(co2_stl.resid.dropna()) ) )
smt.graphics.plot_acf(co2_stl.resid.dropna())
plt.show()
smt.graphics.plot_pacf(co2_stl.resid.dropna())
plt.show()
model = AutoReg( co2_stl.resid.dropna(),
lags=1, # AR(1) # ‘t’-Time trend only. ‘ct’-Constant and time trend.
trend='n', # ‘n’-No trend. ‘c’-Constant only.
).fit()
print( is_normal( shapiro(model.resid.dropna()) ) )
smt.graphics.plot_acf(model.resid.dropna())
plt.show()
The output indicates the residuals are not normally distributed. This fact, residuals not being normally distributed, is not enough to determine the model's validity or potential improvements. But taken into context with the other tests, it should help you determine how good your model is. This is a topic you will explore further in the next chapter.
model.plot_diagnostics()
plt.show()
Depending on the model and analysis you are pursuing, you may need to test for additional assumptions against the observed dataset or the model's residuals. For example, testing for
You will be testing for the stability of the variance against the model's residuals. This will be the same AR(1) model used in the previous normality test:
You will perform a homoskedasticity test on the model's residuals. As stated earlier regarding statistical tests, it is vital to understand the hypothesis behind these tests. The null hypothesis states that the data is homoskedastic for the two tests. For example, you would reject the null hypothesis if the p-value is less than 0.05, making the time series heteroskedastic.
statsmodels.stats.diagnostic.het_breuschpagan(resid, exog_het, robust=True):
Breusch-Pagan Lagrange Multiplier test for heteroscedasticity
The tests the hypothesis that the residual variance does not depend on the variables in x in the form
Homoscedasticity implies that α=0.
Notes
Assumes x contains constant (for counting dof自由度 and calculation of vs
Here, SSE is the Sum of Squared Errors(OR the Sum of Squared of Residuals)
This yields a list of errors squared, which is then summed and equals the unexplained variance.
and SST is the Total Sum of Squares(total variance):
the average actual value y.
is indeed just a rescaled version of the MSE
n is the number of cases(samples) used to fit the model and k is the number predictor variables(features) in the modelhttps://blog.csdn.net/Linli522362242/article/details/121551663
). In the general description of LinearModel test, Greene mentions that this test exaggerates夸大了 the significance of results in small or moderately large samples. In this case the F-statistic((3.23)) is preferable.
p: The number of predictor variables (features) used to fit the model
In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether . As in the simple linear regression
setting, we use a hypothesis test to answer this question. We test the null hypothesis,
If the linear model assumptions are correct, one can know that and that, provided is true,then
Hence, when there is no relationship between the response and predictors(features), one would expect the F-statistic to take on a value close to 1. On the other hand, if is true, then , so we expect F to be greater than 1.
####################
Cochran's theorem:若samples 独立且服从均值为μ,方差为的正态分布,则有 degrees of freedom:n-1
我们假设误差项 是独立同分布的,且
当假定的线性模型正确时,真模型和预测模型的误差:
那么,显然(conditional on )
即 因此
故有
又当成立时, (否则的方差还与有关)
==>
则
(3.23
Notice that in Table 3.4, for each individual predictor a t-statistic and a p-value were reported. These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors. It turns out that each of these are exactly equivalent to the F-test that omits that single variable from the model, leaving all the others in—i.e. q=1 in (3.24). So it reports the partial effect of adding that variable to the model. For instance, as we discussed earlier, these p-values(<0.05) indicate that TV and radio are related to sales , but that there is no evidence that newspaper is associated with sales , in the presence of these two.
####################
Verification
Chisquare test statistic is exactly (<1e-13) the same result as bptest in R-stats with defaults (studentize=True).
Let's create a small function, calling it het_test(model, test) , that takes in a model and the test function and returns either Heteroskedastic or Homoskedastic based on the p-value to determine whether the null hypothesis is accepted or rejected:
from statsmodels.datasets import co2
co2_df = co2.load_pandas().data
co2_df = co2_df.ffill()
co2_df
from statsmodels.datasets import co2
co2_df = co2.load_pandas().data
co2_df = co2_df.ffill()
# from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(co2_df.dropna(), lags=1, trend='n').fit()
model.summary()
print( is_normal(shapiro(model.resid)) )
print( is_normal(normaltest(model.resid)) )
print( is_normal(normal_ad(model.resid)) )
print( is_normal(kstest_normal(model.resid)) )
print( is_normal( kstest( model.resid,
cdf='norm',
args=( np.mean(model.resid),
np.std(model.resid)
)
)
)
)
plt.hist(model.resid)
from statsmodels.graphics.gofplots import qqplot
qqplot(model.resid, line='q')
plt.show()
smt.graphics.plot_acf(model.resid)
plt.show()
model.plot_diagnostics()
plt.show()
from statsmodels.stats.api import( het_breuschpagan,
het_goldfeldquandt,
het_white
)
from statsmodels.tools.tools import add_constant
def het_test( model, test=het_breuschpagan ):
#lm: lagrange multiplier statistic
#lm_pvalue: p-value of lagrange multiplier test
#fvalue: f-statistic of the hypothesis that the error variance does not depend on x
#f_pvalue: p-value for the f-statistic
lm, lm_pvalue, fvalue, f_pvalue = het_breuschpagan( model.resid,
add_constant(model.fittedvalues)
)
return 'Heteroskedastic' if f_pvalue < 0.05 else "Homoskedastic"
Start with the Breusch-Pagan Lagrange multiplier test to diagnose the residuals. In statsmodels, you will use the het_breuschpagan function, which takes resid, the model's residual, and exog_het , where you provide the original data (explanatory variables) related to the heteroskedasticity in the residual:
het_test( model, test=het_breuschpagan)
Consistent with the results shown in Figure Standardized residual.
This result indicates that the residual is homoskedastic, with a constant variance (stable).
A very similar test is White's Lagrange multiplier test. In statsmodels, you will use the het_white function, which has the same two parameters that you used with het_breuschpagan
het_test( model, test=het_white )
plt.scatter(co2_df['co2'].values[1:],model.resid)
Both tests indicate that the residuals of the autoregressive model have constant variance (homoskedastic). Both tests estimate the auxiliary/ɔːɡˈzɪliəri/辅助的 regression against the squared residuals and all the explanatory variables.
Keep in mind that both normality and homoskedasticity are some of the tests you may need to conduct on the residuals as you diagnose your model. Another essential test is testing for autocorrelation, which is discussed in the following recipe, Testing for autocorrelation in time series data.
from datetime import datetime
source='https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/weekly/weekly_in_situ_co2_mlo.csv'
co2_df = pd.read_csv( source,
comment='"',
sep=',',
names=['co2'],# as second column name
index_col=0, # use first column as index
parse_dates=True ,
na_values='-99.99'
)
#co2_df.set_index('Date', inplace=True,)
#co2_df.index.name=None
co2_df.dropna( inplace=True )
co2_df=co2_df.asfreq('W-SAT', 'ffill')#ffill()
#co2_df=co2_df.loc['1958-03-29':'2001-12-30']
co2_df
# from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(co2_df.dropna(), lags=1, trend='n').fit()
model.summary()
print( is_normal(shapiro(model.resid)) )
print( is_normal(normaltest(model.resid)) )
print( is_normal(normal_ad(model.resid)) )
print( is_normal(kstest_normal(model.resid)) )
print( is_normal( kstest( model.resid,
cdf='norm',
args=( np.mean(model.resid),
np.std(model.resid)
)
)
)
)
smt.graphics.plot_acf(model.resid)
plt.show()
model.plot_diagnostics()
plt.show()
het_test( model, test=het_test )
Consistent with the results shown in Figure Standardized residual.
het_test( model, test=het_breuschpagan)
het_test( model, test=het_white )
plt.scatter(co2_df['co2'].values[1:],model.resid)
Box-Cox transformation can be a useful tool, and it's good to be familiar with. Box-Cox transforms a non-normally distributed dataset into a normally distributed one. At the same time, it stabilizes the variance, making the data homoskedastic. To gain a better understanding of the efect of Box-Cox transformation, you will use the Air Passengers dataset, which contains both trend and seasonality:
from scipy.stats import boxcox
airp_df = pd.read_csv('air_passenger.csv',date_parser=True)
# and setting as dataframe index
airp_df.set_index('date', inplace = True)
airp_df.index = pd.to_datetime( airp_df.index )
airp_df = airp_df.resample('M').sum()
airp_df
The airp_df data shows a long-term linear (upward) trend and seasonality. However, the seasonality fuctuations seem to be increasing as well, indicating a multiplicative model(A multiplicative model is suitable when the seasonal variation fuctuates over time. OR When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series与时间序列的水平成正比时, then a multiplicative decomposition is more appropriate. ).==>
Additionally, in this recipe, you will learn about Box-Cox transformation, which you can use to transform the data to satisfy normality and homoskedasticity. Box-Cox transformation takes the following form:Figure 9.22 – Box-Cox transformation
The Box-Cox transformation relies on just one parameter, lambda ( λ ), and covers both logarithm and power transformations.
The approach is to try different values of and then test for normality and homoskedasticity. For example, the SciPy library has the boxcox function, and you can specify different λ values using the lambda parameter (interestingly, this is how it is spelled in the implementation since lambda is a reserved Python keyword). If the lambda parameter is set to None , the function will find the optimal lambda ( λ ) value for you.
Box-Cox allows us to make the data both normal and homoskedastic and is part of a family of power transforms that includes log transform and square root transform. Box-Cox is a powerful transform because it supports both root and log transforms, and others are made possible by changing the lambda values.
Note
One thing to point out is that the boxcox function requires the data to be positive Sometimes a Box-Cox transformation provides a shift parameter to achieve this; boxcox does not. Such a shift parameter is equivalent to adding a positive constant to x before calling boxcox.
However, Box and Cox did propose a second formula that can be used for negative y-values:
The confidence limits returned when alpha is provided give the interval where: 自由度为1的平方分布的1-的分位数, is confidence level
with llf
the log-likelihood function and χ2 the chi-squared function.
##############
The BUPA liver data set[12] contains data on liver enzymes肝酶 ALT and γGT. Suppose we are interested in using log(γGT) to predict ALT. A plot of the data appears in panel (a) of the figure. There appears to be non-constant variance, and a Box–Cox transformation might help.
The log-likelihood of the power parameter appears in panel (b). The horizontal reference line is at a distance of from the maximum and can be used to read off an approximate 95%(1-=0.95) confidence interval for λ. It appears as though a value close to zero would be good看起来接近于零的值会很好, so we take logs.
Possibly, the transformation could be improved by adding a shift parameter to the log transformation. Panel (c) of the figure shows the log-likelihood. In this case, the maximum of the likelihood is close to zero suggesting that a shift parameter is not needed.
The final panel shows the transformed data with a superimposed regression line带有叠加回归线的转换数据.
Note that although Box–Cox transformations can make big improvements in model fit, there are some issues that the transformation cannot help with. In the current example, the data are rather heavy-tailed so that the assumption of normality is not realistic and a robust regression approach leads to a more precise model.
##############
Recall, from the introduction section of this recipe and Figure 9.22, there is a lambda parameter used to determine which transformation to apply (logarithm or power transform). Use the boxcox function with the default parameter value for lambda , which is None . Just provide the dataset to satisfy the required x parameter:
# xt: Box-Cox power transformed array.
# maxlog : If the lmbda parameter is None, the second returned argument
# is the lmbda that maximizes the log-likelihood function.
xt, lmbda = boxcox( airp_df['passengers'], lmbda=None )
print('lambda:', lmbda)
xts = pd.Series(xt, index=airp_df.index)
xts
By not providing a value to lambda and keeping it at None , the function will find the optimal lambda ( λ ) value. From the introduction of this recipe, you'll remember lambda is spelled lambda in the boxcox implementation. The function returns two values captured by xt for the transformed data and lamda for the optimal lambda value found.
A histogram can visually show the impact of the transformation:
fig, ax = plt.subplots( 1,2 )
airp_df.hist( ax=ax[0] )
xts.hist( ax=ax[1] )
ax[1].set_title('Box-Cox Transformed')
plt.show()
The second histogram shows that the data was transformed, and the overall distribution changed. It would be interesting to examine the dataset as a time series plot.
from scipy import stats
fig, ax = plt.subplots( 1,2 )
stats.probplot(airp_df['passengers'].values, dist=stats.norm, plot=ax[0])
prob = stats.probplot(xts, dist=stats.norm, plot=ax[1])
ax[1].set_title('Box-Cox Transformed')
plt.show()
Plot both datasets to compare before and after the transformation:
fig, ax = plt.subplots(1,2, figsize=(14,8))
airp_df.plot( ax=ax[0] )
ax[0].set_title( 'Original Time Series' )
xts.plot( ax=ax[1] )
ax[1].set_title( 'Box-Cox Transformed' )
plt.show()
Figure 9.24 – Box-Cox transformation and overall effect on time series data
Notice how the seasonal effect on the transformed dataset looks more stable than before.
Finally, build two simple autoregressive models to compare the effect on the residuals before and after the transformation:
model_airp = AutoReg( airp_df, lags=1, trend='n' ).fit()
model_box = AutoReg( xts, lags=1, trend='n' ).fit()
fig, ax = plt.subplots( 1,2, figsize=(16,8) )
model_airp.resid.plot( ax=ax[0] )
ax[0].set_title('Residuals Plot - Regular Time Series')
model_box.resid.plot( ax=ax[1] )
ax[1].set_title('Residuals Plot - Box-cox Transformed')
plt.show()
The AutoReg model comes with two useful methods: diagnostic_summary() and plot_diagnostics() . They will save you time from having to write additional code to test the model's residuals for normality, homoskedasticity, and autocorrelation.
print( model_box.diagnostic_summary() )
This should display the results from the Ljung-Box test for autocorrelation, the normality test. and the homoskedasticity test against the model's residuals这应该显示自相关的 Ljung-Box 检验和针对模型残差的同方差检验的结果.
In addition to looking at the ACF plot, we can also do a more formal test for autocorrelation by considering a whole set of values as a group, rather than treating each one separately.
Recall that is the autocorrelation for lag k. When we look at the ACF plot to see whether each spike is within the required limits(Approximate (1−α)×100% significance bounds are given by Values lying outside of either of these bounds are indicative of an autoregressive process位于这些界限之外的值表示自回归过程. ORThis blue area(here is the shaded area) depicts the 95% confidence interval and is an indicator of the significance threshold. That means, anything within the blue area(here is the shaded area) is statistically close to zero and anything outside the blue area is statistically non-zero(statistically significant).), we are implicitly carrying out multiple hypothesis tests, each one with a small probability of giving a false positive(
FP: the non-target( the residuals don't exist autocorrelation) was wrongly classified as the target class(P is the target class or :the residuals exist autocorrelation). When enough of these tests are done, it is likely that at least one will give a false positive, and so we may conclude that the residuals have some remaining autocorrelation, when in fact they do not.
In order to overcome this problem, we test whether the first h autocorrelations are significantly different from what would be expected from a white noise process. A test for a group of autocorrelations is called a portmanteau test, from a French word describing a suitcase or coat rack carrying several items of clothing.
Box-Pierce test
However, the test is not good when h is large, so if these values are larger than T/5, then use h=T/5
(more accurate) Ljung-Box test
large values of Q∗ suggest that the autocorrelations do not come from a white noise series(there is autocorrelation.).
How large is too large? If the autocorrelations did come from a white noise series, then both Q and Q∗ would have a χ2 distribution with (h−K) degrees of freedom, where K is the number of parameters in the model. If they are calculated from raw data原数据 (rather than the residuals from a model), then set K=0.
From the following Ljung-Box results, the p-values are less than 0.05, so you reject the null hypothesis, and there is autocorrelation.
The Jarque-Bera test tests whether the sample data has the skewness and kurtosis matching a normal distribution.
first fact but P-value=0.05 At 5% significant level(α = 0.05), we reject the null hypothesis that the log-return is normally distributed(OR).
print( is_normal(shapiro(model_box.resid)) )
print( is_normal(normaltest(model_box.resid)) )
print( is_normal(normal_ad(model_box.resid)) )
print( is_normal(kstest_normal(model_box.resid)) )
print( is_normal( kstest( model_box.resid,
cdf='norm',
args=( np.mean(model_box.resid),
np.std(model_box.resid)
)
)
)
)
By looking at the metrics such as the mean, standard deviation, skewness, and kurtosis we can infer that they deviate from what we would expect under normality. Additionally, the Jarque-Bera normality test gives us reason to reject the null hypothesis stating that the distribution is normal at the 95% confidence level(α = 0.05).
pff1_whylog return Nominal Inflation_CPI_Realized Volati_outlier_distplot_Jarque–Bera_pAcf_sARIMAx_LIQING LIN的博客-CSDN博客
model_box.plot_diagnostics(figsize=(12,10))
plt.show()
Diagnostic plots for standardized residuals
Standardized residuals over time
Histogram plus estimated density of standardized residuals, along with a Normal(0,1) density plotted and KDE for reference.
Normal Q-Q plot, with Normal reference line.
Correlogram
The Box-Cox normality plot(Q-Q plot) shows that the maximum value of the correlation coefficient is at λ =0.148. The .plot_diagnostics() function will show four plots so you can examine the model's residuals. Mainly, the plots will show whether the residuals are normally distributed from the Q-Q plot and histogram(No). Additionally, the autocorrelation function plot (ACF) will allow you to examine for autocorrelation. You will examine ACF plots in more detail in the Plotting ACF and PACF recipe in Chapter 10, Building Univariate Time Series Models Using Statistical Methods.
Autocorrelation is like statistical correlation (think Pearson correlation from high schoolhttps://blog.csdn.net/Linli522362242/article/details/121721868), which measures the strength of a linear relationship between two variables, except that we measure the linear relationship between time series values separated by a lag. In other words, we are comparing a variable with its lagged version of itself.
In this recipe, you will perform a Ljung-Box test to check for
When running the test using acorr_ljungbox from statsmodels, you need to provide a lag value. The test will run for all lags up to the specified lag (maximum lag).
The autocorrelation test is another helpful test for model diagnostics. As discussed in the previous recipe, Applying power transformations, there are assumptions that you need to test against the model's residuals. For example, when testing for autocorrelation on the residuals, the expectation is that
from statsmodels.datasets import co2
co2_df = co2.load_pandas().data
co2_df = co2_df.ffill()
co2_df
Since the data is not stationary (review the Detecting time series stationarity recipe), you will perform a log transform this time (log differencing):
Run the Ljung-Box test. Start with lags=10:
from statsmodels.stats.diagnostic import acorr_ljungbox
co2_diff = np.log(co2_df).diff().dropna()
acorr_ljungbox( co2_diff, lags=10, return_df=True )
lb_stat - The Ljung-Box test statistic.
lb_pvalue - The p-value based on chi-square distribution. The p-value is computed as 1 - chi2.cdf(lb_stat, dof) (cumulative distribution function)where dof is lag - model_df. If lag - model_df <= 0, then NaN is returned for the pvalue.(model_df: Number of degrees of freedom consumed by the model. In an ARMA model, this value is usually p+q where p is the AR order and q is the MA order. This value is subtracted from the degrees-of-freedom used in the test so that the adjusted dof for the statistics are lags - model_df. If lags - model_df <= 0, then NaN is returned.)
bp_stat - The Box-Pierce test statistic.
bp_pvalue - The p-value based for Box-Pierce test on chi-square distribution. The p-value is computed as 1 - chi2.cdf(bp_stat, dof) where dof is lag - model_df. If lag - model_df <= 0, then NaN is returned for the pvalue.
This shows that the test statistic for all lags up to lag 10 are signifcant (p-value < 0.05), so you can reject the null hypothesis(the previous lags are not correlated with the current period . Rejecting the null hypothesis means you reject the claim that there is no autocorrelation.
acorr_ljungbox is a function that accumulates autocorrelation up until the lag specifed. Therefore, it is helpful to determine whether the structure is worth modeling in the first place.
Let's use the Ljung-Box test against the residual from model_box that was created in the Applying power transformations recipe:
acorr_ljungbox( model_box.resid, return_df=True, lags=10 )
From the preceding example, the p-values are less than 0.05, so you reject the null hypothesis(the previous lags are not correlated with the current period), and there is autocorrelation.
(more accurate) Ljung-Box test
large values of Q∗ suggest that the autocorrelations do not come from a white noise series(there is autocorrelation.).
How large is too large? If the autocorrelations did come from a white noise series, then both Q and Q∗ would have a χ2 distribution with (h−K) degrees of freedom, where K is the number of parameters in the model. If they are calculated from raw data原数据 (rather than the residuals from a model), then set K=0.