Persistence Model/持续性模型

A good baseline for time series forecasting is the persistence model.


This is a forecasting model where the last observation is persisted forward. Because of its simplicity, it is often called the naive forecast.


Prepare Data/准备数据

The first step is to transform the data from a series into a supervised learning problem.


That is to go from a list of numbers to a list of input and output patterns. We can achieve this using a pre-prepared function called series_to_supervised().


The function is listed below.


The function can be called by passing in the loaded series values an n_in value of 1 and an n_out value of 3; for example:

函数可以被调用靠传入,加载序列值,一个 为1的n_in值和一个为3的n_out值,例如:

Next, we can split the supervised learning dataset into training and test sets.


We know that in this form, the last 10 rows contain data for the final year. These rows comprise the test set and the rest of the data makes up the training dataset.


We can put all of this together in a new function that takes the loaded series and some parameters and returns a train and test set ready for modeling.


We can test this with the Shampoo dataset. The complete example is listed below.


from pandas import DataFrame
from pandas import concat
from pandas import read_csv
from pandas import datetime

# date-time parsing function for loading the dataset
def parser(x):
    return datetime.strptime('190' + x, '%Y-%m')

# convert time series into supervised learning problem
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        names += [('var%d(t-%d)' % (j + 1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        if i == 0:
            names += [('var%d(t)' % (j + 1)) for j in range(n_vars)]
            names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
    return agg

# transform series into train and test sets for supervised learning
def prepare_data(series, n_test, n_lag, n_seq):
    # extract raw values
    raw_values = series.values
    raw_values = raw_values.reshape(len(raw_values), 1)
    # transform into supervised learning problem X, y
    supervised = series_to_supervised(raw_values, n_lag, n_seq)
    supervised_values = supervised.values
    # split into train and test sets
    train, test = supervised_values[0:-n_test], supervised_values[-n_test:]
    return train, test

# load dataset
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
# configure
n_lag = 1
n_seq = 3
n_test = 10
# prepare data
train, test = prepare_data(series, n_test, n_lag, n_seq)
print('Train: %s, Test: %s' % (train.shape, test.shape))

Running the example first prints the entire test dataset, which is the last 10 rows. The shape and size of the train test datasets is also printed.


We can see the single input value (first column) on the first row of the test dataset matches the observation in the shampoo-sales for December in the 2nd year:


We can also see that each row contains 4 columns for the 1 input and 3 output values in each observation.


