lstm 能耗预测_预测能耗第一部分

lstm 能耗预测

An Introduction to Time Series Analysis and Forecasting Using Python

使用Python进行时间序列分析和预测的简介

时间序列分析与预测 (Time Series Analysis & Forecasting)

Time series data refers to a set of observations collected at different points in time, often at a regular interval. Analysis of time series data is crucial in a wide array of industries, including finance, epidemiology, meteorology, social sciences and many others.

时间序列数据是指通常以固定间隔在不同时间点收集的一组观测值。 时间序列数据的分析在金融,流行病学,气象学,社会科学等许多行业中至关重要。

In this article, we will look at the following topics within the domain of time series analysis and forecasting:

在本文中,我们将研究时间序列分析和预测领域中的以下主题:

  • Exploratory Analysis

    探索性分析
  • Visualizations

    可视化
  • Seasonal Decomposition

    季节性分解
  • Stationarity

    平稳性
  • ARIMA Models

    ARIMA模型

These topics provide a very basic introduction to time series analysis and forecasting. More advanced forecasting methods will be discussed in Part 2 of this article. You can find a notebook with the code from this article in this GitHub repo.

这些主题为时间序列分析和预测提供了非常基本的介绍。 本文的第2部分将讨论更高级的预测方法。 您可以在此GitHub存储库中找到包含本文代码的笔记本。

资料说明 (Data Description)

We will be analyzing hourly energy consumption data provided by PJM Interconnection. PJM Interconnection is a regional transmission organization (RTO) that coordinates distribution of electricity across a region including all or parts of 14 states in the northeastern United States. The data we will be looking at is composed of hourly energy consumption data from Duquesne Light Co. from January 1, 2005 to August 3, 2018. Duquesne Light Co. serves the Pittsburgh, PA, metropolitan area as well as large segments of Allegheny and Beaver Counties. A copy of the dataset we will be using can be found here. Individual service areas for large member companies of PJM, including Duquesne Light (DUQ), are shown in the figure below.

我们将分析PJM互连提供的每小时能耗数据。 PJM互连是一个区域输电组织(RTO),负责协调整个地区的电力分配,该地区包括美国东北部14个州的全部或部分地区。 我们将要查看的数据由Duquesne Light Co.从2005年1月1日到2018年8月3日的每小时能源消耗数据组成。DuquesneLight Co.为宾夕法尼亚州匹兹堡,大城市地区以及阿勒格尼和海狸县。 我们将使用的数据集的副本可在此处找到。 下图显示了PJM大型会员公司(包括Duquesne Light(DUQ))的各个服务区域。

https://www.pjm.com/library/~/media/about-pjm/pjm-zones.ashx) https://www.pjm.com/library/~/media/about-pjm/pjm-zones.ashx )

Energy consumption data are provided in megawatts (MW). The values provided represent a measurement of instantaneous power within the Duquesne Light grid. Since power represents the rate at which work is done, the data includes a collection of instantaneous rates, similar to a set of velocity data.

能耗数据以兆瓦(MW)为单位。 所提供的值表示Duquesne Light网格内瞬时功率的度量。 由于功率代表完成工作的速率,因此数据包含瞬时速率的集合,类似于一组速度数据。

The total amount of energy consumed over a time period is represented as megawatt-hours (MWh). If power is constant, this value can be determined by multiplying the instantaneous power by the duration over which it is applied. In our data, instantaneous power varies hourly, so to determine the total amount of energy consumed we would need to integrate over time. We will be looking at consumption rates only for our data, as opposed to the total amount of energy consumed.

在一段时间内消耗的能源总量表示为兆瓦时(MWh)。 如果功率恒定,则可以通过将瞬时功率乘以持续时间来确定该值。 在我们的数据中,瞬时功率每小时变化一次,因此要确定消耗的能源总量,我们需要随时间进行积分。 我们将仅查看数据的消耗率,而不是消耗的能源总量。

探索性分析 (Exploratory Analysis)

First, we will need to load the modules we will be using as well as our dataset:

首先,我们需要加载将要使用的模块以及数据集:

# Load Modulesimport numpy as np 
import pandas as pdimport matplotlib
import matplotlib.pyplot as plt
import seaborn as sns# Set Plotting Styles
plt.style.use('ggplot')# Load Data
duq_df = pd.read_csv('data/DUQ_hourly.csv', index_col=[0], parse_dates=[0])# Sort Data
duq_df.sort_index(inplace=True)

When our data was originally stored in the CSV file, timestamps were converted to strings. We need to parse them back into datetime objects for our analysis and set them as the dataset index. This can be done automatically using the parse_dates and index_col parameters for the read_csv function included with Pandas. Once we have established a datetime index, we sort the values to make sure they are presented in chronological order. Skipping this step will reveal that the timestamps may have been sorted by their timestamp strings previously, as some timestamps are out of order.

当我们的数据最初存储在CSV文件中时,时间戳被转换为字符串。 我们需要将它们解析回datetime对象以进行分析,并将它们设置为数据集索引。 可以使用Pandas随附的read_csv函数的parse_dates和index_col参数自动完成此操作。 建立日期时间索引后,我们将对值进行排序以确保它们按时间顺序显示。 跳过此步骤将显示,由于某些时间戳不正确,因此之前可能已按时间戳字符串对时间戳进行了排序。

A quick evaluation of the data indicates that it has 119,008 rows and one column (DUQ_MW). We will now go through the data to identify duplicate timestamps, impute missing values and prepare some basic visualizations.

快速评估数据表明它具有119,008行和一列(DUQ_MW)。 现在,我们将遍历数据以识别重复的时间戳,估算缺失值并准备一些基本的可视化效果。

值重复 (Duplicate Values)

Some datasets may include duplicate readings that share a common timestamp. The reason for this duplication should be investigated for each unique dataset, as reasons for data duplication can vary widely depending on the data collection methodology.

一些数据集可能包含共享共同时间戳的重复读数。 对于每个唯一的数据集,都应调查这种重复的原因,因为数据重复的原因可能会因数据收集方法的不同而大相径庭。

There are several approaches towards the handling of duplicate values. We have the option of discarding both or one of the duplicate values, or imputing the timestamp value using each of the available measurements. For our example, we will compute the mean energy consumption for each of our duplicate value pairs and use that value moving forward.

有几种处理重复值的方法。 我们可以选择放弃两个或两个重复值,或使用每个可用的测量值来估算时间戳值。 对于我们的示例,我们将计算每个重复值对的平均能耗,并使用该值向前移动。

Once we have removed duplicate values, we manually set the frequency of the DatetimeIndex to hourly (‘H’). Normally, the date parser would be able to determine the frequency of the DatetimeIndex automatically. The presence of duplicate DatetimeIndex values prevents this from happening though, so the frequency must be set manually following removal of duplicate values. Setting the frequency now will help us avoid problems with plotting and calculations down the line.

删除重复值后,我们将DatetimeIndex的频率手动设置为每小时('H')。 通常,日期解析器将能够自动确定DatetimeIndex的频率。 但是,存在重复的DatetimeIndex值可以防止这种情况的发生,因此在删除重复值之后必须手动设置频率。 现在设置频率将有助于我们避免沿线绘制和计算出现问题。

# Identify Duplicate Indices
duplicate_index = duq_df[duq_df.index.duplicated()]
print(duq_df.loc[duplicate_index.index.values, :])# Replace Duplicates with Mean Value
duq_df = duq_df.groupby('Datetime').agg(np.mean)# Set DatetimeIndex Frequency
duq_df = duq_df.asfreq('H')

缺失值 (Missing Values)

A quick search through the dataset indicates that there are 24 missing values across our date range. We will use mean interpolation to impute these missing values. We can do this using the interpolate() method on our dataframe object:

快速搜索数据集表明在我们的日期范围内缺少24个值。 我们将使用均值插值来估算这些缺失值。 我们可以在dataframe对象上使用interpolate()方法来做到这一点:

# Determine # of Missing Values
print('# of Missing DUQ_MW Values: {}'.format(len(duq_df[duq_df['DUQ_MW'].isna()])))# Impute Missing Values
duq_df['DUQ_MW'] = duq_df['DUQ_MW'].interpolate(limit_area='inside', limit=None)

可视化能耗数据 (Visualizing Energy Consumption Data)

First, let’s look at a simple time series plot of our data:

首先,让我们看一下数据的简单时间序列图:

plt.plot(duq_df.index, duq_df['DUQ_MW'])
plt.title('Duquesne Light Energy Consumption')
plt.ylabel('Energy Consumption (MW)')
plt.show()

There is a clear annual seasonal pattern visible in the data, but the details of each individual year are obscured by the density of the plot. Let’s look at a single week:

数据中有明显的年度季节性模式,但每一年的详细信息都被图的密度所掩盖。 让我们看一个星期:

WEEK_END_INDEX = 7*24
plt.plot(duq_df.index[:WEEK_END_INDEX], duq_df['DUQ_MW'][:week_end_index])
plt.title('Duquesne Light Energy Consumption (One Week)')
plt.ylabel('Energy Consumption (MW)')
plt.show()

From this plot we can also see a clear daily seasonality, with a repeated pattern of hourly consumption rates occurring each day. We also observe an upward trend within this week of data, but this trend is likely a part of the annual seasonality we saw previously.

从该图中我们还可以看到明显的每日季节性变化,每天都有重复的每小时消费率模式。 在本周的数据中,我们还观察到了上升趋势,但是这种趋势可能是我们之前看到的年度季节性变化的一部分。

Let’s break our DatetimeIndex into separate features so we can look at some of these patterns a little more closely:

让我们将DatetimeIndex分解为单独的功能,以便我们可以更仔细地研究其中的一些模式:

def create_features(df):
df['Date'] = df.index
df['Hour'] = df['Date'].dt.hour
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Quarter'] = df['Date'].dt.quarter
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df['DayOfYear'] = df['Date'].dt.dayofyear
df['DayOfMonth'] = df['Date'].dt.day
df['WeekOfYear'] = df['Date'].dt.weekofyear
df['DayOfYearFloat'] = df['DayOfYear'] + df['Hour'] / 24
df.drop('Date', axis=1, inplace=True)
return dfduq_df = create_features(duq_df)duq_df = create_features(duq_df)fig, axes = plt.subplots(2, 2, figsize=(16,16))# Day of Week
dow_labels = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
g = sns.boxplot(x=duq_df.DayOfWeek, y=duq_df.DUQ_MW, ax=axes[0][0])
g.set_xticklabels(dow_labels)
g.set_ylabel('')# Month of Year
g = sns.boxplot(x=duq_df.Month, y=duq_df.DUQ_MW, ax=axes[0][1])
g.set_ylabel('')# Hour of Day
g = sns.boxplot(x=duq_df.Hour, y=duq_df.DUQ_MW, ax=axes[1][0])
g.set_ylabel('')# Year
g = sns.boxplot(x=duq_df.Year, y=duq_df.DUQ_MW, ax=axes[1][1])
g.set_ylabel('')fig.text(0.08, 0.5, 'Energy Consumption (MW)', va='center', rotation='vertical')
plt.show()

Here, we look at energy consumption in terms of the hour of day, day of the week, month of year, and year of our dataset range. We can see that energy consumption is slightly higher Monday-Friday than it is Saturday-Sunday, which is to be expected since many businesses do not operate on weekends in the U.S. Energy consumption typically peaks around July/August, and there are several more outliers within the months preceding and following the peak period than are observed during the cooler months (November — April). Hourly energy consumption is at its lowest around 4 AM. After this time it gradually increases until it plateaus around 2 PM. It then peaks again around 7 PM before declining again until it reaches its early morning low point. Mean energy consumption rates and interquartile ranges are fairly consistent from year to year, although a slight downward trend is observable.

在这里,我们根据数据集范围的一天中的小时,星期几,一年中的月份和年份来查看能耗。 我们可以看到,周一至周五的能耗比周六至周日略高,这是可以预期的,因为许多企业在美国的周末都​​不营业。能耗通常在7月/ 8月左右达到峰值,并且还有一些异常值在高峰期之前和之后的几个月内,则与凉爽月份(11月至4月)所观察到的情况相比有所增加。 每小时能源消耗在凌晨4点左右处于最低水平。 在这段时间之后,它逐渐增加,直到下午2点左右达到平稳。 然后,它在晚上7点左右再次达到峰值,然后再次下降,直到达到清晨的低点。 尽管可以观察到略有下降的趋势,但每年的平均能源消耗率和四分位间距都相当一致。

Next, we’ll look at a seasonal plot to get a better idea how energy consumption patterns vary from year to year. We’ll use monthly mean energy consumption instead of hourly data to make it easier to distinguish between years when all of the data are plotted together.

接下来,我们将看一看季节性图,以更好地了解能耗模式每年之间如何变化。 我们将使用每月平均能耗而不是每小时数据,以便在将所有数据汇总在一起时更容易区分年份。

year_group = duq_df.groupby(['Year', 'Month']).mean().reset_index()
years = duq_df['Year'].unique()
NUM_COLORS = len(years)cm = plt.get_cmap('gist_rainbow')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_prop_cycle(color=[cm(1.*i/NUM_COLORS) for i in range(NUM_COLORS)])for i, y in enumerate(years):
df = year_group[year_group['Year'] == y]
#rolling_mean = df.DUQ_MW.rolling(window=7*24).mean()
plt.plot(df['Month'], df['DUQ_MW'])
plt.title('Mean Monthly Energy Consumption by Year')
plt.xlabel('Month')
plt.ylabel('Mean Energy Consumption (MW)')
plt.legend(duq_df.Year.unique())
plt.show()

Now, let’s plot each year’s worth of data on a separate subplot so individual years can be evaluated more closely. We will use a 7-day moving average to smooth the data, making general trends easier to identify for comparison.

现在,让我们在单独的子图中绘制每年的数据价值,以便可以更紧密地评估各个年份。 我们将使用7天移动平均线来平滑数据,使整体趋势更易于识别以进行比较。

num_rows = 7
num_cols = 2
year_index = 0fig, axes = plt.subplots(num_rows, num_cols, figsize=(18,18))
years = duq_df['Year'].unique()for i in range(num_rows):
for j in range(num_cols):
df = duq_df[duq_df['Year'] == years[year_index]]
rolling_mean = df['DUQ_MW'].rolling(window=7*24).mean()
axes[i][j].plot(df['DayOfYearFloat'], rolling_mean.values)
axes[i][j].set_title(str(years[year_index]))
axes[i][j].set_ylim(1100, 2500)
axes[i][j].set_xlim(0,370)
year_index += 1
fig.text(0.5, 0.08, 'Elapsed Days', ha='center')
fig.text(0.08, 0.5, 'Energy Consumption (MW)', va='center', rotation='vertical')
fig.subplots_adjust(hspace=0.5)
plt.show()

These plots help us identify the unique fluctuations of energy consumption rate across each year. For example, in 2008 the peak energy consumption rate occurs much earlier than in other years. Some years, such as 2012, demonstrate a clear peak summer season, while other years, such as 2015, also show local maxima in the winter season as well.

这些图可帮助我们确定每年能源消耗率的独特波动。 例如,2008年的峰值能耗率比其他年份早得多。 某些年份(例如2012年)显示出明显的夏季高峰,而其他年份(例如2015年)也显示出冬季的局部最大值。

移动平均线 (Moving Average)

In the plots we just created, we used a smoothing method called a moving average. A moving average is an average value calculated over a fixed window of observations. The average value is calculated over the fixed window length, and the window is shifted one unit into the future. A new average value is calculated, and this process is repeated in a stepwise fashion across the entire time series.

在我们刚创建的图中,我们使用了一种称为移动平均线的平滑方法。 移动平均值是在固定观察窗口上计算的平均值。 在固定的窗口长度上计算平均值,然后将窗口移动一个单位到将来。 计算新的平均值,然后在整个时间序列中逐步执行此过程。

Moving averages are extremely useful for reducing the amount of noise that is present in a time series. They provide a smoothing effect that makes it easier to identify general trends and patterns within the data.

移动平均值对于减少时间序列中存在的噪声量非常有用。 它们提供了平滑效果,使识别数据中的总体趋势和模式变得更加容易。

The size of the fixed-length window that is used to calculate the moving average is determined by the analyst. Larger window sizes result in greater smoothing but reduce granularity, obscuring patterns that appear at lower levels within the time series. The plot below includes a few moving averages calculated using our energy consumption data to demonstrate the smoothing effect produced by increasing the moving average window size.

分析人员确定用于计算移动平均值的定长窗口的大小。 较大的窗口大小会导致更大的平滑度,但会减小粒度,从而遮盖时间序列中较低级别上出现的模式。 下面的曲线包括一些使用我们的能耗数据计算得出的移动平均值,以演示通过增大移动平均窗口大小而产生的平滑效果。

MONTH_PERIOD = 24*30
MIDYEAR_PERIOD = 24*182
YEAR_PERIOD = 24*365month_roll = duq_df.rolling(MONTH_PERIOD).mean()
midyear_roll = duq_df.rolling(MIDYEAR_PERIOD).mean()
year_roll = duq_df.rolling(YEAR_PERIOD).mean()fig, ax = plt.subplots(figsize=(24, 10))
plt.plot(month_roll.index, month_roll['DUQ_MW'], color='red', label='30-Day MA')
plt.plot(midyear_roll.index, midyear_roll['DUQ_MW'], color='blue', label='180-Day MA')
plt.plot(year_roll.index, year_roll['DUQ_MW'], color='black', label='365-Day MA')
plt.title('Duquesne Light Energy Consumption Moving Averages')
plt.ylabel('Energy Consumption (MW) Moving Average')
plt.legend()
plt.show()

季节性和古典季节性分解 (Seasonality and Classical Seasonal Decomposition)

Classical seasonal decomposition asserts that time series data can be broken apart into four separate components:

经典的季节性分解断言,时间序列数据可以分为四个单独的部分:

  • Base Level / Average Value — The average, stationary value of the observations

    基本水平/平均值-观测值的平均固定值
  • Trend — Increase / decrease of observations over time

    趋势-随时间推移观察值的增加/减少
  • Seasonality — A pattern in the data that repeats over a fixed period

    季节性-数据在固定时期内重复的一种模式
  • Residuals — Error

    残差-错误

For our analysis, the Base Level and Trend will be combined to form a single Trend component.

在我们的分析中,基本水平和趋势将结合在一起以形成单个趋势组件。

It is important to note the difference between seasonal and cyclical patterns, as both may be observed in time series data. Seasonal patterns occur with a regular, fixed period. Seasonality may be used to describe regular patterns that repeat daily, weekly, annually, etc. Cyclic patterns, on the other hand, are observed when data rises and falls with an irregular period.

重要的是要注意季节性和周期性模式之间的差异,因为二者都可以在时间序列数据中观察到。 季节性模式以固定的固定周期发生。 季节性可以用来描述每天,每周,每年等重复的规则模式。另一方面,当数据以不规则的周期上升或下降时,会观察到周期性的模式

When trying to determine if an observed pattern is seasonal or cyclic, first determine whether or not the pattern recurs at regular intervals. Patterns that occur at regular intervals can usually be identified by period lengths that correspond to our calendar/time-keeping system. For example, patterns that repeat daily, weekly, or annually would be considered seasonal. If the observed period of the pattern is irregular and does not seem to be tied to our calendar/time-keeping system, it is most likely a cyclical pattern. For example, stock price data often rises and falls based on business cycles, but these cycles are irregular and very difficult to predict since they are based on numerous external factors. The distance between peaks produced by these cycles is not constant, indicating a cyclical pattern.

当试图确定观察到的模式是季节性的还是周期性的时,首先要确定该模式是否定期重复出现。 通常可以通过与我们的日历/计时系统相对应的周期长度来识别以固定间隔出现的模式。 例如,每天,每周或每年重复的模式将被认为是季节性的。 如果观察到的模式周期是不规则的,并且似乎与我们的日历/计时系统无关,则很可能是周期性模式。 例如,股票价格数据通常根据业务周期而上下波动,但是这些周期是不规则的,并且由于基于许多外部因素而很难预测。 这些循环产生的峰之间的距离不是恒定的,表明是周期性的。

Let’s break our energy consumption data into three components (trend, seasonality, and residuals). To do this, we will use the seasonal_decompose function that is provided as part of the statsmodels module.

让我们将能耗数据分为三个部分(趋势,季节性和残差)。 为此,我们将使用statsmodels模块的一部分提供的seasonal_decompose函数。

加性与乘性分解 (Additive vs. Multiplicative Decomposition)

When using the seasonal_decompose function, we will need to specify whether the decomposition should assume an additive or multiplicative model. An additive model assumes that the sum of the individual time series components (Trend + Seasonality + Residuals) is equal to the observed data. A multiplicative model assumes that the product of the individual time series components (Trend * Seasonality * Residuals) is equal to the observed data.

当使用Seasonal_decompose函数时,我们将需要指定分解是采用加性模型还是乘法模型。 加性模型假设各个时间序列成分的总和(趋势+季节性+残差)等于观测数据。 乘法模型假定各个时间序列成分(趋势*季节*残差)的乘积等于观测数据。

Typically, additive decomposition is most appropriate if there is little to no variation in the seasonality over time. Multiplicative decomposition is most appropriate when the magnitude of seasonality varies over time. Before we perform seasonal decomposition, we will look at the magnitude of some of our seasonalities by measuring the peak-to-peak amplitude of each seasonal waveform. The peak-to-peak amplitude is equal to the difference between the maximum value (peak) and minimum value (trough) within each seasonal cycle.

通常,如果季节性随时间变化很小甚至没有变化,则最适合采用加法分解法。 当季节性的大小随时间变化时,乘法分解最合适。 在执行季节性分解之前,我们将通过测量每个季节性波形的峰峰值来查看某些季节的幅度。 峰峰值幅度等于每个季节性周期内的最大值(峰值)和最小值(谷值)之差。

max_daily_vals = duq_df.groupby(['Year', 'DayOfYear']).max()['DUQ_MW'].values
min_daily_vals = duq_df.groupby(['Year', 'DayOfYear']).min()['DUQ_MW'].values
daily_amp = max_daily_vals - min_daily_valsplt.plot(daily_amp)
plt.xlabel('Day #')
plt.ylabel('Amplitude (MW)')
plt.title('Estimated Daily Amplitude')
plt.show()

We can see that the daily amplitude also follows a seasonal pattern, increasing or decreasing depending on the day of the year. In an additive model, we would expect the amplitude to remain relatively constant. Since the amplitude changes over time, we will assume that a multiplicative model best applies to the data.

我们可以看到,每日振幅也遵循季节性模式,根据一年中的一天而增加或减少。 在加性模型中,我们希望振幅保持相对恒定。 由于幅度随时间变化,因此我们假设乘法模型最适用于数据。

from statsmodels.tsa.seasonal import seasonal_decompose
ANNUAL_PERIOD = 365*24mult_decomp = seasonal_decompose(duq_df['DUQ_MW'], model='multiplicative', extrapolate_trend='freq', period=ANNUAL_PERIOD)
mult_decomp.plot()
plt.show()

平稳性 (Stationarity)

The stationarity of a time series refers to whether or not its statistical properties (mean, variance, etc.) remain constant as time progresses. In a stationary time series, statistical properties of the data do not change over time, whereas in a non-stationary time series statistical properties vary. The observed values in a stationary time series are fully independent of time. As a result, time series that exhibit trend or seasonality are non-stationary.

时间序列的平稳性是指其统计属性(均值,方差等)是否随着时间的推移而保持恒定。 在固定时间序列中,数据的统计属性不会随时间变化,而在非固定时间序列中,统计属性会发生变化。 固定时间序列中的观测值完全独立于时间。 结果,呈现趋势或季节性的时间序列是不稳定的。

It is important to consider the stationarity of a time series when the intention is to develop a forecasting model. Most forecasting approaches assume the time series is either stationary or can be rendered stationary via mathematical transformation.

当要开发预测模型时,考虑时间序列的平稳性很重要。 大多数预测方法都假定时间序列是固定的,或者可以通过数学变换使其变为固定的。

增强Dickey-Fuller(ADF)测试 (Augmented Dickey-Fuller (ADF) Test)

There are several statistical methods that can be used to evaluate stationarity, but one of the most common methods is the Augmented Dickey-Fuller test, or ADF test. The ADF test is a type of test known as a unit root test. A unit root is a stochastic trend component within a time series. It adds a random, unpredictable pattern to the time series that is unrelated to trend or seasonality. By detecting whether or not the time series contains a unit root, we can determine whether or not the series is stationary.

有几种统计方法可用于评估平稳性,但是最常用的方法之一是增强Dickey-Fuller检验ADF检验 。 ADF测试是一种称为单位根测试的测试。 单位根是时间序列内的随机趋势成分。 它向时间序列添加了与趋势或季节性无关的随机,不可预测的模式。 通过检测时间序列是否包含单位根,我们可以确定时间序列是否固定。

The ADF test is an augmented form of the Dickey-Fuller test. The Dickey-Fuller test assumes that the time series data can be approximated by a first order autoregressive model with a white noise error. For most real-world datasets, this is an oversimplification and does not provide an adequate model. The augmented version of the Dickey-Fuller test allows for datasets that are better fit using higher order autoregressive models.

ADF测试是Dickey-Fuller测试的增强形式。 Dickey-Fuller测试假设可以通过具有白噪声误差的一阶自回归模型来近似时间序列数据。 对于大多数现实世界的数据集,这过于简单了,不能提供适当的模型。 Dickey-Fuller测试的增强版允许使用高阶自回归模型更好地拟合数据集。

An ADF test assumes the following hypotheses:

ADF检验假设以下假设:

  • Null Hypothesis — If not rejected, suggests that the time series contains a unit root and is non-stationary.

    空假设-如果不被拒绝,则表明时间序列包含单位根并且是非平稳的。
  • Alternative Hypothesis — If the Null Hypothesis is rejected, suggests that the time series does not have a unit root and is stationary.

    替代假设-如果否定假设被拒绝,则表明时间序列没有单位根并且是固定的。

To perform an ADF test, we will use the adfuller function from the statsmodels module. This function takes a time series as input and outputs a tuple containing the following values:

为了执行ADF测试,我们将使用statsmodels模块中的adfuller函数。 此函数以时间序列作为输入,并输出包含以下值的元组:

  • ADF Test Statistic

    ADF测试统计
  • p-value

    p值
  • Number of Lags Used

    使用的延迟数
  • Number of Observations Used

    使用的观察数
  • Critical Values (1%, 5%, and 10%)

    临界值(1%,5%和10%)
  • Maximum Information Criterion (if # of lags is specified)

    最大信息标准(如果指定了滞后次数)

The first two items (ADF Test Statistic and p-value) are the ones we are most interested in. The p-value will help us determine whether nor not we should reject the null hypothesis. For our analysis, we will reject the null hypothesis if the p-value obtained by the ADF test is < 0.05.

前两个项目(ADF测试统计量和p值)是我们最感兴趣的项目。p值将帮助我们确定是否应该拒绝原假设。 对于我们的分析,如果通过ADF检验获得的p值<0.05,我们将拒绝原假设。

The ADF test is designed to identify non-stationarity caused by the presence of a unit root within the process of a time series, but it is not designed to detect non-stationarity in other forms such as seasonality. We can see an example of this with our energy consumption data, which includes clear seasonality:

ADF测试旨在识别在时间序列过程中由于单位根的存在而引起的非平稳性,但并非旨在检测其他形式(例如季节性)的非平稳性。 我们可以从能耗数据中看到一个示例,其中包括明显的季节性因素:

from statsmodels.tsa.stattools import adfulleradf_result = adfuller(duq_df['DUQ_MW'])print(f'ADF Statistic: {adf_result[0]}')
print(f'p-value: {adf_result[1]}')

The p-value obtained by the ADF test is <0.05, indicating that we should reject the Null Hypothesis and accept our Alternative Hypothesis that the time series does not contain a unit root. However, our seasonal decomposition showed that there are very clear seasonality components within our data.

通过ADF测试获得的p值<0.05,表明我们应该拒绝零假设,而接受我们的替代假设,即时间序列不包含单位根。 但是,我们的季节性分解表明,我们的数据中存在非常明显的季节性成分。

ARIMA模型 (ARIMA Models)

ARIMA is an acronym that stands for Autoregressive Integrated Moving Average. ARIMA models offer a basic approach to time series forecasting. An ARIMA model consists of three components:

ARIMA是首字母缩写词,代表自回归综合移动平均线 。 ARIMA模型为时间序列预测提供了一种基本方法。 ARIMA模型包含三个组件:

  • Autoregressive Model Term (p)

    自回归模型项(p)
  • Moving Average Model Term (q)

    移动平均模型项(q)
  • Order of Differencing (d)

    差序(d)

The term “integrated” that is included in the ARIMA acronym refers to the use of differencing of raw data to make the time series stationary. Let’s explore the components of an ARIMA model a bit further.

ARIMA首字母缩略词中包含的术语“积分”是指使用原始数据的差分来使时间序列平稳。 让我们进一步探讨ARIMA模型的组成部分。

自回归模型 (Autoregressive Models)

An autoregressive model is a model that makes use of a regression developed using past (lagged) values observed from the time series. In other words, past values from the time series are used to create a regression equation that can be used to predict future values of the time series. If we define the observed value of a time series at time t as y(t), an autoregressive model developed using the immediate preceding value from the time series would be expressed by the following linear regression equation:

自回归模型是利用从时间序列中观察到的过去(滞后)值开发的回归模型 。 换句话说,时间序列中的过去值用于创建回归方程,该方程可用于预测时间序列的未来值。 如果将在时间t处的时间序列的观测值定义为y(t),则使用以下时间线性回归方程表示使用时间序列中紧前的值建立的自回归模型:

In this equation, y(t-1) represents the observed value of the time series one unit time step prior to the value we are trying to predict, A represents the linear regression coefficient, B represents the linear regression constant term (in this case the mean, or base, value of our time series), and ε represents the regression error.

在此等式中,y(t-1)表示时间序列在我们尝试预测的值之前一个单位时间步长的观测值,A表示线性回归系数,B表示线性回归常数项(在这种情况下,时间序列的均值或基值),而ε表示回归误差。

Autoregressive models are defined by their order, which is the number of preceding time series values that are used to predict future values. Autoregressive models may be referred to using the abbreviated form AR(n), where n represents the model order. The autoregressive model described above is a first order model since it only incorporates one preceding value into the regression, and it may be written as AR(1).

自回归模型由其顺序定义, 顺序是用于预测未来值的先前时间序列值的数量。 可以使用缩写形式AR(n)来引用自回归模型,其中n表示模型顺序。 上述自回归模型是一阶模型,因为它仅将一个先前值合并到回归中,并且可以写为AR(1)。

移动平均线 (Moving Average)

A moving average model is a form of regression model that uses a linear combination of white noise series terms to forecast future values. A white noise series is a sequence of uncorrelated variables that have a constant variance and mean. A white noise series does not exhibit autocorrelation.

移动平均模型是回归模型的一种形式,它使用白噪声序列项的线性组合来预测未来值。 白噪声序列是不相关变量的序列,具有恒定的方差和均值。 白噪声系列不显示自相关。

To understand what this means, let’s look at the linear regression equation for a first order moving average model:

为了理解这意味着什么,让我们看一下一阶移动平均模型的线性回归方程:

In this equation, y(t) represents the value of the time series at time t (the value we are trying to predict), C represents a regression coefficient, D represents a regression constant (the base level of our time series), ε(t-1) represents a white noise series lagged by one time unit, and ε(t) represents a white noise series.

在此等式中,y(t)表示时间t处的时间序列的值(我们试图预测的值),C表示回归系数,D表示回归常数(我们的时间序列的基本水平),ε (t-1)表示以一个时间单位滞后的白噪声序列,并且ε(t)表示白噪声序列。

This regression equation says that the next value in the time series can be predicted by taking the base level of the time series and adding white noise error. The amount of white noise to be added is a linear combination of past white noise error terms with different coefficients. These coefficients essentially produce a weighted average of previous error terms to be added to the time series base level.

该回归方程表明,可以通过采用时间序列的基本级别并添加白噪声误差来预测时间序列中的下一个值。 白噪声的添加量是过去的白噪声误差项具有不同系数的线性组合。 这些系数实质上产生先前误差项的加权平均值,以将其添加到时间序列基准级别。

It should be noted that the term “moving average” has two separate meanings in time series analysis: 1.) Data smoothing, and 2.) Use of white noise errors to develop a regression for forecasting.

应当指出,术语“移动平均值”在时间序列分析中具有两个单独的含义:1.)数据平滑,和2.)使用白噪声误差来建立预测的回归。

Moving average models can be written as MA(n), where n represents the model order, or number of previous residual values that are included in the regression. The first order moving average model illustrated above can be written as MA(1).

移动平均模型可以写为MA(n),其中n表示模型顺序,或者包含在回归中的先前残差值的数量。 上面说明的一阶移动平均模型可以写为MA(1)。

差异化 (Differencing)

When defining an ARIMA model, we must also specify the degree of differencing to implement. Differencing is a technique that attempts to increase stationarity by subtracting a previous observation from the current observation. Subtracting the observation immediately preceding the current observation produces a first difference:

在定义ARIMA模型时,我们还必须指定要实现的差异程度。 差异是一种通过从当前观测值中减去先前观测值来增加平稳性的技术。 减去当前观测值之前的观测值会产生第一个差异:

first_difference = observed_value(t) - observed_value(t-1) 

Differencing can also be performed with a lag. A lag of 7, for example, means that an observation 7 time units prior to our current observation should be subtracted from our current observation:

差异也可以滞后进行。 例如,滞后值为7,意味着应从当前观测值中减去当前观测值之前7个时间单位的观测值:

weekly_lag_difference = observed_value(t) - observed_value(t-7)

Over-differencing has a negative impact on model performance, so it is important to select the correct degree of differencing when defining an ARIMA model. When specifying the degree of differencing for our ARIMA model, we will be identifying the number of times differencing will be performed on our time series. It is important to note that this is not the same as specifying a lag period. As an example, setting the degree of differencing equal to one will produce the following result:

过度差异会对模型性能产生负面影响,因此在定义ARIMA模型时选择正确的差异程度非常重要。 在为ARIMA模型指定差异度时,我们将确定对时间序列进行差异化的次数。 重要的是要注意,这与指定滞后时间不同。 例如,将差异度设置为1将产生以下结果:

differenced_series = observed_value(t) - observed_value(t-1)

Setting the degree of differencing equal to two will produce the following result (which can also be described as the first difference of the first difference):

将差异度设置为2将产生以下结果(也可以描述为第一差异中的第一差异):

differenced_series = (observed_value(t) - observed_value(t-1)) - 
(observed_value(t-1) - observed_value(t-2))

自相关函数(ACF)和部分自相关函数(PACF) (Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF))

To determine the moving average and autoregressive model orders that are most appropriate for our data, we will need to use the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF), respectively. To understand what these are, we will need to define some additional terms.

要确定最适合我们数据的移动平均和自回归模型阶数,我们将需要分别使用自相关函数(ACF)和部分自相关函数(PACF)。 要了解这些是什么,我们将需要定义一些其他术语。

Autocorrelation refers to the correlation between values of a time series and lagged values of the same time series. It represents the degree of correlation between present and past values of a time series. The autocorrelation of a time series can be calculated for several different lag values and plotted. This plot is known as the Autocorrelation Function, or ACF. The ACF for our energy consumption data is shown below.

自相关是指时间序列的值与相同时间序列的滞后值之间的相关性。 它表示时间序列的当前值和过去值之间的相关程度。 可以为几个不同的滞后值计算时间序列的自相关并将其绘制出来。 该图称为自相关函数ACF 。 我们的能耗数据的ACF如下所示。

from statsmodels.graphics.tsaplots import plot_acfplot_acf(duq_df['DUQ_MW'], lags=50)
plt.savefig('fig9.png')
plt.show()

The Partial Autocorrelation Function, or PACF, summarizes the relationship between present and past values within a time series but removes the effects of any potential relationships at intermediate, or “lower-order”, time steps. A typical autocorrelation includes effects from both the direct correlation between an observation at a lagged time value and the present value as well as the correlations between intermediate time values and the present value. By removing the intermediate correlations, we are left with a PACF plot that we can use to determine the order of our moving average model. The PACF for the first 100 lags of our energy consumption data is shown below:

偏自相关函数 ( PACF )总结了时间序列中当前值与过去值之间的关系,但消除了中间或“低阶”时间步长上任何潜在关系的影响。 典型的自相关包括来自滞后时间值的观察值与当前值之间的直接相关性以及中间时间值与当前值之间的相关性两者的影响。 通过删除中间相关性,我们剩下了一个PACF图,可以用来确定移动平均模型的顺序。 能耗数据的前100个滞后的PACF如下所示:

from statsmodels.graphics.tsaplots import plot_pacfplot_pacf(duq_df['DUQ_MW'], lags=100)
plt.show()

In order to determine the correct order for our autoregressive and moving average model terms, we will need to review our ACF and PACF plots. In general, if the PACF shows a sharp cutoff after a certain point and the lag-1 autocorrelation is positive, the order of the AR term can be determined by looking at the lag value for which the PACF cuts off. The PACF is highly significant for the first 24-hour cycle then drops off. This suggests that an AR order of 24 may be appropriate. However, use of an order this large will dramatically slow down model training time.

为了确定我们的自回归和移动平均模型项的正确顺序,我们将需要查看ACF和PACF图。 通常,如果PACF在某个点之后显示出急剧的截止,并且lag-1自相关为正,则可以通过查看PACF截止的滞后值来确定AR项的顺序。 对于第一个24小时周期,PACF非常重要,然后下降。 这表明AR阶数为24可能是合适的。 但是,使用如此大的订单将大大减慢模型训练的时间。

Addition of a moving average term to our model can be evaluated by looking at the ACF plot. If the plot demonstrates a sharp cutoff point and the lag-1 autocorrelation is negative, this suggests that a MA term should be added to the model. We do not see either of these features in our ACF plot. This suggests that a MA term may not help our model.

可以通过查看ACF图来评估将移动平均项添加到模型中的情况。 如果该图显示了一个临界点,而lag-1自相关为负,则表明应将MA项添加到模型中。 我们的ACF图中没有这些功能。 这表明MA术语可能对我们的模型没有帮助。

The AR order demonstrated by the PACF plot is deceptive. Since we did not use a differenced series, seasonality is captured in the autocorrelations. An AR(24) model will be inefficient to train, so we are better off exploring other options for incorporating seasonality into our ARIMA model.

PACF图显示的AR顺序具有欺骗性。 由于我们没有使用差异序列,因此在自相关中捕获了季节性。 AR(24)模型的训练效率很低,因此我们最好探索将季节性纳入我们的ARIMA模型的其他选择。

季节性和ARIMA (Seasonality & ARIMA)

Seasonality is not addressed directly using a standard ARIMA model. However, we can use an ARIMA model adapted to manage seasonality to model data with clear seasonal trends. The seasonal ARIMA model we will be using is known as SARIMAX, and it is included in the statsmodels module. This model can also be used to create a standard ARIMA model by setting all parameters related to seasonality to zero.

使用标准ARIMA模型无法直接解决季节性问题。 但是,我们可以使用适合于管理季节性的ARIMA模型来模拟具有清晰季节趋势的数据。 我们将使用的季节性ARIMA模型称为SARIMAX,它包含在statsmodels模块中。 通过将与季节性相关的所有参数设置为零,此模型还可用于创建标准ARIMA模型。

A seasonal ARIMA model is defined using each of the model components identified for a standard ARIMA model (p, d, and q) with the addition of a set of seasonal terms (P, D, Q, and s) that must also be defined. When defining a seasonal ARIMA model, the terms P, D, and Q represent the autoregressive order, order of differencing, and moving average order, respectively, of the seasonal component of the data. The lowercase versions of these parameters represent the same set of orders for the non-seasonal component of the data. The value s represents the periodicity of the seasonality.

使用为标准ARIMA模型(p,d和q)识别的每个模型组件以及一组必须定义的季节性术语(P,D,Q和s)定义季节性ARIMA模型。 。 在定义季节性ARIMA模型时,术语P,D和Q分别代表数据的季节性分量的自回归阶,差分阶和移动平均阶。 这些参数的小写形式表示数据的非季节性组成部分的相同顺序集。 值s代表季节性的周期性。

示例模型 (Example Models)

We will create some baseline ARIMA models using different combinations of autoregression and moving average model components. We will then use these models to forecast energy consumption for just over one day. To speed up model training for this example, we will discard the first several years of data when creating our training dataset. We will only use data collected between January 1, 2014 and August 1, 2018 for model training, and we will attempt to predict data from August 1, 2018 to August 2, 2018.

我们将使用自回归和移动平均模型成分的不同组合来创建一些基线ARIMA模型。 然后,我们将使用这些模型来预测一天的能耗。 为了加快本示例的模型训练速度,我们将在创建训练数据集时丢弃前几年的数据。 我们仅将2014年1月1日至2018年8月1日之间收集的数据用于模型训练,我们将尝试预测2018年8月1日至2018年8月2日的数据。

误差指标 (Error Metric)

We will be evaluating the performance of our model using mean squared error. Mean squared error (MSE) is a measurement of the average value of the squares of the error residuals. Here is the equation that will be used to calculate the mean squared error of the forecasted values:

我们将使用均方误差评估模型的性能。 均方误差(MSE)是误差残差平方的平均值的度量。 这是将用于计算预测值的均方误差的方程式:

In this equation, n represents the number of observations in our test set, yi represents the forecasted value at time step i, and ŷi represents the actual value at time step i. Using this equation, we are taking the difference between the forecasted and observed values at each time step (the error), squaring it, summing the squared errors over our forecasting window, and dividing by the sum by the number of time steps in our forecasting window to calculate the average (mean) squared error.

在此等式中,n表示测试集中观测值的数量,yi表示时间步长i的预测值,而ŷi表示时间步长i的实际值。 使用此方程式,我们将取每个时间步长的预测值和观测值之间的差(误差),将其平方,在预测窗口中对平方误差求和,然后除以总和除以预测中的时间步数计算平均值(均方根)误差的窗口。

一站式预测 (One-Step Forecasting)

Forecasts will be generated using a one-step approach. This approach forecasts the value for the next time step in the series. The forecasted value is stored in a list. Once an actual value has been observed for the next time step, this value is incorporated into the time series. The model is then retrained and used to predict the value for the next unobserved time step. This process is repeated over a user-identified window of time.

预测将使用一步法生成。 这种方法可以预测该系列中下一个步骤的价值。 预测值存储在列表中。 一旦观察到下一时间步长的实际值,该值将被纳入时间序列。 然后对该模型进行重新训练,并用于预测下一个未观察到的时间步长的值。 在用户确定的时间范围内重复此过程。

This forecasting approach is being used as an example, but some problems may require forecasting of multiple unobserved time steps simultaneously. In fact, a multistep forecasting approach may be more suited towards our dataset since a utility would likely prefer to know the forecasted consumption for a full day or longer as opposed to only one hour ahead. We will stick with this approach for now though to see how selection of model parameters impacts our forecasting accuracy.

以这种预测方法为例,但是某些问题可能需要同时预测多个未观察到的时间步长。 实际上,多步预测方法可能更适合我们的数据集,因为公用事业公司可能更愿意知道整天或更长时间的预测消耗量,而不是提前一个小时。 现在,我们将继续使用这种方法,以了解模型参数的选择如何影响我们的预测准确性。

The example code below shows the one-step forecasting approach using a simple AR(1) model. This approach will be used for a few sets of parameters to generate some baseline models for comparison.

以下示例代码显示了使用简单AR(1)模型的一步预测方法。 此方法将用于几组参数,以生成一些基线模型进行比较。

from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from datetime import datetimetrain_series = duq_df.loc[(duq_df.index >= datetime(2014, 1, 1)) & (duq_df.index < datetime(2018, 8, 1)), 'DUQ_MW']
test_series = duq_df.loc[(duq_df.index >= datetime(2018, 8, 1)), 'DUQ_MW']preds = []
history = [x for x in train_series]for t in range(len(test_series)):
model = ARIMA(history, order=(1,0,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
preds.append(output[0][0])
history.append(test_series[t])pred_series = pd.Series(preds, index=test_series.index)error = mean_squared_error(test_series, pred_series)
print('MSE: %.3f' % error)plt.plot(test_series, label='Observed Values')
plt.plot(pred_series, color='blue', label='Predictions')
plt.legend()
plt.ylabel('Energy Consumption (MW)')
plt.show()

模型结果 (Model Results)

The table below summarizes the results from each of our models.

下表总结了我们每个模型的结果。

Plots of observed values and values forecasted using standard ARIMA models are shown below.

观察值和使用标准ARIMA模型预测的值的图表如下所示。

Plots of observed values and values forecasted using seasonal ARIMA models are shown below.

观测值和使用季节性ARIMA模型预测的值的图表如下所示。

The forecast results indicate that autoregressive terms appear to be more important that moving average terms in the models that were developed. This coincides with our conclusions from the ACF plot that a moving average term may not be a useful model component. However, the combination of an autoregressive term and a moving average term provides better forecasting results than either term independently. This holds true when seasonality is incorporated into the model as well.

预测结果表明,自回归项比已开发模型中的移动平均项更为重要。 这与我们从ACF图得出的结论一致,即移动平均项可能不是有用的模型成分。 但是,自回归项和移动平均项的组合提供的预测结果要比独立于任一个项的预测结果更好。 当将季节性因素也纳入模型时,也是如此。

First-order differencing improved model performance both with and without incorporation of seasonality. We can also see that the addition of 24-hour seasonality to the model improved forecasting accuracy significantly. The only exception to this trend is the moving average model. Performance of this model declined heavily following the incorporation of seasonality.

一阶微分在不考虑季节性因素的情况下均改善了模型性能。 我们还可以看到,模型中增加了24小时的季节性,可以显着提高预测准确性。 这种趋势的唯一例外是移动平均线模型。 纳入季节性因素后,该模型的性能严重下降。

The best performing model was a SARIMA model with a seasonality of 24 and a value of 1 for all available parameters related to autoregression, moving average, and differencing. Since we have only developed baseline models to see the impacts of different model terms, we may be able to obtain even better results by testing out different combinations of parameters and exploring our data more thoroughly. We may also be able to improve the results by adopting alternative forecasting models, such as exponential smoothing or LSTMs. We will explore these alternatives in Part 2 of this article.

表现最佳的模型是SARIMA模型,其季节性为24,与自回归,移动平均值和微分相关的所有可用参数的值均为1。 由于我们仅开发了基准模型来查看不同模型项的影响,因此我们可以通过测试不同的参数组合并更彻底地研究数据来获得更好的结果。 通过采用替代的预测模型,例如指数平滑法或LSTM,我们也许也可以改善结果。 我们将在本文的第2部分中探讨这些替代方法。

谢谢阅读! (Thanks for Reading!)

Please give this article a clap if you found it useful. Code for this article can be found in this Github repository.

如果觉得有用,请给这篇文章加个掌声。 可以在此Github存储库中找到本文的代码。

翻译自: https://medium.com/@scottmduda/predicting-energy-consumption-part-1-ad38f7d7106f

lstm 能耗预测

你可能感兴趣的:(python,机器学习,算法,java)