将时间序列转化为监督学习问题

这里提供两种不同的数据划分方式,看喜好选择了鸭

第一种数据划分的方式

pandas的shift()函数

import pandas as pd
df = pd.DataFrame()
df["time"] = [x for x in range(10)]
df
time
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
df["time-1"] = df["time"].shift(1)
df
time time-1
0 0 NaN
1 1 0.0
2 2 1.0
3 3 2.0
4 4 3.0
5 5 4.0
6 6 5.0
7 7 6.0
8 8 7.0
9 9 8.0
df["time+1"] = df["time"].shift(-1)
df
time time-1 time+1
0 0 NaN 1.0
1 1 0.0 2.0
2 2 1.0 3.0
3 3 2.0 4.0
4 4 3.0 5.0
5 5 4.0 6.0
6 6 5.0 7.0
7 7 6.0 8.0
8 8 7.0 9.0
9 9 8.0 NaN
df["time+2"] = df["time"].shift(-2)
df
time time-1 time+1 time+2
0 0 NaN 1.0 2.0
1 1 0.0 2.0 3.0
2 2 1.0 3.0 4.0
3 3 2.0 4.0 5.0
4 4 3.0 5.0 6.0
5 5 4.0 6.0 7.0
6 6 5.0 7.0 8.0
7 7 6.0 8.0 9.0
8 8 7.0 9.0 NaN
9 9 8.0 NaN NaN

在时间序列预测问题中, 当前时间t和未来时间(t+1,t+n)被称为预测时间,过去的观测值(t-1,t-n)是用于预测的

新的数据集被构造为Dataframe,每列根据变量的编号以及该列左移或者右移的步长命名

def series_to_supervisied(data,step_in,step_out,dropnan = True):
    """
    param:data观测序列,类型为列表或者二维的numpy数组
    param:step_in:作为输入滞后观测值数量(x)
    param:step_out:作为输出的观测值为(y)
    param:dropnan:是否删除具有NaN的行,称为bool,默认为True
    
    return:为监督学习重组得到的dataframe序列
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols,names = [],[]
    #输入序列(t-n,t-(n+1),t-(n+2)...t-1)
    for i in range(step_in,0,-1):
        cols.append(df.shift(i))
        names+=[("var%d(t-%d)"%(j+1,i)) for j in range(n_vars)]
    
    #预测序列(t+1,t+2...t+n)
    for i in range(0,step_out):
        cols.append(df.shift(-i))
        names+=[("var%d(t+%d)"%(j+1,i)) for j in range(n_vars)]
    
    agg = pd.concat(cols,axis=1)
    agg.columns = names
    if dropnan:
        agg.dropna(inplace=True)
    return  agg
values = [x for x in range(10)]
data = series_to_supervisied(data=values,step_in=1,step_out=1)
data
var1(t-1) var1(t+0)
1 0.0 1
2 1.0 2
3 2.0 3
4 3.0 4
5 4.0 5
6 5.0 6
7 6.0 7
8 7.0 8
9 8.0 9
def series_to_supervisied_(data,step_in,step_out,dropnan = True):
    """
    param:data观测序列,类型为列表或者二维的numpy数组
    param:step_in:作为输入滞后观测值数量(x)
    param:step_out:作为输出的观测值为(y)
    param:dropnan:是否删除具有NaN的行,称为bool,默认为True
    
    return:为监督学习重组得到的dataframe序列
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame()
    df_time_in =pd.DataFrame() 
    df_time_out = pd.DataFrame()
    df["time"] =data 
    #输入序列(t-n,t-(n+1),t-(n+2)...t-1)
    for i in range(step_in,0,-1):
        name = "step_"+"time-"+str(i)
        print(name)
        df_time_in[name] = df["time"].shift(i)
        print(name)
    
    #预测序列(t+1,t+2...t+n)
    for i in range(1,step_out+1):
        name = "step_"+"time+"+str(i)
        print(name)
        df_time_out[name] = df["time"].shift(-i)
        print(name)
    df_re = pd.concat([df_time_in,df,df_time_out],axis =1)
    del df,df_time_in,df_time_out
    if dropnan:
        df_re.dropna(inplace=True)
    return  df_re
values = [x for x in range(10)]
data = series_to_supervisied_(data=values,step_in=3,step_out=0)
data
step_time-3
step_time-3
step_time-2
step_time-2
step_time-1
step_time-1
step_time-3 step_time-2 step_time-1 time
3 0.0 1.0 2.0 3
4 1.0 2.0 3.0 4
5 2.0 3.0 4.0 5
6 3.0 4.0 5.0 6
7 4.0 5.0 6.0 7
8 5.0 6.0 7.0 8
9 6.0 7.0 8.0 9

单步单变量预测

用(t-1)作为输入变量预测当前时间的观测值(t),同理,可以指定任意长度的输入

def series_to_supervisied_(data,step_in,step_out,dropnan = True):
    """
    param:data观测序列,类型为列表或者二维的numpy数组
    param:step_in:作为输入滞后观测值数量(x)
    param:step_out:作为输出的观测值为(y)
    param:dropnan:是否删除具有NaN的行,称为bool,默认为True
    
    return:为监督学习重组得到的dataframe序列
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame()
    df_time_in =pd.DataFrame() 
    df_time_out = pd.DataFrame()
    df["time"] =data 
    #输入序列(t-n,t-(n+1),t-(n+2)...t-1)
    for i in range(step_in,0,-1):
        name = "step_"+"time-"+str(i)
        df_time_in[name] = df["time"].shift(i)
    
    
    #预测序列(t+1,t+2...t+n)
    for i in range(0,step_out):
        name = "step_"+"time+"+str(i)
      
        df_time_out[name] = df["time"].shift(-i)
   
    df_re = pd.concat([df_time_in,df_time_out],axis =1)
    del df,df_time_in,df_time_out
    if dropnan:
        df_re.dropna(inplace=True)
    return  df_re

values = [x for x in range(10)]
data = series_to_supervisied_(data=values,step_in=1,step_out=1)
data
step_time-1 step_time+0
1 0.0 1
2 1.0 2
3 2.0 3
4 3.0 4
5 4.0 5
6 5.0 6
7 6.0 7
8 7.0 8
9 8.0 9
def series_to_supervisied_(data,step_in,step_out,dropnan = True):
    """
    param:data观测序列,类型为列表或者二维的numpy数组
    param:step_in:作为输入滞后观测值数量(x)
    param:step_out:作为输出的观测值为(y)
    param:dropnan:是否删除具有NaN的行,称为bool,默认为True
    
    return:为监督学习重组得到的dataframe序列
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame()
    df_time_in =pd.DataFrame() 
    df_time_out = pd.DataFrame()
    df["time"] =data 
    #输入序列(t-n,t-(n+1),t-(n+2)...t-1)
    for i in range(step_in,0,-1):
        name = "step_"+"time-"+str(i)
        df_time_in[name] = df["time"].shift(i)
    
    
    #预测序列(t+1,t+2...t+n)
    for i in range(0,step_out):
        name = "step_"+"time+"+str(i)
      
        df_time_out[name] = df["time"].shift(-i)
   
    df_re = pd.concat([df_time_in,df_time_out],axis =1)
    del df,df_time_in,df_time_out
    if dropnan:
        df_re.dropna(inplace=True)
    return  df_re

values = [x for x in range(10)]
data = series_to_supervisied_(data=values,step_in=2,step_out=1)
data
step_time-2 step_time-1 step_time+0
2 0.0 1.0 2
3 1.0 2.0 3
4 2.0 3.0 4
5 3.0 4.0 5
6 4.0 5.0 6
7 5.0 6.0 7
8 6.0 7.0 8
9 7.0 8.0 9

多步预测

def series_to_supervisied_(data,step_in,step_out,dropnan = True):
    """
    param:data观测序列,类型为列表或者二维的numpy数组
    param:step_in:作为输入滞后观测值数量(x)
    param:step_out:作为输出的观测值为(y)
    param:dropnan:是否删除具有NaN的行,称为bool,默认为True
    
    return:为监督学习重组得到的dataframe序列
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame()
    df_time_in =pd.DataFrame() 
    df_time_out = pd.DataFrame()
    df["time"] =data 
    #输入序列(t-n,t-(n+1),t-(n+2)...t-1)
    for i in range(step_in,0,-1):
        name = "step_"+"time-"+str(i)
        df_time_in[name] = df["time"].shift(i)
    
    
    #预测序列(t+1,t+2...t+n)
    for i in range(0,step_out):
        name = "step_"+"time+"+str(i)
      
        df_time_out[name] = df["time"].shift(-i)
   
    df_re = pd.concat([df_time_in,df_time_out],axis =1)
    del df,df_time_in,df_time_out
    if dropnan:
        df_re.dropna(inplace=True)
    return  df_re

values = [x for x in range(10)]
data = series_to_supervisied_(data=values,step_in=2,step_out=2)
data
step_time-2 step_time-1 step_time+0 step_time+1
2 0.0 1.0 2 3.0
3 1.0 2.0 3 4.0
4 2.0 3.0 4 5.0
5 3.0 4.0 5 6.0
6 4.0 5.0 6 7.0
7 5.0 6.0 7 8.0
8 6.0 7.0 8 9.0

多变量预测

def series_to_superivsed(data,step_in =1,step_out=1,dropnan = True):
    """
    param:data观测序列,类型为列表或者二维的numpy数组
    param:step_in:作为输入滞后观测值数量(x)
    param:step_out:作为输出的观测值为(y)
    param:dropnan:是否删除具有NaN的行,称为bool,默认为True
    
    return:为监督学习重组得到的dataframe序列
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols = []
    names = []
    #输入序列:[(t-n),(t-n+1).....(t-1)]
    for i in range(step_in,0,-1):
        cols.append(df.shift(i))
        names+=[("var%d(t-%d)"%(j+1,i)) for j in range(n_vars)]
    #预测序列[t,(t+1),(t+2)....(t+n)]
    for i in range(0,step_out):
        cols.append(df.shift(-i))
        if i ==0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names+=[("var%d(t+%d)"%(j+1,i)) for j in range(n_vars)]
            
    df_re = pd.concat(cols,axis=1)
    df_re.columns = names
    if dropnan:
        df_re.dropna(inplace =True)
    return df_re
raw = pd.DataFrame()

raw['ob1'] = [x for x in range(10)]
raw['ob2'] = [x for x in range(50, 60)]
values = raw.values
data = series_to_superivsed(values)
data
var1(t-1) var2(t-1) var1(t) var2(t)
1 0.0 50.0 1 51
2 1.0 51.0 2 52
3 2.0 52.0 3 53
4 3.0 53.0 4 54
5 4.0 54.0 5 55
6 5.0 55.0 6 56
7 6.0 56.0 7 57
8 7.0 57.0 8 58
9 8.0 58.0 9 59
raw['ob1'] = [x for x in range(10)]
raw['ob2'] = [x for x in range(50, 60)]
values = raw.values
data = series_to_superivsed(values,1,2)
data
var1(t-1) var2(t-1) var1(t) var2(t) var1(t+1) var2(t+1)
1 0.0 50.0 1 51 2.0 52.0
2 1.0 51.0 2 52 3.0 53.0
3 2.0 52.0 3 53 4.0 54.0
4 3.0 53.0 4 54 5.0 55.0
5 4.0 54.0 5 55 6.0 56.0
6 5.0 55.0 6 56 7.0 57.0
7 6.0 56.0 7 57 8.0 58.0
8 7.0 57.0 8 58 9.0 59.0

第二种数据划分的方式

索引 数据
0 10
1 20
2 30
3 40
4 50
5 60
6 70
7 80
8 90
9 100
10 110

假如time_step(3)个步长预测一个样本 ,得到如下表

索引 x y
0 10,20,30 40
1 20,30,40 50
2 30,40,50 60
3 40,50,60 70
4 50,60,70 80
5 60,70,80 90
6 70,80,90 100
7 80,90,100 110
8 90,100,110 ?
9 100,110,? ??
10 110,?,?? ???
import numpy as np 
def split_sequence(sequence,n_steps):
    x,y = [],[]
    for i in range(len(sequence)):
        #找到步长的最后一个值
        end_idx = i+n_steps
        if end_idx>len(sequence)-1:
            break
        input_x,input_y = sequence[i:end_idx],sequence[end_idx]
        x.append(input_x)
        y.append(input_y)
    return np.array(x),np.array(y)
raw_seq = [10,20,30,40,50,60,70,80,90]
n_steps = 3
x,y = split_sequence(raw_seq,n_steps)
for i in range(len(x)):
    print(x[i],y[i])
[10 20 30] 40
[20 30 40] 50
[30 40 50] 60
[40 50 60] 70
[50 60 70] 80
[60 70 80] 90

多变量时间序列是指每个时间步长有一个观测值的数据

多个输入的系列·

索引 x1,x2 y
0 10,15 25
1 20,25 45
2 30,35 65
3 40,45 85
4 50,55 105
5 60,65 125
6 70,75 145
7 80,85 165
8 90,95 185
in_seq1 =np. array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
out_seq = np.array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])

array([ 25,  45,  65,  85, 105, 125, 145, 165, 185])
in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))
data = np.hstack((in_seq1,in_seq2,out_seq))
data
array([[ 10,  15,  25],
       [ 20,  25,  45],
       [ 30,  35,  65],
       [ 40,  45,  85],
       [ 50,  55, 105],
       [ 60,  65, 125],
       [ 70,  75, 145],
       [ 80,  85, 165],
       [ 90,  95, 185]])
索引 x1,x2 y
0 10,15
1 20,25
2 30,35 65
3 20,25
4 30,35
5 40,45 85
6
def split_sequence(sequence,n_steps):
    x,y = [],[]
    for i in range(len(sequence)):
        end_idx = i+n_steps
        if end_idx>len(sequence):
            break
        input_x,input_y = sequence[i:end_idx,:-1],sequence[end_idx-1,-1]
        x.append(input_x)
        y.append(input_y)
    return np.array(x),np.array(y)

n_steps = 3
x,y = split_sequence(data,n_steps)
for i in range(len(x)):
    print(x[i], y[i])
    print("="*15)

[[10 15]
 [20 25]
 [30 35]] 65
===============
[[20 25]
 [30 35]
 [40 45]] 85
===============
[[30 35]
 [40 45]
 [50 55]] 105
===============
[[40 45]
 [50 55]
 [60 65]] 125
===============
[[50 55]
 [60 65]
 [70 75]] 145
===============
[[60 65]
 [70 75]
 [80 85]] 165
===============
[[70 75]
 [80 85]
 [90 95]] 185
===============

多个序列输出

索引 x1,x2 y
0 10,15 25
1 20,25 45
2 30,35 65
3 40,45 85
4 50,55 105
5 60,65 125
6 70,75 145
7 80,85 165
8 90,95 185
索引 x1,x2 y
0 10,15 25
1 20,25 45
2 30,35 65
输出
3 40,45 85
def split_sequence(sequences,n_step):
    x,y = [],[]
    for i in range(len(sequences)):
        end_idx = i+n_step
        if end_idx>len(sequences)-1:
            break
        input_x,input_y = sequences[i:end_idx,:],sequences[end_idx,:]
        x.append(input_x)
        y.append(input_y)
    return np.array(x),np.array(y)

in_seq1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
out_seq = np.array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])

in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))

data =np. hstack((in_seq1, in_seq2, out_seq))
n_steps = 3
x,y = split_sequence(data,n_steps)
for i in range(len(x)):
    print(x[i], y[i])
    print("="*20)
[[10 15 25]
 [20 25 45]
 [30 35 65]] [40 45 85]
====================
[[20 25 45]
 [30 35 65]
 [40 45 85]] [ 50  55 105]
====================
[[ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]] [ 60  65 125]
====================
[[ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]] [ 70  75 145]
====================
[[ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]] [ 80  85 165]
====================
[[ 60  65 125]
 [ 70  75 145]
 [ 80  85 165]] [ 90  95 185]
====================

多个输入,多个输出

索引 数据
0 10
1 20
2 30
3 40
4 50
5 60
6 70
7 80
8 90
9 100
10 110
索引 数据
0 10
1 20
2 30
输出
3 40
4 50
5 60
def split_sequence(sequence,n_steps_in,n_steps_out):
    x,y =[],[]
    for i in range(len(sequence)):
        end_idx = i+n_steps_in
        out_end_idx = end_idx + n_steps_out
        
        if out_end_idx>len(sequence):
            break
        
        input_x,input_y = sequence[i:end_idx],sequence[end_idx:out_end_idx]
        x.append(input_x)
        y.append(input_y)
    return np.array(x),np.array(y)



raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]

n_steps_in, n_steps_out = 3, 2
x, y = split_sequence(raw_seq, n_steps_in, n_steps_out)
for i in range(len(x)):
    print(x[i], y[i])
    print("="*20)
[10 20 30] [40 50]
====================
[20 30 40] [50 60]
====================
[30 40 50] [60 70]
====================
[40 50 60] [70 80]
====================
[50 60 70] [80 90]
====================

多个维度的输入输出

索引 x1,x2 y
0 10,15 25
1 20,25 45
2 30,35 65
3 40,45 85
4 50,55 105
5 60,65 125
6 70,75 145
7 80,85 165
8 90,95 185
索引 x1,x2 y
0 10,15
1 20,25
2 30,35
输出
3 65
4 85
def split_sequence(sequences,n_steps_in,n_steps_out):
    x,y = [],[]
    for i in range(len(sequences)):
        end_idx = i+n_steps_in
        out_end_idx = end_idx+n_steps_out-1
        if out_end_idx>len(sequences):
            break
        input_x,input_y = sequences[i:end_idx,:-1],sequences[end_idx-1:out_end_idx,-1]
        x.append(input_x)
        y.append(input_y)
    return np.array(x),np.array(y)


in_seq1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
out_seq = np.array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])
# convert to [rows, columns] structure
in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))
data = np.hstack((in_seq1, in_seq2, out_seq))
n_steps_in, n_steps_out = 3, 2
x, y = split_sequence(data, n_steps_in, n_steps_out)
for i in range(len(x)):
    print(x[i], y[i])
    print("="*30)
[[10 15]
 [20 25]
 [30 35]] [65 85]
==============================
[[20 25]
 [30 35]
 [40 45]] [ 85 105]
==============================
[[30 35]
 [40 45]
 [50 55]] [105 125]
==============================
[[40 45]
 [50 55]
 [60 65]] [125 145]
==============================
[[50 55]
 [60 65]
 [70 75]] [145 165]
==============================
[[60 65]
 [70 75]
 [80 85]] [165 185]
==============================
索引 x1,x2 y
0 10,15 25
1 20,25 45
2 30,35 65
3 40,45 85
4 50,55 105
5 60,65 125
6 70,75 145
7 80,85 165
8 90,95 185
索引 x1,x2 y
0 10,15 25
1 20,25 45
2 30,35 65
3 40,45 85
输出
4 50,55 105
5 60,65 125
def split_sequences(sequences,n_steps_in,n_steps_out):
    x,y = [],[]
    for i in range(len(sequences)):
        end_idx = i+n_steps_in
        out_end_idx = end_idx+n_steps_out
        
        if out_end_idx>len(sequences):
            break
        input_x,input_y = sequences[i:end_idx,:],sequences[end_idx:out_end_idx,:]
        x.append(input_x)
        y.append(input_y)
    return np.array(x),np.array(y)

in_seq1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
out_seq = np.array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])

in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))

dataset = np.hstack((in_seq1, in_seq2, out_seq))
n_steps_in, n_steps_out = 3, 2
X, y = split_sequences(dataset, n_steps_in, n_steps_out)
for i in range(len(X)):
    print(X[i], y[i])
    print("="*30)
[[10 15 25]
 [20 25 45]
 [30 35 65]] [[ 40  45  85]
 [ 50  55 105]]
==============================
[[20 25 45]
 [30 35 65]
 [40 45 85]] [[ 50  55 105]
 [ 60  65 125]]
==============================
[[ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]] [[ 60  65 125]
 [ 70  75 145]]
==============================
[[ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]] [[ 70  75 145]
 [ 80  85 165]]
==============================
[[ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]] [[ 80  85 165]
 [ 90  95 185]]
==============================

你可能感兴趣的:(时间序列预测,学习,数据挖掘,python)