[LSTM学习笔记1]LSTM数据准备

       本系列笔记是学习《Long Short Term Memory Networks with Python》时练习和记录,该书主要针对各种LSTM网络使用keras进行实现,我可以将自己码的Jupyter notebook代码和笔记分享。

一.Prepare Numeric Data

缩放数据通常有两种方式:normalizaiton和standardization,都可以使用scikit-learn实现

1.Normalize Series Data

将数据放缩到0~1的区间,当数据变化太大,这种方式不适合使用。

from pandas import Series
from sklearn.preprocessing import MinMaxScaler
#定义数据
data = [10.0,20.0,30.,40.0,50.0,60.0,70.0,80.0,90.0,100.0]
series = Series(data)
print(series)
#prepare data for normalization
values = series.values
values = values.reshape((len(values),1))
print(values.shape)
#train the normalization
scaler = MinMaxScaler(feature_range=(0,1))
scaler = scaler.fit(values)
print('Min:%f,Max:%f \n' % (scaler.data_min_,scaler.data_max_))
#normalizer  the dataset and print
normalized = scaler.transform(values)
print(normalized)
#inverse transform and print
inversed = scaler.inverse_transform(normalized)
print(inversed)

输出:

0     10.0
1     20.0
2     30.0
3     40.0
4     50.0
5     60.0
6     70.0
7     80.0
8     90.0
9    100.0
dtype: float64
(10, 1)
Min:10.000000,Max:100.000000 

[[ 0.        ]
 [ 0.11111111]
 [ 0.22222222]
 [ 0.33333333]
 [ 0.44444444]
 [ 0.55555556]
 [ 0.66666667]
 [ 0.77777778]
 [ 0.88888889]
 [ 1.        ]]
[[  10.]
 [  20.]
 [  30.]
 [  40.]
 [  50.]
 [  60.]
 [  70.]
 [  80.]
 [  90.]
 [ 100.]]

 

2.Standardize Series data

将数据放缩为均值0,标准差1,这种方式比min-max方式更加稳定。使用sklearn中的StandardScaler。

from pandas import Series
from sklearn.preprocessing import StandardScaler
from math import sqrt

#定义数据
data = [1.0,5.0,6.0,8.0,2.5,4.1,7.9,6.3]
series = Series(data)
print(series)
#prepare data for normalization
values = series.values
values = values.reshape((len(values),1))
print(values.shape)
#train the normalization
scaler = StandardScaler()
scaler = scaler.fit(values)
print('mean:%f,stardardDeviation:%f \n' % (scaler.mean_,sqrt(scaler.var_)))
#normalizer  the dataset and print
stardardized = scaler.transform(values)
print(stardardized)
#inverse transform and print
inversed = scaler.inverse_transform(stardardized)
print(inversed)

输出:

 

0    1.0
1    5.0
2    6.0
3    8.0
4    2.5
5    4.1
6    7.9
7    6.3
dtype: float64
(8, 1)
mean:5.100000,stardardDeviation:2.320560 

[[-1.7668147 ]
 [-0.04309304]
 [ 0.38783737]
 [ 1.2496982 ]
 [-1.12041908]
 [-0.43093041]
 [ 1.20660516]
 [ 0.5171165 ]]
[[ 1. ]
 [ 5. ]
 [ 6. ]
 [ 8. ]
 [ 2.5]
 [ 4.1]
 [ 7.9]
 [ 6.3]]

 

二.准备类别数据

两种方式:

1.整数编码

 对于分类变量有顺序特征的比较使用,可以使用sklearn中的LabelEncoder

2.one-hot编码

 对于分类变量无顺序特征的比较使用,可以使用sklearn中的onehotEncoder
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#define example
data = ['cold','cold','warm','hot','hot','cold','warm','hot','hot','cold','warm']
values = array(data)
print(values)
#整数编码
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

#one hot
onehot_enoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded),1)
onehot_encoded = onehot_enoder.fit_transform(integer_encoded)
print(onehot_encoded)

#inverse first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0,:])])
print(inverted)

 输出:

['cold' 'cold' 'warm' 'hot' 'hot' 'cold' 'warm' 'hot' 'hot' 'cold' 'warm']
[0 0 2 1 1 0 2 1 1 0 2]
[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]
['cold']

三.准备变长序列

1.使用keras中的pad_sequance()函数,可以在序列开头或结尾填充给定数据,默认填充0.0,默认padding='pre'

import keras
from keras.preprocessing.sequence import pad_sequences
#define sequence
sequence = [
    [1,2,3,4],
    [1,2,3],
    [1]
]
#pad sequence
padded = pad_sequences(sequence)
print('padded_pre:\n',padded)
padded_post = pad_sequences(sequence,padding='post')
print('padded_post:\n',padded_post)
输出:
padded_pre:
 [[1 2 3 4]
 [0 1 2 3]
 [0 0 0 1]]
padded_post:
 [[1 2 3 4]
 [1 2 3 0]
 [1 0 0 0]]

2.序列裁剪

可以从序列首裁剪也可以从序列尾部,默认truncating='pre'

import keras
from keras.preprocessing.sequence import pad_sequences
#define sequence
sequence = [
    [1,2,3,4],
    [1,2,3],
    [1]
]
#truncate sequence
truncated_pre = pad_sequences(sequence,maxlen=2)
print('truncated_pre:\n',truncated_pre)
truncated_post = pad_sequences(sequence,maxlen=2,truncating='post')
print('truncated_post:\n',truncated_post)
输出:
truncated_pre:
 [[3 4]
 [2 3]
 [0 1]]
truncated_post:
 [[1 2]
 [1 2]
 [0 1]]

 

四.将监督学习用于序列预测的方式

由于监督学习需要有数据和标记,因此如果将监督学习用于序列预测,当前时间需要滞后当前时间一个时间步 可以使用pandas的shift()函数实现

from pandas import DataFrame

#define sequence
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)

#shift forward
df['t-1'] = df['t'].shift(1)
print("shift forward:\n",df)
#shift backward
df['t+1'] = df['t'].shift(-1)
print("\n shift backward:\n",df)

输出: 

   t
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9
shift forward:
    t  t-1
0  0  NaN
1  1  0.0
2  2  1.0
3  3  2.0
4  4  3.0
5  5  4.0
6  6  5.0
7  7  6.0
8  8  7.0
9  9  8.0

 shift backward:
    t  t-1  t+1
0  0  NaN  1.0
1  1  0.0  2.0
2  2  1.0  3.0
3  3  2.0  4.0
4  4  3.0  5.0
5  5  4.0  6.0
6  6  5.0  7.0
7  7  6.0  8.0
8  8  7.0  9.0
9  9  8.0  NaN

你可能感兴趣的:(机器学习,Python,Keras)