本系列笔记是学习《Long Short Term Memory Networks with Python》时练习和记录,该书主要针对各种LSTM网络使用keras进行实现,我可以将自己码的Jupyter notebook代码和笔记分享。
缩放数据通常有两种方式:normalizaiton和standardization,都可以使用scikit-learn实现
将数据放缩到0~1的区间,当数据变化太大,这种方式不适合使用。
from pandas import Series
from sklearn.preprocessing import MinMaxScaler
#定义数据
data = [10.0,20.0,30.,40.0,50.0,60.0,70.0,80.0,90.0,100.0]
series = Series(data)
print(series)
#prepare data for normalization
values = series.values
values = values.reshape((len(values),1))
print(values.shape)
#train the normalization
scaler = MinMaxScaler(feature_range=(0,1))
scaler = scaler.fit(values)
print('Min:%f,Max:%f \n' % (scaler.data_min_,scaler.data_max_))
#normalizer the dataset and print
normalized = scaler.transform(values)
print(normalized)
#inverse transform and print
inversed = scaler.inverse_transform(normalized)
print(inversed)
输出:
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 60.0
6 70.0
7 80.0
8 90.0
9 100.0
dtype: float64
(10, 1)
Min:10.000000,Max:100.000000
[[ 0. ]
[ 0.11111111]
[ 0.22222222]
[ 0.33333333]
[ 0.44444444]
[ 0.55555556]
[ 0.66666667]
[ 0.77777778]
[ 0.88888889]
[ 1. ]]
[[ 10.]
[ 20.]
[ 30.]
[ 40.]
[ 50.]
[ 60.]
[ 70.]
[ 80.]
[ 90.]
[ 100.]]
2.Standardize Series data
将数据放缩为均值0,标准差1,这种方式比min-max方式更加稳定。使用sklearn中的StandardScaler。
from pandas import Series
from sklearn.preprocessing import StandardScaler
from math import sqrt
#定义数据
data = [1.0,5.0,6.0,8.0,2.5,4.1,7.9,6.3]
series = Series(data)
print(series)
#prepare data for normalization
values = series.values
values = values.reshape((len(values),1))
print(values.shape)
#train the normalization
scaler = StandardScaler()
scaler = scaler.fit(values)
print('mean:%f,stardardDeviation:%f \n' % (scaler.mean_,sqrt(scaler.var_)))
#normalizer the dataset and print
stardardized = scaler.transform(values)
print(stardardized)
#inverse transform and print
inversed = scaler.inverse_transform(stardardized)
print(inversed)
输出:
0 1.0
1 5.0
2 6.0
3 8.0
4 2.5
5 4.1
6 7.9
7 6.3
dtype: float64
(8, 1)
mean:5.100000,stardardDeviation:2.320560
[[-1.7668147 ]
[-0.04309304]
[ 0.38783737]
[ 1.2496982 ]
[-1.12041908]
[-0.43093041]
[ 1.20660516]
[ 0.5171165 ]]
[[ 1. ]
[ 5. ]
[ 6. ]
[ 8. ]
[ 2.5]
[ 4.1]
[ 7.9]
[ 6.3]]
两种方式:
1.整数编码
对于分类变量有顺序特征的比较使用,可以使用sklearn中的LabelEncoder
2.one-hot编码
对于分类变量无顺序特征的比较使用,可以使用sklearn中的onehotEncoder
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
#define example
data = ['cold','cold','warm','hot','hot','cold','warm','hot','hot','cold','warm']
values = array(data)
print(values)
#整数编码
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
#one hot
onehot_enoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded),1)
onehot_encoded = onehot_enoder.fit_transform(integer_encoded)
print(onehot_encoded)
#inverse first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0,:])])
print(inverted)
输出:
['cold' 'cold' 'warm' 'hot' 'hot' 'cold' 'warm' 'hot' 'hot' 'cold' 'warm']
[0 0 2 1 1 0 2 1 1 0 2]
[[ 1. 0. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]
['cold']
1.使用keras中的pad_sequance()函数,可以在序列开头或结尾填充给定数据,默认填充0.0,默认padding='pre'
import keras
from keras.preprocessing.sequence import pad_sequences
#define sequence
sequence = [
[1,2,3,4],
[1,2,3],
[1]
]
#pad sequence
padded = pad_sequences(sequence)
print('padded_pre:\n',padded)
padded_post = pad_sequences(sequence,padding='post')
print('padded_post:\n',padded_post)
输出:
padded_pre:
[[1 2 3 4]
[0 1 2 3]
[0 0 0 1]]
padded_post:
[[1 2 3 4]
[1 2 3 0]
[1 0 0 0]]
2.序列裁剪
可以从序列首裁剪也可以从序列尾部,默认truncating='pre'
import keras
from keras.preprocessing.sequence import pad_sequences
#define sequence
sequence = [
[1,2,3,4],
[1,2,3],
[1]
]
#truncate sequence
truncated_pre = pad_sequences(sequence,maxlen=2)
print('truncated_pre:\n',truncated_pre)
truncated_post = pad_sequences(sequence,maxlen=2,truncating='post')
print('truncated_post:\n',truncated_post)
输出:
truncated_pre:
[[3 4]
[2 3]
[0 1]]
truncated_post:
[[1 2]
[1 2]
[0 1]]
由于监督学习需要有数据和标记,因此如果将监督学习用于序列预测,当前时间需要滞后当前时间一个时间步 可以使用pandas的shift()函数实现
from pandas import DataFrame
#define sequence
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)
#shift forward
df['t-1'] = df['t'].shift(1)
print("shift forward:\n",df)
#shift backward
df['t+1'] = df['t'].shift(-1)
print("\n shift backward:\n",df)
输出:
t
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
shift forward:
t t-1
0 0 NaN
1 1 0.0
2 2 1.0
3 3 2.0
4 4 3.0
5 5 4.0
6 6 5.0
7 7 6.0
8 8 7.0
9 9 8.0
shift backward:
t t-1 t+1
0 0 NaN 1.0
1 1 0.0 2.0
2 2 1.0 3.0
3 3 2.0 4.0
4 4 3.0 5.0
5 5 4.0 6.0
6 6 5.0 7.0
7 7 6.0 8.0
8 8 7.0 9.0
9 9 8.0 NaN