最近总结几个项目后,我最深的一点体会是,项目成败的两大因素: 建模、数据;数据问题是客观存在的,无法改变,那能不能把模型构架的更稳定、泛化性更好呢?
今年7月份的时候看了一个NLP的文章,里面提到了一个思路,把神经网络做成类似随机森林的结构,投票的结果(回归问题求平均值,分类问题看谁投票数量最多)即是模型输出。我看了之后觉得这个主意非常好,于是想尝试一下看看效果。
我准备对上一个案例中(对焊机数据挖掘项目)的数据集用随机森林的思想来构建“随机神经网络森林”。首先我们来看随机森林算法的主要思想:1、样本随机 2、特征随机; 这里的特征随机不能套用,因为我建模所用到的参数都是根据工艺机理精挑细选出来的,如果随机挑选反而没有作用。
我的Bagging BPnet思路是这样:1、样本随机 2、cell随机 (cell是超参数,代表神经元数量),其中样本随机比较好理解,这里解释一下cell随机: 在以往的神经网络调优过程中我发现超参数的选择是技术难点,因此在这个结果中我让超参数随机,把不同的结构的神经网络组合起来。也许会有不同的效果。
下面贴出Python代码:
Bagging BPnet:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
from matplotlib import pyplot
from numpy import concatenate
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping
from tensorflow import keras
from keras.models import load_model
#导入数据
dataset = pd.read_csv(r'C:\Users\Think\Desktop\dhj0418.csv')
scaler = MinMaxScaler(feature_range=(0, 1))
data=dataset.iloc[:,0:9]
scaled = scaler.fit_transform(data)
Y = scaled[:, -1]
X = scaled[:, 0:-1]
rsme_group=[]
r_group=[]
def bp_tree(num):
for _ in range(num):
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2)
cell=[8,16,32,64,128,256]
n_cell=np.random.choice(len(cell),1,replace=True)
active_cell=cell[int(n_cell)]
model = Sequential()
input = train_x.shape[1]
model.add(Dense(active_cell, input_shape=(input,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(active_cell, input_shape=(input,)))
model.add(Activation('relu'))
model.add(Dense(active_cell, input_shape=(input,)))
model.add(Activation('relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=Adam())
early_stopping = EarlyStopping(monitor='val_loss', patience=50, verbose=2)
history = model.fit(train_x, train_y, epochs=350,batch_size=20,
validation_data=(test_x, test_y), verbose=2,
shuffle=False, callbacks=[early_stopping])
yhat = model.predict(test_x)
# 预测y 逆标准化
inv_yhat0 = concatenate((test_x, yhat), axis=1)
inv_yhat1 = scaler.inverse_transform(inv_yhat0)
inv_yhat = inv_yhat1[:, -1]
# 原始y逆标准化
test_y1 = test_y.reshape(len(test_y), 1)
inv_y0 = concatenate((test_x, test_y1), axis=1)
inv_y1 = scaler.inverse_transform(inv_y0)
inv_y = inv_y1[:, -1]
# 计算RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
r_2 = r2_score(inv_y, inv_yhat)
rsme_group.append(rmse)
r_group.append(r_2)
result1=np.mean(rsme_group)
result2=np.mean(r_group)
print(active_cell)
return [result1,result2]
bp_tree(10)
参数说明:
num:代表树的棵树,即建立num个神经网络模型
cell=[8,16,32,64,128,256],每次构建神经网络时会从这个数组中随机抽取一个作为神经网络超参数
result1:计算所有神经网络的RMSE平均值
BPnet:
import matplotlib.pyplot as plt
from math import sqrt
from matplotlib import pyplot
import pandas as pd
from numpy import concatenate
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping
import tensorflow as tf
from tensorflow import keras
from keras.models import load_model
dataset = pd.read_csv(r'C:\Users\Think\Desktop\dhj0418.csv')
scaler = MinMaxScaler(feature_range=(0, 1))
data=dataset.iloc[:,0:9]
scaled = scaler.fit_transform(data)
Y = scaled[:, -1]
X = scaled[:, 0:-1]
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2)
model = Sequential()
input = X.shape[1]
model.add(Dense(128, input_shape=(input,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128, input_shape=(input,)))
model.add(Activation('relu'))
model.add(Dense(128, input_shape=(input,)))
model.add(Activation('relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=Adam())
early_stopping = EarlyStopping(monitor='val_loss', patience=50, verbose=2)
history = model.fit(train_x, train_y, epochs=350,batch_size=20,
validation_data=(test_x, test_y), verbose=2,
shuffle=False, callbacks=[early_stopping])
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.title('Model loss')
pyplot.ylabel('Loss')
pyplot.xlabel('Epoch')
pyplot.legend()
pyplot.show()
yhat = model.predict(test_x)
# 预测y 逆标准化
inv_yhat0 = concatenate((test_x, yhat), axis=1)
inv_yhat1 = scaler.inverse_transform(inv_yhat0)
inv_yhat = inv_yhat1[:, -1]
# 原始y逆标准化
test_y = test_y.reshape(len(test_y), 1)
inv_y0 = concatenate((test_x, test_y), axis=1)
inv_y1 = scaler.inverse_transform(inv_y0)
inv_y = inv_y1[:, -1]
r_2 = r2_score(inv_y, inv_yhat)
print('Test r_2: %.3f' % r_2)
# 计算MAE
mae = mean_absolute_error(inv_y, inv_yhat)
print('Test MAE: %.3f' % mae)
# 计算RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)
实验:分别构建bagging BPnet,BPnet十次,使用同样的数据集,其他超参数相同,对比两种结构的RMSE;如下图
bagging BPnet 标准差为0.03,远远小于BP标准差2.5,可以看出bagging BPnet稳定性要高于BPnet。
总结: bagging其实类似于蒙特卡洛抽样,当样本量足够大时,无限逼近函数期望即真实值。
从上面这个表中,BP网络效果有时非常好,有时非常差。bagging方法能避免网络的过拟合或无法收敛。从这个实验中,我发现这种方法的潜力很大,在后续项目中我会继续使用,观察效果。