本实验使用boston_housing数据集对房价数据进行回归分析,数据来自1970年代,波斯顿周边地区的房价,是用于机器学习的经典数据集。该数据集很小,共计506条数据,分为404个训练样本和102个测试样本,因此需要采用K-Fold,这里取K=4。每条数据包含13个特征,分别为:
由于每个维度的特征取值范围不同,因此需要先做数据的Normalization。具体代码实现如下:
from keras.datasets import boston_housing
from keras import models
from keras import layers
import numpy as np
import matplotlib.pyplot as plt
#train_data.shape:(404, 13),test_data.shape:(102, 13),
#train_targets.shape:(404,),test_targets.shape:(102,)
#the data compromises 13 features
#the targets are the median values of owner-occupied homes,in thousands of dollars
(train_data,train_targets),(test_data,test_targets)=boston_housing.load_data()
#feature-wise normalization
mean = train_data.mean(axis=0)
train_data -=mean
std = train_data.std(axis=0)
train_data/=std
#never use any quantity computed on the test data
test_data-=mean
test_data/=std
#build the model
#because we need to build a model several times,we use function to cons
def build_model():
model = models.Sequential()
model.add(layers.Dense(64,activation='relu',input_shape=(train_data.shape[1],)))
model.add(layers.Dense(64,activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',loss='mse',metrics=['mae'])
return model
#K-fold validation
k = 4
num_val_samples = len(train_data) // k
num_epochs = 500
all_scores = []
all_mae_histories = []
#K-fold validation and logs
k = 4
num_val_samples = len(train_data) // k
num_epochs = 500
all_scores = []
all_mae_histories = []
for i in range(k):
print('正在处理fold #',i)
#preparing the validation data:data from partition #k
val_data = train_data[i*num_val_samples:(i+1)*num_val_samples]
val_targets = train_targets[i*num_val_samples:(i+1)*num_val_samples]
#preparing the training data:data from all other partitions
partial_train_data = np.concatenate(
[train_data[:i*num_val_samples],
train_data[((i+1)*num_val_samples):]],
axis=0
)
partial_train_targets = np.concatenate(
[train_targets[:i * num_val_samples],
train_targets[((i + 1) * num_val_samples):]],
axis=0
)
#build the model
model = build_model()
#train the model,silent mode
history = model.fit(partial_train_data,partial_train_targets,validation_data=(val_data,val_targets),epochs=num_epochs,batch_size=1,verbose=0)
#evaluate the model in the validation data
mae_history = history.history['val_mean_absolute_error']
val_mse,val_mae = model.evaluate(val_data,val_targets,verbose=0)
all_scores.append(val_mae)
all_mae_histories.append(mae_history)
print("Complete!")
average_mae_history = [
np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
mean_score = np.mean(all_scores)
print("mean_score:",mean_score)
绘制出validation score,图片不是很有辨识度:
#plotting validation scores
plt.plot(range(1,len(average_mae_history)+1),average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
采用移动平均平滑曲线重新绘图:
#plotting validation scores-excluding the first 10 data points
#using moving average to smooth the curve
def smooth_curve(points,factor=0.9):
smoothed_points = []
for point in points:
if smoothed_points:
previous = smoothed_points[-1]
smoothed_points.append(previous*factor+point*(1-factor))
else:
smoothed_points.append(point)
return smoothed_points
#remove head 10 points
smooth_mae_history = smooth_curve(average_mae_history[10:])
plt.plot(range(1,len(smooth_mae_history)+1),smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel(('Validation MAE'))
plt.show()
可以看出在第80个Epochs是validation score最好,后面开始overfitting,重新构建模型:
from the plot above ,the Validation MAE reaches the lowest in the 80th epochs
model =build_model()
model.fit(train_data,train_targets,epochs=80,batch_size=16,verbose=0)
test_mse_score,test_mae_score = model.evaluate(test_data,test_targets)
print("final test_mae_score:",test_mae_score)
输出结果为:
102/102 [==============================] - 0s 853us/step
final test_mae_score: 2.667761251038196
实验小结:
1、回归和分类使用不同的损失函数,在回归中通常使用Mean Square Error(MSE)
2、评价标准通常使用Mean Absolute Error,分类中通常使用accuracy
3、当样本数据量小的时候,通常使用K-Fold方法使训练的模型更可靠