公众号:机器学习杂货店
作者:Peter
编辑:Peter
大家好,这里是机器学习杂货店 Machine Learning Grocery~
本文的案例讲解的是机器学习中一个重要问题:回归问题,它预测的是一个连续值而不是离散的标签。
注意:逻辑回归不是回归算法,而是分类算法
[外链图片转存中…(img-JnR0sik4-1648911615585)]
506个样本,其中404个训练样本,102个测试样本
In [1]:
import numpy as np
from keras.datasets import boston_housing
In [2]:
# 限制前10000个最常见的单词
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
In [3]:
train_data.shape # 每个样本都是13个特征
Out[3]:
(404, 13)
In [4]:
test_data.shape
Out[4]:
(102, 13)
In [5]:
train_targets[:10]
Out[5]:
array([15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6, 14.4])
神经网络中不能输入取值范围差异很大的数据集,需要进行标准化处理。
每个特征的标准化:(原数据 - 特征平均值) / 标准差。 得到的就是特征平均值为0,标准差为1
In [6]:
# numpy实现
mean = train_data.mean(axis=0)
train_data -= mean # 等价于 train_data = train_data - mean
std = train_data.std(axis=0)
train_data / std
# 测试集
test_data -= mean # 训练集的均值和标准差
test_data /= std
注意的点:
样本量少,可构建2个隐藏层,每层64个单元。较小的网络能够降低过拟合
In [7]:
import tensorflow as tf # add
from keras import models
from keras import layers
def build_model():
model = models.Sequential() # 需要将同一个模型多次实例化,使用同一个函数来构建模型
model.add(tf.keras.layers.Dense(64,
activation="relu",
# 训练集的shape的第二个取值
input_shape=(train_data.shape[1], )
))
model.add(tf.keras.layers.Dense(64,
activation="relu"))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer="rmsprop",
loss="mse",
metrics=["mae"]
)
return model
网络的特点:
当样本数量很少的时候,验证集的划分方式可能会造成验证分数上有很大的方差,无法对模型进行可靠的评估。
最佳方法:使用K折交叉验证
如何K折交叉验证:以3折交叉验证为例
[外链图片转存中…(img-Q2e9Q-1648911615598)]
In [8]:
import numpy as np
k = 4
num_val_samples = len(train_data) // k
for i in range(k):
print("procesing fold ......", i)
# 准备验证数据:第k个分区的数据
val_data = train_data[i*num_val_samples: (i+1) * num_val_samples]
val_targets = train_targets[i*num_val_samples: (i+1) * num_val_samples]
# 训练数据集的合并:axis=0方向
# train_data合并
partial_train_data = np.concatenate(
[train_data[:i*num_val_samples],
train_data[(i+1)*num_val_samples:]],
axis=0
)
# train_targets合并
partial_train_targets = np.concatenate(
[train_targets[:i*num_val_samples],
train_targets[(i+1)*num_val_samples:]],
axis=0
)
procesing fold ...... 0
procesing fold ...... 1
procesing fold ...... 2
procesing fold ...... 3
In [9]:
import numpy as np
k = 4
num_val_samples = len(train_data) // k
number_epochs = 100
all_scores = []
for i in range(k):
print("procesing fold ......", i)
# 准备验证数据:第k个分区的数据
val_data = train_data[i*num_val_samples: (i+1) * num_val_samples]
val_targets = train_targets[i*num_val_samples: (i+1) * num_val_samples]
# 训练数据集的合并:axis=0方向
# train_data合并
partial_train_data = np.concatenate(
[train_data[:i*num_val_samples],
train_data[(i+1)*num_val_samples:]],
axis=0
)
# train_targets合并
partial_train_targets = np.concatenate(
[train_targets[:i*num_val_samples],
train_targets[(i+1)*num_val_samples:]],
axis=0
)
model = build_model()
model.fit(partial_train_data,
partial_train_targets,
epochs=number_epochs,
batch_size=1,
verbose=0 # 静默模式
)
# 验证数据上评估模型
val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
all_scores.append(val_mae)
procesing fold ...... 0
procesing fold ...... 1
procesing fold ...... 2
procesing fold ...... 3
In [10]:
all_scores
Out[10]:
[2.6683619022369385,
2.8356902599334717,
2.8533785343170166,
2.9509527683258057]
In [11]:
np.mean(all_scores)
Out[11]:
2.827095866203308
每次运行模型得到的数值还是有很大的差异,但是均值最终还是在2.94接近3,是一个比较可靠的结果。
In [14]:
import numpy as np
k = 4
num_val_samples = len(train_data) // k
number_epochs = 500
# 修改:统计mae得分
all_mae_histories = []
for i in range(k):
print("procesing fold ......", i)
# 准备验证数据:第k个分区的数据
val_data = train_data[i*num_val_samples: (i+1) * num_val_samples]
val_targets = train_targets[i*num_val_samples: (i+1) * num_val_samples]
# 训练数据集的合并:axis=0方向
# train_data合并
partial_train_data = np.concatenate(
[train_data[:i*num_val_samples],
train_data[(i+1)*num_val_samples:]],
axis=0
)
# train_targets合并
partial_train_targets = np.concatenate(
[train_targets[:i*num_val_samples],
train_targets[(i+1)*num_val_samples:]],
axis=0
)
model = build_model()
history = model.fit(
partial_train_data,
partial_train_targets,
validation_data=(val_data, val_targets), # 增加内容
epochs=number_epochs,
batch_size=1,
verbose=0 # 静默模式
)
# 验证数据上评估模型:记录mae
# 原文使用的是全称
# mae_history = history.history["val_mean_absolute_error"]
# 目前版本已经使用缩写形式
mae_history = history.history["val_mae"]
all_mae_histories.append(mae_history)
procesing fold ...... 0
procesing fold ...... 1
procesing fold ...... 2
procesing fold ...... 3
计算每个轮次中所有折的MAE的平均值:
In [24]:
len(all_mae_histories)
Out[24]:
4
In [25]:
len(all_mae_histories[0])
Out[25]:
500
In [26]:
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(number_epochs)]
average_mae_history[:10]
Out[26]:
[6.998382568359375,
5.590555667877197,
6.159129858016968,
5.215211391448975,
5.865886926651001,
5.083814740180969,
5.314043760299683,
4.525156497955322,
4.432371437549591,
5.098088502883911]
In [27]:
len(average_mae_history)
Out[27]:
500
In [31]:
import matplotlib.pyplot as plt
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MEA")
plt.show()
[外链图片转存中…(img-hxH59OOL-1648911615600)]
上面的图形纵轴范围大,且数据的方差大,很难看清图形,重新绘制:
In [38]:
def smooth_curve(points, factor=0.9):
smoothed_points = []
for point in points:
if smoothed_points:
previous = smoothed_points[-1]
smoothed_points.append(previous * factor + point * (1 - factor))
else:
smoothed_points.append(point)
return smoothed_points
smooth_mae_history = smooth_curve(average_mae_history[10:])
import matplotlib.pyplot as plt
plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MEA")
plt.show()
[外链图片转存中…(img-S4DvOuIM-1648911615601)]
使用最佳参数在整个训练集上进行训练,然后在测试集上进行测试:
In [40]:
model = build_model()
model.fit(train_data,train_targets,
epochs=80,
batch_size=16,
verbose=0
)
test_mse_score, test_mae_score = model.evaluate(test_data,test_targets)
4/4 [==============================] - 0s 3ms/step - loss: 372.9089 - mae: 18.3248
In [41]:
test_mae_score
Out[41]:
18.324810028076172
可以看到预测的房价和真实的房价的相差约为1.8万美元