car-evaluation下载链接: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
在下载页面有对数据的详细描述, 参照数据描述, 对数据进行处理.
将数据分为7大类要素, 其中包括分类结果Result.
column_names=['Buying', 'Maint', 'Doors', 'Persons', 'Lug_boot', 'Safety', 'Result']
raw_datasets = pd.read_csv('./data/PRACTICE/car.data', names=column_names,
na_values='?', sep=',', skipinitialspace=True, skiprows=1)
需要注意的是在这些数据中, 有5列数据是字符串分类数据, 而不是数值, 所以需要将其转化为数值, 才能输入到模型中进行训练.
## 查找是否有未知值-NAN
# print(raw_datasets.isna().sum())
## 简单删除nan行
# raw_datasets = raw_datasets.dropna()
## get_dummies(): Convert categorical variable into indicator variables
price = pd.get_dummies(raw_datasets.Buying, prefix='Buying')
## 其他字符串列, 依次按照这种方式转变.
def dataframe_to_tfdata(dataframe, shuffle=True, batch_size=32):
c_dataframe = dataframe.copy()
labels_tmp0 = c_dataframe.pop('Result_acc')
labels_tmp1 = c_dataframe.pop('Result_good')
labels_tmp2 = c_dataframe.pop('Result_unacc')
labels_tmp3 = c_dataframe.pop('Result_vgood')
labels_tensor = pd.concat([labels_tmp0, labels_tmp1, labels_tmp2, labels_tmp3], axis=1)
tfdata = tf.data.Dataset.from_tensor_slices((dict(c_dataframe), labels_tensor.values))
if shuffle:
tfdata = tfdata.shuffle(buffer_size=len(c_dataframe))
tfdata = tfdata.batch(batch_size)
return tfdata
## 特征列越完整, 特征层就越有意义, 训练模型的精度更高.
feature_column_data=[]
for col in ['Buying_high', 'Buying_low', 'Buying_med', 'Buying_vhigh',
'Maint_high', 'Maint_low', 'Maint_med', 'Maint_vhigh',
'Doors', 'Persons', 'Lug_boot_big',
'Lug_boot_med', 'Lug_boot_small', 'Safety_high', 'Safety_low', 'Safety_med']:
feature_column_data.append(tf.feature_column.numeric_column(col))
batch = 16
train_ds = dataframe_to_tfdata(train_data, batch_size=batch)
#train_ds = train_ds.shuffle(10000).repeat().batch(batch)
val_ds = dataframe_to_tfdata(val_data, shuffle=False, batch_size=batch)
test_ds = dataframe_to_tfdata(test_data, shuffle=False, batch_size=batch)
model = tf.keras.Sequential([
layers.DenseFeatures(feature_column_data),
layers.Dense(128, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(4, activation='softmax')
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_ds,
validation_data=val_ds,
# steps_per_epoch=23,
epochs=100
# validation_steps=5
)
## 这里有两个参数, steps_per_epoch和validation_steps如果不确定,就干脆可以不加, 不会影响训练模型的精度.
loss, accuracy = model.evaluate(test_ds)
print("accuracy=", accuracy)
BaseCollectiveExecutor::StartAbort Out of range: End of sequence, 这个报错很常见, 就是因为几个参数没有调对(batch, epochs, steps_per_epoch, validation_steps),
最后训练的精度最高达到0.9928, 损失为0.0163;
实际测试的精度为: 0.9798
"模型训练小技巧"
## 在模型训练时, 添加下面两句可以在训练精度没有发生较大改变时提前终止:
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
history = model.fit(normed_train_data, train_labels, epochs=EPOCHS,
validation_split = 0.2, verbose=0, callbacks=[early_stop, PrintDot()])
## 经验总结
1.当数字输入数据特征具有不同范围的值时,应将每个特征独立地缩放到相同范围。
2.如果没有太多训练数据,应选择隐藏层很少的小网络,以避免过拟合。
3.尽早停止是防止过拟合的有效技巧。