一个比较完整的机器学习例子
本文根据《Hands-on Machine Learning with Scikit-Learn and TensorFlow》中的第二章整理而成。原文中对于机器学习中的各个步骤的多种方法进行了详尽的论述,但多种方法夹杂在一起对于初学者来讲并不友好,看完之后会丢失主线。因此本文对内容做了精简,仅指出每个步骤中的一种主要方法,梳理出一个更加清晰的脉络。此外对原书中的内容做了更加通俗的解释,因为调用的Scikit-Learn版本升级的原因导致书中部分代码不能实现,文中也给出了解决方案。
本文的主要内容如下:
- 数据获取
- 数据集分割
- 可视化探索
- ML数据准备
- ML算法应用
- 调参
- 模型验证
数据获取
使用一个房屋价格数据,数据在网络上下载获得。
本文采用代码实现所有的操作。包括此处的数据文件路径的建立、数据下载及解压。
数据获取部分主要调用了三个模块
- os:本地文件夹创建
- urllib:web路径的解析,数据文件下载
- tarfile:压缩文件的解压
# 获取数据
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
"""
获取数据
输入文件链接
"""
if not os.path.isdir(housing_path):
os.makedirs(housing_path) # 如果没有目标文件夹则创建一个
tgz_path = os.path.join(housing_path, "housing.tgz")
# 路径组合
urllib.request.urlretrieve(housing_url, tgz_path)
# 将rul数据保存在path
housing_tgz = tarfile.open(tgz_path) # 解压
housing_tgz.extractall(path=housing_path) # 提取文本内容
housing_tgz.close()
# fetch_housing_data()
# # [Finished in 53.8s]
数据获取后,要将数据加载到Python环境中,通过Pandas库实现。加载后通过以下几个方法对数据做初步了解。
- head():查看数据前五行
- info():数据的基本统计信息
- [].value_counts():类别型特征的计数统计
# 使用Pandas加载数据,同样采用一个小函数进行
import pandas as Pandas
def load_housing_data(housing_path=HOUSING_PATH):
"""
加载数据
输入参数:
文件路径
"""
csv_path = os.path.join(housing_path, "housing.csv")
return Pandas.read_csv(csv_path)
housing = load_housing_data()
# print(housing.head()) # 查看数据前五行
# print(housing.info()) # 查看数据基本信息
# print(housing["ocean_proximity"].value_counts())# 对类别数据计数统计
# print(housing.describe()) # 查看数值型数据的基本统计信息
# # 绘制数值型数据的直方图
import matplotlib.pyplot as plt
# housing.hist(bins=50, figsize=(20, 15)) # bins指定箱子个数
# # figsize子图大小
# plt.show()
数据集分割
数据集分割有以下几种思路:
- 自写函数通过随机直接分割数据
- 自写函数通过哈希编码+随机分割数据
- 调用sklearn中的
train_test_split()
- 分层抽样分割数据
以上方法中最为通用的是第三种,通过调用Scikit-Learn中的train_test_split()
实现数据集的分割,其他的方法仅作了解即可,它们仅针对与特定的情况。
# 调用Scikit-Learn库进行数据分割
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
可视化探索
主要是基于地理信息的一些散点图绘制。
housing = train_set.copy()
# housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
# # alpha 设置散点的透明度
# plt.show()
# # 基于房屋价格的**图
# housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
# s=housing["population"]/100, label="population",
# # 采用人口数量为半径画圆,增加图例说明
# c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
# # 人口数量的圆圈用房屋中位数价格填充,颜色选取默尔的颜色映射
# )
# plt.legend()
# plt.show()
ML数据准备
这里直接介绍Scikit-Learn中的Imputer、Pipeline和FeatureUnion。
Imputer用于填充缺失值
Pipeline用于组合各种数据处理方法
FeatureUnion用于联合多个Pipeline
在本项目中,数据的处理主要分为两大类:
- 数值型数据
- 缺失值填充
- 属性融合
- 归一化处理
- 类别型数据
- One-hot编码
# Machine Learning数据准备
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()
# 数据清理
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
"""
属性融合
"""
def __init__(self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return Numpy.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return Numpy.c_[X, rooms_per_household, population_per_household]
class DataFrameSelector(BaseEstimator, TransformerMixin):
"""Scikit-Learn不能直接处理Pandas的DataFrame"""
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
from sklearn.preprocessing import LabelBinarizer
class MylabelBinarizer(TransformerMixin):
"""Scikit-Learn 0.19与0.18种的 LabelBinarizer 不同"""
def __init__(self, *args, **kwargs):
self.encoder = LabelBinarizer(*args, **kwargs)
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing.drop("ocean_proximity", axis=1))
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', MylabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)
ML算法应用
主要应用了三种方法
- 线性回归
- 决策树
- 随机森林
线性回归
# 训练
from sklearn.metrics import mean_squared_error
# 线性回归
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
# 验证
some_data = housing.iloc[: 5]
some_labels = housing_labels.iloc[: 5]
some_data_prepared = full_pipeline.transform(some_data)
# print("Predictions:", lin_reg.predict(some_data_prepared))
# print("Lables:\t\t", list(some_labels))
# 计算RMSE
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = Numpy.sqrt(lin_mse)
print(lin_rmse)
决策树
# 决策树回归
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = Numpy.sqrt(tree_mse)
print(tree_rmse)
# 0 过拟合
# 交叉验证
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = Numpy.sqrt(-scores) # 返回10次结果
# 结果显示
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
# display_scores(tree_rmse_scores)
随机森林
# 随机森林
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = Numpy.sqrt(forest_mse)
print(forest_rmse) # 21925.
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = Numpy.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
调参
- 网络搜索
- 随机搜索
例子仅使用了网络搜索,没有使用随机搜索,实际项目中应该是先通过随机搜索确定一个较小的范围之后再应用网络搜索对指定范围进行详细的搜索。
网络搜索是当已经将参数组合确定在一个小范围时候使用的。参数范围较大的情况下应该用Randomized Search随机搜索,RandomizedSearchCV
。这个类类似于grid_search
,但是它不会搜索所有可能的组合,只是随机选择其中的一部分。这种方法有两个好处:
- 随机1000次会对1000种不同的参数组合进行测试
- 只需要设置迭代次数就可以控制计算
# 调参 GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
grid_search.best_estimator_
通过测试集验证
最后一步验证方法的有效性。使用之前留出的20%测试集数据对模型进行检验。这里就体现出之前通过制作数据处理管道的好处,直接将测试集数据输入即可对数据进行完整的处理,得到可以输入进模型的格式。
# 通过测试集评估模型
final_model = grid_search.best_estimator_
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = Numpy.sqrt(final_mse)
# print(final_rmse) # 49259.0757201 [Finished in 47.4s]
小结
至此已完成一个完整的机器学习项目。下面讨论前述方法中的不足之处,虽然尝试了线性回归LR、决策树DT、随机森林RF,但是没有尝试将多种方法模型融合。在调参部分仅使用了网络搜索,实际中应该要先通过随机所搜确定一个大致的范围,然后再对小范围进行网络搜索。