{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 模型评价与验证:波士顿房价预测"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"欢迎来到机器学习实战系列的波士顿房价预测项目!在此文件中,有些示例代码已经提供给你,但你还需要实现更多的功能来让项目成功运行。除非有明确要求,你无须修改任何已给出的代码。以 **编程练习**开始的标题表示接下来的内容中有需要你实现的功能。需要实现的部分也会在注释中以**TODO**标出。请仔细阅读所有的提示!你可以点击**问题提示**,查看每一部分详细的提示指导,也可以点击**插入答案**,把正确答案插入到下方代码块中。\n",
"\n",
"除了实现代码外,你还需要回答一些与项目和实现有关的问题。每一个需要你回答的问题都会以 **思考问答**为标题。请仔细阅读每个问题,作出答复。当然我们也为你提供**问题提示**和**查看答案**的按钮。\n",
"\n",
"**文档中提供的代码具有顺序性,必须从前往后依次运行代码,不能跳跃执行,否则可能出现意想不到的错误!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"### 第一步:导入数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"在这个项目中,你将利用马萨诸塞州波士顿郊区的房屋信息数据训练和测试一个模型,并对模型的性能和预测能力进行测试。通过该数据训练好的模型可以被用来对房屋做特定预测---尤其是对房屋的价值。对于房地产经纪等人的日常工作来说,这样的预测模型被证明非常有价值。\n",
"\n",
"文件列表中 `visuals.py` 为辅助代码,`housing.csv` 为数据集文件,`result` 文件夹为结果的文件存放地(如果你使用CPU/GPU运行代码创建了job时,job运行完的结果文件也会存放于此)。 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"波士顿房屋这些数据于1978年开始统计,共506个数据点,涵盖了麻省波士顿不同郊区房屋14种特征的信息。本项目对原始数据集做了以下处理:\n",
"- 有16个`'MEDV'` 值为50.0的数据点被移除。 这很可能是由于这些数据点包含**遗失**或**看不到的值**。\n",
"- 有1个数据点的 `'RM'` 值为8.78. 这是一个异常值,已经被移除。\n",
"- 对于本项目,房屋的`'RM'`, `'LSTAT'`,`'PTRATIO'`以及`'MEDV'`特征是必要的,其余不相关特征已经被移除。\n",
"- `'MEDV'`特征的值已经经过必要的数学转换,可以反映35年来市场的通货膨胀效应。\n",
"\n",
"以上特征的含义如下: \n",
"`RM`: 住宅平均房间数量 \n",
"`LSTAT`: 区域中被认为是低收入阶层的比率 \n",
"`PTRATIO`: 镇上学生与教师数量比例 \n",
"`MEDV`: 房屋的中值价格 \n",
"\n",
"运行下面区域的代码以载入波士顿房屋数据集,以及一些此项目所需的 Python 模块库。如果成功返回数据集的大小,表示数据集已载入成功。我们也可以看到输出的数据集结构。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Import libraries necessary for this project\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import ShuffleSplit\n",
"\n",
"# Import supplementary visualizations code visuals.py\n",
"import visuals as vs\n",
"\n",
"# Pretty display for notebooks\n",
"%matplotlib inline\n",
"\n",
"# Load the Boston housing dataset\n",
"data = pd.read_csv('housing.csv')\n",
"prices = data['MEDV']\n",
"features = data.drop('MEDV', axis = 1)\n",
"\n",
"# Success\n",
"print(\"Boston housing dataset has {} data points with {} variables each.\".format(*data.shape))\n",
"\n",
"# 显示数据结构\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第二步:分析数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"在项目的第一个部分,你会对波士顿房地产数据进行初步的观察并给出你的分析。通过对数据的探索来熟悉数据可以让你更好地理解和解释你的结果。由于这个项目的最终目标是建立一个预测房屋价值的模型,我们需要将数据集分为**特征(features)**和**目标变量(target variable)**。\n",
"- **特征** `'RM'`, `'LSTAT'`,和 `'PTRATIO'`,给我们提供了每个数据点的数量相关的信息。\n",
"- **目标变量**:` 'MEDV'`,是我们希望预测的变量。\n",
"\n",
"他们分别被存在 `features` 和 `prices` 两个变量名中。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"你的第一个编程练习是计算有关波士顿房价的描述统计数据。我们已为你导入了 ` NumPy `,你需要使用这个库来执行必要的计算。这些统计数据对于分析模型的预测结果非常重要的。\n",
"在下面的代码中,你要做的是:\n",
"- 计算 `prices` 中的 `'MEDV'` 的最小值、最大值、均值、中值和标准差;\n",
"- 将运算结果储存在相应的变量中。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO: Minimum price of the data\n",
"minimum_price = None\n",
"\n",
"# TODO: Maximum price of the data\n",
"maximum_price = None\n",
"\n",
"# TODO: Mean price of the data\n",
"mean_price = None\n",
"\n",
"# TODO: Median price of the data\n",
"median_price = None\n",
"\n",
"# TODO: Standard deviation of prices of the data\n",
"std_price = None\n",
"\n",
"# Show the calculated statistics\n",
"print(\"Statistics for Boston housing dataset:\\n\")\n",
"print(\"Minimum price: ${:.2f}\".format(minimum_price))\n",
"print(\"Maximum price: ${:.2f}\".format(maximum_price))\n",
"print(\"Mean price: ${:.2f}\".format(mean_price))\n",
"print(\"Median price ${:.2f}\".format(median_price))\n",
"print(\"Standard deviation of prices: ${:.2f}\".format(std_price))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 插入答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Minimum price of the data\n",
"minimum_price = prices.min()\n",
"\n",
"# TODO: Maximum price of the data\n",
"maximum_price = prices.max()\n",
"\n",
"# TODO: Mean price of the data\n",
"mean_price = prices.mean()\n",
"\n",
"# TODO: Median price of the data\n",
"median_price = prices.median()\n",
"\n",
"# TODO: Standard deviation of prices of the data\n",
"std_price = prices.std()\n",
"\n",
"# Show the calculated statistics\n",
"print(\"Statistics for Boston housing dataset:\\n\")\n",
"print(\"Minimum price: ${:.2f}\".format(minimum_price))\n",
"print(\"Maximum price: ${:.2f}\".format(maximum_price))\n",
"print(\"Mean price: ${:.2f}\".format(mean_price))\n",
"print(\"Median price ${:.2f}\".format(median_price))\n",
"print(\"Standard deviation of prices: ${:.2f}\".format(std_price))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"如前文所述,本项目中我们关注的是其中三个值:`'RM'`、`'LSTAT'` 和`'PTRATIO'`,对每一个数据点:\n",
"- `'RM'` 是该地区中每个房屋的平均房间数量;\n",
"- `'LSTAT'` 是指该地区有多少百分比的业主属于是低收入阶层(有工作但收入微薄);\n",
"- `'PTRATIO'` 是该地区的中学和小学里,学生和老师的数目比(`学生/老师`)。\n",
"\n",
"_凭直觉,上述三个特征中对每一个来说,你认为增大该特征的数值,`'MEDV'`的值会是**增大**还是**减小**呢?每一个答案都需要你给出理由。_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第三步:建立模型"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"在项目的第三步中,你需要了解必要的工具和技巧来对你的模型进行预测。用这些工具和技巧对每一个模型的表现做精确的衡量可以极大地增强你预测的信心。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"如果不能对模型的训练和测试的表现进行量化地评估,我们就很难衡量模型的好坏。通常我们会定义一些衡量标准,这些标准可以通过对某些误差或者拟合程度的计算来得到。在这个项目中,你将通过运算[决定系数](https://en.wikipedia.org/wiki/Coefficient_of_determination) $R^2$ 来量化模型的表现。模型的决定系数是回归分析中十分常用的统计信息,经常被当作衡量模型预测能力好坏的标准。\n",
"\n",
"$R^2$ 的数值范围从0至1,表示**目标变量**的预测值和实际值之间的相关程度平方的百分比。一个模型的 $R^2$ 值为0还不如直接用**平均值**来预测效果好;而一个 $R^2$ 值为1的模型则可以对目标变量进行完美的预测。从0至1之间的数值,则表示该模型中目标变量中有百分之多少能够用**特征**来解释。模型也可能出现负值的 $R^2$,这种情况下模型所做预测有时会比直接计算目标变量的平均值差很多。\n",
"\n",
"\n",
"在下方代码的 `performance_metric` 函数中,你要实现:\n",
"- 使用 `sklearn.metrics` 中的 [`r2_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) 来计算 `y_true` 和 `y_predict` 的 $R^2$ 值,作为对其表现的评判。\n",
"- 将他们的表现评分储存到 `score` 变量中。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO: 导入 'r2_score'\n",
"\n",
"def performance_metric(y_true, y_predict):\n",
" \"\"\" Calculates and returns the performance score between\n",
" true and predicted values based on the metric chosen. \"\"\"\n",
"\n",
" # TODO: Calculate the performance score between 'y_true' and 'y_predict'\n",
" score = None\n",
"\n",
" # Return the score\n",
" return score\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 插入答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: 导入 'r2_score'\n",
"\n",
"from sklearn.metrics import r2_score\n",
"\n",
"def performance_metric(y_true, y_predict):\n",
" \"\"\"计算并返回预测值相比于预测值的分数\"\"\"\n",
" score = r2_score(y_true,y_predict)\n",
" return score\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"假设一个数据集有五个数据且一个模型做出下列目标变量的预测:\n",
"\n",
"| 真实数值 | 预测数值 |\n",
"| :-------------: | :--------: |\n",
"| 3.0 | 2.5 |\n",
"| -0.5 | 0.0 |\n",
"| 2.0 | 2.1 |\n",
"| 7.0 | 7.8 |\n",
"| 4.2 | 5.3 |\n",
"*你觉得这个模型已成功地描述了目标变量的变化吗?如果成功,请解释为什么,如果没有,也请给出原因。* \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Calculate the performance of this model\n",
"score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])\n",
"print(\"Model has a coefficient of determination, R^2, of {:.3f}.\".format(score))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"接下来,你需要把波士顿房屋数据集分成训练和测试两个子集。通常在这个过程中,数据也会被重排列,以消除数据集中由于顺序而产生的偏差。\n",
"在下面的代码中,你需要\n",
"\n",
"* 使用 `sklearn.model_selection` 中的 `train_test_split`, 将 `features` 和 `prices` 的数据都分成用于训练的数据子集和用于测试的数据子集。\n",
" - 分割比例为:80%的数据用于训练,20%用于测试;\n",
" - 选定一个数值以设定 `train_test_split` 中的 `random_state` ,这会确保结果的一致性;\n",
"* 将分割后的训练集与测试集分配给 `X_train`, `X_test`, `y_train` 和 `y_test`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO: Import 'train_test_split'\n",
"\n",
"# TODO: Shuffle and split the data into training and testing subsets\n",
"X_train, X_test, y_train, y_test = (None, None, None, None)\n",
"\n",
"# Success\n",
"print(\"Training and testing split was successful.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 插入答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Import 'train_test_split'\n",
"\n",
"# TODO: Shuffle and split the data into training and testing subsets\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)\n",
"print(X_train)\n",
"print(X_test)\n",
"print(y_train)\n",
"print(y_test)\n",
"\n",
"# Success\n",
"print(\"Training and testing split was successful.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"*将数据集按一定比例分为训练用的数据集和测试用的数据集对学习算法有什么好处?*\n",
"\n",
"*如果用模型已经见过的数据,例如部分训练集数据进行测试,又有什么坏处?*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第四步:分析模型的表现"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在项目的第四步,我们来看一下不同参数下,模型在训练集和验证集上的表现。这里,我们专注于一个特定的算法(带剪枝的决策树,但这并不是这个项目的重点),和这个算法的一个参数 `'max_depth'`。用全部训练集训练,选择不同`'max_depth'` 参数,观察这一参数的变化如何影响模型的表现。画出模型的表现来对于分析过程十分有益。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**学习曲线**\n",
"\n",
"下方区域内的代码会输出四幅图像,它们是一个决策树模型在不同最大深度下的表现。每一条曲线都直观得显示了随着训练数据量的增加,模型学习曲线的在训练集评分和验证集评分的变化,评分使用决定系数 $R^2$。曲线的阴影区域代表的是该曲线的不确定性(用标准差衡量)。\n",
"\n",
"运行下方区域中的代码,并利用输出的图形回答下面的问题。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": false
},
"outputs": [],
"source": [
"# Produce learning curves for varying training set sizes and maximum depths\n",
"vs.ModelLearning(features, prices)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"* 选择上述图像中的其中一个,并给出其最大深度。\n",
"* 随着训练数据量的增加,训练集曲线的评分有怎样的变化?验证集曲线呢?\n",
"* 如果有更多的训练数据,是否能有效提升模型的表现呢?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**复杂度曲线**\n",
"\n",
"下列代码内的区域会输出一幅图像,它展示了一个已经经过训练和验证的决策树模型在不同最大深度条件下的表现。这个图形将包含两条曲线,一个是训练集的变化,一个是验证集的变化。跟**学习曲线**相似,阴影区域代表该曲线的不确定性,模型训练和测试部分的评分都用的 `performance_metric` 函数。\n",
"\n",
"**运行下方区域中的代码,并利用输出的图形并回答下面的问题5与问题6。**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"vs.ModelComplexity(X_train, y_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"关于偏差(bias)与方差(variance)之间的权衡取舍,请回答以下问题:\n",
"* 当模型以最大深度 1训练时,模型的预测是出现很大的偏差还是出现了很大的方差?\n",
"* 当模型以最大深度10训练时,情形又如何呢?\n",
"* 图形中的哪些特征能够支持你的结论?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"* 结合问题 5 中的图,你认为最大深度是多少的模型能够最好地对未见过的数据进行预测?\n",
"* 你得出这个答案的依据是什么?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第五步:评估模型的表现"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"在项目的最后一节中,你将构建一个模型,并使用 `fit_model` 中的优化模型去预测客户特征集。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"回答以下关于网格搜索(Grid Search)的问题:\n",
"* 什么是网格搜索法?\n",
"* 如何用它来优化模型?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据集\n",
"- 训练集 train set\n",
" - 训练集 train fold\n",
" - 验证集 test fold\n",
"- 测试集 test set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"回答以下关于交叉验证的问题:\n",
"- 什么是K折交叉验证法(k-fold cross-validation)?\n",
"- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 是如何结合交叉验证来完成对最佳参数组合的选择的?\n",
"- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 中的`'cv_results_'`属性能告诉我们什么?\n",
"- 网格搜索为什么要使用K折交叉验证?K折交叉验证能够避免什么问题?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"在这个练习中,你将需要将所学到的内容整合,使用**决策树算法**训练一个模型。为了得出的是一个最优模型,你需要使用网格搜索法训练模型,以找到最佳的 `'max_depth'` 参数。你可以把`'max_depth'` 参数理解为决策树算法在做出预测前,允许其对数据提出问题的数量。决策树是**监督学习算法**中的一种。\n",
"\n",
"另外,你会发现在实现的过程中是使用`ShuffleSplit()`作为交叉验证的另一种形式(参见'cv_sets'变量)。虽然它不是你在问题8中描述的K-fold交叉验证方法,但它同样非常有用!下面的`ShuffleSplit()`实现将创建10个('n_splits')混洗集合,并且对于每个混洗集,数据的20%('test_size')将被用作验证集合。当您在实现代码的时候,请思考一下它与 `K-fold cross-validation` 的不同与相似之处。\n",
"\n",
"请注意,`ShuffleSplit` 在 `Scikit-Learn` 版本0.17和0.18中有不同的参数。对于下面代码单元格中的 `fit_model` 函数,您需要实现以下内容:\n",
"\n",
"1. **定义 `'regressor'` 变量**: 使用 `sklearn.tree` 中的 [`DecisionTreeRegressor`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) 创建一个决策树的回归函数;\n",
"2. **定义 `'params'` 变量**: 为 `'max_depth'` 参数创造一个字典,它的值是从1至10的数组;\n",
"3. **定义 `'scoring_fnc'` 变量**: 使用 `sklearn.metrics` 中的 [`make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) 创建一个评分函数。将 `‘performance_metric’` 作为参数传至这个函数中;\n",
"4. **定义 `'grid'` 变量**: 使用 `sklearn.model_selection` 中的 [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 创建一个网格搜索对象;将变量`'regressor'`, `'params'`, `'scoring_fnc'`和 `'cv_sets'` 作为参数传至这个对象构造函数中;\n",
"\n",
" \n",
"如果你对 Python 函数的默认参数定义和传递不熟悉,可以参考这个MIT课程的[视频](http://cn-static.udacity.com/mlnd/videos/MIT600XXT114-V004200_DTH.mp4)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO: Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'\n",
"\n",
"def fit_model(X, y):\n",
" \"\"\" Performs grid search over the 'max_depth' parameter for a\n",
" decision tree regressor trained on the input data [X, y]. \"\"\"\n",
"\n",
" # Create cross-validation sets from the training data\n",
" # sklearn version 0.18: ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None)\n",
" # sklearn versiin 0.17: ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None)\n",
" cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=42)\n",
"\n",
" # TODO: Create a decision tree regressor object\n",
" regressor = None\n",
"\n",
" # TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10\n",
" params = {}\n",
"\n",
" # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer'\n",
" scoring_fnc = None\n",
"\n",
" # TODO: Create the grid search cv object --> GridSearchCV()\n",
" # Make sure to include the right parameters in the object:\n",
" # (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.\n",
" grid = None\n",
"\n",
" # Fit the grid search object to the data to compute the optimal model\n",
" grid = grid.fit(X, y)\n",
"\n",
" # Return the optimal model after fitting the data\n",
" return grid.best_estimator_\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 插入答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO 4\n",
"\n",
"from sklearn.model_selection import KFold,GridSearchCV\n",
"from sklearn.metrics import make_scorer\n",
"from sklearn import tree\n",
"\n",
"def fit_model(X, y):\n",
" \"\"\" 基于输入数据 [X,y],利于网格搜索找到最优的决策树模型\"\"\"\n",
" \n",
" cross_validator = KFold()\n",
" \n",
" regressor = tree.DecisionTreeRegressor()\n",
"\n",
" params = {'max_depth':range(1,11)}\n",
" \n",
" scoring_fnc = make_scorer(performance_metric)\n",
"\n",
" grid = GridSearchCV(regressor,params,scoring=scoring_fnc,cv=cross_validator) #,cross_validator\n",
" \n",
" # 基于输入数据 [X,y],进行网格搜索\n",
" grid = grid.fit(X, y)\n",
" # 查看参数\n",
" # print(pd.DataFrame(grid.cv_results_))\n",
" # 返回网格搜索后的最优模型\n",
" return grid.best_estimator_\n",
"\n",
"fit_model(X_train,y_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 第六步:做出预测"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"当我们用数据训练出一个模型,它现在就可用于对新的数据进行预测。在决策树回归函数中,模型已经学会对新输入的数据*提问*,并返回对**目标变量**的预测值。你可以用这个预测来获取数据未知目标变量的信息,这些数据必须是不包含在训练数据之内的。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"*最优模型的最大深度(maximum depth)是多少?此答案与你在**问题 6**所做的猜测是否相同?*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"# Fit the training data to the model using grid search\n",
"reg = fit_model(X_train, y_train)\n",
"\n",
"# Produce the value for 'max_depth'\n",
"print(\"Parameter 'max_depth' is {} for the optimal model.\".format(reg.get_params()['max_depth']))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"想像你是一个在波士顿地区的房屋经纪人,并期待使用此模型以帮助你的客户评估他们想出售的房屋。你已经从你的三个客户收集到以下的资讯:\n",
"\n",
"| 特征 | 客戶 1 | 客戶 2 | 客戶 3 |\n",
"| :---: | :---: | :---: | :---: |\n",
"| 房屋内房间总数 | 5 间房间 | 4 间房间 | 8 间房间 |\n",
"| 社区贫困指数(%被认为是贫困阶层) | 17% | 32% | 3% |\n",
"| 邻近学校的学生-老师比例 | 15:1 | 22:1 | 12:1 |\n",
"\n",
"* 你会建议每位客户的房屋销售的价格为多少?\n",
"* 从房屋特征的数值判断,这样的价格合理吗?为什么?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Produce a matrix for client data\n",
"client_data = [[5, 17, 15], # Client 1\n",
" [4, 32, 22], # Client 2\n",
" [8, 3, 12]] # Client 3\n",
"\n",
"# Show predictions\n",
"for i, price in enumerate(reg.predict(client_data)):\n",
" print(\"Predicted selling price for Client {}'s home: ${:,.2f}\".format(i+1, price))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"你刚刚预测了三个客户的房子的售价。在本练习中,你将用你的最优模型在整个测试数据上进行预测, 并计算相对于目标变量的决定系数 $R^2$ 的值。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO Calculate the r2 score between 'y_true' and 'y_predict'\n",
"\n",
"r2 = None\n",
"\n",
"print(\"Optimal model has R^2 score {:,.2f} on test data\".format(r2))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 插入答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#TODO 5\n",
"\n",
"predict = reg.predict(X_test)\n",
"\n",
"r2 = performance_metric(y_test,predict)\n",
"\n",
"print(\"Optimal model has R^2 score {:,.2f} on test data\".format(r2))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"你刚刚计算了最优模型在测试集上的决定系数,你会如何评价这个结果?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**模型健壮性**\n",
"\n",
"一个最优的模型不一定是一个健壮模型。有的时候模型会过于复杂或者过于简单,以致于难以泛化新增添的数据;有的时候模型采用的学习算法并不适用于特定的数据结构;有的时候样本本身可能有太多噪点或样本过少,使得模型无法准确地预测目标变量。这些情况下我们会说模型是欠拟合的。\n",
"\n",
"\n",
"模型是否足够健壮来保证预测的一致性?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"vs.PredictTrials(features, prices, fit_model, client_data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"*简单地讨论一下你建构的模型能否在现实世界中使用?是否具有实用性?* "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"问题提示 查看答案"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
一键复制
编辑
Web IDE
原始数据
按行查看
历史