步骤
1、分析工作内容
2、获得数据
3、分析和预处理数据
4、选择模型并训练
5、参数调优
6、描述我们最终方案
7、上线我们的系统
利用加州的人口普查数据来对房价进行预测。
这份数据以街区(district)为最小单位,对应的信息有:人口、收入中位数、房价中位数等等。
我们要做什么
记住一个概念--管道(pipline),类似于Linux中的管道
管道各组件独立工作,前一个组件没产出后一个组件就会使用老数据,增强系统鲁棒性
1、收集数据
2、通过数据对街区价格预测
3、通过街区价格预测模型评估街区,获得投资建议
问题:
RMSE和MAE都是衡量两个向量之间距离的指标,向量指的是模型预测的预测值和样本自己真实值组成的向量。
RMSE其实就是L2范数,对应的物理含义是欧几里得距离。
MAE就是L1范数,对应的是曼哈顿距离,是指我们只能垂直或者水平地运动,那么两点之间的距离就是各个坐标轴之间的距离之和。
范数的阶越高,那么它就越关注值大的数据,对值小的数据就会忽视一些。这就是为什么RMSE相较于MAE对于极值更加敏感的原因。不过如果极值点并不多的话,RMSE表现得其实很好,也是最常使用的指标。
该数据集,没有持续不断的数据流,也没有对于快速变化数据的即时响应的特殊需求,而且整个数据集也不大,完全可以加载到内存里,所以
常规的批量学习算法就可以胜任了。
如果数据量很大,我们完全可以使用分布式学习框架,比如mapreduce。
In [1]:
import os
import pandas as pd
HOUSE_PATH = './dataset'
def load_housing_data(housing_path=HOUSE_PATH):
csv_path = os.path.join(housing_path, 'housing.csv')
return pd.read_csv(csv_path)
In [2]:
housing = load_housing_data()
housing.head()
Out[2]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
经度、维度、街区平均房龄、街区总房数、街区总卧室、街区人口、街区住户、收入中位数、房价中位数、离海距离
In [3]:
housing.info()
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.5+ MB
total_bedrooms 20433 有缺失值,我们重点关注有缺失值的项
In [4]:
housing['ocean_proximity'].value_counts()
Out[4]:
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
describe方法可以得到关于数值型字段的一些摘要信息。
count,mean,min, max这几列都是自解释的。这里所有的统计都会把空值排除在外,比如total_bedroom的count是20433,而不是20640。
Std这一行显示了数据的标准差,它衡量了数据的分散程度。
25%,50%,75%这几行显示了对应的百分位点。比如,25%的街区的房龄中位数低于18年,50%的街区的房龄中位数低于29年,75%的低于37年。
这些点也常常被叫做1/4分位点,中值点和3/4分为点。
In [5]:
housing.describe()
Out[5]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
In [7]:
housing.hist(bins=50,figsize=(20,15))
plt.show()
1.收入中位数的单位看起来并不是美元。它的取值范围是0~15,这个具体含义是什么我们其实并不清楚,需要跟收集数据的团队沟通。我们可能会被告知,说这个字段已经被放缩了,而且超过15的都统一被设置成15,而小于0.5的则全部被设置成了0.5。
在机器学习应用中,使用已经被预处理过的数据是一件很平常的事情,而且一般都不太会造成什么问题,不过为了保险起见,我们总是应该清楚我们使用的数据都是怎样处理的。
2.房龄和房价看起来也都进行了掐头去尾的操作。对房价的这种掐头去尾的操作可能对我们而言是非常严重的,因为它是我们的预测目标。这可能会导致我们的系统永远也不会预测出高于50万美元的房价。我们必须跟下游团队沟通,确定这对它们而言是不是一个问题。如果他们说对于高于50万美元的房子他们也需要精确的价格预测信息,那么我们基本上只有两个选择:
a)对于被掐头去尾的街区重新收集真实的价格数据
b)把这些被掐头去尾的街区数据从数据集中移除
3.所有的这些字段的取值范围差异很大。我们得记着后面要做特征缩放。
4.最后,很多直方图都有些长尾的感觉:右侧分布相较于左侧更加长。这种分布可能对某些机器学习算法会有一些影响。我们后面会想办法把它们尽量转化成钟形分布。
In [8]:
import numpy as np
函数shuffle与permutation都是对原来的数组进行重新洗牌(即随机打乱原来的元素顺序);
区别在于:
shuffle直接在原来的数组上进行操作,改变原来数组的顺序,无返回值。
permutation不直接在原来的数组上进行操作,而是返回一个新的打乱顺序的数组,并不改变原来的数组。
In [9]:
def split_train_test(data, test_ratio):
np.random.seed(42)
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
In [10]:
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), '训练样本\n', len(test_set), '测试样本')
16512 训练样本
4128 测试样本
In [11]:
import hashlib
In [12]:
def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]
In [13]:
# 使用样本序号作为id
housing_with_id = housing.reset_index() # 增加一个id列
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
In [14]:
# 使用最稳定的特征生成id
housing_with_id['id'] = housing['longitude'] * 1000 + housing['latitude']
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'id')
train_test_split
In [15]:
from sklearn.model_selection import train_test_split
In [16]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
对于重要特征,我们希望测试集的数据能够有足够代表性的样本覆盖。
比如,我们从专家处得知,收入中位数对于房价具有非常强的影响。那么我们需要确定不同层次
的收入水平在测试集中都应该有所覆盖。
In [17]:
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)
In [18]:
housing['income_cat'].hist()
Out[18]:
In [19]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
In [20]:
housing['income_cat'].value_counts() / len(housing)
Out[20]:
3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cat, dtype: float64
In [21]:
# 删除用于抽样的信息
for s in (strat_train_set, strat_test_set):
s.drop(['income_cat'], axis=1, inplace=True)
4、分析数据特征之间的相关性
In [22]:
housing = strat_train_set.copy()
In [23]:
housing.plot(kind='scatter', x='longitude', y='latitude')
Out[23]:
In [24]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)
Out[24]:
In [25]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
Out[25]:
In [26]:
corr_matrix = housing.corr() # 用corr计算两两特征之间的相关性系数
In [27]:
corr_matrix["median_house_value"].sort_values(ascending=False) # 跟街区价格中位数特征的其他特征的相关系数
Out[27]:
median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64
In [28]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms","housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
Out[28]:
array([[,
,
,
],
[,
,
,
],
[,
,
,
],
[,
,
,
]],
dtype=object)
In [29]:
housing.plot(kind='scatter', x='median_income',
y='median_house_value', alpha=0.1)
Out[29]:
In [30]:
housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms'] / housing['total_rooms']
housing['population_per_household'] = housing['population'] / housing['households']
In [31]:
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Out[31]:
median_house_value 1.000000
median_income 0.687160
rooms_per_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_per_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_per_room -0.259984
Name: median_house_value, dtype: float64
In [32]:
housing = strat_train_set.drop('median_house_value', axis=1) # drop掉median_house_value
housing_labels = strat_train_set['median_house_value'].copy() # 从原始数据集中copy一份median_house_value
卧室数量(total_bedrooms)这个特征有一些缺失值,我们需要对它们进行处理。
一般我们有三种方式:
1、housing.dropna(subset=["total_bedrooms"]) # option 1 剔除掉有缺失值的样本
2、housing.drop("total_bedrooms", axis=1) # option 2 剔除有缺失值的整个特征
3、median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median) # option 3 用均值填充缺失值
In [33]:
from sklearn.impute import SimpleImputer
In [34]:
imputer = SimpleImputer(strategy='median')
In [35]:
# 中位数只能用于数值类型
housing_num = housing.drop('ocean_proximity', axis=1)
In [36]:
imputer.fit(housing_num)
imputer.statistics_
Out[36]:
array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,
408. , 3.5409])
In [37]:
housing_num.median().values
Out[37]:
array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,
408. , 3.5409])
In [38]:
X = imputer.transform(housing_num)
In [39]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.describe()
Out[39]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
count | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 |
mean | -119.575834 | 35.639577 | 28.653101 | 2622.728319 | 533.998123 | 1419.790819 | 497.060380 | 3.875589 |
std | 2.001860 | 2.138058 | 12.574726 | 2138.458419 | 410.839621 | 1115.686241 | 375.720845 | 1.904950 |
min | -124.350000 | 32.540000 | 1.000000 | 6.000000 | 2.000000 | 3.000000 | 2.000000 | 0.499900 |
25% | -121.800000 | 33.940000 | 18.000000 | 1443.000000 | 296.000000 | 784.000000 | 279.000000 | 2.566775 |
50% | -118.510000 | 34.260000 | 29.000000 | 2119.500000 | 433.000000 | 1164.000000 | 408.000000 | 3.540900 |
75% | -118.010000 | 37.720000 | 37.000000 | 3141.000000 | 641.000000 | 1719.250000 | 602.000000 | 4.744475 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6210.000000 | 35682.000000 | 5358.000000 | 15.000100 |
In [40]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing['ocean_proximity']
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
Out[40]:
array([0, 0, 4, ..., 1, 0, 3])
In [41]:
# 0,1,2,3,4特征分别表达什么意思,即把枚举特征转化成数值特征
encoder.classes_
Out[41]:
array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)
现在,所有特征都是数值类型的了,都可以直接被机器学习算法使用了.
但是,我们仔细想一下机器学习的内部原理,以简单的线性回归为例:
y=∑nj=0θjxjy=∑j=0nθjxj
编码数值的相对大小对房价影响很乏! 对于离海距离这个特征,它是一个枚举型的类别特征,不应该存在大小的关系。 如何解决?
五个特征中我们可以设置其中某一个为1其他为0,因为一个房子只有一个特征
In [42]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
# one-hot-encoding接受的是一列数据,就是接受一个向量,所以我们要数据转化一下reshape(-1, 1)成一列
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
housing_cat_1hot # 稀疏矩阵
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
Out[42]:
<16512x5 sparse matrix of type ''
with 16512 stored elements in Compressed Sparse Row format>
In [43]:
housing_cat_1hot.toarray()
Out[43]:
array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
...,
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
OneHotEncoder
无法接受字符串类型的数据,只能接受数值型的数据。所以我们必须把枚举型的类别特征转成数值类型,再转换成OneHot编码。
不过在sklearn中有现成的实现: LabelBinarizer
In [44]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot
Out[44]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
...,
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0]], dtype=int32)
虽然scikit-learn提供了非常多有用的转换器(transformers),我们有时候仍然会需要自定义一些转换器。
scikit-learn依赖所谓的duck typing,并不是通过继承的方式实现接口,因此我们自定义转换器只需要实现fit
, transform
, fit_transform
三个方法就可以把它作为转换器直接使用。
我们这里实现一个用于组合特征的转换器:
In [45]:
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
特征放缩是一个不可缺少的步骤
最常用的两种放缩是:
MinMaxScaler
放缩完是[0,1]之间 x−minmaxx−minmax
StandardScaler
对极端数据的容忍能力比较强 x−μσx−μσ
注意,无论什么转换器(包括放缩),不要在整个数据集上调用fit。只应该在训练数据集上fit。
我们在把数据从最原始的形态转换成最终供机器学习算法使用的形态,可能需要做多次转换,这个过程可以使用scikit-learn的pipeline来组装起来。
Pipeline的构造器接收一个list,里面的每一个元素都是一个二元组,包括我们对转换器的命名以及转换器对象实例。List中除了最后一个esitmator以外,其他的都必须是transformer(也就是必须有fit_transform方法)
当我们调用pipeline的fit方法时,它会顺序地调用所有transformer的fit_transform方法,而对于最后一个esitmater只会调用它的fit方法。 实际上,pipeline向外会暴露跟最后一个estimator一模一样的方法。
In [46]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler())
])
housing_num_tr = num_pipline.fit_transform(housing_num)
前面是对数值型数据的转换,那么对于枚举型的类别数据,我们也需要进行转换,那么对于既有数值型又有枚举型,我们可以使用scikit-learn提供的FeatureUnion
In [47]:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
In [48]:
from sklearn.pipeline import FeatureUnion
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
enc = LabelBinarizer(sparse_output=self.sparse_output)
return enc.fit_transform(X)
num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scalar', StandardScaler())
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', CustomLabelBinarizer())
])
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(housing)
In [49]:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
Out[49]:
array([[-1.15604281, 0.77194962, 0.74333089, ..., 0. ,
0. , 0. ],
[-1.17602483, 0.6596948 , -1.1653172 , ..., 0. ,
0. , 0. ],
[ 1.18684903, -1.34218285, 0.18664186, ..., 0. ,
0. , 1. ],
...,
[ 1.58648943, -0.72478134, -1.56295222, ..., 0. ,
0. , 0. ],
[ 0.78221312, -0.85106801, 0.18664186, ..., 0. ,
0. , 0. ],
[-1.43579109, 0.99645926, 1.85670895, ..., 0. ,
1. , 0. ]])
1、 线性模型
In [50]:
from sklearn.linear_model import LinearRegression
In [51]:
lin_reg = LinearRegression()
# 有监督的学习housing_prepare特征, housing_labels目标变量
lin_reg.fit(housing_prepared, housing_labels)
Out[51]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
In [52]:
# 测试看看
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
mean_squared_error
来看看RMSE:In [53]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Out[53]:
68628.19819848923
大部分街区(district)的房价中位数都在$120,000 ~ $265,000之间,所以这里平均$68,628的误差显然有点不太合适。
训练集上误差很大
这显然是欠拟合了,那么应对欠拟合,我们的措施有:
让我们换个模型试试:
2、决策树
In [54]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
Out[54]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
In [55]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
Out[55]:
0.0
我们可以继续使用train_test_split
来把原来的训练集切分成训练集和交叉验证集。
不过scikit-learn为我们准备好了更易用的交叉验证功能。
In [56]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
In [57]:
def display_scores(scores):
print("RMSE:", scores)
print("平均RMSE:", scores.mean())
print("RMSE标准差:", scores.std())
display_scores(rmse_scores)
RMSE: [69327.01708558 65486.39211857 71358.25563341 69091.37509104
70570.20267046 75529.94622521 69895.20650652 70660.14247357
75843.74719231 68905.17669382]
平均RMSE: 70666.74616904806
RMSE标准差: 2928.322738055112
In [58]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
RMSE: [66782.73843989 66960.118071 70347.95244419 74739.57052552
68031.13388938 71193.84183426 64969.63056405 68281.61137997
71552.91566558 67665.10082067]
平均RMSE: 69052.46136345084
RMSE标准差: 2731.6740017983475
显然决策树过拟合了
3、 用随机森林
In [59]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print('训练集误差:', forest_rmse)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
训练集误差: 22256.354579885898
In [60]:
forest_reg = RandomForestRegressor()
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
RMSE: [52868.21871547 49194.92340594 51713.77716246 55012.10310441
50973.88861214 55990.4479905 52298.68285092 50953.09058463
54428.48087563 53461.73225682]
平均RMSE: 52689.53455589254
RMSE标准差: 1980.36257012708
我们通常会试验很多模型,特别是应用不同的参数。
那么,我们可以把模型存储下来,以后再次加载,可以快速地进行模型间的对比:
from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
my_model_loaded = joblib.load("my_model.pkl")
In [61]:
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
Out[61]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=None,
param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring='neg_mean_squared_error', verbose=0)
In [62]:
grid_search.best_params_
Out[62]:
{'max_features': 6, 'n_estimators': 30}
In [63]:
grid_search.best_estimator_
Out[63]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=30, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
In [64]:
# 看看每一组的评估
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
64560.56453532923 {'max_features': 2, 'n_estimators': 3}
55582.7022830937 {'max_features': 2, 'n_estimators': 10}
52946.41821544518 {'max_features': 2, 'n_estimators': 30}
60706.36585084631 {'max_features': 4, 'n_estimators': 3}
52620.50721816631 {'max_features': 4, 'n_estimators': 10}
50524.81738581646 {'max_features': 4, 'n_estimators': 30}
58527.66047437809 {'max_features': 6, 'n_estimators': 3}
52222.271025610295 {'max_features': 6, 'n_estimators': 10}
49935.06005093446 {'max_features': 6, 'n_estimators': 30}
58659.80511658877 {'max_features': 8, 'n_estimators': 3}
52363.81008209303 {'max_features': 8, 'n_estimators': 10}
50107.93319293479 {'max_features': 8, 'n_estimators': 30}
62413.99504781873 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54498.69196389087 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
60335.933168485164 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52359.94950597992 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59131.91266427517 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51851.7761074725 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
scikit-learn提供的RandomizedSearchCV
在超参空间很大的时候可以帮助我们更加方便地去寻找合适的超参。
它的使用方式和GridSearchCV
没什么区别,不过它有额外的两个好处:
将多个模型组合在一起来形成一个更为强大的集成模型,特别是在不同的子模型误差类型迥异的时候效果更佳。
通过分析模型误差,我们可能发现更深刻的洞见。
In [65]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[65]:
array([7.69687766e-02, 7.04319887e-02, 4.38660619e-02, 1.80630448e-02,
1.66508742e-02, 1.79482850e-02, 1.59942989e-02, 3.27548930e-01,
5.57736006e-02, 1.05319561e-01, 9.13965806e-02, 1.16638033e-02,
1.38350194e-01, 1.03976446e-04, 3.83940095e-03, 6.08062375e-03])
In [66]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Out[66]:
[(0.32754893001195934, 'median_income'),
(0.13835019350471953, 'INLAND'),
(0.10531956070724346, 'pop_per_hhold'),
(0.09139658055040592, 'bedrooms_per_room'),
(0.07696877663966344, 'longitude'),
(0.07043198869339834, 'latitude'),
(0.05577360056115168, 'rooms_per_hhold'),
(0.04386606192454317, 'housing_median_age'),
(0.018063044805051474, 'total_rooms'),
(0.01794828495917463, 'population'),
(0.016650874220719983, 'total_bedrooms'),
(0.01599429894210943, 'households'),
(0.01166380333543238, '<1H OCEAN'),
(0.006080623745976418, 'NEAR OCEAN'),
(0.0038394009525752966, 'NEAR BAY'),
(0.00010397644587548845, 'ISLAND')]
In [67]:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)
47579.68069327446
sklearn.svm.SVR
),尝试使用不同的核以及正则化因子C
GridSearchCV
换成RandomizedSearchCV
看看是不是能找到更好的超参GridSearchCV
来自动地决定某些数据准备工作是不是需要做