sklearn:部分依赖情节


部分依赖图显示了目标函数[2]与一组“目标”特征之间的依赖关系,边缘化了所有其他特征(补充特征)的值。由于人类感知的限制,目标特征集的大小必须很小(通常是一个或两个),因此目标特征通常从最重要的特征中选择(参见feature_importances_)。

此示例显示如何从加州住房数据集上培训的GradientBoostingRegressor获取部分依赖图。该示例取自[1]。

该图显示了四个单向和一个双向部分依赖图。单向PDP的目标变量是:收入中位数(MedInc),平均值。每户人口(AvgOccup),中位年龄(HouseAge)和平均值。每个家庭的房间(AveRooms)。

我们可以清楚地看到,房价中位数与收入中位数呈现线性关系(左上角),房价在平均收益率下降。每户人口增加(中上部)。右上图显示,一个地区的房屋年龄对(中位数)房价没有太大影响;每个家庭的平均房间也是如此。 x轴上的刻度线表示训练数据中特征值的十分位数。

具有两个目标特征的部分依赖图使我们能够可视化它们之间的相互作用。双向偏依赖图显示了房价中位数对房屋年龄和平均房价的关联值的依赖关系。每户人口。我们可以清楚地看到两个功能之间的相互作用:对于平均值。入住率大于2,房价几乎与房屋年龄无关,而对于小于2的房价,则对年龄有很大的依赖性。

import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
from sklearn.datasets.california_housing import fetch_california_housing

# fetch California housing dataset
cal_housing = fetch_california_housing()

# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                    cal_housing.target,
                                                    test_size=0.2,
                                                    random_state=1)
names = cal_housing.feature_names

print('_' * 80)
print("Training GBRT...")
clf = GradientBoostingRegressor(n_estimators=100, max_depth=4,
                                learning_rate=0.1, loss='huber',
                                random_state=1)
clf.fit(X_train, y_train)
print("done.")

print('_' * 80)
print('Convenience plot with ``partial_dependence_plots``')
print

features = [0, 5, 1, 2, (5, 1)]
fig, axs = plot_partial_dependence(clf, X_train, features, feature_names=names,
                                   n_jobs=3, grid_resolution=50)
fig.suptitle('Partial dependence of house value on nonlocation features\n'
             'for the California housing dataset')
plt.subplots_adjust(top=0.9)  # tight_layout causes overlap with suptitle

print('_' * 80)
print('Custom 3d plot via ``partial_dependence``')
print
fig = plt.figure()

target_feature = (1, 5)
pdp, (x_axis, y_axis) = partial_dependence(clf, target_feature,
                                           X=X_train, grid_resolution=50)
XX, YY = np.meshgrid(x_axis, y_axis)
Z = pdp.T.reshape(XX.shape).T
ax = Axes3D(fig)
surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1, cmap=plt.cm.BuPu)
ax.set_xlabel(names[target_feature[0]])
ax.set_ylabel(names[target_feature[1]])
ax.set_zlabel('Partial dependence')
#  pretty init view
ax.view_init(elev=22, azim=122)
plt.colorbar(surf)
plt.suptitle('Partial dependence of house value on median age and '
            'average occupancy')
plt.subplots_adjust(top=0.9)

plt.show()

 

你可能感兴趣的:(sklearn)