Machine Learning with Scikit-Learn and Tensorflow 6.9 决策树局限性

书籍信息
Hands-On Machine Learning with Scikit-Learn and Tensorflow
出版社: O’Reilly Media, Inc, USA
平装: 566页
语种: 英语
ISBN: 1491962291
条形码: 9781491962299
商品尺寸: 18 x 2.9 x 23.3 cm
ASIN: 1491962291

系列博文为书籍中文翻译
代码以及数据下载:https://github.com/ageron/handson-ml

决策树存在部分局限性。首先,决策树倾向水平/垂直的决策边界,使得决策树对数据的旋转敏感。以下图为例,右边的数据由左边的数据旋转45度得到,显然左边的模型优于右边的模型。解决这样的问题的方法包括通过PCA进行降维。

# create the data
rnd.seed(6)
Xs = rnd.rand(100, 2) - 0.5
ys = (Xs[:, 0] > 0).astype(np.float32)

# do the rotation
angle = np.pi / 4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
Xsr = Xs.dot(rotation_matrix)

# train the model
tree_clf_s = DecisionTreeClassifier(random_state=42)
tree_clf_s.fit(Xs, ys)
tree_clf_sr = DecisionTreeClassifier(random_state=42)
tree_clf_sr.fit(Xsr, ys)

# visualize the model
def plot_decision_boundary(clf, X, y, axes):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(tree_clf_s, Xs, ys, [-0.7, 0.7, -0.7, 0.7])
plt.subplot(122)
plot_decision_boundary(tree_clf_sr, Xsr, ys, [-0.7, 0.7, -0.7, 0.7])
plt.show()

Machine Learning with Scikit-Learn and Tensorflow 6.9 决策树局限性_第1张图片

更一般地说,决策树的主要问题是面对训练数据的细微变化非常敏感。例如,如果我们移除iris数据集中的特定数据点,那么我们将会得到完全不同的决策树(上面是6.2得到的决策树,下面是新的决策树)。甚至,对于相同的数据集我们也可能会得到不同的结果(如果需要得到相同的结果,需要设置random_state参数)。

Machine Learning with Scikit-Learn and Tensorflow 6.9 决策树局限性_第2张图片

# load the data
iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

# remove certain point
not_widest_versicolor = (X[:, 1]!=1.8) | (y==2)
X_tweaked = X[not_widest_versicolor]
y_tweaked = y[not_widest_versicolor]

# train the model
tree_clf_tweaked = DecisionTreeClassifier(max_depth=2, random_state=40)
tree_clf_tweaked.fit(X_tweaked, y_tweaked)

# visualize the model
from matplotlib.colors import ListedColormap
plt.figure(figsize=(9, 4))
x1s = np.linspace(0, 7.5, 100)
x2s = np.linspace(0, 3, 100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = tree_clf_tweaked.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
plt.plot(X_tweaked[:, 0][y_tweaked==0], X_tweaked[:, 1][y_tweaked==0], "yo", label="Iris-Setosa")
plt.plot(X_tweaked[:, 0][y_tweaked==1], X_tweaked[:, 1][y_tweaked==1], "bs", label="Iris-Versicolor")
plt.plot(X_tweaked[:, 0][y_tweaked==2], X_tweaked[:, 1][y_tweaked==2], "g^", label="Iris-Virginica")
plt.axis([0, 7.5, 0, 3])
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.plot([0, 7.5], [0.8, 0.8], "k-", linewidth=2)
plt.plot([0, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.text(1.0, 0.9, "Depth=0", fontsize=15)
plt.text(1.0, 1.80, "Depth=1", fontsize=13)
plt.show()

Machine Learning with Scikit-Learn and Tensorflow 6.9 决策树局限性_第3张图片

译者注:
经过试验,这里结果差异的原因random_state参数,而不是移除的特定点。注意到6.1训练决策树时设定的参数是random_state=42,此处设定的参数是random_state=40,如果6.1训练决策树时设定random_state=40,可以得到类似的结果。

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=40)
tree_clf.fit(X, y)

plt.figure(figsize=(9, 4))
x1s = np.linspace(0, 7.5, 100)
x2s = np.linspace(0, 3, 100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = tree_clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
plt.axis([0, 7.5, 0, 3])
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.plot([0, 7.5], [0.8, 0.8], "k-", linewidth=2)
plt.plot([0, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.text(1.0, 0.9, "Depth=0", fontsize=15)
plt.text(1.0, 1.80, "Depth=1", fontsize=13)
plt.show()

Machine Learning with Scikit-Learn and Tensorflow 6.9 决策树局限性_第4张图片

随机森林等集成学习方法可以通过训练多棵树降低决策树的不确定性。

你可能感兴趣的:(机器学习)