python绘制决策树
介绍 (Introduction)
Lately, I have been struggling for a while to visualize the generated model of a classification model. I relied only on the classification report and the confusion matrix to weigh the model performance.
最近,我一直在努力使分类模型的生成模型可视化。 我仅依靠分类报告和混淆矩阵来权衡模型性能。
However, visualize the results of the classification has its charm and makes more sense of it. So, I built a decision surface, and when I succeeded, I decided to write about it as a learning process and for anyone who might have stuck on the same issue.
但是,可视化分类的结果具有其魅力并使其更具意义。 因此,我建立了决策面,当我成功的时候,我决定将它写成一个学习过程,供那些可能坚持同一问题的人学习。
教程内容 (Tutorial content)
In this tutorial, I will start with the built-in dataset package within the Sklearn library to focus on the implementation steps. After that, I will use a pre-processed data (without missing data or outliers) to plot the decision surface after applying the standard scaler.
在本教程中,我将从Sklearn库中的内置数据集包开始,重点介绍实现步骤。 之后,我将使用预处理的数据(没有丢失的数据或离群值)在应用标准缩放器后绘制决策面。
- Decision Surface 决策面
- Importing important libraries 导入重要的库
- Dataset generation 数据集生成
- Generating decision surface 生成决策面
- Applying for real data 申请真实数据
决策面 (Decision Surface)
Classification in machine learning means to train your data to assign labels to the input examples.
机器学习中的分类意味着训练您的数据,以便为输入示例分配标签。
Each input feature is defining an axis on a feature space. A plane is characterized by a minimum of two input features, with dots representing input coordinates in the input space. If there were three input variables, the feature space would be a three-dimensional volume.
每个输入要素都在要素空间上定义一个轴。 平面的特征在于至少两个输入要素,点表示输入空间中的输入坐标。 如果有三个输入变量,则特征空间将是三维体积。
The ultimate goal of classification is to separate the feature space so that labels are assigned to points in the feature space as correctly as possible.
分类的最终目标是分离要素空间,以便将标签尽可能正确地分配给要素空间中的点。
This method is called a decision surface or decision boundary, and it works as a demonstrative tool for explaining a model on a classification predictive modeling task. We can create a decision surface for each pair of input features if you have more than two input features.
该方法称为决策面或决策边界,它用作说明工具,用于解释有关分类预测建模任务的模型。 如果您有两个以上输入要素,我们可以为每对输入要素创建一个决策图。
导入重要的库 (Importing important libraries)
import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
from matplotlib.colors import ListedColormapfrom sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
生成数据集 (Generate dataset)
I will use the make_blobs()
function within the datasets class from the Sklearn library to generate a custom dataset. Doing so would focus on the implementations rather than cleaning the data. However, the steps are the same and are a typical pattern.Let’s start by defining the dataset variables with 1000 samples and only two features and a standard deviation of 3 for simplicity’s sake.
我将在Sklearn库的数据集类中使用make_blobs()
函数来生成自定义数据集。 这样做将专注于实现而不是清理数据。 但是,这些步骤是相同的并且是典型的模式。为了简单起见,让我们先定义具有1000个样本,只有两个特征和标准偏差为3的数据集变量。
X, y = datasets.make_blobs(n_samples = 1000,
centers = 2,
n_features = 2,
random_state = 1,
cluster_std = 3)
Once the dataset is generated, hence we can plot a scatter plot to see the variability between variables.
生成数据集后,因此我们可以绘制散点图以查看变量之间的变异性。
# create scatter plot for samples from each class
for class_value in range(2): # get row indexes for samples with this class
row_ix = np.where(y == class_value) # create scatter of these samples
plt.scatter(X[row_ix, 0], X[row_ix, 1])# show the plot
plt.show()
Here we looped over the dataset and plotted points between each X
and y
colored by a class label. In the next step, we need to build a predictive classification model to predict the class of unseen points. A logistic regression could be used in this case since we have only two categories.
在这里,我们遍历数据集,并在每个X
和y
之间绘制由类标签着色的点。 下一步,我们需要建立一个预测性分类模型来预测未见点的类别。 在这种情况下,可以使用逻辑回归,因为我们只有两个类别。
开发逻辑回归模型 (Develop the logistic regression model)
regressor = LogisticRegression()# fit the regressor into X and y
regressor.fit(X, y)# apply the predict method
y_pred = regressor.predict(X)
All y_pred
could be evaluated using the accuracy_score
class from thesklearn
library.
所有y_pred
可以使用评价accuracy_score
从类sklearn
库。
accuracy = accuracy_score(y, y_pred)
print('Accuracy: %.3f' % accuracy)## Accuracy: 0.972
生成决策面 (Generating decision surface)
matplotlib
provides a handy function called contour()
, which can insert the colors between points. However, as the documentation suggested, we need to define the grid of points X
of y
in the feature space. The beginning point would be to find the maximum value and minimum value of each feature then increase by one to make sure that the whole space is covered.
matplotlib
提供了一个方便的功能,称为contour()
,可以在点之间插入颜色。 但是,正如文档所建议的,我们需要在特征空间中定义y
点X
的网格。 起点是找到每个特征的最大值和最小值,然后再增加一个以确保覆盖整个空间。
min1, max1 = X[:, 0].min() - 1, X[:, 0].max() + 1 #1st feature
min2, max2 = X[:, 1].min() - 1, X[:, 1].max() + 1 #2nd feature
Then we can define the scale of the coordinates using arange()
function from the numpy
library with a0.01
resolution to get the scale range.
然后,我们可以使用numpy
库中的arange()
函数以0.01
分辨率定义坐标的比例,以获取比例范围。
x1_scale = np.arange(min1, max1, 0.1)
x2_scale = np.arange(min2, max2, 0.1)
The next step would be converting x1_scale
and x2_scale
into a grid. The function meshgrid()
within the numpy
library is what we need.
下一步是将x1_scale
和x2_scale
转换为网格。 我们需要numpy
库中的函数meshgrid()
。
x_grid, y_grid = np.meshgrid(x1_scale, x2_scale)
The generated x_grid
is a 2-D array. To be able to use it, we need to reduce the size to a one dimensional array using the flatten()
method from thenumpy
library.
生成的x_grid
是二维数组。 为了能够使用它,我们需要使用numpy
库中的flatten()
方法将大小减小到一维数组。
# flatten each grid to a vector
x_g, y_g = x_grid.flatten(), y_grid.flatten()
x_g, y_g = x_g.reshape((len(x_g), 1)), y_g.reshape((len(y_g), 1))
Finally, stacking the vectors side-by-side as columns in an input dataset, like the original dataset, but at a much higher resolution.
最后,将向量作为列并排堆叠在输入数据集中,例如原始数据集,但分辨率要高得多。
grid = np.hstack((x_g, y_g))
Now, we can fit into the model to predict values.
现在,我们可以拟合模型来预测值。
# make predictions for the grid
y_pred_2 = model.predict(grid)#predict the probability
p_pred = model.predict_proba(grid)# keep just the probabilities for class 0
p_pred = p_pred[:, 0]# reshaping the results
p_pred.shape
pp_grid = p_pred.reshape(x_grid.shape)
Now, a grid of values and the predicted class label across the feature space has been generated.
现在,已生成跨特征空间的值网格和预测的类标签。
Subsequently, we will plot those grids as a contour plot using contourf()
. The contourf()
function needs separate grids per axis. To achieve that, we can utilize the x_grid
and y_grid
and reshape the predictions (y_pred)
to have the same shape.
随后,我们将使用contourf()
将这些网格绘制为轮廓图。 每个轴的contourf()
函数需要单独的网格。 为此,我们可以利用x_grid
和y_grid
并对预测(y_pred)
进行(y_pred)
以使其具有相同的形状。
# plot the grid of x, y and z values as a surface
surface = plt.contourf(x_grid, y_grid, pp_grid, cmap='Pastel1')
plt.colorbar(surface)# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
row_ix = np.where(y == class_value) # create scatter of these samples
plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Pastel1')# show the plot
plt.show()
适用于真实数据 (Apply to real data)
Now it is time to apply the previous steps to real data to connect everything. As I mentioned earlier, this dataset is already cleaned with no missing points. The dataset represents car purchase history for a sample of people according to their age and salary per year.
现在是时候将前面的步骤应用于实际数据以连接所有内容了。 如前所述,该数据集已经清理干净,没有丢失的点。 该数据集根据人们的年龄和每年的工资来代表他们的汽车购买历史。
dataset = pd.read_csv('../input/logistic-reg-visual/Social_Network_Ads.csv')dataset.head()
The dataset has two features Age
and EstimatedSalary
and one dependent variable purchased as a binary column. Value 0 represents the person with similar age, and salary that didn’t make a car purchase. However, one means that the person did purchase the car. The next step would be to separate the dependent variable from features as X and y
该数据集具有两个特征Age
和EstimatedSalary
并购买了一个作为二元列的因变量。 值0代表年龄相似且未购买汽车的薪水的人。 但是,这意味着该人确实购买了汽车。 下一步是将因变量与特征分离为X和y
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.25,
random_state = 0)
功能缩放 (Feature scaling)
We need this step because Age
and salary
is not on the same scale
我们需要这一步,因为Age
和salary
不在同一个比例上
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
建立Logistic模型并拟合训练数据 (Building the Logistic model and fit the training data)
classifier = LogisticRegression(random_state = 0)# fit the classifier into train data
classifier.fit(X_train, y_train)# predicting the value of y
y_pred = classifier.predict(X_test)
绘制决策面-训练结果 (Plot the decision surface — training results)
#1. reverse the standard scaler on the X_train
X_set, y_set = sc.inverse_transform(X_train), y_train#2. Generate decision surface boundaries
min1, max1 = X_set[:, 0].min() - 10, X_set[:, 0].max() + 10 # for Age
min2, max2 = X_set[:, 1].min() - 1000, X_set[:, 1].max() + 1000 # for salary#3. Set coordinates scale accuracy
x_scale ,y_scale = np.arange(min1, max1, 0.25), np.arange(min2, max2, 0.25)#4. Convert into vector
X1, X2 = np.meshgrid(x_scale, y_scale)#5. Flatten X1 and X2 and return the output as a numpy array
X_flatten = np.array([X1.ravel(), X2.ravel()])#6. Transfor the results into it's original form before scaling
X_transformed = sc.transform(X_flatten.T)#7. Generate the prediction and reshape it to the X to have the same shape
Z_pred = classifier.predict(X_transformed).reshape(X1.shape)#8. set the plot size
plt.figure(figsize=(20,10))#9. plot the contour function
plt.contourf(X1, X2, Z_pred,
alpha = 0.75,
cmap = ListedColormap((#10. setting the axes limit
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())#11. plot the points scatter plot ( [salary, age] vs. predicted classification based on training set)for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0],
X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i),
label = j)
#12. plot labels and adjustments
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
测试集的决策图 (Decision plot for test set)
It is exactly the same as the previous code, but instead of using train use test set.
它与先前的代码完全相同,但是不是使用Train而是使用测试集。
结论 (Conclusion)
Finally, I hope this boilerplate could help in visualizing the classification model results. I recommend applying the same steps using another classification model, for example, SVM with more than two features. Thanks for reading, I am looking forward to any constructive comments.
最后,我希望这个样板可以帮助可视化分类模型结果。 我建议使用其他分类模型(例如,具有两个以上功能的SVM)应用相同的步骤。 感谢您的阅读,我期待任何建设性的意见。
翻译自: https://towardsdatascience.com/hands-on-guide-to-plotting-a-decision-surface-for-ml-in-python-149710ee2a0e
python绘制决策树