python机器学习预测_python库的Yellowbrick简介,解释对机器学习的预测

python机器学习预测

动机 (Motivation)

Congratulation! You have just trained a model and improved your f1-score to 98%! But what does it really mean? Will the increase in the f1 score indicate that your model is performing better?

恭喜你! 您刚刚训练了一个模型,并将f1得分提高到98%! 但这到底是什么意思? f1分数的增加是否表示您的模型表现更好?

You know f1-score is the harmonic between recall and precision, but how many of them are false among the positive predictions? And how many of them are false among the negative predictions?

您知道f1得分是召回率和精度之间的谐调,但是在积极的预测中有多少是假的? 在负面预测中有多少是错误的?

If you want to hypertune your machine learning model to improve the f1-score to 99%, which category should you focus on improving to have an increase of 1%?

如果您想对机器学习模型进行超调以将f1分数提高到99%,那么您应该集中精力提高哪一类的增长率为1%?

Gaining these insights would help you to understand your machine learning results and know which action you should take to improve the model. One of the best ways to understand machine learning is through plots. That is when Yellowbrick becomes helpful.

获得这些见解将帮助您了解您的机器学习结果,并知道您应该采取哪些行动来改善模型。 了解机器学习的最好方法之一是通过绘图。 那是Yellowbrick变得有用的时候。

什么是黄砖? (What is Yellowbrick?)

Yellowbrick is a machine learning visualization libary. Essentially, Yellowbrick makes it easier for you to:

Yellowbrick是机器学习可视化库。 本质上,Yellowbrick使您更容易:

  • Select features

    选择功能
  • Tuning hyperparameters

    调整超参数
  • Interpret the score of your models

    解释模型的分数
  • Visualize text data

    可视化文本数据

Being able to analyze your data and model with plots would make it much easier for you to understand your model and figure out the next steps to increase the scores that are meaningful to your goal.

能够使用绘图分析数据和模型将使您更容易理解模型并找出下一步增加对目标有意义的分数的方法。

In this article, we will play with a classification problem to learn which tools yellowbrick provides that can help you interpret your classification results.

在本文中,我们将探讨分类问题,以了解yellowbrick提供的哪些工具可以帮助您解释分类结果。

To install Yellowbrick, type

要安装Yellowbrick,请输入

pip install yellowbrick

We will use occupancy, the experimental data used for binary classification (room occupancy) from Temperature, Humidity, Light, and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.

我们将使用占用率,即用于根据温度,湿度,光照和二氧化碳进行二进制分类(房间占用率)的实验数据。 实际占用率是从每分钟拍摄的带有时间戳的图片中获得的。

演示地址

from yellowbrick.datasets.loaders import load_occupancy
import warnings
warnings.filterwarnings('ignore')


X, y = load_occupancy()

可视化数据 (Visualize the Data)

等级特征 (Rank Features)

Which features are the most relevant to the predicted column? Rank 1D makes it easy for us to rank features utilizing a ranking algorithm that takes into account only a single feature at a time

哪些功能与预测列最相关? 等级1D使我们可以轻松地利用一次仅考虑单个特征的排名算法来对要素进行排名

from yellowbrick.features import Rank1D


visualizer = Rank1D(algorithm='shapiro')
visualizer.fit(X, y)           
visualizer.transform(X)        
visualizer.show()

演示地址

Based on the data, the humidity is the strongest indicator of the occupancy. That makes sense since the more people in the room, the more humid the room would be.

根据数据,湿度是占用率的最强指标。 这是有道理的,因为房间里的人越多,房间越潮湿。

班级余额 (Class Balance)

One of the biggest challenges for classification models is the imbalance of classes in training data. Our high f1-score might be not a good evaluation score for an imbalanced class because the classifier can simply guess all the majority class to get a high score.

分类模型的最大挑战之一是训练数据中类的不平衡。 对于不平衡的班级,我们的高f1分数可能不是一个很好的评估分数,因为分类器可以简单地猜测所有多数班级来获得高分。

Thus, it is important to visualize the distribution of the class. We could utilize ClassBalance to visualize the distribution of the class with a bar chart

因此,可视化类的分布很重要。 我们可以利用ClassBalance可视化类的分布

from yellowbrick.target import ClassBalance




visualizer = ClassBalance(labels=["unoccupied", "occupied"])


visualizer.fit(y)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

演示地址

It seems like there are a lot more data classified as unoccupied than occupied. Knowing this, we can utilize several techniques for dealing with class imbalance such as stratified sampling, weighting to get a more informative result.

似乎分类为未占用的数据多于已占用的数据。 知道了这一点,我们可以利用多种技术来处理类不平衡问题,例如分层抽样,加权以获得更有意义的结果。

可视化模型的结果 (Visualize the Results of the Model)

Now we come back to the question: What does the f1-score of 98% really mean? Does the increase in f1-score result in more profit for your company?

现在我们回到问题:98%的f1分数到底是什么意思? f1分数的增加是否会为您的公司带来更多利润?

Yellowbrick provides several tools you can use to visualize the results for classification problems. Some of them you might have or have not heard of, that can be extremely helpful for interpreting your model.

Yellowbrick提供了多种工具,可用于可视化分类问题的结果。 您可能听说过或从未听说过的其中一些内容对于解释模型非常有帮助。

混淆矩阵 (Confusion Matrix)

What is the percentage of false predictions mong the unoccupied class? What is the percentage of false predictions among the occupied class? The confusion matrix helps us to answer this question

空缺阶层中错误预测的百分比是多少? 占领阶级中错误预测的百分比是多少? 混淆矩阵可以帮助我们回答这个问题

from yellowbrick.classifier import ConfusionMatrix


cm = ConfusionMatrix(model, classes=classes)


cm.fit(X_train, y_train)


cm.score(X_test, y_test)


cm.show()

演示地址

It seems like the occupied class is with a higher percentage of wrong predictions; thus, we could try to improve the number of the right predictions in the occupied class to improve the score.

似乎占领的阶级错误预测的百分比更高; 因此,我们可以尝试提高占领班级中正确预测的数量以提高得分。

罗考克 (ROCAUC)

Imagine we improve our f1 score to 99%, how do we know that it is actually better? Confusion matrix could help but instead of comparing the percentage of right prediction in each class between two models, is there an easier way to compare to the performance of two models? That is when ROC AUC would be helpful.

想象我们将f1分数提高到99%,我们如何知道它实际上更好? 混淆矩阵可能会有所帮助,但是除了比较两个模型之间每个类别中正确预测的百分比之外,还有没有更简单的方法来比较两个模型的性能? 那时ROC AUC会有所帮助。

A ROCAUC plot allows the user to visualize the tradeoff between the classifier’s sensitivity and specificity. A ROC curve displays the true positive rate on the Y axis and the false positive rate on the X axis.

ROCAUC图允许用户可视化分类器的敏感性和特异性之间的权衡 。 ROC曲线在Y轴上显示真实的阳性率,在X轴上显示假的阳性率。

from yellowbrick.classifier import ROCAUC


visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)        
visualizer.show()

演示地址

The ideal point is therefore the top-left corner of the plot: false positives are zero and true positives are one. The higher the area under the curve (AUC), the better the model generally is.

因此理想点是图的左上角 :假阳性为零,真阳性为一。 曲线下的面积(AUC)越高,模型通常越好。

Considering that our ROC curves are near the top-left corner, the performance of our model is really good. If we observe that a different model or a different hyperparameter result in ROC curves are closer to the top left corner then our current one, we can assure that the performance of our model actually improves.

考虑到我们的ROC曲线在左上角附近,因此我们模型的性能确实很好。 如果我们观察到ROC曲线的不同模型或不同超参数结果比我们当前模型更接近左上角,则可以确保模型的性能实际上得到了改善。

我们如何改善模型? (How do we improve the model?)

Now we understand the performance of our model, how do we go about improving the model? To improve our model, we might want to

现在我们了解了模型的性能,如何改进模型? 为了改善我们的模型,我们可能想要

  • Prevent our model from underfitting or overfitting

    防止我们的模型过拟合或过拟合
  • Find the features are the most important to the estimator

    查找特征对估计器最重要

We will explore the tools provided by Yellowbrick to help us figure out how to improve our model

我们将探索Yellowbrick提供的工具,以帮助我们弄清楚如何改善模型

验证曲线 (Validation Curve)

A model can have many hyperparameters. We might select the hyperparameters that accurately predict the training data. The good way to find the best hyperparameters is to choose a combination of those parameters with a grid search.

一个模型可以具有许多超参数。 我们可能会选择能准确预测训练数据的超参数。 查找最佳超参数的好方法是选择这些参数与网格搜索的组合。

But how do we know that those hyperparameters will also accurately predict the test data? It is useful to plot the influence of a single hyperparameter on the training and test data to determine if the estimator is underfitting or overfitting for some hyperparameter values.

但是我们怎么知道那些超参数也可以准确地预测测试数据? 绘制单个超参数对训练和测试数据的影响,以确定估计器对于某些超参数值而言是欠拟合还是过拟合是很有用的。

Validation Curve could help us to find the sweet spot where lower or higher value than this hyperparameter will result in underfitting or overfitting the data

验证曲线可以帮助我们找到最佳点,该最佳点的值低于或高于此超参数会导致数据拟合不足或过度拟合

from yellowbrick.model_selection import ValidationCurve


viz = validation_curve(
    model, X, y, param_name="max_depth",
    param_range=np.arange(1, 11), cv=10, scoring="f1",
)

演示地址

As we can see from the plot, while the higher number of max depth results in the higher training score, but it also results in the lower cross-validation score. This makes sense since decision trees become more overfitting the deeper they are.

从图中可以看出,虽然最大深度数量越大,训练得分越高,但交叉验证得分也越低。 这是有道理的,因为决策树变得更适合他们的深层。

Thus, the sweet spot will be where the cross-validation score does not decrease, which is 1.

因此,最佳点将是交叉验证得分不会降低的地方,即1。

学习曲线 (Learning Curve)

Will more data result in a better performance of the model? Not always, the estimator may be more sensitive to error due to variance. That is when the learning curve is helpful.

是否会有更多数据使模型的性能更好? 并非总是如此,估计器可能对因方差引起的错误更敏感。 那就是学习曲线很有帮助的时候。

A learning curve shows the relationship of the training score versus the cross-validated test score for an estimator with a varying number of training samples.

学习 曲线显示了针对具有不同数量训练样本的估计量, 训练分数交叉验证的测试分数之间的关系。

from yellowbrick.model_selection import LearningCurve
from sklearn.model_selection import StratifiedKFold


#Create the learning curve visualizer
cv = StratifiedKFold(n_splits=12)
sizes = np.linspace(0.3, 1.0, 10)


visualizer = LearningCurve(
    model, cv=cv, scoring='f1', train_sizes=sizes, 
)


visualizer.fit(X, y)        # Fit the data to the visualizer
visualizer.show()           # Finalize and render the figure

演示地址

From the graph, we can see that the number of training instances of around 8700 results in the best f1-score. The higher number of training instances results in a lower f1-score.

从图中可以看到,大约8700个训练实例的数量可以得出最佳的f1得分。 训练实例数量越多,f1得分越低。

功能重要性 (Feature Importances)

Having more features is not always equivalent to a better model. The more features the model has, the more sensitive the model is to errors due to variance. Thus, we want to select the minimum required features to produce a valid model.

具有更多功能并不总是等同于更好的模型。 更多功能的模型,更敏感的模式,是因为方差错误 。 因此,我们希望选择最少的特征以生成有效的模型。

A common approach to eliminate features is to eliminate the ones that are the least important to the model. Then we re-evaluate if the model actually performs better during cross-validation.

消除特征的常用方法是消除对模型最不重要的特征。 然后,我们重新评估该模型在交叉验证期间是否确实表现更好。

Feature importance is perfect for this task since it helps us to visualize the relative importance of the features for the model.

特征重要性对于此任务而言是完美的,因为它有助于我们可视化模型中特征的相对重要性

from yellowbrick.model_selection import FeatureImportances


viz = FeatureImportances(model)
viz.fit(X, y)
viz.show()

演示地址

It seems like the light is the most important feature to DecisionTreeClassifier, followed by CO2, temperature.

似乎光是DecisionTreeClassifier最重要的功能,其次是二氧化碳,温度。

Considering there are not many features in our data, we will not eliminate humidity. But if there are many features in our model, we should eliminate the ones that are not important to the model to prevent errors due to variance.

考虑到我们的数据中没有太多功能,我们不会消除湿度。 但是,如果模型中有很多功能,则应消除那些对模型不重要的功能,以防止由于方差引起的误差。

结论 (Conclusion)

Congratulations! You have just learned how to create plots that help you to interpret the results of your model. Being able to understand your machine learning results will make it easier for you to find out the next steps to improve its performance. There are more functionalities of Yellowbrick than what I mention here. To learn more about how to use Yellowbrick to interpret your machine learning model, I encourage you to check the doc here.

恭喜你! 您刚刚学习了如何创建可帮助您解释模型结果的图。 能够理解您的机器学习结果将使您更容易找到下一步以提高其性能。 Yellowbrick的功能比我在这里提到的要多。 要了解有关如何使用Yellowbrick解释您的机器学习模型的更多信息,建议您在此处查看文档。

The source code of this repo could be found here.

这个仓库的源代码可以在这里找到。

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

我喜欢写有关基本数据科学概念的文章,并喜欢使用不同的算法和数据科学工具。 您可以在LinkedIn和Twitter上与我联系。

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these:

如果您想查看我编写的所有文章的代码,请给此回购加注星号。 在“ Medium”上关注我,以随时了解最新的数据科学文章,例如:

翻译自: https://towardsdatascience.com/introduction-to-yellowbrick-a-python-library-to-explain-the-prediction-of-your-machine-learning-d63ecee10ecc

python机器学习预测

你可能感兴趣的:(python,机器学习,人工智能,深度学习,数据分析)