机器学习分类数据集

什么是数据集？ (What is the dataset?)

The Poker Hand dataset [Cattral et al., 2007] is publicly available and very well-documented at the UCI Machine Learning Repository [Dua et al., 2019]. [Cattral et al., 2007] described it as:

扑克之手数据集 [Cattral等人，2007]可公开获得，并且在UCI机器学习存储库中有很好的文档记录[Dua等人，2019]。 [Cattral et al。，2007]将其描述为：

Found to be a challenging dataset for classification algorithms

被发现是分类算法的具有挑战性的数据集

It is an 11-dimensional dataset with 25K samples for training and over 1M samples for testing. Each dataset instance is a 5-cards poker-hand that uses two features per card (suite and rank) and the Poker-hand label.

它是一个11维数据集，包含用于训练的25K样本和用于测试的1M样本。每个数据集实例都是一张5张纸牌的扑克手，每张纸牌使用两个功能(套房和等级)和扑克手标签。

为什么很难？ (Why is it hard?)

It has two properties that makes it particular challenging for classification algorithms: it’s all categorical features and it’s extremely imbalanced. Categorical features are hard because the typical distance (a.k.a. similarity) metrics can’t be naturally applied to such features. E.g. this dataset has two features: rank and suite, calculating the Euclidean distance between “spades” and “hearts” simply doesn’t make sense. Imbalanced datasets are hard because the machine learning algorithms kind of assume a good balance, Jason Brownlee from Machine Learning Mastery describes the problem as:

它具有两个特性，这使分类算法特别具有挑战性：它是所有分类功能 ，并且极不平衡 。分类特征很难，因为典型的距离(即相似性)度量标准不能自然地应用于此类特征。例如，此数据集具有两个功能：等级和套件，计算“锹”和“心脏”之间的欧几里得距离根本没有意义。不平衡的数据集很难解决，因为机器学习算法具有良好的平衡， Machine Learning Mastery的 Jason Brownlee将问题描述为：

Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class

不平衡分类对预测建模提出了挑战，因为用于分类的大多数机器学习算法都是围绕每个类均假设相同数量的示例来设计的

那么，为什么我会得到好的结果？ (Then, why do I get good results?)

So, if this dataset is supposedly hard, why can a simple neural-network achieve over 90% accuracy without any particular tuning or data pre-processing?

因此，如果假设此数据集很难，为什么简单的神经网络无需任何特殊的调整或数据预处理就可以达到90％以上的精度？

In this article, Aditya Bhardwaj shows that his neural-network achieved 90.04 accuracy. Below, I show that a simple Multi-layer Perceptron Neural-Network achieves over 99% accuracy. One reason this will happen is due to the Class Imbalance Problem. [Ling et al., 2011] explain:

在本文中， Aditya Bhardwaj显示了他的神经网络达到了90.04的准确性。下面，我展示了一个简单的多层感知器神经网络，可以达到99％以上的精度。发生这种情况的原因之一是由于班级不平衡问题 。 [Ling et al。，2011]解释：

Data are said to suffer the Class Imbalance Problem when the class distributions are highly imbalanced. In this context, many classification learning algorithms have low predictive accuracy for the infrequent class.

当类别分布高度不平衡时，数据将遭受类别不平衡问题的困扰。在这种情况下，许多分类学习算法对于不常见的课程预测准确性较低。

The Poker-hand dataset happens to be extremely imbalanced, with the first two classes representing 90% of the samples, in both the training and testing set. A classifier that learns how to classify correctly these two-classes, but completely miss-classifies the remaining classes, will still achieve 90% accuracy in predictions. This is not a good classifier!. The reason the classifier still receives a good score is simply because the class imbalance is taking into account, i.e. the correct predictions of the dominant-classes are given weight proportional to the number of samples. The “low predictive accuracy for the infrequent class” is shadowed by the better predictions from those classes where there are lots of samples to learn from.

扑克手数据集恰好极度不平衡，在训练和测试集中，前两个类别代表了90％的样本。一个学习如何正确地对这两个类别进行分类，但对其余类别进行完全错误分类的分类器，仍将在预测中达到90％的准确性。 这不是一个很好的分类器！ 。分类器仍然获得高分的原因仅仅是因为考虑了类别不平衡 ，即，对优势类别的正确预测的权重与样本数量成正比。 来自那些需要学习大量样本的那些类的更好的预测掩盖了“针对不频繁的类的较低的预测准确性”。

我们对于它可以做些什么呢？ (What can we do about it?)

A metric that doesn’t take class-imbalance into account, e.g. gives equal weight to all classes regardless of their dominance, can provide more “real” or accurate results. Scikit-learn’s Classification Reports have one such metric. The F1 score combines the results from both precision and recall, and the Classification Report includes a F1 macro-average metric, i.e. unweighted F1-score average per label! . As mentioned in Scikit-learn’s documentation about the F-metrics:

不考虑类别不平衡的度量标准(例如，无论所有类别的主导地位如何均赋予同等权重)可以提供更多“真实”或准确的结果。 Scikit-learn的分类报告中有一个这样的指标。 F1分数结合了精确度和召回率的结果，并且分类报告包括F1 宏观平均指标，即每个标签的未加权F1分数平均值！ 。如Scikit-learn的有关F-metrics的文档所述：

In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance

在不经常使用类的问题中，宏平均可能是突出其性能的一种方法

This is a good example of a metric that can be used to measure the performance of the classifier against this highly imbalanced dataset.

这是一个很好的指标示例，可用于针对此高度不平衡的数据集衡量分类器的性能。

Many other methods and metrics have been proposed in the Machine Learning Literature to deal with some of the problems mentioned here. E.g., Boriah et al discusses some of the existing methods for handling Categorical features in their paper “Similarity Measures for Categorical Data: A Comparative Evaluation”. Discussing this is not the scope of this post and hence I will simply leave you with a link to the paper here.

机器学习文献中已经提出了许多其他方法和度量标准来解决这里提到的一些问题。例如，Boriah等人在他们的论文“ 分类数据的相似性度量：比较评估”中讨论了一些处理分类特征的现有方法。讨论这不是本文的范围，因此，我将在此处为您提供本文的链接。

有结果吗？ (Some results, please?)

I went ahead and run a Multi-layer Perceptron Neural Network and here the results I obtained. This network uses 3 hidden layers of 100 neurons each, with alpha=0.0001 and learning rate=0.01. The following is the Confusion Matrix. It can be observed that the neural-network did a good job overall, correctly classifying most of the first 6 classes, with some particular bad results for classes 7 and 9 (Four of a kind and Royal flush.

我继续运行一个多层感知器神经网络，并在这里获得了结果。该网络使用3个隐藏的层，每个层包含100个神经元，其中alpha = 0.0001，学习率= 0.01。 以下是混淆矩阵 。可以观察到，神经网络在整体上做得很好，正确地对前6个类中的大多数进行了分类，而对7类和9类(四类和皇家同花顺)有一些特别不好的结果。

机器学习分类数据集_一个很好的机器学习分类器对扑克手数据集的准确性指标_第1张图片

Confusion Matrix 混淆矩阵

The accuracy reported for this classifier is 99%. Even though classes 7 and 9 did very bad, they only contribute 233 samples out of 1M samples tested. The bad results from a couple of non-dominant classes are completely shadowed by the other classes. This clearly gives a false impression of success! The neural-network miss-classified 66% and 77% of the Royal-flush and The Four-of-a-kind hands, yet it gets a correct but misleading 99% accuracy result.

该分类器的报告准确性为99％ 。即使第7类和第9类的表现非常差，但在测试的1M个样本中，它们仅贡献了233个样本。几个非主要类的不良结果完全被其他类所掩盖。这显然给人以错误的成功印象！神经网络错误地将皇家同花顺和“四人一手”分为66％和77％，但它获得了正确但误导的99％准确性结果。

The Classification Report shown below includes the previously mentioned macro-average F1 score for all classes. This unweighted mean provides a much better overview of how well the classifier did. It can be seen that most classes did pretty good actually, but there were that couple of classes that did particularly bad. But more importantly, it can be observed that the macro-average reported is 78%. This is a more appropriate score for the results observed! 2/10 classes did poorly, while others did much better, and it is reflected in the metric, when the metric is chosen carefully.

下面显示的分类报告包括前面提到的所有类别的宏观平均F1得分 。这种未加权均值可以更好地概述分类器的效果。可以看出，大多数类实际上都做得不错，但是有几个类表现特别差。 但更重要的是，可以观察到报告的宏观平均值为78％。 对于观察到的结果，这是一个更合适的分数！ 2/10个班级的表现较差，而其他班级的则要好得多，这是在仔细选择度量标准的情况下反映的。

              precision    recall  f1-score   support           0       1.00      0.99      0.99    501209
           1       0.99      0.99      0.99    422498
           2       0.96      1.00      0.98     47622
           3       0.99      0.99      0.99     21121
           4       0.85      0.64      0.73      3885
           5       0.97      0.99      0.98      1996
           6       0.77      0.98      0.86      1424
           7       0.70      0.23      0.35       230
           8       1.00      0.83      0.91        12
           9       0.04      0.33      0.07         3    accuracy                           0.99   1000000
   macro avg       0.83      0.80      0.78   1000000
weighted avg       0.99      0.99      0.99   1000000

翻译自: https://towardsdatascience.com/a-good-machine-learning-classifiers-accuracy-metric-for-the-poker-hand-dataset-44cc3456b66d