公平性 机器学习_比较机器学习模型的案例研究公平性

公平性 机器学习

COMPAS和分类器公平性 (COMPAS and classifier fairness)

Recent events around the world raises many questions — is the society we are living in biased to a particular sect. With many unanswered questions about the racial discrimination, lets explore a case study about COMPAS which is tool used in many jurisdictions around the U.S. to predict if a convicted criminal is likely to re-offend. Machine learning algorithms are surely quite beneficial when it comes to predict a set of unknown depending upon the already acquired knowledge about the what we already know. Having said that for machine learning models to provide accuracy without bias it is imperative that the data on which the algorithm works upon is clean and without any pre-defined inclination towards a particular data. Sometimes machine learning models are just not suitable to be used on the grounds of social ethics and norms.

世界各地最近发生的事件引发了许多问题-我们所处的社会偏向某个特定派别。 有了关于种族歧视的许多悬而未决的问题,让我们来探讨一个关于COMPAS的案例研究,COMPAS是美国许多司法管辖区使用的工具,用于预测定罪的罪犯是否有可能再次犯罪。 当根据已经获得的关于我们已经知道的知识的知识来预测一组未知数时,机器学习算法无疑会非常有益。 前面已经说过,为了使机器学习模型能够提供准确度而不会产生偏差,必须确保算法所基于的数据是干净的,并且没有针对特定数据的任何预定义倾向。 有时,机器学习模型不适合基于社会道德和规范使用。

About COMPAS:

关于COMPAS:

COMPAS, an acronym for Correctional Offender Management Profiling for Alternative Sanctions, is an assistive software and support tool used to predict recidivism risk — the risk that a criminal defendant will re-offend.

COMPAS是“替代性制裁的矫正罪犯管理概要”的首字母缩写,是一种辅助软件和支持工具,用于预测累犯风险-刑事被告会再次犯罪的风险。

  • COMPAS is helpful in ways that it provides scores from 1 (being lowest risk) to 10 (being highest risk).

    COMPAS的得分从1(最低风险)到10(最高风险)不等。
  • It also provides a category based evaluation labeled as high risk of recidivism, medium risk of recidivism, or low risk of recidivism. For simplifying things we can convert this multi-class classification problem into binary classification combining the medium risk and high risk of recidivism vs. low risk of recidivism.

    它还提供了一个基于类别的评估,标记为累犯风险高,累犯风险中度或累犯风险低。 为简化起见,我们可以将这种多类别分类问题转换为结合了中度风险和高风险再犯与低风险的二元分类。
  • The input used for prediction of recidivism is wide-scale and uses 137 factors including age, gender, and criminal history of the defendant as the input.

    用于预测累犯的输入数据很广泛,并使用137个因素作为输入,包括年龄,性别和被告的犯罪历史。
  • Race is not an explicit feature considered by the model.

    种族不是模型考虑的明确特征。

使用COMPAS: (Use of COMPAS:)

COMPAS has been used by the U.S. states of New York, Wisconsin, California, Florida’s Broward County, and other jurisdictions. Depending on the scores generated by this software, the judge can decide upon whether to detain the defendant prior to trial and/or when sentencing. It has been observed that the Defendants who are classified medium or high risk (scores of 5–10), are more likely to be held in prison while awaiting trial than those classified as low risk (scores of 1–4). Although this software might seem to be assistive and helpful but it suffers from machine bias. According to an investigative journal ProPublica:

COMPAS已被美国纽约州,威斯康星州,加利福尼亚州,佛罗里达州的布劳沃德县和其他司法管辖区使用。 根据该软件生成的分数,法官可以决定是否在审判前和/或判刑时拘留被告。 据观察,被分类为中或高风险(5-10分)的被告比被分类为低风险(1-4分)的被告更有可能被关押在监狱中。 尽管此软件似乎是辅助和有用的,但它会遭受机器偏见的困扰。 根据调查杂志ProPublica:

  • The prediction fails differently for the black defendants:

    黑人被告的预测失败的方式有所不同:
  • Overall, Northpointe’s assessment tool correctly predicts recidivism 61 percent of the time. But blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend. It makes the opposite mistake among whites: They are much more likely than blacks to be labeled lower risk but go on to commit other crimes. This a major risk of machine learning models might possess and when it comes to someone’s freedom it’s a flaw that shouldn’t go unnoticed.

    总体而言,Northpointe的评估工具可以正确预测61%的再次发生率。 但是黑人被标记为更高风险但实际上没有再次犯罪的几率几乎是白人的两倍。 这在白人中犯了相反的错误:他们比黑人被标记为低风险的可能性更大,但继续犯下其他罪行。 这可能是机器学习模型的主要风险,而当涉及到某人的自由时,这是一个不容忽视的缺陷。

To get what ProPublica claimed let’s try to make our own model which replicates their analysis.

为了获得ProPublica的主张,让我们尝试建立自己的模型来复制他们的分析。

制定ProPublica分析: (Formulating ProPublica analysis:)

Before starting any analysis, we need to have the proper tools for that — importing the required libraries.

在开始任何分析之前,我们需要具有适当的工具—导入所需的库。

读入数据: (Read in the data:)

After importing the libraries we need to import the data and to know how the data looks like use the following code to visualize it :

导入库之后,我们需要导入数据并了解数据的外观,请使用以下代码对其进行可视化:

After running the above cell we get the following output:

运行上面的单元格后,我们得到以下输出:

As stated earlier to make this into a binary classification problem we carry out the following section.

如前所述,要将其变成二进制分类问题,我们执行以下部分。

转换为二进制分类问题: (Transform into a binary classification problem:)

First, let’s make this a binary classification problem. We will add a new column that translates the risk score (decile_score) into a binary label.

首先,让我们将其设为二进制分类问题。 我们将添加一个新列,将风险评分( decile_score )转换为二进制标签。

Any score 5 or higher (Medium or High risk) means that a defendant is treated as a likely recividist, and a score of 4 or lower (Low risk) means that a defendant is treated as unlikely to re-offend. As we know that creating a new feature with the help of the existing one might help us to envision data that might go unexplored. The code to do the same is given below :

任何得分5或更高(中或高风险)表示被告被视为可能的信使者,得分为4或更低(低风险)意味着被告不太可能再犯罪。 众所周知,在现有功能的帮助下创建新功能可能会帮助我们预想可能无法探索的数据。 下面是执行相同操作的代码:

To see the new column added , we can use df.head().

要查看添加的新列,我们可以使用df.head()

评估模型性能: (Evaluate model performance:)

To evaluate the performance of the model, we will compare the model’s predictions to the “truth”:

为了评估模型的性能,我们将模型的预测与“真相”进行比较:

  • The risk score prediction of the COMPAS system is in the decile_score column,

    COMPAS系统的风险评分预测位于decile_score列中,

  • The classification of COMPAS as medium/high risk or low risk is in the is_med_or_high_risk column

    is_med_or_high_riskis_med_or_high_risk COMPAS分类为中/高风险或低风险

  • The “true” recidivism value (whether or not the defendant committed another crime in the next two years) is in the two_year_recid column.

    “ true”累犯值(被告是否在接下来的两年内犯下另一项罪行)在two_year_recid列中。

Let’s start by computing the accuracy:

让我们从计算精度开始:

This, itself, might already be considered problematic…

这本身可能已经被认为是有问题的……

The accuracy score includes both kinds of errors:

准确性分数包括两种错误:

  • false positives (defendant is predicted as medium/high risk but does not re-offend)

    误报(被告被预测为中/高风险,但不会再次冒犯)
  • false negatives (defendant is predicted as low risk, but does re-offend)

    假阴性(被告被预测为低风险,但会再次犯罪)

but these errors have different costs. It can be useful to pull them out separately, to see the rate of different types of errors.

但是这些错误会带来不同的成本。 分开拉出它们以查看不同类型错误的发生率可能很有用。

If we create a confusion matrix, we can use it to derive a whole set of classifier metrics:

如果我们创建混淆矩阵,则可以使用它来导出一整套分类器指标:

  • True Positive Rate (TPR) also called recall or sensitivity

    真阳性率(TPR)也称为召回率或敏感性
  • True Negative Rate (TNR) also called specificity

    真阴性率(TNR)也称为特异性
  • Positive Predictive Value (PPV) also called precision

    正预测值(PPV)也称为精度
  • Negative Predictive Value (NPV)

    负预测值(NPV)
  • False Positive Rate (FPR)

    误报率(FPR)
  • False Discovery Rate (FDR)

    错误发现率(FDR)
  • False Negative Rate (FNR)

    假阴性率(FNR)
  • False Omission Rate (FOR)

    错误遗漏率(FOR)

To use the theory into practice we can use various approach for the same.

为了将理论应用到实践中,我们可以对它使用多种方法。

We can also use sklearn's confusion_matrix to pull out these values and compute any metrics of interest:

我们还可以使用sklearnconfusion_matrix提取这些值并计算任何感兴趣的指标:

We get the following distribution :

我们得到以下分布:

  • True negatives: 2681

    真实底片:2681
  • False positives: 1282

    误报:1282
  • False negatives: 1216

    假阴性:1216
  • True positives: 2035

    真实肯定:2035年

Or we can compute them directly using crosstab -

或者我们可以使用crosstab直接计算它们-

Here, we normalize by row — show the PPV, FDR, FOR, NPV:

在这里,我们按行归一化-显示PPV,FDR,FOR和NPV:

Here, we normalize by column— show the TPR, FPR, FNR, TNR:

在这里,我们按列进行归一化-显示TPR,FPR,FNR,TNR:

Overall, we see that a defendant has a similar likelihood of being wrongly labeled a likely recidivist and of being wrongly labeled as unlikely to re-offend:

总体而言,我们发现被告被错误地标记为可能的累犯的可能性以及被错误地标记为不太可能再次犯罪的可能性相似:

We can also directly evaluate the risk score, instead of just the labels. The risk score is meant to indicate the probability that a defendant will re-offend.

我们还可以直接评估风险评分,而不仅仅是标签。 风险评分旨在指示被告重新犯罪的可能性。

Defendants with a higher COMPAS score indeed had higher rates of recidivism.

COMPAS分数较高的被告确实有较高的累犯率。

After running the above cell we get the following accuracy 0.7021662544019724

运行以上单元格后,我们得到以下精度0.7021662544019724

公平: (Fairness:)

COMPAS has been under scrutiny for issues related for fairness with respect to race of the defendant.

COMPAS已就与被告人的种族公正有关的问题进行了审查。

Race is not an explicit input to COMPAS, but some of the questions that are used as input may have strong correlations with race.

比赛还没有一个明确的输入COMPAS,但有些用作输入可能与种族相关性强的问题。

First, we will find out how frequently each race is represented in the data:

首先,我们将找出每个种族在数据中的代表频率:

We will focus specifically on African-American or Caucasian defendants, since they are the subject of the ProPublica claim.

我们将特别关注非裔美国人或白种人的被告,因为它们是ProPublica索赔的主题。

It isn’t exactly the same, but it’s similar — within a few points. This is a type of fairness known as overall accuracy equality.

虽然不完全相同,但是很相似-仅有几点。 这是一种公平性,称为整体准确性相等

Next, let’s see whether a defendant who is classified as medium/high risk has the same probability of recidivism for the two groups.

接下来,让我们看看被分类为中/高风险的被告在两组中是否具有相同的累犯概率。

In other words, we will compute the PPV for each group:

换句话说,我们将为每个组计算PPV:

Again, similar (within a few points). This is a type of fairness known as predictive parity.We can extend this idea, to check whether a defendant with a given score has the same probability of recidivism for the two groups:

再次,类似(在几点之内)。 这是一种公平性,称为预测均等 。我们可以扩展此思想,以检查具有给定分数的被告是否在两组中再次犯案的可能性相同:

We can see that for both African-American and Caucasian defendants, for any given COMPAS score, recidivism rates are similar. This is a type of fairness known as calibration.

我们可以看到,对于非裔美国人和白种人被告,对于任何给定的COMPAS分数,累犯率都是相似的。 这是一种称为校准的公平类型。

Next, we will look at the frequency with which defendants of each race are assigned each COMPAS score:

接下来,我们将查看为每个种族的被告分配COMPAS分数的频率:

We observe that Caucasian defendants in this sample are more likely to be assigned a low risk score.

我们观察到该样本中的白种人被告更有可能被分配为低风险评分。

However, to evaluate whether this is unfair, we need to know the true prevalence — whether the rates of recidivism are the same in both populations, according to the data:

但是,要评估这种情况是否不公平 ,我们需要了解真实的患病率-根据数据,两个人群的累犯率是否相同:

The predictions of the model are pretty close to the actual prevalence in the population.

该模型的预测与人口中的实际患病率非常接近。

So far, our analysis suggests that COMPAS is fair with respect to race:

到目前为止,我们的分析表明COMPAS在种族方面是公平的:

  • The overall accuracy of the COMPAS label is the same, regardless of race (overall accuracy equality)

    无论种族如何,COMPAS标签的整体准确度都是相同的( 整体准确度相等 )

  • The likelihood of recidivism among defendants labeled as medium or high risk is similar, regardless of race (predictive parity)

    不论种族如何,被标记为中度或高度风险的被告人再犯的可能性均相似( 预测均等 )

  • For any given COMPAS score, the risk of recidivism is similar, regardless of race — the “meaning” of the score is consistent across race (calibration)

    对于任何给定的COMPAS分数,无论种族如何,累犯的风险都是相似的-分数的“含义”在各个种族之间都是一致的( 校准 )

We do not have statistical parity (a type of fairness corresponding to equal probability of positive classification), but we don’t necessarily expect to when the prevalence of actual positive is different between groups.

我们没有统计上的均等性 (一种公平性,对应于肯定分类的相等概率),但是我们不必期望当实际阳性率在两组之间不同时。

我们可以解决,对不对? (We can fix it, right?)

Why is it so tricky to satisfy multiple types of fairness at once? This is due to a proven impossibility result.

为什么同时满足多种类型的公平性如此棘手? 这是由于无法证明的结果

Any time

任何时候

  • the base rate (prevalence of the positive condition) is different in the two groups, and

    两组的基本比率 (阳性情况的患病率)不同,并且

  • we do not have a perfect classifier

    我们没有一个完美的分类器

Then we cannot simultaneously satisfy:

那么我们不能同时满足:

  • Equal PPV and NPV for both groups (known as conditional use accuracy equality), and

    两组的PPV和NPV相等(称为条件使用准确性相等 ),并且

  • Equal FPR and FNR for both groups (known as equalized odds or conditional procedure accuracy equality)

    两组的FPR和FNR相等(称为均等赔率条件过程精度相等 )

我们学到了什么 (What we learned)

  • A model can be biased with respect to age, race, gender, if those features are not used as input to the model.

    如果未将这些特征用作模型的输入,则模型可能会因年龄,种族,性别而产生偏差。
  • There are many measures of fairness, it may be impossible to satisfy some combination of these simultaneously.

    有很多公平的措施,可能无法同时满足这些要求的某种组合。
  • Human biases and unfairness in society leak into the data used to train machine learning models.

    人为的偏见和社会中的不公平现象渗入了用于训练机器学习模型的数据中。

参考: (Reference:)

  • Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, May 2016, Machine Bias

    Julia·昂温(Julia Angwin),杰夫·拉森(Jeff Larson),苏里亚·马图(Surya Mattu)和劳伦·基希纳(Lauren Kirchner),2016年5月,《 机器偏差》

  • Jeff Larson, Surya Mattu, Lauren Kirchner and Julia Angwin, May 2016, How We Analyzed the COMPAS Recidivism Algorithm

    Jeff Larson,Surya Mattu,Lauren Kirchner和Julia Angwin,2016年5月, 我们如何分析COMPAS累犯算法

  • William Dieterich, Christina Mendoza, and Tim Brennan, July 2016, COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity

    威廉·迪特里里希(William Dieterich),克里斯蒂娜·门多萨(Christina Mendoza)和蒂姆·布伦南(Tim Brennan),2016年7月, COMPAS风险量表:证明准确性公平性和可预测的平价

  • Google’s People + AI + Research (PAIR) group explainer: Measuring fairness

    Google的People + AI + Research(PAIR)小组解释者: 衡量公平性

  • Another Google Explainer: Attacking discrimination with smarter machine learning

    另一位Google Explainer: 通过更智能的机器学习来打击歧视

翻译自: https://medium.com/@farhanrahman02/compas-case-study-fairness-of-a-machine-learning-model-f0f804108751

公平性 机器学习

你可能感兴趣的:(机器学习,人工智能,python,大数据,深度学习)