"Why Should I Trust You?" Explaining the Predictions of Any Classifier


    姓名 地址 邮箱
    Marco Tulio Ribeiro University of Washington Seattle, WA 98105, USA [email protected]
    Sameer Singh University of Washington Seattle, WA 98105, USA [email protected]
    Carlos Guestrin University of Washington Seattle, WA 98105, USA [email protected]


Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides insights into the model, which can be used to turn an untrustworthy model or prediction into a trustworthy one.


In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We further propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects. Our explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and detecting why a classifier should not be trusted.


1 介绍

Machine learning is at the core of many recent advances in science and technology. Unfortunately, the important role of humans is an oft-overlooked aspect in the field. Whether humans are directly using machine learning classifiers as tools, or are deploying models into products that need to be shipped, a vital concern remains: if the users do not trust a model or a prediction, they will not use it. It is important to differentiate between two different (but related) definitions of trust: (1) trusting a prediction, i.e. whether a user trusts an individual prediction sufficiently to take some action based on it, and (2) trusting a model, i.e. whether the user trusts a model to behave in reasonable ways if deployed. Both are directly impacted by how much the human understands a model’s behaviour, as opposed to seeing it as a black box.


Determining trust in individual predictions is an important problem when the model is used for real world actions. When using machine learning for medical diagnosis [6] or terrorism detection, for example, predictions cannot be acted upon on blind faith, as the consequences may be catastrophic. Apart from trusting individual predictions, there is also a need to evaluate the model as a whole before deploying it “in the wild”. To make this decision, users need to be confident that the model will perform well on real-world data, according to the metrics of interest. Currently, models are evaluated using metrics such as accuracy on an available validation dataset. However, real-world data is often significantly different, and further, the evaluation metric may not be indicative of the product’s goal. Inspecting individual predictions and their explanations can be a solution to this problem, in addition to such metrics. In this case, it is important to guide users by suggesting which instances to inspect, especially for larger datasets.


In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and selecting multiple such predictions (and explanations) as a solution to the “trusting the model” problem. Our main contributions are summarized as follows.


  • LIME, an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.

  • LIME是一种算法,它可以用一个可解释的模型局部地逼近任何分类器或回归器的预测,从而以一种可靠的方式来解释它。

  • SP-LIME, a method that selects a set of representative instances with explanations to address the “trusting the model” problem, via submodular optimization.

  • SP-LIME是一种选择一组具有代表性的实例,并通过子模块优化来解决信任模型问题的算法。

  • Comprehensive evaluation with simulated and human subjects, where we measure the impact of explanations on trust and associated tasks. In our experiments, non-experts using LIME are able to pick which classifier from a pair generalizes better in the real world. Further, they are able to greatly improve an untrustworthy classifier trained on 20 newsgroups, by doing feature engineering using LIME. We also show how understanding the predictions of a neural network on images helps practitioners know when and why they should not trust a model.

  • 通过模拟和人体实验进行综合评估,我们测量解释对信任和相关任务的影响。在我们的实验中,使用LIME的非专家能够从一对分类器中挑选出在现实世界中泛化更好的分类器。此外,通过使用LIME进行特征工程,他们能够极大地改进在20个新闻组上训练的不可信分类器。我们还展示了理解神经网络对图像的预测如何帮助实践者知道何时以及为什么他们不应该信任模型。

2 The Case for Explanations 解释的理由

By “explaining a prediction”, we mean presenting textual or visual artifacts that provide qualitative understanding of the relationship between the instance’s components (e.g. words in text, patches in an image) and the model’s prediction. We argue that explaining predictions is an important aspect in getting humans to trust and use machine learning effectively, provided the explanations are faithful and intelligible.


The process of explaining individual predictions is illustrated in Figure 1. It is clear that a doctor is much better positioned to make a decision with the help of a model if intelligible explanations are provided. In this case, explanations are a small list of symptoms with relative weights - symptoms that either contribute towards the prediction (in green) or are evidence against it (in red). In this, and other examples where humans make decisions with the help of predictions, trust is of fundamental concern. Even when stakes are lower, as in product or movie recommendations, the user needs to trust the prediction enough to spend money or time on it. Humans usually have prior knowledge about the application domain, which they can use to accept (trust) or reject a prediction if they understand the reasoning behind it. It has been observed, for example, that providing an explanation can increase the acceptance of computer-generated movie recommendations [12] and other automated systems [7].



Figure 1: Explaining individual predictions. A model predicts that a patient has the flu, and LIME highlights which symptoms in the patient’s history led to the prediction. Sneeze and headache are portrayed as contributing to the “flu” prediction, while “no fatigue” is evidence against it. With these, a doctor can make an informed decision about the model’s prediction.


Every machine learning application requires a certain measure of trust in the model. Development and evaluation of a classification model often consists of collecting annotated data, followed by learning parameters on a subset and evaluating using automatically computed metrics on the remaining data. Although this is a useful pipeline for many applications, it has become evident that evaluation on validation data often may not correspond to performance “in the wild” due to a number of reasons - and thus trust cannot rely solely on it. Looking at examples is a basic human strategy for comprehension [20], and for deciding if they are trustworthy - especially if the examples are explained. We thus propose explaining several representative individual predictions of a model as a way to provide a global understanding of the model. This global perspective is useful to machine learning practitioners in deciding between different models, or configurations of a model.


There are several ways a model can go wrong, and practitioners are known to overestimate the accuracy of their models based on cross validation [21]. Data leakage, for example, defined as the unintentional leakage of signal into the training (and validation) data that would not appear in the wild [14], potentially increases accuracy. A challenging example cited by Kaufman et al. [14] is one where the patient ID was found to be heavily correlated with the target class in the training and validation data. This issue would be incredibly challenging to identify just by observing the predictions and the raw data, but much easier if explanations such as the one in Figure 1 are provided, as patient ID would be listed as an explanation for predictions. Another particularly hard to detect problem is dataset shift [5], where training data is different than test data (we give an example in the famous 20 newsgroups dataset later on). The insights given by explanations (if the explanations correspond to what the model is actually doing) are particularly helpful in identifying what must be done to turn an untrustworthy model into a trustworthy one - for example, removing leaked data or changing the training data to avoid dataset shift.


Machine learning practitioners often have to select a model from a number of alternatives, requiring them to assess the relative trust between two or more models. In Figure 2, we show how individual prediction explanations can be used to select between models, in conjunction with accuracy. In this case, the algorithm with higher accuracy on the validation set is actually much worse, a fact that is easy to see when explanations are provided (again, due to human prior knowledge), but hard otherwise. Further, there is frequently a mismatch between the metrics that we can compute and optimize (e.g. accuracy) and the actual metrics of interest such as user engagement and retention. While we may not be able to measure such metrics, we have knowledge about how certain model behaviors can influence them. Therefore, a practitioner may wish to choose a less accurate model for content recommendation that does not place high importance in features related to “clickbait” articles (which may hurt user retention), even if exploiting such features increases the accuracy of the model in cross validation. We note that explanations are particularly useful in these (and other) scenarios if a method can produce them for any model, so that a variety of models can be compared.



Figure 2: Explaining individual predictions of competing classifiers trying to determine if a document is about “Christianity” or “Atheism”. The bar chart represents the importance given to the most relevant words, also highlighted in the text. Color indicates which class the word contributes to (green for “Christianity”, magenta for “Atheism”). Whole text not shown for space reasons.


Desired Characteristics for Explainers 解释者期望的特征

We have argued thus far that explaining individual predictions of classifiers (or regressors) is a significant component for assessing trust in predictions or models. We now outline a number of desired characteristics from explanation methods:

An essential criterion for explanations is that they must be interpretable, i.e., provide qualitative understanding between joint values of input variables and the resulting predicted response value [11]. We note that interpretability must take into account human limitations. Thus, a linear model [24], a gradient vector [2] or an additive model [6] may or may not be interpretable. If hundreds or thousands of features significantly contribute to a prediction, it is not reasonable to expect users to comprehend why the prediction was made, even if they can inspect individual weights. This requirement further implies that explanations should be easy to understand - which is not necessarily true for features used by the model. Thus, the “input variables” in the explanations may be different than the features used by the model.



Another essential criterion is local fidelity. Although it is often impossible for an explanation to be completely faithful unless it is the complete description of the model itself, for an explanation to be meaningful it must at least be locally faithful - i.e. it must correspond to how the model behaves in the vicinity of the instance being predicted. We note that local fidelity does not imply global fidelity: features that are globally important may not be important in the local context, and vice versa. While global fidelity would imply local fidelity, presenting globally faithful explanations that are interpretable remains a challenge for complex models.


While there are models that are inherently interpretable [6, 17, 26, 27], an explainer must be able to explain any model, and thus be model-agnostic (i.e. treating the original model as a black box). Apart from the fact that many state-of-the-art classifiers are not currently interpretable, this also provides flexibility to explain future classifiers.


In addition to explaining predictions, providing a global perspective is important to ascertain trust in the model. As mentioned before, accuracy may often not be sufficient to evaluate the model, and thus we want to explain the model. Building upon the explanations for individual predictions, we select a few explanations to present to the user, such that they are representative of the model.


3 Local Interpretable Model-Agnostic Explanations 局部可解释模型不可知解释

We now present Local Interpretable Model-agnostic Explanations (LIME). The overall goal of LIME is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier.


3.1 Interpretable Data Representations 可解释数据表示

Before we present the explanation system, it is important to distinguish between features and interpretable data representations. As mentioned before, interpretable explanations need to use a representation that is understandable to humans, regardless of the actual features used by the model. For example, a possible interpretable representation for text classification is a binary vector indicating the presence or absence of a word, even though the classifier may use more complex (and incomprehensible) features such as word embeddings. Likewise for image classification, an interpretable representation may be a binary vector indicating the “presence” or “absence” of a contiguous patch of similar pixels (a super-pixel), while the classifier may represent the image as a tensor with three color channels per pixel. We denote$ x \in \mathbb{R}^d $d be the original representation of an instance being explained, and we use x ′ ∈ { 0 , 1 } d ′ x' \in \{0,1\}^{d'} x{0,1}d to denote a binary vector for its interpretable representation.

在我们提出解释系统之前,区分特征和可解释数据表示是很重要的。如前所述,可解释的解释需要使用人类可以理解的表示,而不考虑模型使用的实际特征。例如,用于文本分类的可能可解释表示是指示单词的存在或不存在的二进制向量,即使分类器可以使用更复杂(并且不可理解)的特征,例如单词嵌入。同样地,对于图像分类,可解释的表示可以是指示类似像素(超级像素)的连续面片的“存在”或“不存在”的二值向量,而分类器可以将图像表示为每个像素具有三个颜色通道的张量。我们将 x ∈ R d x\in\mathbb{R}^d xRdd表示为被解释实例的原始表示,并使用 x ′ ∈ { 0 , 1 } d ′ x'\in\{0,1\}^{d'} x{0,1}d表示其可解释表示的二进制向量。

3.2 Fidelity-Interpretability Trade-off 保真度可解释性权衡

Formally, we define an explanation as a model g ∈ G g \in G gG , where G G G is a class of potentially interpretable models, such as linear models, decision trees, or falling rule lists [27]. The assumption is that given a model g ∈ G g \in G gG , we can readily present it to the user with visual or textual artifacts. Note that the domain of g g g is { 0 , 1 } d ′ \{0,1\}^{d'} {0,1}d , i.e. g g g acts over absence/presence of the interpretable components. As noted before, not every g ∈ G g \in G gG is simple enough to be interpretable - thus we let Ω ( g ) \Omega(g) Ω(g) be a measure of complexity (as opposed to interpretability) of the explanation g ∈ G g \in G gG. For example, for decision trees Ω ( g ) \Omega(g) Ω(g) may be the depth of the tree, while for linear models, Ω ( g ) Ω(g) (g) may be the number of non-zero weights.

形式上,我们将解释定义为模型 g ∈ G g\in G gG,其中 g g g是一类潜在的可解释模型,如线性模型、决策树或下降规则列表[27]。假设给定一个模型 g ∈ G g\in G gG,我们可以很容易地用视觉或文本工件将其呈现给用户。注意, g g g 的域是 { 0 , 1 } d ′ \{0,1\}^{d'} {0,1}d,即, g g g 作用于可解释组件的不存在/存在。如前所述,并不是每个 g ∈ G g\in G gG 都简单到可以解释的程度-因此我们让 Ω ( g ) \Omega(g) Ω(g)作为解释 g   i n g g\ in g g ing的复杂性(相对于可解释性)的度量。例如,对于决策树, Ω ( g ) \Omega(g) Ω(g)可能是树的深度,而对于线性模型$Ω(g) $可以是非零权重的数目。

Let the model being explained be denoted f : R d → R \mathbb{R}^d \rightarrow R RdR. In classification, f ( x ) f(x) f(x) is the probability (or a binary indicator) that x x x belongs to a certain class1. We further use Π x ( z ) \Pi_x(z) Πx(z) as a proximity measure between an instance z z z to x x x, so as to define locality around x x x. Finally, let L ( f , g , Π x ) \mathcal{L}(f, g, \Pi_x) L(f,g,Πx) be a measure of how unfaithful g is in approximating f in the locality defined by Π x \Pi_x Πx. In order to ensure both interpretability and local fidelity, we must minimize L ( f , g , Π x ) \mathcal{L}(f, g, \Pi_x) L(f,g,Πx) while having Ω ( g ) \Omega(g) Ωg be low enough to be interpretable by humans. The explanation produced by LIME is obtained by the following:

让被解释的模型用 f f f: R d → R \mathbb{R}^d\rightarrow R RdR表示。在分类中, f ( x ) f(x) f(x) x x x属于某一类的概率(或二进制指标)。我们进一步使用 Π x ( z ) \Pi_x(z) Πx(z)作为实例 z z z x x x之间的邻近度量,以便在 x x x周围定义位置。最后,让 L ( f , g , Π x ) \mathcal{L}(f,g,\Pi_x) L(fgΠx)度量g在 Π x \Pi_x Πx定义的局部性中逼近f的不忠程度。为了保证可解释性和局部保真度,我们必须最小化 L ( f , g , Π x ) \mathcal{L}(f,g,\Pi_x) L(fgΠx),同时使 Ω ( g ) \Omega(g) Ω(g)足够低,可以被人类解释。由LIME得出的解释如下:
ξ ( x ) = a r g m i n g ∈ G L ( f , g , Π x ) + Ω ( g ) (1) \xi(x) = argmin_{g\in G} \mathcal{L}(f, g, \Pi_x) + \Omega(g)\tag{1} ξ(x)=argmingGL(f,g,Πx)+Ω(g)(1)
This formulation can be used with different explanation families G G G, fidelity functions L \mathcal{L} L, and complexity measures Ω \Omega Ω. Here we focus on sparse linear models as explanations, and on performing the search using perturbations.

此公式可用于不同的解释族 G G G、保真度函数 L \mathcal{L} L和复杂性度量 Ω \Omega Ω。在这里,我们集中在稀疏线性模型作为解释,并在执行使用扰动搜索。

3.3 Sampling for Local Exploration 局部勘探取样

We want to minimize the expected locally-aware loss L ( f , g , Π x ) \mathcal{L}(f,g,\Pi_x) L(fgΠx) without making any assumptions about f, since we want the explainer to be model-agnostic. Thus, in order to learn the local behaviour of f as the interpretable inputs vary, we approximate L ( f , g , Π x ) \mathcal{L}(f,g,\Pi_x) L(fgΠx) by drawing samples, weighted by Π x \Pi_x Πx. We sample instances around x ′ x' x by drawing nonzero elements of x ′ x' x uniformly at random (where the number of such draws is also uniformly sampled). Given a perturbed sample z ′ ∈ { 0 , 1 } d ′ z' \in \{0, 1\}^{d'} z{0,1}d (which contains a fraction of the nonzero elements of x ′ x' x ), we recover the sample in the original representation z ∈ R d z \in \mathbb{R}^d zRd and obtain f ( z ) f(z) f(z), which is used as a label for the explanation model. Given this dataset Z Z Z of perturbed samples with the associated labels, we optimize Eq. (1) to get an explanation ξ ( x ) \xi(x) ξ(x). The primary intuition behind LIME is presented in Figure 3, where we sample instances both in the vicinity of x x x (which have a high weight due to Π x \Pi_x Πx) and far away from x (low weight from Π x \Pi_x Πx). Even though the original model may be too complex to explain globally, LIME presents an explanation that is locally faithful (linear in this case), where the locality is captured by Π x \Pi_x Πx. It is worth noting that our method is fairly robust to sampling noise since the samples are weighted by Π x \Pi_x Πx in Eq. (1). We now present a concrete instance of this general framework.

我们希望最小化预期的局部感知损失 L ( f , g , Π x ) \mathcal{L}(f,g,\Pi_x) LfgΠx,而不需要对f做任何假设,因为我们希望解释者是模型不可知的。因此,为了了解f在可解释输入变化时的局部行为,我们通过绘制样本来近似 L ( f , g , Π x ) \mathcal{L}(f,g,\Pi_x) LfgΠx,加权 Π x \Pi_x Πx。我们通过随机均匀地绘制 x ′ x' x的非零元素来对 x ′ x' x周围的实例进行采样(这样绘制的数量也是均匀采样的)。给定一个扰动样本 z ′ ∈ { 0 , 1 } d ′ z'\in\{0,1\}^{d'} z{01}d(它包含 x ′ x' x的一小部分非零元素),我们恢复原始表示形式 z ∈ R d z\in\mathbb{R}^d zRd中的样本并获得 f ( z ) f(z) fz,它被用作解释模型的标签。给定这个数据集$Z 与 相 关 标 签 的 扰 动 样 本 , 我 们 优 化 E q . ( 1 ) 得 到 解 释 与相关标签的扰动样本,我们优化Eq.(1)得到解释 Eq.1 \xi(x) 。 L I M E 背 后 的 主 要 直 觉 如 图 3 所 示 , 其 中 我 们 在 。LIME背后的主要直觉如图3所示,其中我们在 LIME3x 附 近 ( 由 于 附近(由于 \Pi_ x 的 权 重 较 高 ) 和 远 离 x ( 从 的权重较高)和远离x(从 x\Pi_x 的 权 重 较 低 ) 的 地 方 对 实 例 进 行 了 采 样 。 尽 管 原 始 模 型 可 能 过 于 复 杂 , 无 法 进 行 全 局 解 释 , 但 L I M E 提 供 了 一 种 局 部 忠 实 的 解 释 ( 在 本 例 中 是 线 性 的 ) , 其 中 局 部 由 的权重较低)的地方对实例进行了采样。尽管原始模型可能过于复杂,无法进行全局解释,但LIME提供了一种局部忠实的解释(在本例中是线性的),其中局部由 LIME线\Pi_ x 捕 获 。 值 得 注 意 的 是 , 我 们 的 方 法 对 采 样 噪 声 相 当 稳 健 , 因 为 在 公 式 ( 1 ) 中 , 样 本 的 权 重 是 捕获。值得注意的是,我们的方法对采样噪声相当稳健,因为在公式(1)中,样本的权重是 1\Pi_ x$。我们现在提出这一总体框架的一个具体实例。

3.4 Sparse Linear Explanations 稀疏线性解释

For the rest of this paper, we let G G G be the class of linear models, such that g ( z ′ ) = w g ⋅ z ′ g(z') = w_g \cdot z' g(z)=wgz . We use the locally weighted square loss as L \mathcal{L} L, as defined in Eq. (2), where we let Π x ( z ) = e x p ( − D ( x , z ) 2 / σ 2 ) \Pi_x(z) = exp(−D(x, z)^2/σ^2 ) Πx(z)=exp(D(x,z)2/σ2) be an exponential kernel defined on some distance function D D D (e.g. cosine distance for text, L 2 L2 L2 distance for images) with width σ σ σ.

在本文的其余部分,我们假设 G G G是一类线性模型,使得 G ( z ′ ) = w g ⋅ z ′ G(z')=w_g \cdot z' Gz=wgz。我们使用局部加权平方损失作为 L \mathcal{L} L,如等式(2)中所定义,其中我们让 Π x ( z ) = e x p ( − D ( x , z ) 2 / σ 2 ) \Pi_x(z)=exp(−D(x,z)^2/σ^2) Πxz=exp(Dxz2/σ2是在某个距离函数 D D D(例如,文本的余弦距离, L 2 L2 L2图像的距离)上定义的具有宽度的指数核 σ σ σ.
L ( f , g , Π x ) = ∑ z , z ′ ∈ Z ( f ( z ) − g ( z ′ ) ) 2 (2) \mathcal{L}(f,g,\Pi_x)=\sum_{z,z'\in Z}(f(z)-g(z'))^2\tag{2} L(f,g,Πx)=z,zZ(f(z)g(z))2(2)
For text classification, we ensure that the explanation is interpretable by letting the interpretable representation be a bag of words, and by setting a limit K on the number of words included, i.e. Ω ( g ) = ∞ I [ ∥ w g ∥ 0 > K ] \Omega(g) = \infty \mathbb{I}[\|{w_g}\|_0 > K] Ω(g)=I[wg0>K]. We use the same Ω \Omega Ω for image classification, using “super-pixels” (computed using any standard algorithm) instead of words, such that the interpretable representation of an image is a binary vector where 1 indicates the original super-pixel and 0 indicates a grayed out super-pixel. This particular choice of Ω \Omega Ω makes directly solving Eq. (1) intractable, but we approximate it by first selecting K K K features with Lasso (using the regularization path [8]) and then learning the weights via least squares (a procedure we call K-LASSO in Algorithm 1). We note that in Algorithm 1, the time required to produce an explanation is dominated by the complexity of the black box model f ( z i ) f(zi) f(zi). To give a rough idea of running time, explaining predictions from random forests with 1000 trees using scikit-learn2 on a laptop with N = 5000 takes around 3 seconds. Explaining each prediction of the Inception network [25] for image classification takes around 10 minute

对于文本分类,我们通过让可解释的表示为一个单词包,并通过对包含的单词数设置一个限制 K K K,即 Ω ( g ) = ∞ i [ ∥ w ∣ g ∥ u 0 > K ] \Omega(g)=\infty\mathbb{i}[\|{w|g}\|u 0>K] Ωg=i[wgu0>K],来确保解释是可解释的。我们使用相同的 Ω \Omega Ω进行图像分类,使用“超级像素”(使用任何标准算法计算)而不是单词,这样图像的可解释表示是一个二进制向量,其中1表示原始超级像素,0表示灰显超级像素。这种特殊的 Ω \Omega Ω选择使得直接求解公式(1)变得困难,但是我们首先用套索选择 K K K特征(使用正则化路径[8]),然后通过最小二乘法学习权重(在算法1中我们称之为K-Lasso)。我们注意到,在算法1中,产生解释所需的时间主要取决于黑箱模型 f ( z i ) f(zi) fzi的复杂性。要大致了解运行时间,在N=5000的笔记本电脑上使用scikit-learn2解释来自1000棵树的随机森林的预测大约需要3秒钟。解释用于图像分类的初始网络[25]的每个预测大约需要10分钟


Figure 3: Toy example to present intuition for LIME. The black-box model’s complex decision function f (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by a linear model. The bright bold red cross is the instance being explained. LIME samples instances, gets predictions using f, and weighs them by the proximity to the instance being explained (represented here by size). The dashed line is the learned explanation that is locally (but not globally) faithful.


Algorithm 1 LIME for Sparse Linear Explanations
Require: Classifier f, Number of samples N
Require: Instance x, and its interpretable version x 0
**Require: **Similarity kernel Π x \Pi_x Πx, Length of explanation K
$Z \leftarrow {} $
for i ∈ 1 , 2 , 3 , . . . , N i \in {1, 2, 3, ..., N} i1,2,3,...,N do
z ’ i ← s a m p l e a r o u n d ( x ′ ) z ’_ i ← sample_around(x ' ) zisamplearound(x)
Z ← Z ⋃ [ z i ′ , f ( z i ) , Π x ( z i ) Z \leftarrow Z \bigcup [z'_i , f(z_i), \Pi_x\left(zi\right) ZZ[zi,f(zi),Πx(zi)
end for
w ← w \leftarrow w K-Lasso ( Z , K ) (Z, K) (Z,K) with z i ′ z'_i zi as features, f ( z ) f(z) f(z)
as target return w

3.5 Example 1: Text classification with SVM 支持向量机在文本分类中的应用

In Figure 2 (right side), we explain the predictions of a support vector machine with RBF kernel trained on unigrams to differentiate “Christianity” from “Atheism” (on a subset of the 20 newsgroup dataset). Although this classifier achieves 94% held-out accuracy, and one would be tempted to trust it based on this, the explanation for an instance shows that predictions are made for quite arbitrary reasons (words “Posting”, “Host” and “Re” have no connection to either Christianity or Atheism). The word “Posting” appears in 22% of examples in the training set, 99% of them in the class “Atheism”. Even if headers are removed, proper names of prolific posters (such as “Keith”) in the original newsgroups are selected by the classifier, which would also not generalize.


After getting such insights from explanations, it is clear that this dataset has serious issues (which are not evident just by studying the raw data or predictions), and that this classifier, or held-out evaluation, cannot be trusted. It is also clear what the problems are, and the steps that can be taken to fix these issues and train a more trustworthy classifier.



Figure 4: Explaining an image classification prediction made by Google’s Inception network, highlighting positive pixels. The top 3 classes predicted are “Electric Guitar” (p = 0.32), “Acoustic guitar” (p = 0.24) and “Labrador” (p = 0.21)


3.6 Example 2: Deep networks for images 图像深度网络

We learn a linear model with positive and negative weights for each super-pixel in an image. For the purpose of visualization, one may wish to just highlight the super-pixels with positive weight towards a specific class, as they give intuition as to why the model would think that class may be present. We explain the prediction of Google’s pre-trained Inception neural network [25] in this fashion on an arbitrary image (Figure 4a). Figures 4b, 4c, 4d show the super-pixels explanations for the top 3 predicted classes (with the rest of the image grayed out), having set K = 10. What the neural network picks up on for each of the classes is quite natural to humans - Figure 4b in particular provides insight as to why acoustic guitar was predicted to be electric: due to the fretboard. This kind of explanation enhances trust in the classifier (even if the top predicted class is wrong), as it shows that it is not acting in an unreasonable manner.


4 Submodular Pick for Explaining Models 用于解释模型的子模块选取

Although an explanation of a single prediction provides some understanding into the reliability of the classifier to the user, it is not sufficient to evaluate and assess trust in the model as a whole. We propose to give a global understanding of the model by explaining a set of individual instances. This approach is still model agnostic, and is complementary to computing summary statistics such as held-out accuracy.


Even though explanations of multiple instances can be insightful, these instances need to be selected judiciously, since users may not have the time to ex- amine a large number of explanations. We represent the time and patience that humans have by a budget B that denotes the number of explanations they are willing to look at in order to understand a model. Given a set of instances X, we define the pick step as the task of selecting B instances for the user to inspect.


The pick step is not dependent on the existence of explanations - one of the main purpose of tools like Modeltracker [1] and others [10] is to assist users in selecting instances themselves, and examining the raw data and predictions. However, as we have argued that looking at raw data is not enough to understand predictions and get insights, it is intuitive that a method for the pick step should take into account the explanations that accompany each prediction. Moreover, this method should pick a diverse, representative set of explanations to show the user – i.e. non-redundant explanations that represent how the model behaves globally.


Given all of the explanations for a set of instances X X X, we construct an n × d ′ n \times d' n×d explanation matrix W W W that represents the local importance of the interpretable components for each instance. When using linear models as explanations, for an instance xi and explanation g i = ξ ( x i ) g_i = \xi(x_i) gi=ξ(xi), we set W i j = ∣ w g i j ∣ W_{ij} = |w_{g_{ij} }| Wij=wgij. Further, for each component j in W, we let I j I_j Ij denote the global importance, or representativeness of that component in the explanation space. Intuitively, we want I such that features that explain many different instances have higher importance scores. Concretely for the text applications, we set I j = ∑ i = 1 n W i j I_j = \sqrt{\sum^n_{i=1}{W_{ij}}} Ij=i=1nWij . For images, I I I must measure something that is comparable across the super-pixels in different images, such as color histograms or other features of super-pixels; we leave further exploration of these ideas for future work. In Figure 5, we show a toy example W, with n = d ′ = 5 n = d'= 5 n=d=5, where W is binary (for simplicity). The importance function I should score feature f2 higher than feature f1, i.e. I 2 > I 1 I_2 > I_1 I2>I1, since feature f2 is used to explain more instances.

给定一组实例 X X X的所有解释,我们构造了一个 n × d ′ n\times d' n×d解释矩阵 W W W,它表示每个实例的可解释组件的局部重要性。当使用线性模型作为解释时,对于实例席席和解释 g i = ξ ( X i ) g_i= \xi(X_i) gi=ξXi,我们设置 W i j = ∣ W g i j ∣ W_{ij}=|W_{g_{ij}}| WijWgij。此外,对于W中的每个分量j,我们让 I j I_j Ij表示该分量在解释空间中的全局重要性或代表性。直觉上,我们希望我这样的特点,解释许多不同的例子有较高的重要性得分。具体地说,对于文本应用程序,我们设置 I j = ∑ i = 1 n W i j I_j=\sqrt{\sum^n_{i=1}W_{ij}} Ij=i=1nWij 。对于图像, I I I必须测量不同图像中的超级像素之间的可比性,例如颜色直方图或超级像素的其他特征;我们将这些想法留给今后的工作进一步探讨。在图5中,我们展示了一个玩具示例W, n = d ′ = 5 n=d'=5 n=d=5,其中W是二进制的(为了简单起见)。重要性函数I应将特征f2的得分高于特征f1,即 I 2 > I 1 I_2>I_1 I2>I1,因为特征f2用于解释更多实例。

Algorithm 2 Submodular pick algorithm
Require: Instances X, Budget B
for all x i ∈ X x_i \in X xiX do
W i ← W_ i \leftarrow Wi explain ( x i , x i ′ ) (x_i,x'_i) (xi,xi) Using Algorithm 1
end for
for j ∈ { 0... d ′ } j \in \{0...d'\} j{0...d} do
$I_j \leftarrow \sqrt{\sum^n_{i=1}

While we want to pick instances that cover the important components, the set of explanations must not be redundant in the components they show the users, i.e. avoid selecting instances with similar explanations. In Figure 5, after the second row is picked, the third row adds no value, as the user has already seen features f2 and f3 - while the last row exposes the user to completely new features. Selecting the second and last row results in the coverage of almost all the features. We formalize this non-redundant coverage intuition in Eq. (3), where we define coverage as the set function c, given W and I, which computes the total importance of the features that appear in at least one instance in a set V .
c ( V , W , I ) = ∑ j = 1 d ′ I [ ∃ i ∈ V : W i j > 0 ] I j (3) c(V, W, I) =\sum^{d'}_{j=1}\mathbb{I}_{[\exists i \in V:W_{ij}>0]}I_j\tag{3} c(V,W,I)=j=1dI[iV:Wij>0]Ij(3)

The pick problem is defined in Eq. (4), and it consists of finding the set V, |V | ≤ B that achieves highest coverage.

pick问题在式(4)中定义,它包括寻找集合V,| V |≤ B达到最高覆盖率。
P i c k ( W , I ) = arg ⁡ max ⁡ V , ∣ V ∣ ≤ B c ( V , W , I ) (4) Pick\left(W,I\right)= {\mathop{\arg\max}}_{V,|V|\leq B^{c(V,W,I)}}\tag{4} Pick(W,I)=argmaxV,VBc(V,W,I)(4)
The problem in Eq. (4) is maximizing a weighted coverage function, and is NP-hard [9]. Let c ( V ⋃ { i } , W , I ) − c ( V , W , I ) c(V \bigcup \{i\}, W, I) − c(V, W, I) c(V{i},W,I)c(V,W,I) be the marginal coverage gain of adding an instance i to a set V . Due to sub-modularity, a greedy algorithm that iteratively adds the instance with the highest marginal coverage gain to the solution offers a constant-factor approximation guarantee of 1−1/e to the optimum [15]. We outline this approximation for the pick step in Algorithm 2, and call it submodular pick.

式(4)中的问题是最大化加权覆盖函数,是NP-hard的[9]。让 c ( V ⋃ { i } , W , i ) − c ( V , W , I ) c(V\bigcup\{i\},W,i)− c(V,W,I) cV{i}WicVWI是将实例I添加到集合V的边际覆盖增益。由于子模块化,贪婪算法迭代地将具有最高边缘覆盖增益的实例添加到解中,从而提供了1的常数因子近似保证−1/e至最佳值[15]。我们概述了算法2中选取步骤的这种近似,并称之为子模选取。

5 Simulated User Experiments 模拟用户实验

In this section, we present simulated user experiments to evaluate the usefulness of explanations in trust-related tasks. In particular, we address the following questions: (1) Are the explanations faithful to the model, (2) Can the explanations aid users in ascertaining trust in predictions, and (3) Are the explanations useful for evaluating the model as a whole.


5.1 Experiment Setup 实验设置

We use two sentiment analysis datasets (books and DVDs, 2000 instances each) where the task is to classify product reviews as positive or negative [4]. The results on two other datasets (electronics, and kitchen) are similar, thus we omit them due to space. We train decision trees (DT), logistic regression with L2 regularization (LR), nearest neighbors (NN), and support vector machines with RBF kernel (SVM), all using bag of words as features. We also include random forests (with 1000 trees) trained with the average word2vec embedding [19] (RF), a model that is impossible to interpret. We use the implementations and default parameters of scikit-learn, unless noted otherwise. We divide each dataset into train (1600 instances) and test (400 instances). Code for replicating our experiments is available online .

我们使用两个情绪分析数据集(书籍和DVD,每个2000个实例),其中的任务是将产品评论分为正面或负面[4]。另外两个数据集(电子设备和厨房)的结果是相似的,因此由于空间的原因,我们省略了它们。我们训练决策树(DT)、L2正则化logistic回归(LR)、最近邻(NN)和RBF核支持向量机(SVM),都是以词包作为特征。我们还包括用平均word2vec嵌入[19](RF)训练的随机森林(有1000棵树),这是一个无法解释的模型。我们使用scikit learn的实现和默认参数,除非另有说明。我们将每个数据集分为train(1600个实例)和test(400个实例)。复制我们实验的代码可以在网上找到。

To explain individual predictions, we compare our proposed approach (LIME), with parzen [2], for which we take the K features with the highest absolute gradients as explanations. We set the hyperparameters for parzen and LIME using cross validation, and set N = 15, 000. We also compare against a greedy procedure (similar to Martens and Provost [18]) in which we greedily remove features that contribute the most to the predicted class until the prediction changes (or we reach the maximum of K features), and a random procedure that randomly picks K features as an explanation. We set K to 10 for our experiments. For experiments where the pick procedure applies, we either do random selection (random pick, RP) or the procedure described in Section 4 (submodular pick, SP). We refer to pick-explainer combinations by adding RP or SP as a prefix.

为了解释个别预测,我们将我们提出的方法(LIME)与parzen[2]进行了比较,parzen[2]将具有最高绝对梯度的K特征作为解释。我们使用交叉验证设置parzen和LIME的超参数,并设置N=15000。我们还与贪婪程序(类似于Martens和Provost[18])进行比较,在贪婪程序中,我们贪婪地删除对预测类贡献最大的特征,直到预测发生变化(或达到K个特征的最大值),随机选取K个特征作为解释。我们的实验把K设为10。对于选取程序适用的实验,我们要么进行随机选择(random pick,RP),要么进行第4节中描述的程序(submodular pick,SP)。我们通过添加RP或SP作为前缀来引用pick解释者组合。


Figure 6: Recall on truly important features for two interpretable classifiers on the books dataset.



Figure 7: Recall on truly important features for two interpretable classifiers on the DVDs dataset.


5.2 Are explanations faithful to the model? 解释是否忠实于模型?

We measure faithfulness of explanations on classifiers that are by themselves interpretable (sparse logistic regression and decision trees). In particular, we train both classifiers such that the maximum number of features they use for any instance is 10. For such models, we know the set of truly important features. For each prediction on the test set, we generate explanations and compute the fraction of truly important features that are recovered by the explanations. We report this recall averaged over all the test instances in Figures 6 and 7. We observe that the greedy approach is comparable to parzen on logistic regression, but is substantially worse on decision trees since changing a single feature at a time often does not have an effect on the prediction. However, text is a particularly hard case for the parzen explainer, due to the difficulty in approximating the original classifier in high dimensions, thus the overall recall by parzen is low. LIME consistently provides > 90% recall for both logistic regression and decision trees on both datasets, demonstrating that LIME explanations are quite faithful to the model.


5.3 Should I trust this prediction? 我应该相信这个预测吗?

In order to simulate trust in individual predictions, we first randomly select 25% of the features to be “untrustworthy”, and assume that the users can iden-tify and would not want to trust these features (such as the headers in 20 newsgroups, leaked data, etc). We thus develop oracle “trustworthiness” by labeling test set predictions from a black box classifier as “untrustworthy” if the prediction changes when untrustworthy features are removed from the instance, and “trustworthy” otherwise. In order to simulate users, we assume that users deem predictions untrustworthy from LIME and parzen explanations if the prediction from the linear approximation changes when all untrustworthy features that appear in the explanations are removed (the simulated human “discounts” the effect of untrustworthy features). For greedy and random, the prediction is mistrusted if any untrustworthy features are present in the explanation, since these methods do not provide a notion of the contribution of each feature to the prediction. Thus for each test set prediction, we can evaluate whether the simulated user trusts it using each explanation method, and compare it to the trustworthiness oracle.


Using this setup, we report the F1 on the trustworthy predictions for each explanation method, averaged over 100 runs, in Table 1. The results indicate that LIME dominates others (all results are significant at p = 0.01) on both datasets, and for all of the black box models. The other methods either achieve a lower recall (i.e. they mistrust predictions more than they should) or lower precision (i.e. they trust too many predictions), while LIME maintains both high precision and high recall. Even though we artificially select which features are untrustworthy, these results indicate that LIME is helpful in assessing trust in individual predictions.

5.4 Can I trust this model? 我能相信这个模型吗?

In the final simulated user experiment, we evaluate whether the explanations can be used for model selection, simulating the case where a human has to decide between two competing models with similar accuracy on validation data. For this purpose, we add 10 artificially “noisy” features. Specifically, on training and validation sets (80/20 split of the original training data), each artificial feature appears in 10% of the examples in one class, and 20% of the other, while on the test instances, each artificial feature appears in 10% of the examples in each class. This recreates the situation where the models use not only features that are informative in the real world, but also ones that are noisy and introduce spurious correlations. We create pairs of competing classifiers by repeatedly training pairs of random forests with 30 trees until their validation accuracy is within 0.1% of each other, but their test accuracy differs by at least 5%. Thus, it is not possible to identify the better classifier (the one with higher test accuracy) from the accuracy on the validation data.


The goal of this experiment is to evaluate whether a user can identify the better classifier based on the explanations of B instances from the validation set. The simulated human marks the set of artificial features that appear in the B explanations as untrustworthy, following which we evaluate how many total predictions in the validation set should be trusted (as in the previous section, treating only marked features as untrustworthy). Then, we select the classifier with fewer untrustworthy predictions, and compare this choice to the classifier with higher held-out test set accuracy.


Table 1: Average F1 of trustworthiness for different explainers on a collection of classifiers and datasets.




Figure 8: Choosing between two classifiers, as the number of instances shown to a simulated user is varied. Averages and standard errors from 800 runs.


We present the accuracy of picking the correct classifier as B varies, averaged over 800 runs, in Figure 8. We omit SP-parzen and RP-parzen from the figure since they did not produce useful explanations for this task, performing only slightly better than random. We see that LIME is consistently better than greedy, irrespective of the pick method. Further, combining submodular pick with LIME outperforms all other methods, in particularly it is much better than using RP-LIME when only a few examples are shown to the users. These results demonstrate that the trust assessments provided by SP-selected LIME explanations are good indicators of generalization, which we validate with human experiments in the next section.

在图8中,我们给出了选择正确分类器的准确度,因为B变化,平均超过800次。我们从图中省略了SP parzen和RP parzen,因为它们没有对这项任务产生有用的解释,只比random的性能稍好。我们看到,无论采用哪种方法,LIME始终优于greedy。此外,将子模pick与LIME相结合的方法优于其他所有方法,特别是当只向用户展示几个示例时,它比使用RP-LIME要好得多。这些结果表明,SP-selected-LIME解释提供的信任评估是一个很好的泛化指标,我们将在下一节通过人体实验验证这一点。

6 Evaluation with Human Subjects 人体评价

In this section, we recreate three scenarios in machine learning that require trust and understanding of predictions and models. In particular, we evaluate LIME and SP-LIME in the following settings: (1) Can users choose from two classifiers the one that generalizes better (Section 6.2), (2) based on the explanations, can users perform feature engineering to improve the model (Section 6.3), and (3) are users able to identify and describe classifier irregularities by looking at explanations (Section 6.4).


6.1 Experimental setup 实验设置

For experiments in sections 6.2 and 6.3, we use the subset of 20 newsgroups mentioned beforehand, where the task is to distinguish between “Christianity” and “Atheism” documents. This dataset is quite problematic since it contains features that do not generalize well (e.g. very informative header information and author names), and thus validation accuracy considerably overestimates real-world performance.


In order to estimate the real world performance, we create a new religion dataset for evaluation. We download Atheism and Christianity websites from the DMOZ directory4 and human curated lists, yielding 819 webpages in each class (more details and data available online5 ). High accuracy on the religion dataset by a classifier trained on 20 newsgroups indicates that the classifier is generalizing using semantic content, instead of placing importance on the data specific issues outlined above.


Unless noted otherwise, we use SVM with RBF kernel, trained on the 20 newsgroups data with hyperparameters tuned via the cross-validation. This classifier obtains 94% accuracy on the original 20 newsgroups train-test split.


6.2 Can users select the best classifier? 用户能选择最好的分类器吗?

In this section, we want to evaluate whether explanations can help users decide which classifier generalizes better - that is, which classifier the user trusts more “in the wild”. Specifically, users have to decide between two classifiers: SVM trained on the original 20 newsgroups dataset, and a version of the same classifier trained on a “cleaned” dataset where many of the features that do not generalize are manually removed using regular expressions. The original classifier achieves an accuracy score of 57.3% on the religion dataset, while the “cleaned” classifier achieves a score of 69.0%. In contrast, the test accuracy on the original train/test split for 20 newsgroups is 94.00% and 88.6%, respectively - suggesting that the worse classifier would be selected if accuracy alone is used as a measure of trust.


We recruit human subjects on Amazon Mechanical Turk – by no means machine learning experts, but instead people with basic knowledge about religion. We measure their ability to choose the better algorithm by seeing side-by-side explanations with the associated raw data (as shown in Figure 2). We restrict both the number of words in each explanation (K) and the number of documents that each person inspects (B) to 6. The position of each algorithm and the order of the instances seen are randomized between subjects. After examining the explanations, users are asked to select which algorithm will perform best in the real world, and to explain why. The explanations are produced by either greedy (chosen as a baseline due to its performance in the simulated user experiment) or LIME, and the instances are selected either by random (RP) or submodular pick (SP). We modify the greedy step in Algorithm 2 slightly so it alternates between explanations of the two classifiers. For each setting, we repeat the experiment with 100 users.

我们在亚马逊Mechanical Turk上招募人类研究对象——绝不是机器学习专家,而是对宗教有基本了解的人。我们通过查看相关原始数据的并排解释来衡量他们选择更好算法的能力(如图2所示)。我们将每个解释中的字数(K)和每个人检查的文档数(B)限制为6。每个算法的位置和看到的实例的顺序在受试者之间随机化。在检查这些解释之后,用户被要求选择哪种算法在现实世界中表现最好,并解释原因。解释是由贪婪(由于其在模拟用户实验中的性能而被选为基线)或LIME产生的,实例是由随机(RP)或子模选取(SP)产生的。我们稍微修改了算法2中的贪心步骤,以便在两个分类器的解释之间进行交替。对于每个设置,我们用100个用户重复这个实验。


Figure 9: Average accuracy of human subject (with standard errors) in choosing between two classifiers.


The results are presented in Figure 9. The first thing to note is that all of the methods are good at identifying the better classifier, demonstrating that the explanations are useful in determining which classifier to trust, while using test set accuracy would result in the selection of the wrong classifier. Further, we see that the submodular pick (SP) greatly improves the user’s ability to select the best classifier when compared to random pick (RP), with LIME outperforming greedy in both cases. While a few users got confused and selected a classifier for arbitrary reasons, most indicated that the fact that one of the classifiers clearly utilized more semantically meaningful words was critical to their selection.


6.3 Can non-experts improve a classifier?非专家能改进分类器吗?

If one notes a classifier is untrustworthy, a common task in machine learning is feature engineering, i.e. modifying the set of features and retraining in order to improve generalization and make the classifier trustworthy. Explanations can aid in this process by presenting the important features, especially for removing features that the users feel do not generalize.


We use the 20 newsgroups data here as well, and ask Amazon Mechanical Turk users to identify which words from the explanations should be removed from subsequent training, in order to improve the worse classifier from the previous section. At each round of interaction, the subject marks words for deletion while seeing B = 10 instances with K = 10 words in each explanation (an interface similar to Figure 2, but with a single algorithm). As a reminder, the users here are not experts in machine learning and are unfamiliar with feature engineering, thus are only identifying words based on their semantic content. Further, users do not have any access to the religion dataset - they do not even know of its existence. We start the experiment with 10 subjects. After they mark words for deletion, we train 10 different classifiers, one for each subject (with the corresponding words removed). The explanations for each classifier are then presented to a set of 5 users in a new round of interaction, which results in 50 new classifiers. We do a final round, after which we have 250 classifiers, each with a path of interaction tracing back to the first 10 subjects.



Figure 10: Feature engineering experiment. Each shaded line represents the average accuracy of subjects in a path starting from one of the initial 10 subjects. Each solid line represents the average across all paths per round of interaction.


The explanations and instances shown to each user are produced by SP-LIME or RP-LIME. We show the average accuracy on the religion dataset at each interaction round for the paths originating from each of the original 10 subjects (shaded lines), and the average across all paths (solid lines) in Figure 10. It is clear from the figure that the crowd workers are able to improve the model by removing features they deem unimportant for the task. Further, SP-LIME outperforms RP-LIME, indicating selection of the instances is crucial for efficient feature engineering.


It is also interesting to observe that paths where the initial users do a relatively worse job in selecting features are later fixed by the subsequent users.


Each subject took an average of 3.6 minutes per round of cleaning, resulting in just under 11 minutes to produce a classifier that generalizes much better to real world data. Each path had on average 200 words removed with SP, and 157 with RP, indicating that incorporating coverage of important features is useful for feature engineering. Further, out of an average of 200 words selected with SP, 174 were selected by at least half of the users, while 68 by all the users. Along with the fact that the variance in the accuracy decreases across rounds, this high agreement demonstrates that the users are converging to similar correct models. This evaluation is an example of how explanations make it easy to improve an untrustworthy classifier – in this case easy enough that machine learning knowledge is not required.


6.4 Do explanations lead to insights? 解释是否能带来见解?

Often artifacts of data collection can induce undesirable correlations that the classifiers pick up during training. These issues can be very difficult to identify just by looking at the raw data and predictions.


In an effort to reproduce such a setting, we take the task of distinguishing between photos of Wolves and Eskimo Dogs (huskies). We train a logistic regression classifier on a training set of 20 images, hand selected such that all pictures of wolves had snow in the background, while pictures of huskies did not. As the features for the images, we use the first max-pooling layer of Google’s pre-trained Inception neural network [25]. On a collection of additional 60 images, the classifier predicts “Wolf” if there is snow (or light background at the bottom), and “Husky” otherwise, regardless of animal color, position, pose, etc. We trained this bad classifier intentionally, to evaluate whether subjects are able to detect it.


The experiment proceeds as follows: we first present a balanced set of 10 test predictions (without explanations), where one wolf is not in a snowy background (and thus the prediction is “Husky”) and one husky is (and is thus predicted as “Wolf”). We show the “Husky” mistake in Figure 11a. The other 8 examples are classified correctly. We then ask the subject three questions: (1) Do they trust this algorithm to work well in the real world, (2) why, and (3) how do they think the algorithm is able to distinguish between these photos of wolves and huskies. After getting these responses, we show the same images with the associated explanations, such as in Figure 11b, and ask the same questions.


Since this task requires some familiarity with the notion of spurious correlations and generalization, the set of subjects for this experiment were graduate students and professors in machine learning and its applications (NLP, Vision, etc.). After gathering the responses, we had 3 independent evaluators read their reasoning and determine if each subject mentioned snow, background, or equivalent as a potential feature the model may be using. We pick the majority as an indication of whether the subject was correct about the insight, and report these numbers before and after showing the explanations in Table 2.


Before observing the explanations, more than a third trusted the classifier, a somewhat low number since we presented only 10 examples. They did speculate as to what the neural network was picking up on, and a little less than half mentioned the snow pattern as a possible cause. After examining the explanations, however, almost all of the subjects identified the correct insight, with much more certainty that it was a determining factor. Further, the trust in the classifier also dropped substantially. Although our sample size is small, this experiment demonstrates the utility of explaining individual predictions for getting insights into classifiers knowing when not to trust them and why. Figuring out the best interfaces and doing further experiments in this area (in particular with real machine learning based services) is an exciting direction for future research.


7 Related Work 相关工作

The problems with relying on validation set accuracy as the primary measure of trust have been well studied. Practitioners consistently overestimate their model’s accuracy [21], propagate feedback loops [23], or fail to notice data leaks [14]. In order to address these issues, researchers have proposed tools like Gestalt [22] and Modeltracker [1], which help users navigate individual instances. These tools are complementary to LIME in terms of explaining models, since they do not address the problem of explaining individual predictions - instead they let the user browse raw data or features. Further, our submodular pick procedure can be incorporated in such tools to aid users in navigating larger datasets.


Some recent work aims to anticipate failures in machine learning, specifically for vision tasks [3, 29]. Letting users know when the systems are likely to fail can lead to an increase in trust, by avoiding “silly mistakes” [7]. These solutions either require additional annotations and feature engineering that is specific to vision tasks or do not provide insight into why a decision should not be trusted. Furthermore, they assume that the current evaluation metrics are reliable, which may not be the case if problems such as data leakage are present. Other recent work [10] focuses on exposing users to different kinds of mistakes (our pick step). Interestingly, the subjects in their study did not notice the serious problems in the 20 newsgroups data even after looking at many mistakes, suggesting that examining raw data is not sufficient. Note that Groce et al. [10] are not alone in this regard, many researchers in the field have unwittingly published classifiers that would not generalize for this task. Using LIME, we show that even non-experts are able to identify these irregularities when explanations are present. Further, LIME can complement these existing systems, and allow users to assess trust even when a prediction seems “correct” but is made for the wrong reasons.


Recognizing the utility of explanations in assessing trust, many have proposed using interpretable models [27], especially for the medical domain [6, 17, 26]. While such models may be appropriate for some domains, they may not apply equally well to others (e.g. a supersparse linear model [26] with 5 − 10 features is unsuitable for text applications). Interpretability, in these cases, comes at the cost of flexibility, accuracy, or efficiency. For text, EluciDebug [16] is a full human-in-the-loop system that shares many of our goals (interpretability, faithfulness, etc). However, they focus on an already interpretable model (Naive Bayes). In computer vision, systems that rely on object detection to produce candidate alignments [13] or attention [28] are able to produce explanations for their predictions. These are, however, constrained to specific neural network architectures or incapable of detecting “non object” parts of the images. Here we focus on general, model-agnostic explanations that can be applied to any classifier or regressor that is appropriate for the domain - even ones that are yet to be proposed.

认识到解释在评估信任中的效用,许多人建议使用可解释模型[27],特别是在医学领域[6,17,26]。虽然这些模型可能适用于某些领域,但它们可能并不同样适用于其他领域(例如,具有5的超解析线性模型[26])− 10个功能不适合文本应用程序)。在这些情况下,可解释性是以牺牲灵活性、准确性或效率为代价的。对于文本来说,EluciDebug[16]是一个完整的人在回路系统,它与我们的许多目标(可解释性、忠实性等)相同。然而,他们关注的是一个已经可以解释的模型(朴素贝叶斯)。在计算机视觉中,依靠目标检测产生候选对齐[13]或注意力[28]的系统能够对其预测产生解释。然而,这些都局限于特定的神经网络结构或无法检测图像的“非对象”部分。在这里,我们集中在一般的,模型不可知的解释,可以适用于任何分类器或回归是适合的领域-甚至那些尚未提出。

A common approach to model-agnostic explanation is learning a potentially interpretable model on the predictions of the original model [2]. Having the explanation be a gradient vector captures a similar locality intuition to that of LIME. However, interpreting the coefficients on the gradient is difficult, particularly for confident predictions (where gradient is near zero). Further, the model that produces the gradient is trained to approximate the original model globally. When the number of dimensions is high, maintaining local fidelity for such models becomes increasingly hard, as our experiments demonstrate. In contrast, LIME solves the much more feasible task of finding a model that approximates the original model locally. The idea of perturbing inputs for explanations has been explored before [24], where the authors focus on learning a specific contribution model, as opposed to our general framework. None of these approaches explicitly take cognitive limitations into account, and thus may produce noninterpretable explanations, such as a gradients or linear models with thousands of non-zero weights. The problem becomes worse if the original features are nonsensical to humans (e.g. word embeddings). In contrast, LIME incorporates interpretability both in the optimization and in our notion of interpretable representation, such that domain and task specific interpretability criteria can be accommodated.


8 Conclusion and Future Work 结论与未来工作

In this paper, we argued that trust is crucial for effective human interaction with machine learning systems, and that explaining individual predictions is important in assessing trust. We proposed LIME, a modular and extensible approach to faithfully explain the predictions of any model in an interpretable manner. We also introduced SP-LIME, a method to select representative and non-redundant predictions, providing a global view of the model to users. Our experiments demonstrated that explanations are useful for trust-related tasks: deciding between models, assessing trust, improving untrustworthy models, and getting insights into predictions.


There are a number of avenues of future work that we would like to explore. Although we describe only sparse linear models as explanations, our framework supports the exploration of a variety of explanation families, such as decision trees; it would be interesting to see a comparative study on these with real users. One issue that we do not mention in this work was how to perform the pick step for images, and we would like to address this limitation in the future. The domain and model agnosticism enables us to explore a variety of applications, and we would like to investigate potential uses in speech, video, and medical domains. Finally, we would like to explore theoretical properties (such as the appropriate number of samples) and computational optimizations (such as using parallelization and GPU processing), in order to provide the accurate, real-time explanations that are critical for any human-in-the-loop machine learning system.


