lime模型_如何使用LIME在您的机器学习模型的预测中建立信任

作者：不乱于心丨不困于丶情 | 来源：互联网 | 2023-06-29 10:40

lime模型Thisarticleisastepbystepguidethatllhelpyouinterpretyourmachinelearningmodelspredicti

lime 模型

This article is a step by step guide that&＃39;ll help you interpret your machine learning model&＃39;s predictions using LIME. Even when your model achieves close to 100% accuracy, there is always one question that runs in your mind: should we trust it?

本文是分步指南&＃xff0c;将帮助您使用LIME解释您的机器学习模型的预测。即使您的模型达到了接近100&＃xff05;的准确性&＃xff0c;您始终会想到一个问题&＃xff1a;我们应该信任它吗&＃xff1f;

Consider a situation at a doctor&＃39;s office – would a doctor trust a computer if it just showed a diagnosis without giving any valid reason behind it?

考虑一下医生办公室的情况–如果计算机仅显示诊断但没有给出任何正当理由&＃xff0c;医生会信任计算机吗&＃xff1f;

Any model which fails to explain the reason behind its output is considered a black box. And trusting such a model is not the right approach.

任何无法解释其输出背后原因的模型都将被视为黑匣子。信任这样的模型不是正确的方法。

Let&＃39;s say we&＃39;re given a model which predicts whether an animal is a dog or cat and has 100% accuracy. But what if it makes that prediction based on the background of the image? Would you trust that model?

假设我们有一个模型可以预测动物是狗还是猫&＃xff0c;并且具有100&＃xff05;的准确性。但是&＃xff0c;如果它根据图像的背景做出预测呢&＃xff1f; 您会相信该模型吗&＃xff1f;

As you can see in the above figure, the green color represents the features it took to identify the image as a cat, and the red indicates the features it took to represent it as a dog.

从上图中可以看出&＃xff0c;绿色代表将图像识别为猫的特征&＃xff0c;红色代表将图像识别为狗的特征。

If our model provides such a valid reason for its prediction, it builds our trust for that model. Similarly for the doctor situation, if the model can tell which features were important in its prediction and to which symptoms it gave more weight, it is easier for the doctor to trust that model.

如果我们的模型为预测提供了这样的正当理由&＃xff0c;那么它将建立我们对该模型的信任。同样&＃xff0c;对于医生而言&＃xff0c;如果模型可以判断出哪些特征在其预测中很重要&＃xff0c;以及对于哪些症状给予了更大的重视&＃xff0c;那么医生就更容易信任该模型。

But it is that simple to interpret any model? Luckily yes. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin came out with a paper called "Why Should I Trust You?": Explaining the Predictions of Any Classifier in 2016.

但是&＃xff0c;解释任何模型是如此简单吗&＃xff1f; 幸运的是。 Marco Tulio Ribeiro&＃xff0c;Sameer Singh和Carlos Guestrin提出了一篇名为“为什么我应该信任您&＃xff1f;”的论文&＃xff1a;解释2016年任何分类器的预测。

In it, they proposed their technique LIME. The basic approach of this technique was to easily interpret any model by learning it locally around its prediction.

在其中&＃xff0c;他们提出了他们的技术LIME。该技术的基本方法是通过在模型周围进行预测来轻松学习任何模型。

They wrote this paper to understand the explanations behind any model&＃39;s prediction. So whenever you need to choose a model, you can use the insights from LIME.

他们写这篇论文是为了了解任何模型预测背后的解释。因此&＃xff0c;无论何时需要选择模型&＃xff0c;都可以使用LIME的见解。

In the above diagram, the model predicts that a patient has the flu, and LIME highlights the symptoms in the patient&＃39;s history that led to the prediction.

在上图中&＃xff0c;该模型预测患者患有流感&＃xff0c;而LIME突出显示了导致该预测的患者病史中的症状。

Sneeze and headache contribute to the "flu" prediction, while "no fatigue" is evidence against it. With this information, a doctor can make an informed decision about whether to trust the model&＃39;s prediction.

喷嚏和头痛有助于“流感”的预测&＃xff0c;而“无疲劳”是反对的证据。有了这些信息&＃xff0c;医生就可以决定是否信任模型的预测。

那么&＃xff0c;LIME到底是什么&＃xff1f; (So, what exactly is LIME?)

LIME is model-agnostic, meaning that it can be applied to any machine learning model. The goal of LIME is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier.
LIME与模型无关&＃xff0c;这意味着它可以应用于任何机器学习模型。 LIME的目标是在本地可信赖分类器的可解释表示形式上识别可解释模型。

LIME is model-agnostic, meaning that it can be applied to any machine learning model. The goal of LIME is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier.
LIME与模型无关&＃xff0c;这意味着它可以应用于任何机器学习模型。 LIME的目标是在本地可信赖分类器的可解释表示形式上识别可解释模型。

LIME is model-agnostic, meaning that it can be applied to any machine learning model. The goal of LIME is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier. - Definition from official paper (link)
LIME与模型无关&＃xff0c;这意味着它可以应用于任何机器学习模型。 LIME的目标是在本地可信赖分类器的可解释表示形式上识别可解释模型。 -官方文件的定义( 链接 )

To understand this, we need to understand the meaning of the acronym LIME.

要了解这一点&＃xff0c;我们需要了解缩写LIME的含义。

Local: Refers to how we get these explanations. LIME approximates the black box model locally in the neighborhood of predictions.

本地&＃xff1a;指我们如何获得这些说明。 LIME在预测附近对黑盒模型进行局部估计。

Interpretable: The explanations provided by LIME are simple enough for humans to understand.

可解释的&＃xff1a; LIME提供的解释非常简单&＃xff0c;人类可以理解。

Model-agnostic: LIME treats the model as a blackbox, and so it works for any model.

与模型无关&＃xff1a; LIME将模型视为黑盒&＃xff0c;因此适用于任何模型。

Explanations: The justifications given for the actions performed by the model.

说明 &＃xff1a;模型执行的操作的理由。

LIME provides local model interpretability. It modifies a single data sample by tweaking the feature values and observing the resulting impact on the output.

LIME提供本地模型可解释性。它通过调整特征值并观察对输出的影响来修改单个数据样本。

With LIME, we cane explain why the RandomForestClassifier thinks what it does before giving a prediction.

使用LIME&＃xff0c;我们可以解释为什么RandomForestClassifier在给出预测之前会思考其作用。

让我们看一些代码
(Let&＃39;s look at some code)

We&＃39;ll start by using the RandomForestClassifier model to work on the "Did it rain in Seattle" dataset. The data is available here.

我们将从使用RandomForestClassifier模型开始研究“西雅图下雨了”数据集。数据可在此处获得。

First we will import our base libraries:

首先&＃xff0c;我们将导入我们的基础库&＃xff1a;

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline

In order to avoid future warnings in our code, we will add this to our code at the start of our script:

为了避免将来在我们的代码中出现警告&＃xff0c;我们将在脚本的开头将其添加到我们的代码中&＃xff1a;

import warnings warnings.filterwarnings(&＃39;ignore&＃39;)

We then import a few sklearn libraries for splitting the dataset and for defining the metrics. The RandomForestClassifier will also be imported from the same library.

然后&＃xff0c;我们导入一些sklearn库&＃xff0c;用于拆分数据集和定义指标。 RandomForestClassifier也将从同一库中导入。

from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier

Since we have all our required libraries, we will read our data:

由于我们拥有所有必需的库&＃xff0c;因此我们将读取数据&＃xff1a;

df &＃61; pd.read_csv(&＃39;seattleWeather_1948-2017.csv&＃39;) df.head()

So the data consists of 4 feature columns and a target column, i.e. RAIN. Our task is to predict if there was RAIN in Seattle.

因此&＃xff0c;数据由4个要素列和一个目标列(即RAIN)组成。我们的任务是预测西雅图是否有雨。

df.shape

(25551, 5)

(25551&＃xff0c;5)

Our data consists of 25,551 rows which is enough for our model to train.

我们的数据包含25,551行&＃xff0c;足以对我们的模型进行训练。

We will check for missing values, if any:

我们将检查缺失值(如果有)&＃xff1a;

df.isnull().sum()

Since our main focus is interpreting the model&＃39;s prediction, we will discard the missing value rows directly. For simplicity&＃39;s sake we will remove the DATE column as well.

由于我们的主要重点是解释模型的预测&＃xff0c;因此我们将直接丢弃缺失值行。为了简单起见&＃xff0c;我们也将删除DATE列。

df.dropna(inplace&＃61;True) df.pop(&＃39;DATE&＃39;)

We will now encode our target column:

现在&＃xff0c;我们将对目标列进行编码&＃xff1a;

df.RAIN.replace({True:1,False:0},inplace&＃61;True) df.head()

This is how our data looks in the end.

这就是我们的数据最终的外观。

target &＃61; df.pop(&＃39;RAIN&＃39;) x_train , x_test , y_train , y_test &＃61; train_test_split(df, target, train_size&＃61;0.75)

We have now split the data into train and test sets with train equal to 75% of the original data.

现在&＃xff0c;我们将数据分为训练集和测试集&＃xff0c;训练集等于原始数据的75&＃xff05;。

We will now create our model with default parameters:

现在&＃xff0c;我们将使用默认参数创建模型&＃xff1a;

rfc &＃61; RandomForestClassifier()

And fit the model to the training samples:

并将模型拟合到训练样本&＃xff1a;

rfc.fit(x_train,y_train)

accuracy_score(y_test,rfc.predict(x_test))

1.0

The model has achieved 100% accuracy. But now let&＃39;s interpret the model so we can trust it.

该模型已达到100&＃xff05;的准确性。但是现在让我们解释该模型&＃xff0c;以便我们可以信任它。

酸橙
(LIME)

First, we need to discuss a bit of theory before we go on.

首先&＃xff0c;在继续之前&＃xff0c;我们需要讨论一些理论。

LIME creates new data which includes permuted samples and its respective predictions.

LIME创建新数据&＃xff0c;其中包括置换样本及其各自的预测。

On this, LIME trains a local model which is weighted by proximity of sample instances. This model can be any basic model, namely a Decision tree.

在此基础上&＃xff0c;LIME训练了一个局部模型&＃xff0c;该模型通过样本实例的接近度进行加权。该模型可以是任何基本模型&＃xff0c;即决策树。

This model must have similar local predictions as that of the existing model. This accuracy is called local fidelity.

该模型必须具有与现有模型相似的局部预测。这种准确性称为局部保真度。

import lime from lime import lime_tabular

Now that we have imported the required packages, we need to perform our interpretation.

现在我们已经导入了必需的软件包&＃xff0c;我们需要执行我们的解释。

Here&＃39;s the recipe for training local surrogate models:

这是训练本地代理模型的方法&＃xff1a;

Select the model for which you want to get the explanation of its prediction
选择您要获得其预测解释的模型
Train this model and get its prediction for the test values
训练该模型并获得其对测试值的预测
For LIME, we weight the new samples with respect to their proximity to the model
对于LIME&＃xff0c;我们根据新样本与模型的接近程度对其进行加权
Create a local model on the dataset
在数据集上创建本地模型
Finally we explain the prediction by interpreting the local model
最后&＃xff0c;我们通过解释局部模型来解释预测

Define a LimeTableExplainer model. Parameters of this model are Training sample, Feature names, and class names:

定义一个LimeTableExplainer模型。该模型的参数是训练样本&＃xff0c;特征名称和类名称&＃xff1a;

explainer &＃61; lime_tabular.LimeTabularExplainer(x_train.values,feature_names&＃61;[&＃39;PRCP&＃39;,&＃39;TMAX&＃39;,&＃39;TMIN&＃39;],class_names&＃61;[&＃39;False&＃39;,&＃39;True&＃39;],discretize_continuous&＃61;True)

We need to pass training samples, the training column names, and the target class names that are expected.

我们需要传递训练样本&＃xff0c;训练列名称和预期的目标类名称。

We then call the explain_instance() function of the explainer we created.

然后&＃xff0c;我们调用创建的解释器的explain_instance()函数。

We will be using the following parameters of this function - test sample, predict function of model, number of features, and top labels to consider:

我们将使用此功能的以下参数-测试样本&＃xff0c;预测模型功能&＃xff0c;特征数量和要考虑的顶部标签&＃xff1a;

i &＃61; np.random.randint(0,x_test.shape[0]) exp &＃61; explainer.explain_instance(x_test.iloc[i],rfc.predict_proba,num_features&＃61;x_train.shape[1],top_labels&＃61;None)

In order to display the explanation in the notebook, the following code is required.

为了在笔记本中显示说明&＃xff0c;需要以下代码。

exp.show_in_notebook()

Let&＃39;s decrypt the output.

让我们解密输出。

The top left diagram indicates the predicted output with probability.

左上方的图以概率表示预测的输出。

The model&＃39;s output is False with 100% probability.

模型的输出为False &＃xff0c;概率为100&＃xff05;。

The top right diagram indicates the conditions required to fall for each category with their weights.

右上方的图表以权重指示了每个类别所需的条件。

Since the condition for PRCP variables for predicting the target as False is PRCP ≤0.00 and it has 0.96 weight.

由于用于预测目标为False的PRCP变量的条件是PRCP≤0.00&＃xff0c;并且权重为0.96。

The Bottom right diagram indicates our test values. Since the PRCP values satisfy for a False condition, you can see the blue color as the background for this.

右下图显示了我们的测试值。由于PRCP值满足False条件&＃xff0c;因此您可以看到蓝色作为背景。

To display the explanation as a plot:

要将说明显示为图表&＃xff1a;

fig &＃61; exp.as_pyplot_figure()

Here you can see the weight for each feature with their predicted class (represented by color ). They represent the local weights assigned to each feature. The red color represents a False target whereas the green color represents a True target.

在这里&＃xff0c;您可以看到每个要素的权重及其预测的类别(用color表示)。它们代表分配给每个特征的局部权重。红色代表False目标&＃xff0c;绿色代表True目标。

It is now easy to interpret the model by seeing the weight given to each feature as well the condition for each test value falling under specific class.

现在&＃xff0c;通过查看赋予每个功能的权重以及属于特定类别的每个测试值的条件&＃xff0c;可以轻松地解释模型。

Values of PRCP and TMAX indicate that the predicted target should be False whereas the value of TMIN indicates a True Target.

PRCP和TMAX值指示预测目标应为False&＃xff0c;而TMIN的值指示真实目标。

LIME is not only used for binary classification of Tabular data, but also for multi-class case, Images and Text.

LIME不仅用于表格式数据的二进制分类&＃xff0c;而且还用于多类情况(图像和文本)。

The code can be found in my GitHub repository: https://github.com/Sid11/Lime

可以在我的GitHub存储库中找到代码&＃xff1a; https &＃xff1a; //github.com/Sid11/Lime

And here&＃39;s a link to the LIME official GitHub repository: https://github.com/marcotcr/lime

这是LIME官方GitHub存储库的链接&＃xff1a; https : //github.com/marcotcr/lime

If you have any questions, please reach out to me. Hope you liked the article!

如有任何疑问&＃xff0c;请与我联系。希望您喜欢这篇文章&＃xff01;

翻译自: https://www.freecodecamp.org/news/how-to-build-trust-in-models-prediction-with-code/

lime 模型

推荐阅读

get
【机器学习】生成式对抗网络模型综述

生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型，可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ... [详细]

蜡笔小新 2023-12-14 17:51:18
java
VScode格式化文档换行或不换行的设置方法

本文介绍了在VScode中设置格式化文档换行或不换行的方法，包括使用插件和修改settings.json文件的内容。详细步骤为：找到settings.json文件，将其中的代码替换为指定的代码。 ... [详细]

蜡笔小新 2023-12-14 17:15:38
get
向QTextEdit拖放文件的方法及实现步骤

本文介绍了在使用QTextEdit时如何实现拖放文件的功能，包括相关的方法和实现步骤。通过重写dragEnterEvent和dropEvent函数，并结合QMimeData和QUrl等类，可以轻松实现向QTextEdit拖放文件的功能。详细的代码实现和说明可以参考本文提供的示例代码。 ... [详细]

蜡笔小新 2023-12-14 16:06:38
get
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
get
android listview OnItemClickListener失效原因

最近在做listview时发现OnItemClickListener失效的问题，经过查找发现是因为button的原因。不仅listitem中存在button会影响OnItemClickListener事件的失效，还会导致单击后listview每个item的背景改变，使得item中的所有有关焦点的事件都失效。本文给出了一个范例来说明这种情况，并提供了解决方法。 ... [详细]

蜡笔小新 2023-12-14 14:25:50
get
sklearn数据集库中的常用数据集类型介绍

本文介绍了sklearn数据集库中常用的数据集类型，包括玩具数据集和样本生成器。其中详细介绍了波士顿房价数据集，包含了波士顿506处房屋的13种不同特征以及房屋价格，适用于回归任务。 ... [详细]

蜡笔小新 2023-12-13 17:45:15
format
【机器学习手册】日期和时区操作的重要性及应用

本文介绍了机器学习手册中关于日期和时区操作的重要性以及其在实际应用中的作用。文章以一个故事为背景，描述了学童们面对老先生的教导时的反应，以及上官如在这个过程中的表现。同时，文章也提到了顾慎为对上官如的恨意以及他们之间的矛盾源于早年的结局。最后，文章强调了日期和时区操作在机器学习中的重要性，并指出了其在实际应用中的作用和意义。 ... [详细]

蜡笔小新 2023-12-12 17:40:14
format
【shell】网络处理：判断IP是否在网段、两个ip是否同网段、IP地址范围、网段包含关系

本文介绍了使用shell脚本判断IP是否在同一网段、判断IP地址是否在某个范围内、计算IP地址范围、判断网段之间的包含关系的方法和原理。通过对IP和掩码进行与计算，可以判断两个IP是否在同一网段。同时，还提供了一段用于验证IP地址的正则表达式和判断特殊IP地址的方法。 ... [详细]

蜡笔小新 2023-12-12 11:19:14
get
开源Keras Faster RCNN模型介绍及代码结构解析

本文介绍了开源Keras Faster RCNN模型的环境需求和代码结构，包括FasterRCNN源码解析、RPN与classifier定义、data_generators.py文件的功能以及损失计算。同时提供了该模型的开源地址和安装所需的库。 ... [详细]

蜡笔小新 2023-12-10 17:44:07
format
【论文】ICLR 2020 九篇满分论文！！！

点击上方，选择星标或置顶，每天给你送干货！阅读大概需要11分钟跟随小博主，每天进步一丢丢来自：深度学习技术前沿 ... [详细]

蜡笔小新 2023-10-17 18:45:53
java
python打卡记录去重_Python零基础学习笔记与记录之一（了解Python这个小伙伴）

本人学习笔记，知识点均摘自于网络，用于学习和交流(如未注明出处，请提醒，将及时更正，谢谢)OS:我学习是为了上 ... [详细]

蜡笔小新 2023-10-17 16:05:58
java
java多线程获取线程返回结果

我们在使用java多线程编写相关业务代码时，往往有这样一种情况，某个线程依赖于其他线程执行结果。也就是说，我们需要在一个线程中获取另一个线程的信息。可以分为两种情况，一种是轮询，一 ... [详细]

蜡笔小新 2023-10-16 17:09:55
stream
[Dart]Dart概述

关于Dart，下面两个资源在Dart开发者和专家之间非常流行：Dart语言之旅浏览Dart的所有主要语言特征。灵活的Dart一组指南，向你 ... [详细]

蜡笔小新 2023-10-16 10:01:41
get
Status quo, Dilemma and Outlook of Wallets

Forexperiencedcryptoinvestors,thereareseveralsectorsthatseemedpromisingbutdidn’tlive ... [详细]

蜡笔小新 2023-10-15 16:10:41
string
C#学习教程：在Console中工作但在Windows窗体中不工作的异步代码分享

本文分享了一个关于在C#中使用异步代码的问题，作者在控制台中运行时代码正常工作，但在Windows窗体中却无法正常工作。作者尝试搜索局域网上的主机，但在窗体中计数器没有减少。文章提供了相关的代码和解决思路。 ... [详细]

蜡笔小新 2023-12-14 15:56:00