本文介绍了可视化来自预训练模型的样本的优缺点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在尝试预测公寓价格.因此,我有很多带有标签的数据,在每套公寓的哪里,我都有可能影响价格的功能,例如:

Let's say I'm trying to predict an apartment price. So, I have a lot of labeled data, where on each apartment I have features that could affect the price like:

  • 城市
  • 街道
  • 地板
  • 建成年份
  • 社会经济地位
  • 平方英尺

我训练一个模型,比方说XGBOOST.现在,我要预测新公寓的价格.有什么好方法可以显示这间公寓中的好",什么是坏,以及多少(按0-1缩放)?

And I train a model, let's say XGBOOST. Now, I want to predict the price of a new apartment. Is there a good way to show what is "good" in this apartment, and what is bad, and by how much (scaled 0-1)?

例如:楼层号是一个强大"的功能(即-在此区域中,该楼层号是理想的,因此对公寓的价格产生积极影响),但是社会经济地位却是一个弱项(即,社会经济状况)状态低下,因此会对公寓的价格产生负面影响.

For example: The floor number is a "strong" feature (i.e. - in this area this floor number is desired, thus affects positively on the price of the apartment), but the socioeconomic status is a weak feature (i.e. the socioeconomic status is low and thus affects negatively on the price of the apartment).

我想要或多或少地说明我的模型为何决定这个价格,并且希望用户通过这些指标来感觉公寓的价值.

What I want is to illustrate more or less why my model decided on this price, and I want the user to get a feel of the apartment value by those indicators.

我想到了对每个功能进行详尽的搜索-但恐怕这会花费太多时间.

I thought of exhaustive search on each feature - but I'm afraid that will take too much time.

是否有更出色的方法?

任何帮助将不胜感激...

Any help would be much appreciated...

推荐答案

有个好消息给你.

最近为此目的发布了名为"SHAP" ( SHapley Additive exPlanation )的软件包.这是指向github的链接.

A package called "SHAP" (SHapley Additive exPlanation) was recently released just for that purpose.Here's a link to the github.

它支持可视化复杂模型(很难直观解释),如增强树(尤其是XGBOOST!)

It supports visualization of complicated models (which are hard to intuitively explain) like boosted trees (and XGBOOST in particular!)

它可以向您显示真实"功能的重要性,这比"gain""weight""cover" xgboost 供应要好,因为它们不一致.

It can show you "real" feature importance which is better than the "gain", "weight", and "cover" xgboost supplies as they are not consistent.

您可以在此处上阅读所有有关SHAP为何更好的信息.

You can read all about why SHAP is better for feature evaluation here.

很难为您提供适合您的代码,但是有一个很好的文档,您应该编写适合自己的代码.

It will be hard to give you code that will work for you, but there is a good documentation and you should write one that suits you.

以下是构建第一个图形的指导原则:

Here's the guide lines of building your first graph:

import shap
import xgboost as xgb

# Assume X_train and y_train are both features and labels of data samples

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names, weight=weights_trn)

# Train your xgboost model
bst = xgb.train(params0, dtrain, num_boost_round=2500, evals=watchlist, early_stopping_rounds=200)

# "explainer" object of shap
explainer = shap.TreeExplainer(bst)

# "Values you explain, I took them from my training set but you can "explain" here what ever you want
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

要绘制"为什么某个样本得到其分数",您可以使用内置的SHAP函数(仅在Jupyter Notebook上有效). 此处为完美示例

To plot the "Why a certain sample got its score" you can either use built in SHAP function for it (only works on a Jupyter Notebook). Perfect example here

我亲自编写了一个函数,该函数将使用matplotlib对其进行绘制,这将需要一些努力.

I personally wrote a function that will plot it using matplotlib, which will take some effort.

这是我使用shap值绘制的图的一个示例(功能是机密的,因此已全部删除)

Here is an example of a plot I've made using the shap values (features are confidential so all erased)

对于该特定样本,您可以看到97%的预测为label=1,并且每个功能及其对数损失的增加或减少量.

You can see a 97% prediction to be label=1 and each feature and how much it added or negate from the log-loss, for that specific sample.

这篇关于可视化来自预训练模型的样本的优缺点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 00:57