[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR STRUCTURED DATA

Slide 1

Slide 1 text

L-SHAPLEY and C-SHAPLEY: Efﬁcient INTERPRETATION FOR STRUCTURED DATA Kyoto University Daiki Tanaka

Slide 2

Slide 2 text

Background • Although many black-box ML models, such as RandomForest, NN or kernel-methods, can produce highly accurate prediction, but such predictions are lack of interpretability. 1. Luck of interpretation is a crucial issue when black-box models are applied in areas such as medicine, ﬁnancial markets, and criminal justice. 2. To be able to know the model’s reasoning is the good way to improve the model.

Slide 3

Slide 3 text

Background : There are kinds of approaches for interpreting models. • Model-speciﬁc interpretation or Model-agnostic interpretation • Model-speciﬁc : making some assumptions to models. (e.g. methods based on attention weights, or gradient based method like smooth-grad, grad-CAM, …) • Model-agnostic : making no assumption to models. (e.g. LIME, or Shapley value) • Model-level interpretation or Instance-wise interpretation • Instance-wise : yielding feature importances for each input instance (e.g. saliency map). • Model-level : yielding feature importances for the whole model. (e.g. weights of Logistic Regression, or decision rules of decision tree) This study focuses on Model-Agnostic & Instance-wise interpretation.

Slide 4

Slide 4 text

Problem Setting : Model-Agnostic & Instance-wise interpretation • Input : • An Instance • A predictive model • Output : • A vector of importance scores of the feature. • Indicating which features are the key for the model to make its prediction on that instance. *OUFSQSFUBUJPO NFUIPE *OTUBODF .PEFM *NQPSUBODF TDPSFT .BLJOHOPBTTVNQUJPOTPONPEFM

Slide 5

Slide 5 text

Related work : Shapley value • The Shapley value is an idea in the ﬁeld of cooperative game theory. • This was originally proposed as a characterization of a fair distribution of a total proﬁt from all the players. 1FSTPO" 1FSTPO# 1FSTPO$ 1SPpU # " $

Slide 6

Slide 6 text

Related work : Shapley value • The Shapley value of is deﬁned as : • is the set including all players. (ex. ) • is the function which returns the “proﬁt” by the set . • considers the contribution of the element . • means the number of ways for selecting -sized subsets. i ϕ(i) = 1 N ∑ S⊆N\{i} 1 ( |N| − 1 |S| ) (v(S ∪ {i}) − v(S)) N N = {person A, person B, person C} v(S) S v(S ∪ {i}) − v(S) i ( |N| − 1 |S| ) |S|

Slide 7

Slide 7 text

Related work : Example of Shapley value • The Shapley value of person A is deﬁned as : ϕ(A) = 1 3 ∑ S⊆{A,B,C}\{A} 1 ( |N| − 1 |S| ) (v(S ∪ {A}) − v(S)) = 1 3 ( 1 1 (100 − 50) + 1 2 (55 − 5) + 1 2 (75 − 30)) 1FSTPO" 1FSTPO# 1FSTPO$ 1SPpU

Slide 8

Slide 8 text

Related work : Shapley value • The Shapley value can be applied to predictive models. • Each feature is seen as a player in the underlying game. • Issue : Each evaluation of the Shapley value requires exponential number of model evaluations. There are two kinds of approaches to deal with the problem. • Approach1 : Sampling based method • Randomly sampling feature subsets • Approach2 : Regression based method • Sampling feature subsets based on a weighted kernel, and carrying out a linear regression to estimate Shapley value

Slide 9

Slide 9 text

Notation • Feature Vector : • Note that is the dimension of feature vectors. • Set of features : • Sub-vector of features : • Output variables : • Output vector of a model given an input vector : x ∈ ⊂ ℝd d S ⊂ {1,2,…, d} xs = {xj , j ∈ S} y ∈ x ℙm (Y|x)

Slide 10

Slide 10 text

Preliminaries : Importance of a feature set • Here, Importance score of feature set is introduced as: • Where denotes the expectation over . • The more similar the prediction produced by to the prediction produced by , the higher becomes. S m [ ⋅ |x] ℙm ( ⋅ |x) xS x vx (S) vx (S) = m[log ℙm (Y|xs ) ∣ x]

Slide 11

Slide 11 text

Preliminaries : Importance of a feature set • In many cases, class-specific importance is favored. • How important is a feature set S to the predicted class? • Here, following degenerate conditional distribution is introduced. • We can then define the importance of a subset S with respect to using the modified score, which is the expected log probability of the predicted class. ̂ ℙm ̂ ℙm (y|x) = { 1 if y ∈ arg maxy′ ℙm (y′|x) 0 otherwise . vx (S) = ̂ m[log ℙm (Y|xs ) ∣ x]

Slide 12

Slide 12 text

Preliminaries : measuring interaction between features • Consider quantifying the importance of a given -th feature for feature vector . • A naive way is to compute the importance of set : . • But it ignores interactions between features. • For example, when performing sentiment analysis on the following sentence. This movie is not heartwarming or entertaining. • Then we wish to quantify the the importance of feature “not”, which plays an important role in the sentence as being classiﬁed as negative. • But one would expect that , because “not” itself has neither negative nor positive sentiment. i x {i} Vx ({i}) Vx ({not}) ≈ 0

Slide 13

Slide 13 text

Preliminaries : marginal contributions of a feature • It is essential to consider the interactions of a given feature with other features. • A natural way to assess how feature interacts with other features is to compute difference between the importance of all features in S, with and without . • This difference is called marginal contribution of to S, and given by : • To obtain a simple scaler measure for , we need to aggregate these marginal contributions over all subsets S including . • The Shapley value is one way to do so. i i i i i i mx (S, i) := vx (S) − vx (S\i)

Slide 14

Slide 14 text

Preliminaries : Shapley value • For k = 1,2,…,d, we let denote the set of k-sized feature subsets that contain feature . • The Shapley value is obtained by averaging the marginal contributions. • First over the set for a ﬁxed k. • Then over all possible choices of set size k. Sk (i) i Sk (i) ϕx (i) = 1 d d ∑ k=1 1 ( d − 1 k − 1) ∑ S∈Sk (i) mx (S, i) 4FUPGLTJ[FEGFBUVSFTFUTJODMVEJOHGFBUVSFJ

Slide 15

Slide 15 text

Challenge with computing Shapley value • The exact computation of the Shapley value leads to computational difﬁculties. • We need to calculate marginal contributions for subsets. • There are some sampling-based approaches to deal with the problem. • But such approaches suffer from high variance when the number of samples to be collected per instance is limited. 2d−1 ϕx (i) = 1 d d ∑ k=1 1 ( d − 1 k − 1) ∑ S∈Sk (i) mx (S, i) = ∑ S∋i,S⊆[d] 1 ( d − 1 |S| − 1) mx (S, i)

Slide 16

Slide 16 text

Proposed method Relaxing computational difﬁculties of Shapley-value

Slide 17

Slide 17 text

Key idea : features can be seen as nodes of a graph, and they have some relationship. • In many applications, features can be considered as nodes of a graph, and we can deﬁne distances between pairs of features based on the graph structure. • Features distant in the graph have weak interactions each other. • For example, an image is modeled with a grid graph. Pixels that are far apart may have little effect on each other in the computation of Shapley value. • Or, a text is represented as a line graph. 5IJT JT " QFO

Slide 18

Slide 18 text

Proposed method : preliminary • We are given feature vector • Then we let denote a connected graph • Each feature is assigned with node . • Edges represent the interactions between features. • The graph induces a following distance function on . • For a given node , its k-neighborhood is the set : x ∈ ℝd G = (V, E) i i V × V i ∈ V k (i) := {j ∈ V|dG (i, j) ≤ k} dG (l, m) = the number of edges in shortest path joining l to m . Gray area is an example of . 2 (i)

Slide 19

Slide 19 text

Proposed method1 : Local-Shapley (L-Shapley) • Deﬁnition1: Given a model , a sample , and a feature , the L-Shapley estimate of order k on a graph is given by : ℙm x i G ̂ ϕk x (i) = 1 |k (i)| ∑ T∋i,T⊆Nk (i) 1 ( |Nk (i)| − 1 |T| − 1 ) mx (T, i) LOFJHICPSIPPETPGJ original Shapley value : ϕx (i) = ∑ S∋i,S⊆[d] 1 ( d − 1 |S| − 1) mx (S, i)

Slide 20

Slide 20 text

Proposed method2 : Connected-Shapley (C-Shapley) • Deﬁnition2: Given a model , a sample , and a feature , the C-Shapley estimate of order k on a graph is given by : where denotes the set of all subsets of that contain node , and nodes that are connected in . ℙm x i G Ck (i) k (i) i G ̂ ϕk x (i) = ∑ U∈Ck (i) 2 (|U| + 2)(|U| + 1)|U| mx (U, i) original Shapley value : ϕx (i) = ∑ S∋i,S⊆[d] 1 ( d − 1 |S| − 1) mx (S, i)

Slide 21

Slide 21 text

Examples • Left subset (blue and red) is summed over in L-Shapely, not in C-Shapley. • Right subset (blue and red) is summed over in both L- Shapley and C-Shapley.

Slide 22

Slide 22 text

Properties : Error between L-Shapley value and true Shapley value is upper-bounded. • S is the subset of k-nearest features of i. • is the sub-vector having k-nearest features of i. • is the sub-vector having features not included by S. XU XV

Slide 23

Slide 23 text

Properties : Error between C-Shapley value and true Shapley value is upper-bounded. • S is the subset of k-nearest features of i. • is the connected subset in S. U ∋ i

Slide 24

Slide 24 text

Experiments

Slide 25

Slide 25 text

Experiments : tasks and baselines • Tasks : image classiﬁcation, and text classiﬁcation • Baselines : model agnostic methods • KernelSHAP : regression based approximation of Shapley • SampleShapley : Random sampling based approximation of Shapley • LIME : model agnostic interpretation method linearly approximating the black-box function around the target feature.

Slide 26

Slide 26 text

Experiments : evaluation method • Evaluation method : • The change in log-odds scores of the predicted class before and after masking the top features ranked by importance scores, where masked words are replaced by zero paddings. • Larger decreasing of log-odds means that importance scores by algorithm could correctly capture importance of features.

Slide 27

Slide 27 text

Experiment(1/3) : Text classiﬁcation • We study the performances on three neural models and three datasets: %BUBTFU 5BTL .FUIPE "DDVSBDZ *.%#3FWJFX 4FOUJNFOUDMBTTJpDBUJPO 8PSECBTFE$// "(OFXT $BUFHPSZDMBTTJpDBUJPO $IBSBDUFS#BTFE $// :BIPP"OTXFST $BUFHPSZDMBTTJpDBUJPO -45.

Slide 28

Slide 28 text

Experiment(1/3) : result • On IMDB with Word-CNN, the simplest model among the three, L-Shapley achieves the best performance while LIME, KernelSHAP and C-Shapley achieve slightly worse performance. #FUUFS

Slide 29

Slide 29 text

Experiment(2/3) : Image classiﬁcation • Datasets : • A subset of MNIST : Only “3" and “8” are included. • A subset of CIFAR-10 : Only deers(ࣛ) and horses(അ) are included.

Slide 30

Slide 30 text

Experiment(2/3) : Result • C-Shapley outperform other methods in both datasets. #FUUFS

Slide 31

Slide 31 text

Experiment(2/3) : Examples for misclassified images • Above image is “3”, and below image is “8”. They are misclassified into “8”, and “3”, respectively. • The masked pixels are colored with red if activated, (white) and blue otherwise. • The result seems to show the “reasoning” of the classifier.

Slide 32

Slide 32 text

Experiment(3/3) : Evaluation by human subjects (5 people) • They use Amazon Mechanical Turk to compare L-Shapley, C- Shapley and KernelSHAP on IMDB movie reviews (200 reviews). • Experimental purpose : • Are humans able to make a decision with top words alone? • Are humans unable to make a decision with top words masked? • They ask subjects to classify the sentiment of texts into ﬁve categories : strongly positive (+2), positive (+1), neutral (0), negative (-1), strongly negative (-2).

Slide 33

Slide 33 text

Experiment(3/3) : Evaluation by human subjects (5 people) • Texts have three types : 1. raw reviews 2. top 10 words of each review ranked by L-Shapley, C-Shapley and KernelSHAP 3. reviews with top words being masked • Masked words are produced by the L-Shapley, C-Shapley, and KernelSHAP until the probability score of the correct class produced by the model is lower than 10%. • Around 14.6% of words in each review are masked for L-Shapley and C- Shapley, and 31.6% for KernelSHAP.

Slide 34

Slide 34 text

Experiment(3/3) : Evaluation by human subjects (5 people) • Evaluation metrics : • Consistency (0 or 1) between true labels and labels from human subjects. • Standard deviation of scores on each reviews • This is as a measure of disagreement between humans. • The absolute value of the averaged scores. • This is as a measure of conﬁdence of decision.

Slide 35

Slide 35 text

Experiment(3/3) : result • Humans become more consistent and conﬁdent when they are presented with top words. On the other hand, when top words are masked, humans are easier to make mistakes and are less certain. • C-Shapley yields the highest performance in terms of consistency, agreement, and conﬁdence. • L-Shapley harms the human judgement the most among the three algorithms.

Slide 36

Slide 36 text

Conclusion • They have proposed two algorithms; L-Shapley and C- Shapley for instance-wise and model-agnostic interpretation. • They demonstrated the superior performance of these algorithms compared to other methods on black-box models in both text and image classiﬁcation with both quantitative metrics and human evaluation.