Integrated Gradients — Alibi 0.9.5 documentation (original) (raw)

Note

To enable support for Integrated Gradients, you may need to run

pip install alibi[tensorflow]

Overview

Integrated gradients is a method originally proposed in Sundararajan et al., “Axiomatic Attribution for Deep Networks” that aims to attribute an importance value to each input feature of a machine learning model based on the gradients of the model output with respect to the input. In particular, integrated gradients defines an attribution value for each feature by considering the integral of the gradients taken along a straight path from a baseline instance\(x^\prime\) to the input instance \(x.\)

Integrated gradients method

The method is applicable to regression and classification models. In the case of a non-scalar output, such as in classification models or multi-target regression, the gradients are calculated for one given element of the output. For classification models, the gradient usually refers to the output corresponding to the true class or to the class predicted by the model.

Let us consider an input instance \(x\), a baseline instance \(x^\prime\) and a model \(M: X \rightarrow Y\) which acts on the feature space \(X\) and produces an output \(y\) in the output space \(Y\). Let us now define the function \(F\) as

For example, in case of a \(K\)-class classification, \(M_k(x)\) is the probability of class \(k\), which could be the true class corresponding to \(x\) or the highest probability class predicted by the model. The attributions \(A_i(x, x^\prime)\) for each feature \(x_i\) with respect to the corresponding feature \(x_i^\prime\) in the baseline are calculated as

\[A_i(x, x^\prime) = (x_i - x_i^\prime) \int_0^1 \frac{\partial F(x^\prime + \alpha (x - x^\prime))}{\partial x_i} d\alpha,\]

where the integral is taken along a straight path from the baseline \(x^\prime\) to the instance \(x\) parameterized by the parameter \(\alpha\).

It is shown that such attributions satisfy the following axioms:

The proofs that integrated gradients satisfies these axioms are relatively straightforward and are discussed in Sections 2 and 3 of the original paper “Axiomatic Attribution for Deep Networks”.

Usage

The alibi implementation of the integrated gradients method is specific to TensorFlow and Keras models.

import tensorflow as tf from alibi.explainers import IntegratedGradients

model = tf.keras.models.load_model("path_to_your_model")

ig = IntegratedGradients(model, layer=None, taget_fn=None, method="gausslegendre", n_steps=50, internal_batch_size=100)

explanation = ig.explain(X, baselines=None, target=None)

attributions = explanation.attributions

Example

If your model is a classifier outputting class probabilities (i.e. the predictions are \(N\times C\) arrays where \(N\) is batch size and \(C\) is the number of classes), then you can provide a target_fn to the constructor that, for each data point, would select the class of highest probability to calculate the attributions for:

from functools import partial import numpy as np target_fn = partial(np.argmax, axis=1) ig = IntegratedGradients(model=model, target_fn=target_fn) explanation = ig.explain(X)

Alternatively, you can leave out target_fn and instead provide the predicted class labels directly to the explain method:

predictions = model.predict(X).argmax(axis=1) ig = IntegratedGradients(model=model) explanation = ig.explain(X, target=predictions)

Layer attributions

It is possible to calculate the integrated gradients attributions for the model input features or for the elements of an intermediate layer of the model. Specifically,

Calculating attribution with respect to an internal layer of the model is particularly useful for models that take text as an input and use word-to-vector embeddings. In this case, the integrated gradients are calculated with respect to the embedding layer (see example on the IMDB dataset).

Baselines

Conceptually, baselines represent data points which do not contain information useful for the model task, and they are used as a benchmark by the integrated gradients method. Common choices for the baselines are data points with all features values set to zero (for example the black image in case of image classification) or set to a random value.

However, the choice of the baselines can have a significant impact on the values of the attributions. For example, if we consider a simple binary image classification task where a model is trained to predict whether a picture was taken at night or during the day, considering the black image as a baseline would be misleading: in fact, with such a baseline all the dark pixels of the images would have zero attributions, while they are likely to be important for the task at hand.

An extensive discussion about the impact of the baselines on integrated gradients attributions can be found in P. Sturmfels at al., “Visualizing the Impact of Feature Attribution Baselines”.

Targets

In the context of integrated gradients, the target variable specifies which element of the output should be considered to calculate the attributions. If the output of the model is a scalar, as in the case of single target regression, a target is not necessary, and the gradients are calculated in a straightforward way.

If the output of the model is a vector, the target value specifies the position of the element in the output vector considered for the calculation of the attributions. In case of a classification model, the target can be either the true class or the class predicted by the model for a given input.

Examples

MNIST dataset

Imagenet dataset

IMDB dataset text classification

Text classification using transformers