Integrated Gradients — Alibi 0.9.5 documentation (original) (raw)

Note

To enable support for Integrated Gradients, you may need to run

pip install alibi[tensorflow]

Overview

Integrated gradients is a method originally proposed in Sundararajan et al., “Axiomatic Attribution for Deep Networks” that aims to attribute an importance value to each input feature of a machine learning model based on the gradients of the model output with respect to the input. In particular, integrated gradients defines an attribution value for each feature by considering the integral of the gradients taken along a straight path from a baseline instance\(x^\prime\) to the input instance \(x.\)

Integrated gradients method

The method is applicable to regression and classification models. In the case of a non-scalar output, such as in classification models or multi-target regression, the gradients are calculated for one given element of the output. For classification models, the gradient usually refers to the output corresponding to the true class or to the class predicted by the model.

Let us consider an input instance \(x\), a baseline instance \(x^\prime\) and a model \(M: X \rightarrow Y\) which acts on the feature space \(X\) and produces an output \(y\) in the output space \(Y\). Let us now define the function \(F\) as

\(F(x) = M(x)\) if the model output is a scalar;
\(F(x) = M_k(x)\) if the model output is a vector, with the index \(k\) denoting the \(k\)-th element of \(M(x)\).

For example, in case of a \(K\)-class classification, \(M_k(x)\) is the probability of class \(k\), which could be the true class corresponding to \(x\) or the highest probability class predicted by the model. The attributions \(A_i(x, x^\prime)\) for each feature \(x_i\) with respect to the corresponding feature \(x_i^\prime\) in the baseline are calculated as

\[A_i(x, x^\prime) = (x_i - x_i^\prime) \int_0^1 \frac{\partial F(x^\prime + \alpha (x - x^\prime))}{\partial x_i} d\alpha,\]

where the integral is taken along a straight path from the baseline \(x^\prime\) to the instance \(x\) parameterized by the parameter \(\alpha\).

It is shown that such attributions satisfy the following axioms:

Sensitivity axiom: if we consider a baseline \(x^\prime\) which differs from the input instance \(x\) for the value of one feature \(x_i\) and yields different predictions, the attribution given to feature \(x_i\) must be non-zero.
Implementation invariance axiom: an attribution method should be such that the attributions do not depend on the particular implementation of the model.
Completeness axiom: The completeness axiom states that the sum over all features attributions should be equal to the difference between the model output at the instance \(x\) and the model output at the baseline \(x^\prime\):
\[\sum_i A_i(x, x^\prime) = F(x) - F(x^\prime).\]

The proofs that integrated gradients satisfies these axioms are relatively straightforward and are discussed in Sections 2 and 3 of the original paper “Axiomatic Attribution for Deep Networks”.

Usage

The alibi implementation of the integrated gradients method is specific to TensorFlow and Keras models.

import tensorflow as tf from alibi.explainers import IntegratedGradients

model = tf.keras.models.load_model("path_to_your_model")

ig = IntegratedGradients(model, layer=None, taget_fn=None, method="gausslegendre", n_steps=50, internal_batch_size=100)

model: Tensorflow or Keras model.
layer: Layer with respect to which the gradients are calculated. If not provided, the gradients are calculated with respect to the input.
target_fn: A scalar function that is applied to the predictions of the model. This can be used to specify which scalar output the attributions should be calculated for (see the example below).
method: Method for the integral approximation. Methods available: riemann_left, riemann_right, riemann_middle, riemann_trapezoid, gausslegendre.
n_steps: Number of step in the path integral approximation from the baseline to the input instance.
internal_batch_size: Batch size for the internal batching.

explanation = ig.explain(X, baselines=None, target=None)

attributions = explanation.attributions

X: Instances for which integrated gradients attributions are computed.
baselines: Baselines (starting point of the path integral) for each instance. If the passed value is an np.ndarray must have the same shape as X. If not provided, all features values for the baselines are set to 0.
target: Defines which element of the model output is considered to compute the gradients. It can be a list of integers or a numeric value. If a numeric value is passed, the gradients are calculated for the same element of the output for all data points. It must be provided if the model output dimension is higher than 1 and no target_fn is provided. For regression models whose output is a scalar, target should not be provided. For classification models target can be either the true classes or the classes predicted by the model.

Example

If your model is a classifier outputting class probabilities (i.e. the predictions are \(N\times C\) arrays where \(N\) is batch size and \(C\) is the number of classes), then you can provide a target_fn to the constructor that, for each data point, would select the class of highest probability to calculate the attributions for:

from functools import partial import numpy as np target_fn = partial(np.argmax, axis=1) ig = IntegratedGradients(model=model, target_fn=target_fn) explanation = ig.explain(X)

Alternatively, you can leave out target_fn and instead provide the predicted class labels directly to the explain method:

predictions = model.predict(X).argmax(axis=1) ig = IntegratedGradients(model=model) explanation = ig.explain(X, target=predictions)

Layer attributions

It is possible to calculate the integrated gradients attributions for the model input features or for the elements of an intermediate layer of the model. Specifically,

If the parameter layer is set to its default value None as in the example above, the attributions are calculated for each input feature.
If a layer of the model is passed, the attributions are calculated for each element of the layer passed.

Calculating attribution with respect to an internal layer of the model is particularly useful for models that take text as an input and use word-to-vector embeddings. In this case, the integrated gradients are calculated with respect to the embedding layer (see example on the IMDB dataset).

Baselines

Conceptually, baselines represent data points which do not contain information useful for the model task, and they are used as a benchmark by the integrated gradients method. Common choices for the baselines are data points with all features values set to zero (for example the black image in case of image classification) or set to a random value.

However, the choice of the baselines can have a significant impact on the values of the attributions. For example, if we consider a simple binary image classification task where a model is trained to predict whether a picture was taken at night or during the day, considering the black image as a baseline would be misleading: in fact, with such a baseline all the dark pixels of the images would have zero attributions, while they are likely to be important for the task at hand.

An extensive discussion about the impact of the baselines on integrated gradients attributions can be found in P. Sturmfels at al., “Visualizing the Impact of Feature Attribution Baselines”.

Targets

In the context of integrated gradients, the target variable specifies which element of the output should be considered to calculate the attributions. If the output of the model is a scalar, as in the case of single target regression, a target is not necessary, and the gradients are calculated in a straightforward way.

If the output of the model is a vector, the target value specifies the position of the element in the output vector considered for the calculation of the attributions. In case of a classification model, the target can be either the true class or the class predicted by the model for a given input.

Examples

MNIST dataset

Imagenet dataset

IMDB dataset text classification

Text classification using transformers