Partial Dependence Variance — Alibi 0.9.5 documentation (original) (raw)

Overview

Partial Dependence Variance is a method proposed by Greenwell et al. (2018)[1] to compute the global feature importance or the feature interaction of a pair of features. As the naming suggests, the feature importance and the feature interactions are summarized in a single positive number given by the variance within the Partial Dependence (PD)[2, 3] function. Because the computation relies on the PD, the method is quite intuitive and easy to comprehend.

To get a better intuition of what the proposed method tries to achieve, let us consider a simple example. Given a trained model on the Bike rental[4] dataset, one can compute the PD function for each individual feature. Figure 1 displays the PD for 6 out of 11 features:

PD plots Bike rental dataset.

Figure 1. PD plots for Bike rental datasets.*

From the inspection of the plots, we can observe that temperature (temp), humidity (hum), wind speed (windspeed) have a strong non-linear relationship with the predicted outcome. We can observe that the model prediction increases with temperature till it reaches approx \(17^{\circ}C\). Then it flattens at a high number until the weather becomes too hot (i.e., approx. \(27^{\circ}C\) ), after which it starts dropping again. The humidity larger than \(60\%\) seems to be a factor that inhibits the number of rentals, since we can observe a downward trend form that point onward. Finally, a similar analysis can be conducted for speed. As the wind speed increases, fewer and fewer people are riding the bike.

Quite noticeable are the plots in the second row which show a flat response. Naturally, although some heterogeneity can be hidden, one can assume that the features in the second row have a less impact on the model prediction than the others.

Given the arguments above, one can propose a notion of quantifying the importance of a feature based on a measure of flatness of the PD function, for which the variance represents a natural and straightforward candidate. Figure 2 displays the global feature importance for the given example using the variance of the PD function (left figure) and a model-internal method (i.e., based on impurity because it is a tree ensemble model) (right figure):

Feature importance comparison between the PD variance and impurity-based method.

Figure 2. Feature importance comparison between the PD variance (left) and impurity-based method (right).

As we can observe, the two methods agree on the top most salient features.

Advantages:

Drawbacks:

Usage

To initialize the explainer with any black-box model, one can directly pass the prediction function and optionally a list of feature names, a list of target names, and a dictionary of categorical names for interpretation and specification of the categorical features.

from alibi.explainers import PartialDependenceVariance

pd_variance = PartialDependenceVariance(predictor=predictor_fn, feature_names=feature_names, categorical_names=categorical_names, target_names=target_names)

Since the PartialDependenceVariance uses the PartialDependence explainer, it has support for some tree-based sklearn models directly, just by passing the model to the predictor argument (i.e., predictor=tree_predictor, where tree_predictor is a specific sklearn tree-based model). The rest of the initialization remains the same.

Following the initialization, we can compute two types of explanation for a given dataset X with F features.

The first type of explanation computes feature importance. To compute the feature importance one has to pass the argument method='importance' to the explain function. The call should look like:

exp_importance = pd_variance.explain(X=X, method='importance')

By default, the explainer will compute the feature importance for all F features in the dataset. The feature set for which to compute the importance can be customized through the argument features as will be presented later.

The second type of explanation computes feature interaction between pairs of features. To compute the feature interaction, one has to pass the argument method='interaction' to the explain function. The call should look like:

exp_interaction = pd_variance.explain(X, method='interaction')

By default, the explainer will compute the feature importance for all F x (F - 1) feature pair combinations from the dataset. As before, the pairs of feature to compute the feature importance for can be customized.

Multiple other arguments can be specified to the explain function:

The results exp is an Explanation object which contains the following data-related attributes:

Alibi exposes a utility function to plot a summary of the feature importance and feature interaction, or a more detailed exposition of them.

To plot a summary of the feature interaction, one can simply call the plot_pd_variance as follows:

from alibi.explainers import plot_pd_variance plot_pd_variance(exp=exp_importance)

Figure 3 displays the summary of the feature importance as a horizontal bar plot. By default, the features are sorted in descending order (top to bottom) according to their feature importance.

Feature importance summary.

Figure 3. Feature importance summary.

To plot a more detailed exposition of the feature importance, one should set the summarise=False flag in the plot_pd_variance function. The call should look like:

plot_pd_variance(exp=exp_importance, summarise=False)

Figure 4 displays the PD plots for the explained features. By default the plots are sorted in descending order (left to right, top to bottom) according to their feature importance:

Detailed feature importance plots.

Figure 4. Detailed feature importance plots.

To plot the summary for the feature interaction, we follow the same steps from above:

plot_pd_variance(exp=exp_interaction)

As before, in Figure 5, the feature interaction is plotted as a horizontal bar plot, sorted in descending order according to the feature interaction.

Feature interaction summary.

Figure 5. Feature interaction summary.

To plot the more detailed exposition of the feature interaction, we pass, as before, the flag summarise=False to the plot_pd_variance. It is recommended that the number of axes columns to be divisible by 3 for visualization purposes (see Figure 6).

plot_pd_variance(exp=exp_interaction, summarise=False, n_cols=3).

Feature interaction detailed.

Figure 6. Detailed feature interaction plots.

Note that in this case, for each feature pair, the plots display the 2-way PD function and the two conditional importance plots for each individual feature. By default, the three plots groups are sorted in descending order according to their feature interaction strength.

Theoretical exposition

We split the theoretical exposition in two parts, the first one covering the feature importance and the second one covering the feature interaction.

Feature importance

Following the notation from the Partial Dependence exposition, we say that any variable with a flat PD plot is likely to be less important than those for which the PD plot varies across a wider range. This notion of variable importance is based on a measure of the flatness of the PD function which can be generally stated as:

\[i(x_S) = F(f_{S}(x_S)),\]

where \(F(\cdot)\) is any measure of the “flatness” for the PD of the variables \(S\).

Greenwell et al. (2018)[1] proposed to measure the “flatness” as the sample standard deviation for continuous features and the range statistic divided by four for the categorical features. Note that the range divided by four is an estimate of the standard deviation for a small sample size. Formally, those statistics are defined as:

\begin{equation} i(x_1) = \begin{cases} \sqrt{\frac{1}{k-1}\sum_{i=1}^{k}[f_{1}(x_{1i}) - \frac{1}{k}\sum_{j=1}^{k}f_{1}(x_{1i})]^2} \text{, if } x_{1} \text{ is continuous} \\ [\max_{i}(f_{1}(x_{1i})) - \min_{i}(f_{1}(x_{1i}))] / 4 \text{, if } {x_1} \text{ is categorical} \end{cases} \end{equation}

Connection to t-statistic

Although the choice of computing the variance of the PD can be motivated intuitively, one can justify it more rigorously by considering a linear model as follows:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon\]

where \(\beta_i\), \(i = 1, ..., p\) are the regression coefficients and \(\epsilon \sim \mathcal{N}(0, \sigma^2)\).

To test the significance of a regression coefficient for a least squares problem, one can apply the t-test. For that, one has to compute the t-statistic given by:

\[\text{t-statistic} = \frac{\hat\beta_i - \beta_{H_0}}{s.e.(\hat\beta_i)},\]

where \(\hat\beta_i\) is an estimate of \(\beta_i\), \(\beta_{H_0}\) is the value under the null hypothesis, and \(s.e.\) is the standard error.

If we set \(\beta_{H_0} = 0\), then the t-statistic is given by:

\[\text{t-statistic} = \frac{\hat\beta_i}{s.e.(\beta_i)}.\]

For completeness, we will provide a sketch of the derivation of the t-statistic for the least squares problem. Using the matrix notation, we can rewrite the least squares problem as follows:

\[\hat\beta = \arg\min_{\beta \in \mathbb{R}^p} \| Y - X \beta \|_{2}^2,\]

where \(Y\) is the target variable. The least squares solution of the equation above is given by:

\[\hat\beta = (X^TX)^{-1}X^TY.\]

Under the assumption that the true model is given by \(Y = X\beta + \epsilon\), we can infer the distribution of \(\hat\beta\):

\[\hat\beta = (X^TX)^{-1}X^T(X\beta + \epsilon) = \beta + (X^TX)^{-1}X^T\epsilon.\]

From the equation above, one can conclude that \(\hat\beta - \beta \sim \mathcal{N}(0, \sigma^2(X^TX)^{-1})\). Knowing that \(\hat\beta - \beta\) has a multivariate normal distribution, we can look at the diagonal entrance and obtain that \(\hat\beta_i - \beta_i \sim \mathcal{N}(0, \sigma^2 S_{ii})\), where \(S_{ii}\) is the i-th diagonal entrance on of the matrix \((X^TX)^{-1}\). The last statement implies that:

\[z_i = \frac{\hat\beta_i - \beta_i}{\sqrt{\sigma^2 S_{ii}}} \sim \mathcal{N}(0, 1).\]

Let us denote by \(\hat\epsilon = (I - X(X^TX)^{-1}X^T)Y\) the residuals, and \(s^2 = \frac{\hat\epsilon^T\hat\epsilon}{n-p}\) be an unbiased estimate of \(\sigma^2\). One can show that:

\[V = \frac{(n-p)s^2}{\sigma^2} \sim \chi_{n-p}^{2}.\]

Given that \(z_i \sim \mathcal{N}(0, 1)\) and \(V \sim \chi_{n-p}^2\), we can conclude that \(t_i = \frac{z_i}{\sqrt{V / (n -p)}}\) is characterized by a t-student distribution with \(n-p\) degrees of freedom. With some simple algebraic manipulation, one can show that \(t_i = \frac{\hat{\beta}_i - \beta_i}{s.e.(\hat{\beta}_i)}\) as follows:

\[t_i = \frac{\frac{\hat\beta_i - \beta_i}{\sqrt{\sigma^2 S_{ii}}}}{\sqrt{\frac{(n-p)s^2}{\sigma^2} / (n-p)}} = \frac{\frac{\hat\beta_i - \beta_i}{\sqrt{S_{ii}}}}{\sqrt{s^2}} = \frac{\hat\beta_i - \beta_i}{\sqrt{s^2 S_{ii}}} = \frac{\hat\beta_i - \beta_i}{s.e.(\hat\beta_i)}\]

For a more detailed derivation of the results above, see this page.

To see exactly the connection between the Partial Dependence Variance feature importance and the t-statistic, we consider the following example also presented in Greenwell et al. (2018)[1]. Consider the linear model:

\[Y = \hat\beta_0 + \hat\beta X_1 + \hat\beta X_2\]

where \(\hat\beta_0, \hat\beta_1, \hat\beta_2\) are constants, \(X_1\) and \(X_2\) are both independent \(\text{Uniform}[0, 1]\). Since the distributions of \(X_1\) and \(X_2\) are known, one can compute the exact PD \(f_i(X_i)\). For example, \(f_1(X_1) = \int_{0}^{1} \mathbb{E}[Y | X_1, X_2] p(X_2)dX_2\), where \(p(X_2) = 1\). After simple calculus, one obtains:

\[f_1(X_1) = \hat\beta_0 + \frac{\hat\beta_2}{2} + \hat\beta_1 X_1 \text{ and } f_2(X_2) = \hat\beta_0 + \frac{\hat\beta_1}{2} + \hat\beta_2 X_2.\]

Computing the variance for each PD function above, gives us:

\[\mathbb{V}[f_1(X_1)] = \frac{\hat\beta_{1}^{2}}{12} \text{ and } \mathbb{V}[f_2(X_2)] = \frac{\hat\beta_{2}^{2}}{12}\]

which implies that the standard deviation is given by \(\frac{\lvert \hat\beta_1 \rvert}{\sqrt{12}}\) and \(\frac{\vert \hat\beta_2 \rvert}{\sqrt{12}}\), respectively.

On the other hand, the two-tailed t-statistic is given by:

\[t_1 = \frac{\lvert \hat\beta_1 \rvert}{\sqrt{s^2 S_{11}}} = \frac{\lvert \hat\beta_1 \rvert}{\sqrt{12 s^2}} \text{ and } t_2 = \frac{\lvert \hat\beta_2 \rvert}{\sqrt{s^2 S_{22}}} = \frac{\lvert \hat\beta_2 \rvert}{\sqrt{12 s^2}}\]

which matches the Partial Dependence Variance up to a proportionality constant. Thus, the variance of the PD measures the significance of each regression coefficient and orders them accordingly. In other words, the most important features will correspond to the ones with the most significant p-values.

Feature interaction

Greenwell et al. (2018)[1] also proposed to use the PD to measure the feature interaction of two given features. Let \(SD(X_i, X_j)\), with \(i \neq j\), be the standard deviation of the joint PD values \(f_{ij}(X_{ii^\prime}, X_{jj^\prime})\), for \(i^\prime = 1, 2, ..., k_i\) and \(j^\prime = 1, 2, ..., k_j\). The intuition proposed by the authors is that a weak interaction effect between \(X_i\) and\(X_j\) on the response \(Y\) would suggest that importance \(i(X_i, X_j)\) has little variance when either \(X_i\) or \(X_j\) is held constant and the other varies.

The computation of the feature interaction is straightforward. Consider any two features \((X_i, X_j)\), \(i \neq j\). We construct the PD function \(f_{ij}(X_i, X_j)\) and compute the feature importance of \(X_i\) while keeping \(X_j\) constant, for all values of \(X_j\). We denote it by \(SD(X_i | X_j)\). Following that, we take the standard deviation of the resulting importance scores across all values of \(X_j\). We denote the latter quantity as\(i(X_i | X_j)\). Similarly, we compute \(i(X_j | X_i)\). To compute the feature interaction, one simply averages the two results. A large values will indicate a possible feature interaction.

Although the results reported by Greenwell et al. (2018)[1] seem encouraging, the authors do not offer a rigorous justification for their proposal which makes the method to appear rather heuristic. In the following paragraphs, we provide through concrete examples some arguments on why the method can capture feature interactions and which are some failure cases.

Consider the following function of two variables, \(f: [0, 1] \rightarrow \mathbb{R}\), \(f(X_1, X_2) = X_1 + X_2 + X_1 X_2\). Due to its simplistic form, one might be tempted to eyeball the decomposition of the function in three terms: a main effect of \(X_1\) given by \(f_1(X_1)\), a main effect of \(X_2\) given by \(f_2(X_2)\), and an interaction term between \((X_1, X_2)\) given by \(f_3(X_1, X_2) = X_1 X_2\). Although this can be a valid function decomposition within an axiomatic framework, it is not the case for the PD. This is because in the PD case the term \(X_1 X_2\) does not only contain a feature interaction between \(X_1\) and \(X_2\), but also contains a fraction of their main effects.

To understand why \(X_1 X_2\) contains also a fraction of the main effects of \(X_1\) and \(X_2\), we first provide an intuitive view of what the main effect consists when using the PD functions. Informally, one can think of the main effect of a feature in the PD context as how well w.r.t. the mean squared error (MSE) one can approximate \(f(X_1, X_2)\) by only having access to the feature \(X_1\). In our case we can analyze the three terms:

We now elaborate more on the third bullet point. To intuitively see why this is the case, let us fix the value \(X_1=0.2\) and inspect \(f_3(0.2, X_2)\) by varying \(X_2\) (see Figure 7(a)).

Conditional PD estimation steps.

Figure 7. Conditional PD estimation steps.

Having no access to \(X_2\), the best we can predict \(f_3(X_1, X_2)\) based only on \(X_1 = 0.2\) w.r.t. the MSE, is to predict the average response of \(f_3(0.2, X_2)\) over the marginal of \(X_2\). Formally, the value is given by \(\mathbb{E}_{X_2}[f_3(0.2, X_2)]\). This is depicted by the green line in Figure 7(b).

The residuals (i.e. what we cannot predict) are given by \(f_3(0.2, X_2) - \mathbb{E}[f_3(0.2, X_2)]\), displayed in red in Figure 7(c).

The residuals/errors can be attributed to the following:

Intuitively, we would conclude that we do not have any feature interaction between \(X_1\) and \(X_2\) if for any value of \(X_1 = c\), we obtain the same error patterns. In this case, the errors would only be a result from a fraction of the main effect of \(X_2\). Figure 8 displays the two scenarios, without and with feature interaction.

Conditional PD estimation error patterns.

Figure 8. Conditional PD estimation error patterns. The first row displays error patterns when there is no interaction (i.e., the interaction term is removed). The second row displays error pattern when there is interaction.

We now come back to Greenwell et al. (2018)[1] proposal on how to measure the feature interaction. For feature \(X_1\), the first step is to compute the standard deviation of the PD along the \(X_2\) axis when \(X_1\) is held constant. This is equivalent of computing the root mean squared error (i.e., corresponds to the red lines above). This first step can quantify whether there is some effect we are missing just by using \(X_1\), which we can attribute to either a fraction from the main effect of \(X_2\) or to the interaction between \(X_1\) and \(X_2\) (we still do not know to which one we should attribute). We repeat the same step for every value of \(X_1\). Having all the standard deviations for every value of \(X_{1i}\) (i.e., \(SD_{X_2}(X_2 | X_{1i})\), we compute the standard deviation along the \(X_1\), \(i(X_1 | X_2) = SD_{X_1}[SD_{X_2}(X_2 | X_{1})]\). This step is necessary (but not sufficient) to check whether the variance in the error is attributed to the fraction from the main effect of \(X_2\) or to the feature interaction between \(X_1\) and \(X_2\). Note that if the standard deviation \(i(X_1 | X_2) = SD_{X_1}[SD(X_2 | X_1)] = 0\), it means that we have the same variance along the \(X_2\) for every value \(X_{1i}\). This might happen for multiple reasons, but one reason can be if we encounter the same error pattern when we condition on every \(X_{1i}\). If that happens then it means that there is no feature interaction and all the error is attributed to a fraction of the main effect of \(X_2\).

One simple example for which the method fails to capture the feature interaction is for \(f:[-1, 1] \times [-1, 1] \rightarrow \mathbb{R}\), \(f(X_1, X_2) = \mathbb{1}(X_1 X_2 > 0)\). In this case the variance is the same when keeping \(X_1\) or \(X_2\) constant, but the error patterns differ depending on their sign. The PD and the conditional importances are displayed in Figure 9:

Feature interaction failure case.

Figure 9. Feature interaction failure case.

Although there is clear interaction between \(X_1\) and \(X_2\), the method fails to detect it because the variance along the \(X_1\) and \(X_2\) axis is the same.

Examples

Partial Dependence Variance, regression example (Friedman’s regression problem)

References

[1] Greenwell, Brandon M., Bradley C. Boehmke, and Andrew J. McCarthy. “A simple and effective model-based variable importance measure.” arXiv preprint arXiv:1805.04755 (2018).

[2] Friedman, Jerome H. “Greedy function approximation: a gradient boosting machine.” Annals of statistics (2001): 1189-1232.

[3] Molnar, Christoph. Interpretable machine learning. Lulu. com, 2020.

[4] Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.

[5] Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.

[6] Friedman, Jerome H. “Greedy function approximation: a gradient boosting machine.” Annals of statistics (2001): 1189-1232.