eli5.lime — ELI5 0.15.0 documentation (original) (raw)
eli5.lime.lime
An impementation of LIME (http://arxiv.org/abs/1602.04938), an algorithm to explain predictions of black-box models.
class TextExplainer(n_samples: int = 5000, char_based: bool | None = None, clf=None, vec=None, sampler: BaseSampler | None = None, position_dependent: bool = False, rbf_sigma: float | None = None, random_state=None, expand_factor: int | None = 10, token_pattern: str | None = None)[source]
TextExplainer allows to explain predictions of black-box text classifiers using LIME algorithm.
Parameters:
- n_samples (int) – A number of samples to generate and train on. Default is 5000.
With larger n_samples it takes more CPU time and RAM to explain a prediction, but it could give better results. Larger n_samples could be also required to get good results if you don’t want to make strong assumptions about the black-box classifier (e.g. char_based=True and position_dependent=True). - char_based (bool) – True if explanation should be char-based, False if it should be token-based. Default is False.
- clf (object, optional) – White-box probabilistic classifier. It should be supported by eli5, follow scikit-learn interface and provide predict_proba method. When not set, a default classifier is used (logistic regression with elasticnet regularization trained with SGD).
- vec (object, optional) – Vectorizer which converts generated texts to feature vectors for the white-box classifier. When not set, a default vectorizer is used; which one depends on
char_based
andposition_dependent
arguments. - sampler (MaskingTextSampler or MaskingTextSamplers, optional) – Sampler used to generate modified versions of the text.
- position_dependent (bool) – When True, a special vectorizer is used which takes each token or character (depending on
char_based
value) in account separately. When False (default) a vectorized passed invec
or a default vectorizer is used.
Default vectorizer converts text to vector using bag-of-ngrams or bag-of-char-ngrams approach (depending onchar_based
argument). It means that it may be not powerful enough to approximate a black-box classifier which e.g. takes in account word FOO in the beginning of the document, but not in the end.
Whenposition_dependent
is True the model becomes powerful enough to account for that, but it can become more noisy and require largern_samples
to get an OK explanation.
Whenchar_based=False
the default vectorizer uses word bigrams in addition to unigrams; this is less powerful thanposition_dependent=True
, but can give similar results in practice. - rbf_sigma (float, optional) – Sigma parameter of RBF kernel used to post-process cosine similarity values. Default is None, meaning no post-processing (cosine simiilarity is used as sample weight as-is). Small
rbf_sigma
values (e.g. 0.1) tell the classifier to pay more attention to generated texts which are close to the original text. Largerbf_sigma
values (e.g. 1.0) make distance between text irrelevant.
Note that if you’re using largerbf_sigma
it could be more efficient to use customsamplers
instead, in order to generate text samples which are closer to the original text in the first place. Use e.g.max_replace
parameter of MaskingTextSampler. - random_state (integer or numpy.random.RandomState, optional) – random state
- expand_factor (int or None) – To approximate output of the probabilistic classifier generated dataset is expanded by
expand_factor
(10 by default) according to the predicted label probabilities. This is a workaround for scikit-learn limitation (no cross-entropy loss for non 1/0 labels). With larger values training takes longer, but probability output can be approximated better.
expand_factor=None turns this feature off; pass None when you know that black-box classifier returns only 1.0 or 0.0 probabilities. - token_pattern (str, optional) – Regex which matches a token. Use it to customize tokenization. Default value depends on
char_based
parameter.
rng_
random state
Type:
numpy.random.RandomState
samples_
A list of samples the local model is trained on. Only available after fit().
Type:
list[str]
X_
A matrix with vectorized samples_
. Only available after fit().
Type:
ndarray or scipy.sparse matrix
similarity_
Similarity vector. Only available after fit().
Type:
ndarray
y_proba_
probabilities predicted by black-box classifier (predict_proba(self.samples_)
result). Only available after fit().
Type:
ndarray
clf_
Trained white-box classifier. Only available after fit().
Type:
object
vec_
Fit white-box vectorizer. Only available after fit().
Type:
object
metrics_
A dictionary with metrics of how well the local classification pipeline approximates the black-box pipeline. Only available after fit().
Type:
dict
explain_prediction(**kwargs)[source]
Call eli5.explain_prediction() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.explain_prediction().
fit() must be called before using this method.
explain_weights(**kwargs)[source]
Call eli5.show_weights() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.show_weights().
fit() must be called before using this method.
fit(doc: str, predict_proba: Callable[[Any], Any]) → TextExplainer[source]
Explain predict_proba
probabilistic classification function for the doc
example. This method fits a local classification pipeline following LIME approach.
To get the explanation use show_prediction(),show_weights(), explain_prediction() orexplain_weights().
Parameters:
- doc (str) – Text to explain
- predict_proba (callable) – Black-box classification pipeline.
predict_proba
should be a function which takes a list of strings (documents) and return a matrix of shape(n_samples, n_classes)
with probability values - a row per document and a column per output label.
set_fit_request(*, doc: bool | None | str = '$UNCHANGED$', predict_proba: bool | None | str = '$UNCHANGED$') → TextExplainer
Request metadata passed to the fit
method.
Note that this method is only relevant ifenable_metadata_routing=True
(see sklearn.set_config()
). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside aPipeline
. Otherwise it has no effect.
Parameters:
- doc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
doc
parameter infit
. - predict_proba (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predict_proba
parameter infit
.
Returns:
self (object) – The updated object.
show_prediction(**kwargs)[source]
Call eli5.show_prediction() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.show_prediction().
fit() must be called before using this method.
show_weights(**kwargs)[source]
Call eli5.show_weights() for the locally-fit classification pipeline. Keyword arguments are passed to eli5.show_weights().
fit() must be called before using this method.
eli5.lime.samplers
Base sampler class. Sampler is an object which generates examples similar to a given example.
abstractmethod sample_near(doc, n_samples=1)[source]
Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.
class MaskingTextSampler(token_pattern: str | None = None, bow: bool = True, random_state=None, replacement: str = '', min_replace: int | float = 1, max_replace: int | float = 1.0, group_size: int = 1)[source]
Sampler for text data. It randomly removes or replaces tokens from text.
Parameters:
- token_pattern (str, optional) – Regexp for token matching
- bow (bool, optional) – Sampler could either replace all instances of a given token (bow=True, bag of words sampling) or replace just a single token (bow=False).
- random_state (integer or numpy.random.RandomState, optional) – random state
- replacement (str) – Defalt value is ‘’ - by default tokens are removed. If you want to preserve the total token count set
replacement
to a non-empty string, e.g. ‘UNKN’. - min_replace (int or float) – A minimum number of tokens to replace. Default is 1, meaning 1 token. If this value is float in range [0.0, 1.0], it is used as a ratio. More than min_replace tokens could be replaced if group_size > 1.
- max_replace (int or float) – A maximum number of tokens to replace. Default is 1.0, meaning all tokens can be replaced. If this value is float in range [0.0, 0.1], it is used as a ratio.
- group_size (int) – When group_size > 1, groups of nearby tokens are replaced all in once (each token is still replaced with a replacement). Default is 1, meaning individual tokens are replaced.
sample_near(doc: str, n_samples: int = 1) → tuple[list[str], ndarray][source]
Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.
sample_near_with_mask(doc: TokenizedText | str, n_samples: int = 1) → tuple[list[str], ndarray, ndarray, TokenizedText][source]
class MaskingTextSamplers(sampler_params: list[dict[str, Any]], token_pattern: str | None = None, random_state=None, weights: ndarray | list[float] | None = None)[source]
Union of MaskingText samplers, with weights.sample_near() or sample_near_with_mask() generate a requested number of samples using all samplers; a probability of using a sampler is proportional to its weight.
All samplers must use the same token_pattern in order forsample_near_with_mask() to work.
Create it with a list of {param: value} dicts with MaskingTextSampler paremeters.
sample_near(doc: str, n_samples: int = 1) → tuple[list[str], ndarray][source]
Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.
sample_near_with_mask(doc: str, n_samples: int = 1) → tuple[list[str], ndarray, ndarray, TokenizedText][source]
class MultivariateKernelDensitySampler(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]
General-purpose sampler for dense continuous data, based on multivariate kernel density estimation.
The limitation is that a single bandwidth value is used for all dimensions, i.e. bandwith matrix is a positive scalar times the identity matrix. It is a problem e.g. when features have different variances (e.g. some of them are one-hot encoded and other are continuous).
sample_near(doc, n_samples=1)[source]
Return (examples, similarity) tuple with generated documents similar to a given document and a vector of similarity values.
class UnivariateKernelDensitySampler(kde=None, metric='euclidean', fit_bandwidth=True, bandwidths=array([1.00000000e-06, 1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02, 1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01, 1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03, 1.00000000e+04]), sigma='bandwidth', n_jobs=1, random_state=None)[source]
General-purpose sampler for dense continuous data, based on univariate kernel density estimation. It estimates a separate probability distribution for each input dimension.
The limitation is that variable interactions are not taken in account.
Unlike KernelDensitySampler it uses different bandwidths for different dimensions; because of that it can handle one-hot encoded features somehow (make sure to at least tune the default sigma
parameter). Also, at sampling time it replaces only random subsets of the features instead of generating totally new examples.
sample_near(doc, n_samples=1)[source]
Sample near the document by replacing some of its features with values sampled from distribution found by KDE.
eli5.lime.textutils
Utilities for text generation.
cosine_similarity_vec(num_tokens, num_removed_vec)[source]
Return cosine similarity between a binary vector with all ones of length num_tokens
and vectors of the same length withnum_removed_vec
elements set to zero.
generate_samples(text: TokenizedText, n_samples=500, bow=True, random_state=None, replacement='', min_replace=1.0, max_replace=1.0, group_size=1) → Tuple[List[str], ndarray, ndarray][source]
Return n_samples
changed versions of text (with some words removed), along with distances between the original text and a generated examples. If bow=False
, all tokens are considered unique (i.e. token position matters).