f_regression (original) (raw)
sklearn.feature_selection.f_regression(X, y, *, center=True, force_finite=True)[source]#
Univariate linear regression tests returning F-statistic and p-values.
Quick linear model for testing the effect of a single regressor, sequentially for many regressors.
This is done in 2 steps:
- The cross correlation between each regressor and the target is computed using r_regression as:
E[(X[:, i] - mean(X[:, i])) * (y - mean(y))] / (std(X[:, i]) * std(y)) - It is converted to an F score and then to a p-value.
f_regression is derived from r_regression and will rank features in the same order if all the features are positively correlated with the target.
Note however that contrary to f_regression, r_regressionvalues lie in [-1, 1] and can thus be negative. f_regression is therefore recommended as a feature selection criterion to identify potentially predictive feature for a downstream classifier, irrespective of the sign of the association with the target variable.
Furthermore f_regression returns p-values whiler_regression does not.
Read more in the User Guide.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)
The data matrix.
yarray-like of shape (n_samples,)
The target vector.
centerbool, default=True
Whether or not to center the data matrix X
and the target vector y
. By default, X
and y
will be centered.
force_finitebool, default=True
Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite:
- when the target
y
or some features inX
are constant. In this case, the Pearson’s R correlation is not defined leading to obtainnp.nan
values in the F-statistic and p-value. Whenforce_finite=True
, the F-statistic is set to0.0
and the associated p-value is set to1.0
. - when a feature in
X
is perfectly correlated (or anti-correlated) with the targety
. In this case, the F-statistic is expected to benp.inf
. Whenforce_finite=True
, the F-statistic is set tonp.finfo(dtype).max
and the associated p-value is set to0.0
.
Added in version 1.1.
Returns:
f_statisticndarray of shape (n_features,)
F-statistic for each feature.
p_valuesndarray of shape (n_features,)
P-values associated with the F-statistic.
See also
Pearson’s R between label/feature for regression tasks.
ANOVA F-value between label/feature for classification tasks.
Chi-squared stats of non-negative features for classification tasks.
Select features based on the k highest scores.
Select features based on a false positive rate test.
Select features based on an estimated false discovery rate.
Select features based on family-wise error rate.
Select features based on percentile of the highest scores.
Examples
from sklearn.datasets import make_regression from sklearn.feature_selection import f_regression X, y = make_regression( ... n_samples=50, n_features=3, n_informative=1, noise=1e-4, random_state=42 ... ) f_statistic, p_values = f_regression(X, y) f_statistic array([1.2...+00, 2.6...+13, 2.6...+00]) p_values array([2.7..., 1.5..., 1.0...])