Module pyqt_fit.kde — PyQt-Fit 1.3.3 documentation (original) (raw)
\[\DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\argmin}{argmin} \newcommand{\R}{\mathbb{R}} \newcommand{\n}{\boldsymbol{n}}\]
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
Module implementing kernel-based estimation of density of probability.
Given a kernel \(K\), the density function is estimated from a sampling \(X = \{X_i \in \mathbb{R}^n\}_{i\in\{1,\ldots,m\}}\) as:
\[f(\mathbf{z}) \triangleq \frac{1}{hW} \sum_{i=1}^m \frac{w_i}{\lambda_i} K\left(\frac{X_i-\mathbf{z}}{h\lambda_i}\right)\]\[W = \sum_{i=1}^m w_i\]
where \(h\) is the bandwidth of the kernel, \(w_i\) are the weights of the data points and \(\lambda_i\) are the adaptation factor of the kernel width.
The kernel is a function of \(\mathbb{R}^n\) such that:
\[\begin{split}\begin{array}{rclcl} \idotsint_{\mathbb{R}^n} f(\mathbf{z}) d\mathbf{z} & = & 1 & \Longleftrightarrow & \text{$f$ is a probability}\\ \idotsint_{\mathbb{R}^n} \mathbf{z}f(\mathbf{z}) d\mathbf{z} &=& \mathbf{0} & \Longleftrightarrow & \text{$f$ is centered}\\ \forall \mathbf{u}\in\mathbb{R}^n, \|\mathbf{u}\| = 1\qquad\int_{\mathbb{R}} t^2f(t \mathbf{u}) dt &\approx& 1 & \Longleftrightarrow & \text{The co-variance matrix of fff is close to be the identity.} \end{array}\end{split}\]
The constraint on the covariance is only required to provide a uniform meaning for the bandwidth of the kernel.
If the domain of the density estimation is bounded to the interval \([L,U]\), the density is then estimated with:
\[f(x) \triangleq \frac{1}{hW} \sum_{i=1}^n \frac{w_i}{\lambda_i} \hat{K}(x;X,\lambda_i h,L,U)\]
where \(\hat{K}\) is a modified kernel that depends on the exact method used. Currently, only 1D KDE supports bounded domains.
Kernel Density Estimation Methods¶
class pyqt_fit.kde.KDE1D(xdata, **kwords)[source]¶
Perform a kernel based density estimation in 1D, possibly on a bounded domain \([L,U]\).
Parameters: | data (ndarray) – 1D array with the data points kwords (dict) – setting attributes at construction time. Any named argument will be equivalent to setting the property after the fact. For example: >>> xs = [1,2,3] >>> k = KDE1D(xs, lower=0) will be equivalent to: >>> k = KDE1D(xs) >>> k.lower = 0 |
---|
The calculation is separated in three parts:
- The kernel (kernel)
- The bandwidth or covariance estimation (bandwidth, covariance)
- The estimation method (method)
__call__(points, out=None)[source]¶
This method is an alias for BoundedKDE1D.evaluate()
Bandwidth of the kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.
Note
A ndarray with a single value will be converted to a floating point value.
cdf_grid(N=None, cut=None)[source]¶
Compute the cdf from the lower bound to the points given as argument.
Returns true if the density domain is closed (i.e. lower and upper are both finite)
Shallow copy of the KDE object
Covariance of the gaussian kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.
Note
A ndarray with a single value will be converted to a floating point value.
evaluate(points, out=None)[source]¶
Compute the PDF of the distribution on the set of points points
Compute the various parameters needed by the kde method
grid(N=None, cut=None)[source]¶
Evaluate the density on a grid of N points spanning the whole dataset.
Returns: | a tuple with the mesh on which the density is evaluated and the density itself |
---|
icdf_grid(N=None, cut=None)[source]¶
Compute the inverse cumulative distribution (quantile) function on a grid.
Kernel object. This must be an object modeled on pyqt_fit.kernels.Kernel1D. It is recommended to inherit this class to provide numerical approximation for all methods.
By default, the kernel is an instance of pyqt_fit.kernels.normal_kernel1d
Scaling of the bandwidth, per data point. It can be either a single value or an array with one value per data point.
When deleted, the lamndas are reset to 1.
Lower bound of the density domain. If deleted, becomes set to\(-\infty\)
Select the method to use. The method should be an object modeled on pyqt_fit.kde_methods.KDE1DMethod, and it is recommended to inherit the model.
Available methods in the pyqt_fit.kde_methods sub-module.
Default: | pyqt_fit.kde_methods.default_method |
---|
Upper bound of the density domain. If deleted, becomes set to\(\infty\)
Weigths associated to each data point. It can be either a single value, or an array with a value per data point. If a single value is provided, the weights will always be set to 1.
Bandwidth Estimation Methods¶
pyqt_fit.kde.variance_bandwidth(factor, xdata)¶
Returns the covariance matrix:
\[\mathcal{C} = \tau^2 cov(X)\]
where \(\tau\) is a correcting factor that depends on the method.
pyqt_fit.kde.silverman_covariance(xdata, model=None)¶
The Silverman bandwidth is defined as a variance bandwidth with factor:
\[\tau = \left( n \frac{d+2}{4} \right)^\frac{-1}{d+4}\]
pyqt_fit.kde.scotts_covariance(xdata, model=None)¶
The Scotts bandwidth is defined as a variance bandwidth with factor:
\[\tau = n^\frac{-1}{d+4}\]
pyqt_fit.kde.botev_bandwidth(N=None, **kword)¶
Implementation of the KDE bandwidth selection method outline in:
Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.
Based on the implementation of Daniel B. Smith, PhD.
The object is a callable returning the bandwidth for a 1D kernel.