Module pyqt_fit.kde — PyQt-Fit 1.3.3 documentation (original) (raw)

\[\DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\argmin}{argmin} \newcommand{\R}{\mathbb{R}} \newcommand{\n}{\boldsymbol{n}}\]

Author:	Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com>

Module implementing kernel-based estimation of density of probability.

Given a kernel $K$, the density function is estimated from a sampling $X = \{X_i \in \mathbb{R}^n\}_{i\in\{1,\ldots,m\}}$ as:

\[f(\mathbf{z}) \triangleq \frac{1}{hW} \sum_{i=1}^m \frac{w_i}{\lambda_i} K\left(\frac{X_i-\mathbf{z}}{h\lambda_i}\right)\]\[W = \sum_{i=1}^m w_i\]

where $h$ is the bandwidth of the kernel, $w_i$ are the weights of the data points and $\lambda_i$ are the adaptation factor of the kernel width.

The kernel is a function of $\mathbb{R}^n$ such that:

\[\begin{split}\begin{array}{rclcl} \idotsint_{\mathbb{R}^n} f(\mathbf{z}) d\mathbf{z} & = & 1 & \Longleftrightarrow & \text{$f$ is a probability}\\ \idotsint_{\mathbb{R}^n} \mathbf{z}f(\mathbf{z}) d\mathbf{z} &=& \mathbf{0} & \Longleftrightarrow & \text{$f$ is centered}\\ \forall \mathbf{u}\in\mathbb{R}^n, \|\mathbf{u}\| = 1\qquad\int_{\mathbb{R}} t^2f(t \mathbf{u}) dt &\approx& 1 & \Longleftrightarrow & \text{The co-variance matrix of fff is close to be the identity.} \end{array}\end{split}\]

The constraint on the covariance is only required to provide a uniform meaning for the bandwidth of the kernel.

If the domain of the density estimation is bounded to the interval $[L,U]$, the density is then estimated with:

\[f(x) \triangleq \frac{1}{hW} \sum_{i=1}^n \frac{w_i}{\lambda_i} \hat{K}(x;X,\lambda_i h,L,U)\]

where $\hat{K}$ is a modified kernel that depends on the exact method used. Currently, only 1D KDE supports bounded domains.

Kernel Density Estimation Methods¶

class pyqt_fit.kde.KDE1D(xdata, **kwords)[source]¶

Perform a kernel based density estimation in 1D, possibly on a bounded domain $[L,U]$.

Parameters:	data (ndarray) – 1D array with the data points kwords (dict) – setting attributes at construction time. Any named argument will be equivalent to setting the property after the fact. For example: >>> xs = [1,2,3] >>> k = KDE1D(xs, lower=0) will be equivalent to: >>> k = KDE1D(xs) >>> k.lower = 0

The calculation is separated in three parts:

The kernel (kernel)

The bandwidth or covariance estimation (bandwidth, covariance)

The estimation method (method)

__call__(points, out=None)[source]¶

This method is an alias for BoundedKDE1D.evaluate()

bandwidth[source]¶

Bandwidth of the kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

cdf_grid(N=None, cut=None)[source]¶

Compute the cdf from the lower bound to the points given as argument.

closed[source]¶

Returns true if the density domain is closed (i.e. lower and upper are both finite)

copy()[source]¶

Shallow copy of the KDE object

covariance[source]¶

Covariance of the gaussian kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

evaluate(points, out=None)[source]¶

Compute the PDF of the distribution on the set of points points

fit()[source]¶

Compute the various parameters needed by the kde method

grid(N=None, cut=None)[source]¶

Evaluate the density on a grid of N points spanning the whole dataset.

Returns:	a tuple with the mesh on which the density is evaluated and the density itself

icdf_grid(N=None, cut=None)[source]¶

Compute the inverse cumulative distribution (quantile) function on a grid.

kernel[source]¶

Kernel object. This must be an object modeled on pyqt_fit.kernels.Kernel1D. It is recommended to inherit this class to provide numerical approximation for all methods.

By default, the kernel is an instance of pyqt_fit.kernels.normal_kernel1d

lambdas[source]¶

Scaling of the bandwidth, per data point. It can be either a single value or an array with one value per data point.

When deleted, the lamndas are reset to 1.

lower[source]¶

Lower bound of the density domain. If deleted, becomes set to$-\infty$

method[source]¶

Select the method to use. The method should be an object modeled on pyqt_fit.kde_methods.KDE1DMethod, and it is recommended to inherit the model.

Available methods in the pyqt_fit.kde_methods sub-module.

Default:	pyqt_fit.kde_methods.default_method

upper[source]¶

Upper bound of the density domain. If deleted, becomes set to$\infty$

weights[source]¶

Weigths associated to each data point. It can be either a single value, or an array with a value per data point. If a single value is provided, the weights will always be set to 1.

Bandwidth Estimation Methods¶

pyqt_fit.kde.variance_bandwidth(factor, xdata)¶

Returns the covariance matrix:

\[\mathcal{C} = \tau^2 cov(X)\]

where $\tau$ is a correcting factor that depends on the method.

pyqt_fit.kde.silverman_covariance(xdata, model=None)¶

The Silverman bandwidth is defined as a variance bandwidth with factor:

\[\tau = \left( n \frac{d+2}{4} \right)^\frac{-1}{d+4}\]

pyqt_fit.kde.scotts_covariance(xdata, model=None)¶

The Scotts bandwidth is defined as a variance bandwidth with factor:

\[\tau = n^\frac{-1}{d+4}\]

pyqt_fit.kde.botev_bandwidth(N=None, **kword)¶

Implementation of the KDE bandwidth selection method outline in:

Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.

Based on the implementation of Daniel B. Smith, PhD.

The object is a callable returning the bandwidth for a 1D kernel.