Notebook on nbviewer (original) (raw)

Transform the space¶

To find clusters we want to find the islands of higher density amid a sea of sparser noise -- and the assumption of noise is important: real data is messy and has outliers, corrupt data, and noise. The core of the clustering algorithm is single linkage clustering, and it can be quite sensitive to noise: a single noise data point in the wrong place can act as a bridge between islands, gluing them together. Obviously we want our algorithm to be robust against noise so we need to find a way to help 'lower the sea level' before running a single linkage algorithm.

How can we characterize 'sea' and 'land' without doing a clustering? As long as we can get an estimate of density we can consider lower density points as the 'sea'. The goal here is not to perfectly distinguish 'sea' from 'land' -- this is an initial step in clustering, not the ouput -- just to make our clustering core a little more robust to noise. So given an identification of 'sea' we want to lower the sea level. For practical purposes that means making 'sea' points more distant from each other and from the 'land'.

That's just the intuition however. How does it work in practice? We need a very inexpensive estimate of density, and the simplest is the distance to the k_th nearest neighbor. If we have the distance matrix for our data (which we will need imminently anyway) we can simply read that off; alternatively if our metric is supported (and dimension is low) this is the sort of query that kd-trees are good for. Let's formalise this and (following the DBSCAN, LOF, and HDBSCAN literature) call it the core distance defined for parameter k for a point x and denote as mathrmcorek(x)\mathrm{core}_k(x)mathrmcorek(x). Now we need a way to spread apart points with low density (correspondingly high core distance). The simple way to do this is to define a new distance metric between points which we will call (again following the literature) the mutual reachability distance. We define mutual reachability distance as follows: dmathrmmreach−k(a,b)=maxmathrmcorek(a),mathrmcorek(b),d(a,b)d_{\mathrm{mreach-}k}(a,b) = \max \{\mathrm{core}_k(a), \mathrm{core}_k(b), d(a,b) \}dmathrmmreach−k(a,b)=maxmathrmcorek(a),mathrmcore_k(b),d(a,b)

where d(a,b)d(a,b)d(a,b) is the original metric distance between a and b. Under this metric dense points (with low core distance) remain the same distance from each other but sparser points are pushed away to be at least their core distance away from any other point. This effectively 'lowers the sea level' spreading sparse 'sea' points out, while leaving 'land' untouched. The caveat here is that obviously this is dependent upon the choice of k; larger k values interpret more points as being in the 'sea'. All of this is a little easier to understand with a picture, so let's use a k value of five. Then for a given point we can draw a circle for the core distance as the circle that touches the sixth nearest neighbor (counting the point itself), like so: