Notebook on nbviewer (original) (raw)
Transform the space¶
To find clusters we want to find the islands of higher density amid a sea of sparser noise -- and the assumption of noise is important: real data is messy and has outliers, corrupt data, and noise. The core of the clustering algorithm is single linkage clustering, and it can be quite sensitive to noise: a single noise data point in the wrong place can act as a bridge between islands, gluing them together. Obviously we want our algorithm to be robust against noise so we need to find a way to help 'lower the sea level' before running a single linkage algorithm.
How can we characterize 'sea' and 'land' without doing a clustering? As long as we can get an estimate of density we can consider lower density points as the 'sea'. The goal here is not to perfectly distinguish 'sea' from 'land' -- this is an initial step in clustering, not the ouput -- just to make our clustering core a little more robust to noise. So given an identification of 'sea' we want to lower the sea level. For practical purposes that means making 'sea' points more distant from each other and from the 'land'.
That's just the intuition however. How does it work in practice? We need a very inexpensive estimate of density, and the simplest is the distance to the k_th nearest neighbor. If we have the distance matrix for our data (which we will need imminently anyway) we can simply read that off; alternatively if our metric is supported (and dimension is low) this is the sort of query that kd-trees are good for. Let's formalise this and (following the DBSCAN, LOF, and HDBSCAN literature) call it the core distance defined for parameter k for a point x and denote as mathrmcorek(x)\mathrm{core}_k(x)mathrmcorek(x). Now we need a way to spread apart points with low density (correspondingly high core distance). The simple way to do this is to define a new distance metric between points which we will call (again following the literature) the mutual reachability distance. We define mutual reachability distance as follows: dmathrmmreach−k(a,b)=maxmathrmcorek(a),mathrmcorek(b),d(a,b)d_{\mathrm{mreach-}k}(a,b) = \max \{\mathrm{core}_k(a), \mathrm{core}_k(b), d(a,b) \}dmathrmmreach−k(a,b)=maxmathrmcorek(a),mathrmcore_k(b),d(a,b)
where d(a,b)d(a,b)d(a,b) is the original metric distance between a and b. Under this metric dense points (with low core distance) remain the same distance from each other but sparser points are pushed away to be at least their core distance away from any other point. This effectively 'lowers the sea level' spreading sparse 'sea' points out, while leaving 'land' untouched. The caveat here is that obviously this is dependent upon the choice of k; larger k values interpret more points as being in the 'sea'. All of this is a little easier to understand with a picture, so let's use a k value of five. Then for a given point we can draw a circle for the core distance as the circle that touches the sixth nearest neighbor (counting the point itself), like so:
Pick another point and we can do the same thing, this time with a different set of neighbors (one of them even being the first point we picked out).
And we can do that a third time for good measure, with another set of six nearest neighbors and another circle with slightly different radius again.
Now if we want to know the mutual reachabiility distance between the blue and green points we can start by drawing in and arrow giving the distance between green and blue:
This passes through the blue circle, but not the green circle -- the core distance for green is larger than the distance between blue and green. Thus we need to mark the mutual reachability distance between blue and green as larger -- equal to the radius of the green circle (easiest to picture if we base one end at the green point).
On the other hand the mutual reachablity distance from red to green is simply distance from red to green since that distance is greater than either core distance (i.e. the distance arrow passes through both circles).
In general there is underlying theory to demonstrate that mutual reachability distance as a transform works well in allowing single linkage clustering to more closely approximate the hierarchy of level sets of whatever true density distribution our points were sampled from.
Build the minimum spanning tree¶
Now that we have a new mutual reachability metric on the data we want start finding the islands on dense data. Of course dense areas are relative, and different islands may have different densities. Conceptually what we will do is the following: consider the data as a weighted graph with the data points as vertices and an edge between any two points with weight equal to the mutual reachability distance of those points.
Now consider a threshold value, starting high, and steadily being lowered. Drop any edges with weight above that threshold. As we drop edges we will start to disconnect the graph into connected components. Eventually we will have a hierarchy of connected components (from completely connected to completely disconnected) at varying threshold levels.
In practice this is very expensive: there are n2n^2n2 edges and we don't want to have to run a connected components algorithm that many times. The right thing to do is to find a minimal set of edges such that dropping any edge from the set causes a disconnection of components. But we need more, we need this set to be such that there is no lower weight edge that could connect the components. Fortunately graph theory furnishes us with just such a thing: the minimum spanning tree of the graph.
We can build the minimum spanning tree very efficiently via Prim's algorithm -- we build the tree one edge at a time, always adding the lowest weight edge that connects the current tree to a vertex not yet in the tree. You can see the tree HDBSCAN constructed below; note that this is the minimum spanning tree for mutual reachability distance which is different from the pure distance in the graph. In this case we had a k value of 5.
In the case that the data lives in a metric space we can use even faster methods, such as Dual Tree Boruvka to build the minimal spanning tree.
Intuitively we want the choose clusters that persist and have a longer lifetime; short lived clusters are ultimately probably merely artifacts of the single linkage approach. Looking at the previous plot we could say that we want to choose those clusters that have the greatest area of ink in the plot. To make a flat clustering we will need to add a further requirement that, if you select a cluster, then you cannot select any cluster that is a descendant of it. And in fact that intuitive notion of what should be done is exactly what HDBSCAN does. Of course we need to formalise things to make it a concrete algorithm.
First we need a different measure than distance to consider the persistence of clusters; instead we will use lambda=frac1mathrmdistance\lambda = \frac{1}{\mathrm{distance}}lambda=frac1mathrmdistance. For a given cluster we can then define values lambdamathrmbirth\lambda_{\mathrm{birth}}lambdamathrmbirth and lambdamathrmdeath\lambda_{\mathrm{death}}lambdamathrmdeath to be the lambda value when the cluster split off and became its own cluster, and the lambda value (if any) when the cluster split into smaller clusters respectively. In turn, for a given cluster, for each point p in that cluster we can define the value lambdap\lambda_plambdap as the lambda value at which that point 'fell out of the cluster' which is a value somewhere between lambdamathrmbirth\lambda_{\mathrm{birth}}lambdamathrmbirth and lambdamathrmdeath\lambda_{\mathrm{death}}lambdamathrmdeath since the point either falls out of the cluster at some point in the cluster's lifetime, or leaves the cluster when the cluster splits into two smaller clusters. Now, for each cluster compute the stability to as sumpinmathrmcluster(lambdap−lambdamathrmbirth)\sum_{p \in \mathrm{cluster}} (\lambda_p - \lambda_{\mathrm{birth}})sumpinmathrmcluster(lambdap−lambdamathrmbirth).
Declare all leaf nodes to be selected clusters. Now work up through the tree (the reverse topological sort order). If the sum of the stabilities of the child clusters is greater than the stability of the cluster then we set the cluster stability to be the sum of the child stabilities. If, on the other hand, the cluster's stability is greater than the sum of its children then we declare the cluster to be a selected cluster, and unselect all its descendants. Once we reach the root node we call the current set of selected clusters our flat clsutering and return that.
Okay, that was wordy and complicated, but it really is simply performing our 'select the clusters in the plot with the largest total ink area' subject to descendant constraints that we explained earlier. We can select the clusters in the condensed tree dendrogram via this algorithm, and you get what you expect: