How HDBSCAN Works — hdbscan 0.8.1 documentation (2024)

HDBSCAN is a clustering algorithm developed by Campello, Moulavi, andSander.It extends DBSCAN by converting it into a hierarchical clusteringalgorithm, and then using a technique to extract a flat clustering basedin the stability of clusters. The goal of this notebook is to give youan overview of how the algorithm works and the motivations behind it. Incontrast to the HDBSCAN paper I’m going to describe it without referenceto DBSCAN. Instead I’m going to explain how I like to think about thealgorithm, which aligns more closely with Robust SingleLinkage with flatclusterextractionon top of it.

Before we get started we’ll load up most of the libraries we’ll need inthe background, and set up our plotting (because I believe the best wayto understand what is going on is to actually see it working inpictures).

import numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport sklearn.datasets as data%matplotlib inlinesns.set_context('poster')sns.set_style('white')sns.set_color_codes()plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

The next thing we’ll need is some data. To make for an illustrativeexample we’ll need the data size to be fairly small so we can see whatis going on. It will also be useful to have several clusters, preferablyof different kinds. Fortunately sklearn has facilities for generatingsample clustering data so I’ll make use of that and make a dataset ofone hundred data points.

moons, _ = data.make_moons(n_samples=50, noise=0.05)blobs, _ = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)test_data = np.vstack([moons, blobs])plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)

How HDBSCAN Works — hdbscan 0.8.1 documentation (1)

Now, the best way to explain HDBSCAN is actually just use it and then gothrough the steps that occurred along the way teasing out what ishappening at each step. So let’s load up the hdbscanlibrary and get to work.

import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)clusterer.fit(test_data)

HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True, gen_min_span_tree=True, leaf_size=40, memory=Memory(None), metric='euclidean', min_cluster_size=5, min_samples=None, p=None)

Transform the space¶

To find clusters we want to find the islands of higher density amid asea of sparser noise – and the assumption of noise is important: realdata is messy and has outliers, corrupt data, and noise. The core of theclustering algorithm is single linkage clustering, and it can be quitesensitive to noise: a single noise data point in the wrong place can actas a bridge between islands, gluing them together. Obviously we want ouralgorithm to be robust against noise so we need to find a way to help‘lower the sea level’ before running a single linkage algorithm.

How can we characterize ‘sea’ and ‘land’ without doing a clustering? Aslong as we can get an estimate of density we can consider lower densitypoints as the ‘sea’. The goal here is not to perfectly distinguish ‘sea’from ‘land’ – this is an initial step in clustering, not the ouput –just to make our clustering core a little more robust to noise. So givenan identification of ‘sea’ we want to lower the sea level. For practicalpurposes that means making ‘sea’ points more distant from each other andfrom the ‘land’.

That’s just the intuition however. How does it work in practice? We needa very inexpensive estimate of density, and the simplest is the distanceto the kth nearest neighbor. If we have the distance matrix for ourdata (which we will need imminently anyway) we can simply read that off;alternatively if our metric is supported (and dimension is low) this isthe sort of query thatkd-treesare good for. Let’s formalise this and (following the DBSCAN, LOF, andHDBSCAN literature) call it the core distance defined for parameterk for a point x and denote as . Now weneed a way to spread apart points with low density (correspondingly highcore distance). The simple way to do this is to define a new distancemetric between points which we will call (again following theliterature) the mutual reachability distance. We define mutualreachability distance as follows:

where is the original metric distance between a andb. Under this metric dense points (with low core distance) remain thesame distance from each other but sparser points are pushed away to beat least their core distance away from any other point. This effectively‘lowers the sea level’ spreading sparse ‘sea’ points out, while leaving‘land’ untouched. The caveat here is that obviously this is dependentupon the choice of k; larger k values interpret more points as beingin the ‘sea’. All of this is a little easier to understand with apicture, so let’s use a k value of five. Then for a given point we candraw a circle for the core distance as the circle that touches the sixthnearest neighbor (counting the point itself), like so:

Pick another point and we can do the same thing, this time with adifferent set of neighbors (one of them even being the first point wepicked out).

And we can do that a third time for good measure, with another set ofsix nearest neighbors and another circle with slightly different radiusagain.

Build the minimum spanning tree¶

Now that we have a new mutual reachability metric on the data we wantstart finding the islands on dense data. Of course dense areas arerelative, and different islands may have different densities.Conceptually what we will do is the following: consider the data as aweighted graph with the data points as vertices and an edge between anytwo points with weight equal to the mutual reachability distance ofthose points.

Now consider a threshold value, starting high, and steadily beinglowered. Drop any edges with weight above that threshold. As we dropedges we will start to disconnect the graph into connected components.Eventually we will have a hierarchy of connected components (fromcompletely connected to completely disconnected) at varying thresholdlevels.

In practice this is very expensive: there are edges and wedon’t want to have to run a connected components algorithm that manytimes. The right thing to do is to find a minimal set of edges such thatdropping any edge from the set causes a disconnection of components. Butwe need more, we need this set to be such that there is no lower weightedge that could connect the components. Fortunately graph theoryfurnishes us with just such a thing: the minimum spanning tree of thegraph.

We can build the minimum spanning tree very efficiently via Prim’salgorithm – webuild the tree one edge at a time, always adding the lowest weight edgethat connects the current tree to a vertex not yet in the tree. You cansee the tree HDBSCAN constructed below; note that this is the minimumspanning tree for mutual reachability distance which is different fromthe pure distance in the graph. In this case we had a k value of 5.

In the case that the data lives in a metric space we can use even fastermethods, such as Dual Tree Boruvka to build the minimal spanning tree.

clusterer.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.6, node_size=80, edge_linewidth=2)

How HDBSCAN Works — hdbscan 0.8.1 documentation (12)

Build the cluster hierarchy¶

Given the minimal spanning tree, the next step is to convert that intothe hierarchy of connected components. This is most easily done in thereverse order: sort the edges of the tree by distance (in increasingorder) and then iterate through, creating a new merged cluster for eachedge. The only difficult part here is to identify the two clusters eachedge will join together, but this is easy enough via aunion-finddata structure. We can view the result as a dendrogram as we see below:

clusterer.single_linkage_tree_.plot(cmap='viridis', colorbar=True)

How HDBSCAN Works — hdbscan 0.8.1 documentation (13)

This brings us to the point where robust single linkage stops. We wantmore though; a cluster hierarchy is good, but we really want a set offlat clusters. We could do that by drawing a a horizontal line throughthe above diagram and selecting the clusters that it cuts through. Thisis in practice whatDBSCANeffectively does (declaring any singleton clusters at the cut level asnoise). The question is, how do we know where to draw that line? DBSCANsimply leaves that as a (very unintuitive) parameter. Worse, we reallywant to deal with variable density clusters and any choice of cut lineis a choice of mutual reachability distance to cut at, and hence asingle fixed density level. Ideally we want to be able to cut the treeat different places to select our clusters. This is where the next stepsof HDBSCAN begin and create the difference from robust single linkage.

Condense the cluster tree¶

The first step in cluster extraction is condensing down the large andcomplicated cluster hierarchy into a smaller tree with a little moredata attached to each node. As you can see in the hierarchy above it isoften the case that a cluster split is one or two points splitting offfrom a cluster; and that is the key point – rather than seeing it as acluster splitting into two new clusters we want to view it as a singlepersistent cluster that is ‘losing points’. To make this concrete weneed a notion of minimum cluster size which we take as a parameterto HDBSCAN. Once we have a value for minimum cluster size we can nowwalk through the hierarchy and at each split ask if one of the newclusters created by the split has fewer points than the minimum clustersize. If it is the case that we have fewer points than the minimumcluster size we declare it to be ‘points falling out of a cluster’ andhave the larger cluster retain the cluster identity of the parent,marking down which points ‘fell out of the cluster’ and at what distancevalue that happened. If on the other hand the split is into two clusterseach at least as large as the minimum cluster size then we consider thata true cluster split and let that split persist in the tree. Afterwalking through the whole hierarchy and doing this we end up with a muchsmaller tree with a small number of nodes, each of which has data abouthow the size of the cluster at that node decreases over varyingdistance. We can visualize this as a dendrogram similar to the one above– again we can have the width of the line represent the number ofpoints in the cluster. This time, however, that width varies over thelength of the line as points fall out of the cluster. For our data usinga minimum cluster size of 5 the result looks like this:

clusterer.condensed_tree_.plot()

How HDBSCAN Works — hdbscan 0.8.1 documentation (14)

This is much easier to look at and deal with, particularly in as simplea clustering problem as our current test dataset. However we still needto pick out clusters to use as a flat clustering. Looking at the plotabove should give you some ideas about how one might go about doingthis.

Extract the clusters¶

Intuitively we want the choose clusters that persist and have a longerlifetime; short lived clusters are probably merely artifactsof the single linkage approach. Looking at the previous plot we couldsay that we want to choose those clusters that have the greatest area ofink in the plot. To make a flat clustering we will need to add a furtherrequirement that, if you select a cluster, then you cannot select anycluster that is a descendant of it. And in fact that intuitive notion ofwhat should be done is exactly what HDBSCAN does. Of course we need toformalise things to make it a concrete algorithm.

First we need a different measure than distance to consider thepersistence of clusters; instead we will use. For a given cluster wecan then define values and to be the lambda value when the clustersplit off and became it’s own cluster, and the lambda value (if any)when the cluster split into smaller clusters respectively. In turn, fora given cluster, for each point p in that cluster we can define thevalue as the lambda value at which that point ‘fellout of the cluster’ which is a value somewhere between and since the point either falls out of the cluster at some point in thecluster’s lifetime, or leaves the cluster when the cluster splits intotwo smaller clusters. Now, for each cluster compute the stabilityas

Declare all leaf nodes to be selected clusters. Now work up through thetree (the reverse topological sort order). If the sum of the stabilitiesof the child clusters is greater than the stability of the cluster, thenwe set the cluster stability to be the sum of the child stabilities. If,on the other hand, the cluster’s stability is greater than the sum ofits children then we declare the cluster to be a selected cluster andunselect all its descendants. Once we reach the root node we call thecurrent set of selected clusters our flat clustering and return that.

Okay, that was wordy and complicated, but it really is simply performingour ‘select the clusters in the plot with the largest total ink area’subject to descendant constraints that we explained earlier. We canselect the clusters in the condensed tree dendrogram via this algorithm,and you get what you expect:

clusterer.condensed_tree_.plot(select_clusters=True, selection_palette=sns.color_palette())

How HDBSCAN Works — hdbscan 0.8.1 documentation (22)

Now that we have the clusters it is a simple enough matter to turn thatinto cluster labelling as per the sklearn API. Any point not in aselected cluster is simply a noise point (and assigned the label -1). Wecan do a little more though: for each cluster we have the for each point p in that cluster; If we simplynormalize those values (so they range from zero to one) then we have ameasure of the strength of cluster membership for each point in thecluster. The hdbscan library returns this as a probabilities_attribute of the clusterer object. Thus, with labels and membershipstrengths in hand we can make the standard plot, choosing a color forpoints based on cluster label, and desaturating that color according thestrength of membership (and make unclustered points pure gray).

palette = sns.color_palette()cluster_colors = [sns.desaturate(palette[col], sat) if col >= 0 else (0.5, 0.5, 0.5) for col, sat in zip(clusterer.labels_, clusterer.probabilities_)]plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

How HDBSCAN Works — hdbscan 0.8.1 documentation (24)

And that is how HDBSCAN works. It may seem somewhat complicated – thereare a fair number of moving parts to the algorithm – but ultimatelyeach part is actually very straightforward and can be optimized well.Hopefully with a better understanding both of the intuitions and some ofthe implementation details of HDBSCAN you will feel motivated to try itout. The library continues todevelop, and will provide a base for new ideas including a nearparameterless Persistent Density Clustering algorithm, and a newsemi-supervised clustering algorithm.