Class DBSCAN
- All Implemented Interfaces:
Serializable,Cloneable
Density-based spatial clustering of applications with noise (DBSCAN) is a
clustering algorithm that can detect clusters of different shapes. It is
based on the notion of density-reachability: Let \(\mathcal{A} \subset
\mathbb{R}^n\) be a set of observations and \(d: \mathbb{R}^n \times
\mathbb{R}^n \rightarrow \mathbb{R}_+ \) be a distance function. The
(closed) \(\epsilon\) - neighborhood of a point \(x \in \mathcal{A} \) is
defined as
$$ \mathcal{N}_\epsilon(x)=
\{p \in \mathcal{A} \; | \; d(x,p) \le \epsilon\}.$$
A point \(x\) is said to be a core point if the number of points in
the \(\epsilon\) - neighborhood of \(x\) is at least minPts,
\(|\mathcal{N}_\epsilon(x)| \ge \) minPts. A point \(y\) is
directly density-reachable from a point \(x\) with respect to
\(\epsilon\) and minPts, if \(y \in \mathcal{N}_\epsilon(x)\)
and \(x\) is a core point. More generally, a point \(y\) is said to be
density-reachable from a point \(x\) if there is a sequence of points
\(x=x_1,x_2,\ldots,x_i=y\) such that \(x_k\) is directly density-reachable
from \(x_{k-1}\) for \(k=2,\ldots,i\). Note that this definition only implies
the points \(x_1,x_2,\ldots,x_{i-1}\) to be core points. The last point
\(x_i\) can be a non-core point as well.
The DBSCAN algorithm starts with an arbitrary core point \(x\) and
determines all points in \(\mathcal{A}\) that are density-reachable from
\(x\) with respect to \(\epsilon\) and minPts. These points form
a cluster, consisting of core points and potential border points. The
algorithm then picks a non-cluster core point and creates the next cluster.
This process continues until all core points are assigned to clusters. The
remaining points that do not belong to any cluster are non-core and usually
located in regions with low density of observations. They are termed noise
points or outliers.
DBSCAN requires the two key parameters \(\epsilon\) and minPts.
For setting appropriate values, users may apply specific domain knowledge or
employ a good heuristic method as described in Sander et al. (1998). They
suggest setting minPts to twice the data dimensionality,
minPts = \(2 \cdot n\) . For the determination of \(\epsilon\)
they use a graphical approach: First, for \(k = 2 \cdot n - 1\), the distance
of each observation to its \(k\) - th nearest neighbor is determined,
resulting in a function \(F_{k}: \mathcal{A} \rightarrow \mathbb{R}_+\). The
distance values \(F_k(\mathcal{A})\) are then ordered in decreasing order and
plotted in a 2-dimensional \(k\)-distance graph. From the graph, the user
selects an observation \(x_0\) as the threshold point and sets \(F_k(x_0)\)
as the \(\epsilon\) - value. In practice, the threshold point can e.g. be
derived from a known percentage of noise in the data or as the first point in
the first "valley" of the \(k\)-distance graph.
minPts and \(\epsilon\) is offered by Scitovski et al. (2021)
and is implemented in auxiliary method
computeDBSCANParams(double).
DBSCAN usually works well on preponderant homogeneous data, i.e. data where almost all core points contain approximately the same number of data points in their \(\epsilon\) - neighborhoods. For non-homogeneous data, where clusters can have highly different densities, the results might sometimes be unsatisfactory.
ClassDBSCAN contains an implementation of the DBSCAN algorithm
based on Ester et al. (1996) and Scitovski et al. (2021).
References
- Ester, M., H.-P. Kriegel, J. Sander and X. Xu (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, Portland, Oregon.
- Sander, J., M. Ester, H.-P. Kriegel, and X. Xu (1998), Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications , Data Mining and Knowledge Discovery 2, 2, 169-194.
- Scitovski, R., K. Sabo, F. Martinez-Alvarez, and S. Ungar (2021), Cluster Analysis and Applications, Springer Nature Switzerland, Cham, Switzerland.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classClass that holds the minimum number of points and epsilon parameters of theDBSCANalgorithm.static interfacePublic interface for the user-supplied function to compute the distances between points. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionclone()Clones aDBSCANcluster object.final voidcompute(int minPts, double epsilon) Clusters the observations using theDBSCANalgorithm with user-defined epsilon radius and minimum number of points per core point.computeDBSCANParams(double epsQuantile) Computes minimum number of points in a core point and epsilon radius for preponderant homogeneous data.int[]Returns the number of observations in each cluster.int[]Returns the cluster membership for each observation.int[][]Returns the clusters and outliers computed by DBSCAN.intReturns the number ofjava.lang.Threadinstances used for parallel processing.voidsetNumberOfThreads(int numberOfThreads) Sets the number ofjava.lang.Threadinstances to be used for parallel processing.
-
Constructor Details
-
DBSCAN
Constructor for classDBSCAN.- Parameters:
f- aFunctionobject, the user-supplied distance metricx- adoublematrix containing the observations to be clustered in row-wise order
-
-
Method Details
-
compute
public final void compute(int minPts, double epsilon) Clusters the observations using theDBSCANalgorithm with user-defined epsilon radius and minimum number of points per core point.- Parameters:
minPts- a positiveint, the minimum number of observations in the epsilon neighborhood of a point (including the point itself) required to declare the point a core pointepsilon- a non-negativedouble, the radius defining the epsilon neighborhood of all observations
-
computeDBSCANParams
Computes minimum number of points in a core point and epsilon radius for preponderant homogeneous data.For a given set \(\mathcal{A}\) of observations, the minimum number of points is estimated as
minPts\(=\lfloor \log (| \mathcal{A} |) \rfloor\) (see Scitovski et al. (2021), p. 94). For each observation, the minimum epsilon ball that contains at leastminPtsobservation points is computed. For the resulting set of epsilon radii the quantile is determined and returned as epsilon radius.This method is parallelized. The number of threads can be controlled by method
setNumberOfThreads(int).- Parameters:
epsQuantile- adoublebetween 0 and 1, the (epsQuantile* 100)% quantile of the minimum radii of the epsilon balls around each observation point that contain at leastminPtspoints. The quantile should reflect the expected percentage of core points in the observation set. Usually, a value between 0.9 and 0.99 should be chosen.- Returns:
- a
DBSCANParamsobject containing the minimum number of points per core point and epsilon values that can be used as arguments in methodcompute(int, double)
-
getClusters
public int[][] getClusters()Returns the clusters and outliers computed by DBSCAN.Note that the
compute(int, double)method must be invoked first before invoking this method. Otherwise, the method throws anIllegalStateExceptionexception.- Returns:
- an
intjagged array containing the indices of the observations belonging to each cluster.Row 0 contains the indices of observations identified as outliers, that is, observations that could not be assigned to any cluster by DBSCAN. If no outliers are present, row 0 is an array of size 0. Row 1 contains the indices of all observations belonging to cluster 1, row 2 contains the indices of all observations belonging to cluster 2, etc.
-
getClusterMembership
public int[] getClusterMembership()Returns the cluster membership for each observation.Note that the
compute(int, double)method must be invoked first before invoking this method. Otherwise, the method throws anIllegalStateExceptionexception.- Returns:
- an
intarray containing the cluster membership for each observationCluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc. An outlier, that is, an observation that could not be assigned to any cluster by
DBSCAN, is indicated by the value 0.
-
getClusterCounts
public int[] getClusterCounts()Returns the number of observations in each cluster.Note that the
compute(int, double)method must be invoked first before invoking this method. Otherwise, the method throws anIllegalStateExceptionexception.- Returns:
- an
intarray containing the number of observations in each cluster. Element 0 contains the number of outliers, element 1 the number of elements in cluster 1, element 2 the number of elements in cluster 2, etc.
-
setNumberOfThreads
public void setNumberOfThreads(int numberOfThreads) Sets the number ofjava.lang.Threadinstances to be used for parallel processing.- Parameters:
numberOfThreads- anintspecifying the number ofjava.lang.Threadinstances to be used for parallel processingDefault:
numberOfThreads= 1.
-
getNumberOfThreads
public int getNumberOfThreads()Returns the number ofjava.lang.Threadinstances used for parallel processing.- Returns:
- an
intcontaining the number ofjava.lang.Threadinstances used for parallel processing
-
clone
Clones aDBSCANcluster object.
-