com.imsl.stat.DBSCAN

All Implemented Interfaces:: Serializable, Cloneable

public class DBSCAN extends Object implements Serializable, Cloneable

Perform a DBSCAN cluster analysis.

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm that can detect clusters of different shapes. It is based on the notion of density-reachability: Let $\mathcal{A} \subset \mathbb{R}^n$ be a set of observations and $d: \mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R}_+ $ be a distance function. The (closed) $\epsilon$ - neighborhood of a point $x \in \mathcal{A} $ is defined as $$ \mathcal{N}_\epsilon(x)= \{p \in \mathcal{A} \; | \; d(x,p) \le \epsilon\}.$$ A point $x$ is said to be a core point if the number of points in the $\epsilon$ - neighborhood of $x$ is at least minPts, $|\mathcal{N}_\epsilon(x)| \ge $ minPts. A point $y$ is directly density-reachable from a point $x$ with respect to $\epsilon$ and minPts, if $y \in \mathcal{N}_\epsilon(x)$ and $x$ is a core point. More generally, a point $y$ is said to be density-reachable from a point $x$ if there is a sequence of points $x=x_1,x_2,\ldots,x_i=y$ such that $x_k$ is directly density-reachable from $x_{k-1}$ for $k=2,\ldots,i$. Note that this definition only implies the points $x_1,x_2,\ldots,x_{i-1}$ to be core points. The last point $x_i$ can be a non-core point as well.

The DBSCAN algorithm starts with an arbitrary core point $x$ and determines all points in $\mathcal{A}$ that are density-reachable from $x$ with respect to $\epsilon$ and minPts. These points form a cluster, consisting of core points and potential border points. The algorithm then picks a non-cluster core point and creates the next cluster. This process continues until all core points are assigned to clusters. The remaining points that do not belong to any cluster are non-core and usually located in regions with low density of observations. They are termed noise points or outliers.

DBSCAN requires the two key parameters $\epsilon$ and minPts. For setting appropriate values, users may apply specific domain knowledge or employ a good heuristic method as described in Sander et al. (1998). They suggest setting minPts to twice the data dimensionality, minPts = $2 \cdot n$ . For the determination of $\epsilon$ they use a graphical approach: First, for $k = 2 \cdot n - 1$, the distance of each observation to its $k$ - th nearest neighbor is determined, resulting in a function $F_{k}: \mathcal{A} \rightarrow \mathbb{R}_+$. The distance values $F_k(\mathcal{A})$ are then ordered in decreasing order and plotted in a 2-dimensional $k$-distance graph. From the graph, the user selects an observation $x_0$ as the threshold point and sets $F_k(x_0)$ as the $\epsilon$ - value. In practice, the threshold point can e.g. be derived from a known percentage of noise in the data or as the first point in the first "valley" of the $k$-distance graph.

Another, fully automated approach for the determination of minPts and $\epsilon$ is offered by Scitovski et al. (2021) and is implemented in auxiliary method computeDBSCANParams(double).

DBSCAN usually works well on preponderant homogeneous data, i.e. data where almost all core points contain approximately the same number of data points in their $\epsilon$ - neighborhoods. For non-homogeneous data, where clusters can have highly different densities, the results might sometimes be unsatisfactory.

Class DBSCAN contains an implementation of the DBSCAN algorithm based on Ester et al. (1996) and Scitovski et al. (2021).

References

Ester, M., H.-P. Kriegel, J. Sander and X. Xu (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, Portland, Oregon.
Sander, J., M. Ester, H.-P. Kriegel, and X. Xu (1998), Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications , Data Mining and Knowledge Discovery 2, 2, 169-194.
Scitovski, R., K. Sabo, F. Martinez-Alvarez, and S. Ungar (2021), Cluster Analysis and Applications, Springer Nature Switzerland, Cham, Switzerland.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

DBSCAN.DBSCANParams

Class that holds the minimum number of points and epsilon parameters of the DBSCAN algorithm.

static interface

DBSCAN.Function

Public interface for the user-supplied function to compute the distances between points.
Constructor Summary

Constructors

Constructor

Description

DBSCAN(DBSCAN.Function f, double[][] x)

Constructor for class DBSCAN.
Method Summary

Modifier and Type

Method

Description

DBSCAN

clone()

Clones a DBSCAN cluster object.

final void

compute(int minPts, double epsilon)

Clusters the observations using the DBSCAN algorithm with user-defined epsilon radius and minimum number of points per core point.

DBSCAN.DBSCANParams

computeDBSCANParams(double epsQuantile)

Computes minimum number of points in a core point and epsilon radius for preponderant homogeneous data.

int[]

getClusterCounts()

Returns the number of observations in each cluster.

int[]

getClusterMembership()

Returns the cluster membership for each observation.

int[][]

getClusters()

Returns the clusters and outliers computed by DBSCAN.

int

getNumberOfThreads()

Returns the number of java.lang.Thread instances used for parallel processing.

void

setNumberOfThreads(int numberOfThreads)

Sets the number of java.lang.Thread instances to be used for parallel processing.

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DBSCAN
  
  public DBSCAN(DBSCAN.Function f, double[][] x)
  
  Constructor for class DBSCAN.
  
  Parameters:
  
  f - a Function object, the user-supplied distance metric
  
  x - a double matrix containing the observations to be clustered in row-wise order
Method Details
- compute
  
  public final void compute(int minPts, double epsilon)
  
  Clusters the observations using the DBSCAN algorithm with user-defined epsilon radius and minimum number of points per core point.
  
  Parameters:
  
  minPts - a positive int, the minimum number of observations in the epsilon neighborhood of a point (including the point itself) required to declare the point a core point
  
  epsilon - a non-negative double, the radius defining the epsilon neighborhood of all observations
- computeDBSCANParams
  
  public DBSCAN.DBSCANParams computeDBSCANParams(double epsQuantile)
  
  Computes minimum number of points in a core point and epsilon radius for preponderant homogeneous data.
  For a given set $\mathcal{A}$ of observations, the minimum number of points is estimated as minPts $=\lfloor \log (| \mathcal{A} |) \rfloor$ (see Scitovski et al. (2021), p. 94). For each observation, the minimum epsilon ball that contains at least minPts observation points is computed. For the resulting set of epsilon radii the quantile is determined and returned as epsilon radius.
  
  This method is parallelized. The number of threads can be controlled by method setNumberOfThreads(int).
  
  Parameters:
  
  epsQuantile - a double between 0 and 1, the (epsQuantile * 100)% quantile of the minimum radii of the epsilon balls around each observation point that contain at least minPts points. The quantile should reflect the expected percentage of core points in the observation set. Usually, a value between 0.9 and 0.99 should be chosen.
  
  Returns:
  
  a DBSCANParams object containing the minimum number of points per core point and epsilon values that can be used as arguments in method compute(int, double)
- getClusters
  
  public int[][] getClusters()
  
  Returns the clusters and outliers computed by DBSCAN.
  Note that the compute(int, double) method must be invoked first before invoking this method. Otherwise, the method throws an IllegalStateException exception.
  
  Returns:
  
  an int jagged array containing the indices of the observations belonging to each cluster.
  Row 0 contains the indices of observations identified as outliers, that is, observations that could not be assigned to any cluster by DBSCAN. If no outliers are present, row 0 is an array of size 0. Row 1 contains the indices of all observations belonging to cluster 1, row 2 contains the indices of all observations belonging to cluster 2, etc.
- getClusterMembership
  
  public int[] getClusterMembership()
  
  Returns the cluster membership for each observation.
  Note that the compute(int, double) method must be invoked first before invoking this method. Otherwise, the method throws an IllegalStateException exception.
  
  Returns:
  
  an int array containing the cluster membership for each observation
  Cluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc. An outlier, that is, an observation that could not be assigned to any cluster by DBSCAN, is indicated by the value 0.
- getClusterCounts
  
  public int[] getClusterCounts()
  
  Returns the number of observations in each cluster.
  Note that the compute(int, double) method must be invoked first before invoking this method. Otherwise, the method throws an IllegalStateException exception.
  
  Returns:
  
  an int array containing the number of observations in each cluster. Element 0 contains the number of outliers, element 1 the number of elements in cluster 1, element 2 the number of elements in cluster 2, etc.
- setNumberOfThreads
  
  public void setNumberOfThreads(int numberOfThreads)
  
  Sets the number of java.lang.Thread instances to be used for parallel processing.
  
  Parameters:
  
  numberOfThreads - an int specifying the number of java.lang.Thread instances to be used for parallel processing
  Default: numberOfThreads = 1.
- getNumberOfThreads
  
  public int getNumberOfThreads()
  
  Returns the number of java.lang.Thread instances used for parallel processing.
  
  Returns:
  
  an int containing the number of java.lang.Thread instances used for parallel processing
- clone
  
  public DBSCAN clone()
  
  Clones a DBSCAN cluster object.
  
  Overrides:
  
  clone in class Object
  
  Returns:
  
  a clone of the DBSCAN cluster object

Class DBSCAN

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

DBSCAN

Method Details

compute

computeDBSCANParams

getClusters

getClusterMembership

getClusterCounts

setNumberOfThreads

getNumberOfThreads

clone