Package com.imsl.stat

Class DBSCAN

java.lang.Object
com.imsl.stat.DBSCAN
All Implemented Interfaces:
Serializable, Cloneable

public class DBSCAN extends Object implements Serializable, Cloneable
Perform a DBSCAN cluster analysis.

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm that can detect clusters of different shapes. It is based on the notion of density-reachability: Let \(\mathcal{A} \subset \mathbb{R}^n\) be a set of observations and \(d: \mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R}_+ \) be a distance function. The (closed) \(\epsilon\) - neighborhood of a point \(x \in \mathcal{A} \) is defined as $$ \mathcal{N}_\epsilon(x)= \{p \in \mathcal{A} \; | \; d(x,p) \le \epsilon\}.$$ A point \(x\) is said to be a core point if the number of points in the \(\epsilon\) - neighborhood of \(x\) is at least minPts, \(|\mathcal{N}_\epsilon(x)| \ge \) minPts. A point \(y\) is directly density-reachable from a point \(x\) with respect to \(\epsilon\) and minPts, if \(y \in \mathcal{N}_\epsilon(x)\) and \(x\) is a core point. More generally, a point \(y\) is said to be density-reachable from a point \(x\) if there is a sequence of points \(x=x_1,x_2,\ldots,x_i=y\) such that \(x_k\) is directly density-reachable from \(x_{k-1}\) for \(k=2,\ldots,i\). Note that this definition only implies the points \(x_1,x_2,\ldots,x_{i-1}\) to be core points. The last point \(x_i\) can be a non-core point as well.

The DBSCAN algorithm starts with an arbitrary core point \(x\) and determines all points in \(\mathcal{A}\) that are density-reachable from \(x\) with respect to \(\epsilon\) and minPts. These points form a cluster, consisting of core points and potential border points. The algorithm then picks a non-cluster core point and creates the next cluster. This process continues until all core points are assigned to clusters. The remaining points that do not belong to any cluster are non-core and usually located in regions with low density of observations. They are termed noise points or outliers.

DBSCAN requires the two key parameters \(\epsilon\) and minPts. For setting appropriate values, users may apply specific domain knowledge or employ a good heuristic method as described in Sander et al. (1998). They suggest setting minPts to twice the data dimensionality, minPts = \(2 \cdot n\) . For the determination of \(\epsilon\) they use a graphical approach: First, for \(k = 2 \cdot n - 1\), the distance of each observation to its \(k\) - th nearest neighbor is determined, resulting in a function \(F_{k}: \mathcal{A} \rightarrow \mathbb{R}_+\). The distance values \(F_k(\mathcal{A})\) are then ordered in decreasing order and plotted in a 2-dimensional \(k\)-distance graph. From the graph, the user selects an observation \(x_0\) as the threshold point and sets \(F_k(x_0)\) as the \(\epsilon\) - value. In practice, the threshold point can e.g. be derived from a known percentage of noise in the data or as the first point in the first "valley" of the \(k\)-distance graph.

Another, fully automated approach for the determination of minPts and \(\epsilon\) is offered by Scitovski et al. (2021) and is implemented in auxiliary method computeDBSCANParams(double).

DBSCAN usually works well on preponderant homogeneous data, i.e. data where almost all core points contain approximately the same number of data points in their \(\epsilon\) - neighborhoods. For non-homogeneous data, where clusters can have highly different densities, the results might sometimes be unsatisfactory.

Class DBSCAN contains an implementation of the DBSCAN algorithm based on Ester et al. (1996) and Scitovski et al. (2021).

References

  1. Ester, M., H.-P. Kriegel, J. Sander and X. Xu (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, Portland, Oregon.
  2. Sander, J., M. Ester, H.-P. Kriegel, and X. Xu (1998), Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications , Data Mining and Knowledge Discovery 2, 2, 169-194.
  3. Scitovski, R., K. Sabo, F. Martinez-Alvarez, and S. Ungar (2021), Cluster Analysis and Applications, Springer Nature Switzerland, Cham, Switzerland.

See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    Class that holds the minimum number of points and epsilon parameters of the DBSCAN algorithm.
    static interface 
    Public interface for the user-supplied function to compute the distances between points.
  • Constructor Summary

    Constructors
    Constructor
    Description
    DBSCAN(DBSCAN.Function f, double[][] x)
    Constructor for class DBSCAN.
  • Method Summary

    Modifier and Type
    Method
    Description
    Clones a DBSCAN cluster object.
    final void
    compute(int minPts, double epsilon)
    Clusters the observations using the DBSCAN algorithm with user-defined epsilon radius and minimum number of points per core point.
    computeDBSCANParams(double epsQuantile)
    Computes minimum number of points in a core point and epsilon radius for preponderant homogeneous data.
    int[]
    Returns the number of observations in each cluster.
    int[]
    Returns the cluster membership for each observation.
    int[][]
    Returns the clusters and outliers computed by DBSCAN.
    int
    Returns the number of java.lang.Thread instances used for parallel processing.
    void
    setNumberOfThreads(int numberOfThreads)
    Sets the number of java.lang.Thread instances to be used for parallel processing.

    Methods inherited from class java.lang.Object

    equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • DBSCAN

      public DBSCAN(DBSCAN.Function f, double[][] x)
      Constructor for class DBSCAN.
      Parameters:
      f - a Function object, the user-supplied distance metric
      x - a double matrix containing the observations to be clustered in row-wise order
  • Method Details

    • compute

      public final void compute(int minPts, double epsilon)
      Clusters the observations using the DBSCAN algorithm with user-defined epsilon radius and minimum number of points per core point.
      Parameters:
      minPts - a positive int, the minimum number of observations in the epsilon neighborhood of a point (including the point itself) required to declare the point a core point
      epsilon - a non-negative double, the radius defining the epsilon neighborhood of all observations
    • computeDBSCANParams

      public DBSCAN.DBSCANParams computeDBSCANParams(double epsQuantile)
      Computes minimum number of points in a core point and epsilon radius for preponderant homogeneous data.

      For a given set \(\mathcal{A}\) of observations, the minimum number of points is estimated as minPts \(=\lfloor \log (| \mathcal{A} |) \rfloor\) (see Scitovski et al. (2021), p. 94). For each observation, the minimum epsilon ball that contains at least minPts observation points is computed. For the resulting set of epsilon radii the quantile is determined and returned as epsilon radius.

      This method is parallelized. The number of threads can be controlled by method setNumberOfThreads(int).

      Parameters:
      epsQuantile - a double between 0 and 1, the (epsQuantile * 100)% quantile of the minimum radii of the epsilon balls around each observation point that contain at least minPts points. The quantile should reflect the expected percentage of core points in the observation set. Usually, a value between 0.9 and 0.99 should be chosen.
      Returns:
      a DBSCANParams object containing the minimum number of points per core point and epsilon values that can be used as arguments in method compute(int, double)
    • getClusters

      public int[][] getClusters()
      Returns the clusters and outliers computed by DBSCAN.

      Note that the compute(int, double) method must be invoked first before invoking this method. Otherwise, the method throws an IllegalStateException exception.

      Returns:
      an int jagged array containing the indices of the observations belonging to each cluster.

      Row 0 contains the indices of observations identified as outliers, that is, observations that could not be assigned to any cluster by DBSCAN. If no outliers are present, row 0 is an array of size 0. Row 1 contains the indices of all observations belonging to cluster 1, row 2 contains the indices of all observations belonging to cluster 2, etc.

    • getClusterMembership

      public int[] getClusterMembership()
      Returns the cluster membership for each observation.

      Note that the compute(int, double) method must be invoked first before invoking this method. Otherwise, the method throws an IllegalStateException exception.

      Returns:
      an int array containing the cluster membership for each observation

      Cluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc. An outlier, that is, an observation that could not be assigned to any cluster by DBSCAN, is indicated by the value 0.

    • getClusterCounts

      public int[] getClusterCounts()
      Returns the number of observations in each cluster.

      Note that the compute(int, double) method must be invoked first before invoking this method. Otherwise, the method throws an IllegalStateException exception.

      Returns:
      an int array containing the number of observations in each cluster. Element 0 contains the number of outliers, element 1 the number of elements in cluster 1, element 2 the number of elements in cluster 2, etc.
    • setNumberOfThreads

      public void setNumberOfThreads(int numberOfThreads)
      Sets the number of java.lang.Thread instances to be used for parallel processing.
      Parameters:
      numberOfThreads - an int specifying the number of java.lang.Thread instances to be used for parallel processing

      Default: numberOfThreads = 1.

    • getNumberOfThreads

      public int getNumberOfThreads()
      Returns the number of java.lang.Thread instances used for parallel processing.
      Returns:
      an int containing the number of java.lang.Thread instances used for parallel processing
    • clone

      public DBSCAN clone()
      Clones a DBSCAN cluster object.
      Overrides:
      clone in class Object
      Returns:
      a clone of the DBSCAN cluster object