Package com.imsl.stat

Class ClusterKNN

java.lang.Object
com.imsl.stat.ClusterKNN
All Implemented Interfaces:
Serializable, Cloneable

public class ClusterKNN extends Object implements Serializable, Cloneable

Perform a k-Nearest Neighbor classification.

ClusterKNN implements an algorithm to classify objects based on a training set. Among the simpler algorithms for classification, classifying a new object is essentially a majority vote of its closest k neighbors. k must be a positive integer and is typically small and odd. The method is straightforward in that the distance from the new point to every point in the training set is computed and sorted. The k closest points are examined and the new object is assigned to the class that is most common in that set. For the case k = 1 the object is assigned to the class of its nearest neighbor.

The default distance method is the Euclidean distance, but other options are available by using the setDistanceMethod method. The supported methods are:

method Description
L2_NORM The Euclidean distance method, \( L_2\) norm, defined as the sum of the squares of the difference of each coordinate. (Default)
L1_NORM The rectilinear norm or city block method, \(L_1\) norm, defined as the sum of the absolute values of the difference of each coordinate. This is most useful for integer input data.
INFINITY_NORM The Chebyshev distance method, \(L_{\infty} \) norm, defined as the maximum of the absolute values of the difference of each coordinate.

For cases where the data are poorly scaled, it may be necessary to normalize the input data first. For example, if in a 2D space the X values range from 0 to 1 and the Y values, from 0 to 1000, the distance calculations will be dominated by the Y coordinate unless they are normalized.

See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Indicates the distance is computed using the \(L_{\infty} \) norm method.
    static final int
    Indicates the distance is computed using the \(L_1\) norm method.
    static final int
    Indicates the distance is computed using the \(L_2\) norm, or Euclidean distance measurement.
  • Constructor Summary

    Constructors
    Constructor
    Description
    ClusterKNN(double[][] x, int[] c)
    Constructor for ClusterKNN.
  • Method Summary

    Modifier and Type
    Method
    Description
    int[]
    classify(double[][] value, int k)
    Classify a set of observations using k nearest neighbors.
    int
    classify(double[] value, int k)
    Classify an observation using k nearest neighbors.
    void
    setDistanceMethod(int method)
    Sets the distance calculation method to be used.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • L2_NORM

      public static final int L2_NORM
      Indicates the distance is computed using the \(L_2\) norm, or Euclidean distance measurement.
      See Also:
    • L1_NORM

      public static final int L1_NORM
      Indicates the distance is computed using the \(L_1\) norm method. Also known as rectilinear distance or city block distance, it is most useful for integer input data.
      See Also:
    • INFINITY_NORM

      public static final int INFINITY_NORM
      Indicates the distance is computed using the \(L_{\infty} \) norm method. This is also known as the maximum difference or Chebyshev distance.
      See Also:
  • Constructor Details

    • ClusterKNN

      public ClusterKNN(double[][] x, int[] c)
      Constructor for ClusterKNN.
      Parameters:
      x - a double matrix containing the known x.length observations of x[0].length variables
      c - an int array containing the categories for the x.length observations. All integer values are valid.
  • Method Details

    • classify

      public int classify(double[] value, int k)
      Classify an observation using k nearest neighbors.
      Parameters:
      value - a double array of x[0].length variables containing the observation to classify
      k - an int containing the number of nearest neighbors to use. An odd value is recommended.
      Returns:
      an int containing the cluster to which the observation belongs
    • classify

      public int[] classify(double[][] value, int k)
      Classify a set of observations using k nearest neighbors.
      Parameters:
      value - a double matrix of value.length observations on x[0].length variables to classify
      k - an int containing the number of nearest neighbors to use. An odd value is recommended.
      Returns:
      an int array containing the cluster to which each of the observations belong
    • setDistanceMethod

      public void setDistanceMethod(int method)
      Sets the distance calculation method to be used.
      Parameters:
      method - an int identifying the distance calculation method to be used. By default, method = L2_NORM.

      method Description
      L2_NORM \(\mathrm{d}(\mathbf{p},\mathbf{q})=\sqrt{ \sum_{i=1}^{n}{(q_i-p_i})^2}\)
      L1_NORM \(\mathrm{d}(\mathbf{p},\mathbf{q})=\sum_{i=1}^{n} \lvert{q_i-p_i}\rvert\)
      INFINITY_NORM \(\mathrm{d}(\mathbf{p},\mathbf{q})=\max_i(|p_i-q_i|) \)