Class DecisionTreeInfoGain

All Implemented Interfaces:
Serializable, Cloneable
Direct Known Subclasses:
ALACART, C45

public abstract class DecisionTreeInfoGain extends DecisionTree implements Serializable, Cloneable

Abstract class that extends DecisionTree for classes that use an information gain criteria.

See Also:
  • Constructor Details

    • DecisionTreeInfoGain

      public DecisionTreeInfoGain(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
      Constructs a DecisionTree object for a single response variable and multiple predictor variables.
      Parameters:
      xy - a double matrix with rows containing the observations on the predictor variables and one response variable
      responseColumnIndex - an int specifying the column index of the response variable
      varType - a PredictiveModel.VariableType array containing the type of each variable
  • Method Details

    • selectSplitVariable

      protected abstract int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)
      Abstract method for selecting the next split variable and split definition for the node.
      Specified by:
      selectSplitVariable in class DecisionTree
      Parameters:
      xy - a double matrix containing the data
      classCounts - a double array containing the counts for each class of the response variable, when it is categorical
      parentFreq - a double array used to indicate which subset of the observations belong in the current node
      splitValue - a double array representing the resulting split point if the selected variable is quantitative
      splitCriterionValue - a double, the value of the criterion used to determine the splitting variable
      splitPartition - an int array indicating the resulting split partition if the selected variable is categorical
      Returns:
      an int specifying the column index of the split variable in this.getPredictorIndexes
    • setGainCriteria

      public void setGainCriteria(DecisionTreeInfoGain.GainCriteria gainCriteria)
      Specifies which criteria to use in gain calculations in order to determine the best split at each node.
      Parameters:
      gainCriteria - a DecisionTreeInfoGain.GainCriteria specifying which criteria to use in gain calculations in order to determine the best split at each node

      Default: gainCriteria = DecisionTreeInfoGain.GainCriteria.SHANNON_ENTROPY

    • useGainRatio

      public boolean useGainRatio()
      Returns whether or not the gain ratio is to be used instead of the gain to determine the best split.
      Returns:
      a boolean indicating if the gain ratio is to be used

      true, uses the gain ratio; false uses the gain.

    • setUseRatio

      public void setUseRatio(boolean ratio)
      Sets the flag to use or not use the gain ratio instead of the gain to determine the best split.
      Parameters:
      ratio - a boolean indicating if the gain ratio is to be used

      true uses the gain ratio; false uses the gain.

      Default: useRatio=false

    • getCriteriaValueCategorical

      protected double getCriteriaValueCategorical(double[][] tableXY, double[] classCounts, int nRows, int maxNumCats)
      Calculates and returns the value of the criterion on the node represented by the data set S = xy.
      Parameters:
      tableXY -
      classCounts - an int array containing the total counts of response variable by category
      nRows - an int, the number of rows in xy
      maxNumCats - an int, the maximum number of categorical values allowed in the problem
      Returns:
      a double, the value of the splitting criteria
    • getCountXY

      protected double[][] getCountXY(double[][] xy, int nRows, int xIdx, int yIdx, int maxNumberOfCategories, int[] uniqueX, int[] uniqueY, double[] frequencies)
      Calculates a two-way frequency table with input frequencies. Note: The function assumes that the data is encoded as indices 0,1,...,K < maxNumberOfCategories. @param xy a double matrix containing the data to be tabulated
      Parameters:
      nRows - an int the number of rows in xy
      xIdx - an int the column index of x
      yIdx - an int the column index of y
      maxNumberOfCategories - an int the maximum number of categories in either x or y
      uniqueX - an int array containing indicators for the categories of x
      uniqueY - an int array containing indicators for the categories of y
      frequencies - an double array of length nRows containing the row frequencies for the data
      Returns:
      a maxNumberOfCategories by maxNumberOfCategories double matrix containing the cross-tabulated frequencies for x and y. The categories of x vary along the rows and the categories of y vary along the columns.