com.imsl.datamining.decisionTree.DecisionTreeInfoGain

All Implemented Interfaces:: Serializable, Cloneable

Direct Known Subclasses:: ALACART, C45

public abstract class DecisionTreeInfoGain extends DecisionTree implements Serializable, Cloneable

Abstract class that extends DecisionTree for classes that use an information gain criteria.

See Also:

Serialized Form

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

DecisionTreeInfoGain.GainCriteria

Specifies which information gain criteria to use in determining the best split at each node.

Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTree
DecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeException

Nested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType
Constructor Summary

Constructors

Constructor

Description

DecisionTreeInfoGain(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)

Constructs a DecisionTree object for a single response variable and multiple predictor variables.
Method Summary

Modifier and Type

Method

Description

protected double[][]

getCountXY(double[][] xy, int nRows, int xIdx, int yIdx, int maxNumberOfCategories, int[] uniqueX, int[] uniqueY, double[] frequencies)

Calculates a two-way frequency table with input frequencies.

protected double

getCriteriaValueCategorical(double[][] tableXY, double[] classCounts, int nRows, int maxNumCats)

Calculates and returns the value of the criterion on the node represented by the data set S = xy.

protected abstract int

selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)

Abstract method for selecting the next split variable and split definition for the node.

void

setGainCriteria(DecisionTreeInfoGain.GainCriteria gainCriteria)

Specifies which criteria to use in gain calculations in order to determine the best split at each node.

void

setUseRatio(boolean ratio)

Sets the flag to use or not use the gain ratio instead of the gain to determine the best split.

boolean

useGainRatio()

Returns whether or not the gain ratio is to be used instead of the gain to determine the best split.

Methods inherited from class com.imsl.datamining.decisionTree.DecisionTree
fitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinCostComplexityValue, getMinObsPerChildNode, getMinObsPerNode, getNodeAssigments, getNumberOfComplexityValues, getNumberOfRandomFeatures, isAutoPruningFlag, isRandomFeatureSelection, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setConfiguration, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNode, setNumberOfRandomFeatures, setRandomFeatureSelection

Methods inherited from class com.imsl.datamining.PredictiveModel
clone, getClassCounts, getClassErrors, getClassErrors, getClassLabels, getClassProbabilities, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setTrainingData, setVariableType, setWeights

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DecisionTreeInfoGain
  
  public DecisionTreeInfoGain(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
  
  Constructs a DecisionTree object for a single response variable and multiple predictor variables.
  
  Parameters:
  
  xy - a double matrix with rows containing the observations on the predictor variables and one response variable
  
  responseColumnIndex - an int specifying the column index of the response variable
  
  varType - a PredictiveModel.VariableType array containing the type of each variable
Method Details
- selectSplitVariable
  
  protected abstract int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)
  
  Abstract method for selecting the next split variable and split definition for the node.
  
  Specified by:
  
  selectSplitVariable in class DecisionTree
  
  Parameters:
  
  xy - a double matrix containing the data
  
  classCounts - a double array containing the counts for each class of the response variable, when it is categorical
  
  parentFreq - a double array used to indicate which subset of the observations belong in the current node
  
  splitValue - a double array representing the resulting split point if the selected variable is quantitative
  
  splitCriterionValue - a double, the value of the criterion used to determine the splitting variable
  
  splitPartition - an int array indicating the resulting split partition if the selected variable is categorical
  
  Returns:
  
  an int specifying the column index of the split variable in this.getPredictorIndexes
- setGainCriteria
  
  public void setGainCriteria(DecisionTreeInfoGain.GainCriteria gainCriteria)
  
  Specifies which criteria to use in gain calculations in order to determine the best split at each node.
  
  Parameters:
  
  gainCriteria - a DecisionTreeInfoGain.GainCriteria specifying which criteria to use in gain calculations in order to determine the best split at each node
  Default: gainCriteria = DecisionTreeInfoGain.GainCriteria.SHANNON_ENTROPY
- useGainRatio
  
  public boolean useGainRatio()
  
  Returns whether or not the gain ratio is to be used instead of the gain to determine the best split.
  
  Returns:
  
  a boolean indicating if the gain ratio is to be used
  true, uses the gain ratio; false uses the gain.
- setUseRatio
  
  public void setUseRatio(boolean ratio)
  
  Sets the flag to use or not use the gain ratio instead of the gain to determine the best split.
  
  Parameters:
  
  ratio - a boolean indicating if the gain ratio is to be used
  true uses the gain ratio; false uses the gain.
  
  Default: useRatio=false
- getCriteriaValueCategorical
  
  protected double getCriteriaValueCategorical(double[][] tableXY, double[] classCounts, int nRows, int maxNumCats)
  
  Calculates and returns the value of the criterion on the node represented by the data set S = xy.
  
  Parameters:
  
  tableXY -
  
  classCounts - an int array containing the total counts of response variable by category
  
  nRows - an int, the number of rows in xy
  
  maxNumCats - an int, the maximum number of categorical values allowed in the problem
  
  Returns:
  
  a double, the value of the splitting criteria
- getCountXY
  
  protected double[][] getCountXY(double[][] xy, int nRows, int xIdx, int yIdx, int maxNumberOfCategories, int[] uniqueX, int[] uniqueY, double[] frequencies)
  
  Calculates a two-way frequency table with input frequencies. Note: The function assumes that the data is encoded as indices 0,1,...,K < maxNumberOfCategories. @param xy a double matrix containing the data to be tabulated
  
  Parameters:
  
  nRows - an int the number of rows in xy
  
  xIdx - an int the column index of x
  
  yIdx - an int the column index of y
  
  maxNumberOfCategories - an int the maximum number of categories in either x or y
  
  uniqueX - an int array containing indicators for the categories of x
  
  uniqueY - an int array containing indicators for the categories of y
  
  frequencies - an double array of length nRows containing the row frequencies for the data
  
  Returns:
  
  a maxNumberOfCategories by maxNumberOfCategories double matrix containing the cross-tabulated frequencies for x and y. The categories of x vary along the rows and the categories of y vary along the columns.

Class DecisionTreeInfoGain

Nested Class Summary

Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTree

Nested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel

Constructor Summary

Method Summary

Methods inherited from class com.imsl.datamining.decisionTree.DecisionTree

Methods inherited from class com.imsl.datamining.PredictiveModel

Methods inherited from class java.lang.Object

Constructor Details

DecisionTreeInfoGain

Method Details

selectSplitVariable

setGainCriteria

useGainRatio

setUseRatio

getCriteriaValueCategorical

getCountXY