Package com.imsl.datamining.decisionTree
Class DecisionTreeInfoGain
java.lang.Object
com.imsl.datamining.PredictiveModel
com.imsl.datamining.decisionTree.DecisionTree
com.imsl.datamining.decisionTree.DecisionTreeInfoGain
- All Implemented Interfaces:
Serializable,Cloneable
Abstract class that extends DecisionTree for classes that use an
information gain criteria.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumSpecifies which information gain criteria to use in determining the best split at each node.Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTree
DecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeExceptionNested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType -
Constructor Summary
ConstructorsConstructorDescriptionDecisionTreeInfoGain(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType) Constructs aDecisionTreeobject for a single response variable and multiple predictor variables. -
Method Summary
Modifier and TypeMethodDescriptionprotected double[][]getCountXY(double[][] xy, int nRows, int xIdx, int yIdx, int maxNumberOfCategories, int[] uniqueX, int[] uniqueY, double[] frequencies) Calculates a two-way frequency table with input frequencies.protected doublegetCriteriaValueCategorical(double[][] tableXY, double[] classCounts, int nRows, int maxNumCats) Calculates and returns the value of the criterion on the node represented by the data set S = xy.protected abstract intselectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition) Abstract method for selecting the next split variable and split definition for the node.voidsetGainCriteria(DecisionTreeInfoGain.GainCriteria gainCriteria) Specifies which criteria to use in gain calculations in order to determine the best split at each node.voidsetUseRatio(boolean ratio) Sets the flag to use or not use the gain ratio instead of the gain to determine the best split.booleanReturns whether or not the gain ratio is to be used instead of the gain to determine the best split.Methods inherited from class com.imsl.datamining.decisionTree.DecisionTree
fitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinCostComplexityValue, getMinObsPerChildNode, getMinObsPerNode, getNodeAssigments, getNumberOfComplexityValues, getNumberOfRandomFeatures, isAutoPruningFlag, isRandomFeatureSelection, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setConfiguration, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNode, setNumberOfRandomFeatures, setRandomFeatureSelectionMethods inherited from class com.imsl.datamining.PredictiveModel
clone, getClassCounts, getClassErrors, getClassErrors, getClassLabels, getClassProbabilities, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setTrainingData, setVariableType, setWeights
-
Constructor Details
-
DecisionTreeInfoGain
public DecisionTreeInfoGain(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType) Constructs aDecisionTreeobject for a single response variable and multiple predictor variables.- Parameters:
xy- adoublematrix with rows containing the observations on the predictor variables and one response variableresponseColumnIndex- anintspecifying the column index of the response variablevarType- aPredictiveModel.VariableTypearray containing the type of each variable
-
-
Method Details
-
selectSplitVariable
protected abstract int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition) Abstract method for selecting the next split variable and split definition for the node.- Specified by:
selectSplitVariablein classDecisionTree- Parameters:
xy- adoublematrix containing the dataclassCounts- adoublearray containing the counts for each class of the response variable, when it is categoricalparentFreq- adoublearray used to indicate which subset of the observations belong in the current nodesplitValue- adoublearray representing the resulting split point if the selected variable is quantitativesplitCriterionValue- adouble, the value of the criterion used to determine the splitting variablesplitPartition- anintarray indicating the resulting split partition if the selected variable is categorical- Returns:
- an
intspecifying the column index of the split variable inthis.getPredictorIndexes
-
setGainCriteria
Specifies which criteria to use in gain calculations in order to determine the best split at each node.- Parameters:
gainCriteria- aDecisionTreeInfoGain.GainCriteriaspecifying which criteria to use in gain calculations in order to determine the best split at each nodeDefault:
gainCriteria=DecisionTreeInfoGain.GainCriteria.SHANNON_ENTROPY
-
useGainRatio
public boolean useGainRatio()Returns whether or not the gain ratio is to be used instead of the gain to determine the best split.- Returns:
- a
booleanindicating if the gain ratio is to be usedtrue, uses the gain ratio;falseuses the gain.
-
setUseRatio
public void setUseRatio(boolean ratio) Sets the flag to use or not use the gain ratio instead of the gain to determine the best split.- Parameters:
ratio- abooleanindicating if the gain ratio is to be usedtrueuses the gain ratio;falseuses the gain.Default:
useRatio=false
-
getCriteriaValueCategorical
protected double getCriteriaValueCategorical(double[][] tableXY, double[] classCounts, int nRows, int maxNumCats) Calculates and returns the value of the criterion on the node represented by the data set S = xy.- Parameters:
tableXY-classCounts- anintarray containing the total counts of response variable by categorynRows- anint, the number of rows inxymaxNumCats- anint, the maximum number of categorical values allowed in the problem- Returns:
- a
double, the value of the splitting criteria
-
getCountXY
protected double[][] getCountXY(double[][] xy, int nRows, int xIdx, int yIdx, int maxNumberOfCategories, int[] uniqueX, int[] uniqueY, double[] frequencies) Calculates a two-way frequency table with input frequencies. Note: The function assumes that the data is encoded as indices 0,1,...,K < maxNumberOfCategories. @param xy adoublematrix containing the data to be tabulated- Parameters:
nRows- anintthe number of rows inxyxIdx- anintthe column index ofxyIdx- anintthe column index ofymaxNumberOfCategories- anintthe maximum number of categories in eitherxoryuniqueX- anintarray containing indicators for the categories ofxuniqueY- anintarray containing indicators for the categories ofyfrequencies- andoublearray of lengthnRowscontaining the row frequencies for the data- Returns:
- a
maxNumberOfCategoriesbymaxNumberOfCategoriesdoublematrix containing the cross-tabulated frequencies forxandy. The categories ofxvary along the rows and the categories ofyvary along the columns.
-