Class ALACART
- All Implemented Interfaces:
DecisionTreeSurrogateMethod,Serializable,Cloneable
Generates a decision tree using the CARTTM method of Breiman, Friedman, Olshen and Stone (1984). CARTTM stands for Classification and Regression Trees and applies to categorical or quantitative type variables.
Only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.
For classification problems, ALACART uses a similar criterion to
information gain called impurity. The method searches for a split that
reduces the node impurity the most. For a given set of data S at a
node, the node impurity for a C-class categorical response is a function of
the class probabilities.
The measure function \(\phi(\cdot)\) should be 0 for "pure" nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.
As only binary splits of a subset S are considered (S1, S2 such that \(S=S_1\cup S_2 \) and \(S=S_1\cap S_2=\emptyset\)), the reduction in impurity when splitting S into S1, S2 is
$$\Delta I=I(S)-q_1I\left(S_1\right)-q_2 I\left(S_2\right)$$where $$q_j = Pr[S_j], j = 1, 2$$ is the node probability.
The gain criteria and the reduction in impurity \(\Delta I\) are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by
$$I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array} }^Cp(i|S)=1-\sum^C_{i=1}p^2(i|S)$$
If Y is an ordered response or continuous, the problem is a regression
problem. ALACART generates the tree using the same steps, except
that node-level measures or loss-functions are the mean squared error (MSE)
or mean absolute error (MAD) rather than node impurity measures.
Missing Values
Any observation or case with a missing response variable is eliminated from
the analysis. If a predictor has a missing value, each algorithm skips that
case when evaluating the given predictor. When making a prediction for a new
case, if the split variable is missing, the prediction function applies
surrogate split-variables and splitting rules in turn, if they are
estimated with the decision tree. Otherwise, the prediction function returns
the prediction from the most recent non-terminal node. In this
implementation, only ALACART estimates surrogate split variables
when requested.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTreeInfoGain
DecisionTreeInfoGain.GainCriteriaNested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTree
DecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeExceptionNested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType -
Constructor Summary
ConstructorsConstructorDescriptionALACART(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType) Constructs anALACARTdecision tree for a single response variable and multiple predictor variables.Constructs a copy of the inputALACARTdecision tree. -
Method Summary
Modifier and TypeMethodDescriptionvoidaddSurrogates(Tree tree, double[] surrogateInfo) Adds the surrogate information to the tree.clone()Clones anALACARTdecision tree.intReturns the number of surrogate splits.double[]Returns the surrogate split information.protected intselectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition) Selects the split variable for the present node using the CARTTM method.protected final voidSets the configuration ofPredictiveModelto that of the input model.voidsetNumberOfSurrogateSplits(int nSplits) Sets the number of surrogate splits.Methods inherited from class com.imsl.datamining.decisionTree.DecisionTreeInfoGain
getCountXY, getCriteriaValueCategorical, setGainCriteria, setUseRatio, useGainRatioMethods inherited from class com.imsl.datamining.decisionTree.DecisionTree
fitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinCostComplexityValue, getMinObsPerChildNode, getMinObsPerNode, getNodeAssigments, getNumberOfComplexityValues, getNumberOfRandomFeatures, isAutoPruningFlag, isRandomFeatureSelection, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNode, setNumberOfRandomFeatures, setRandomFeatureSelectionMethods inherited from class com.imsl.datamining.PredictiveModel
getClassCounts, getClassErrors, getClassErrors, getClassLabels, getClassProbabilities, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setTrainingData, setVariableType, setWeights
-
Constructor Details
-
ALACART
Constructs anALACARTdecision tree for a single response variable and multiple predictor variables.- Parameters:
xy- adoublematrix containing the training data and associated response valuesresponseColumnIndex- anintspecifying the column index inxyof the response variablevarType- aPredictiveModel.VariableTypearray containing the type of each variable
-
ALACART
Constructs a copy of the inputALACARTdecision tree.- Parameters:
alacartModel- anALACARTdecision tree
-
-
Method Details
-
clone
Clones anALACARTdecision tree.- Specified by:
clonein classPredictiveModel- Returns:
- a clone of the
ALACARTdecision tree
-
addSurrogates
Adds the surrogate information to the tree.- Specified by:
addSurrogatesin interfaceDecisionTreeSurrogateMethod- Parameters:
tree- aTreecontaining the decision tree structuresurrogateInfo- adoublearray containing the surrogate split information
-
getNumberOfSurrogateSplits
public int getNumberOfSurrogateSplits()Returns the number of surrogate splits.- Specified by:
getNumberOfSurrogateSplitsin interfaceDecisionTreeSurrogateMethod- Returns:
- an
int, the number of surrogate splits
-
setConfiguration
Description copied from class:DecisionTreeSets the configuration ofPredictiveModelto that of the input model.- Overrides:
setConfigurationin classDecisionTree- Parameters:
pm- aPredictiveModelobject
-
setNumberOfSurrogateSplits
public void setNumberOfSurrogateSplits(int nSplits) Sets the number of surrogate splits.- Specified by:
setNumberOfSurrogateSplitsin interfaceDecisionTreeSurrogateMethod- Parameters:
nSplits- anintspecifying the number of predictors to consider as surrogate splitting variablesDefault:
nSplits= 0
-
getSurrogateInfo
public double[] getSurrogateInfo()Returns the surrogate split information.- Specified by:
getSurrogateInfoin interfaceDecisionTreeSurrogateMethod- Returns:
- a
doublearray containing the surrogate split information
-
selectSplitVariable
protected int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition) Selects the split variable for the present node using the CARTTM method.- Specified by:
selectSplitVariablein classDecisionTreeInfoGain- Parameters:
xy- adoublematrix containing the dataclassCounts- adoublearray containing the counts for each class of the response variable, when it is categoricalparentFreq- adoublearray used to determine the subset of the observations that belong to the current nodesplitValue- adoublearray representing the resulting split point if the selected variable is quantitativesplitCriterionValue- adouble, the value of the criterion used to determine the splitting variablesplitPartition- anintarray indicating the resulting split partition if the selected variable is categorical- Returns:
- an
intspecifying the index of the split variable inthis.getPredictorIndexes()
-