public class ALACART extends DecisionTreeInfoGain implements DecisionTreeSurrogateMethod, Serializable, Cloneable
Generates a decision tree using the CARTTM method of Breiman, Friedman, Olshen and Stone (1984). CARTTM stands for Classification and Regression Trees and applies to categorical or quantitative type variables.
Only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.
For classification problems, ALACART
uses a similar criterion to
information gain called impurity. The method searches for a split that
reduces the node impurity the most. For a given set of data S at a
node, the node impurity for a C-class categorical response is a function of
the class probabilities.
The measure function \(\phi(\cdot)\) should be 0 for "pure" nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.
As only binary splits of a subset S are considered (S1, S2 such that \(S=S_1\cup S_2 \) and \(S=S_1\cap S_2=\emptyset\)), the reduction in impurity when splitting S into S1, S2 is
$$\Delta I=I(S)-q_1I\left(S_1\right)-q_2 I\left(S_2\right)$$where $$q_j = Pr[S_j], j = 1, 2$$ is the node probability.
The gain criteria and the reduction in impurity \(\Delta I\) are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by
$$I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array} }^Cp(i|S)=1-\sum^C_{i=1}p^2(i|S)$$
If Y is an ordered response or continuous, the problem is a regression
problem. ALACART
generates the tree using the same steps, except
that node-level measures or loss-functions are the mean squared error (MSE)
or mean absolute error (MAD) rather than node impurity measures.
Any observation or case with a missing response variable is eliminated from
the analysis. If a predictor has a missing value, each algorithm skips that
case when evaluating the given predictor. When making a prediction for a new
case, if the split variable is missing, the prediction function applies
surrogate split-variables and splitting rules in turn, if they are
estimated with the decision tree. Otherwise, the prediction function returns
the prediction from the most recent non-terminal node. In this
implementation, only ALACART
estimates surrogate split variables
when requested.
DecisionTreeInfoGain.GainCriteria
DecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeException
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType
Constructor and Description |
---|
ALACART(ALACART alacartModel)
Constructs a copy of the input
ALACART decision tree. |
ALACART(double[][] xy,
int responseColumnIndex,
PredictiveModel.VariableType[] varType)
Constructs an
ALACART decision tree for a single response
variable and multiple predictor variables. |
Modifier and Type | Method and Description |
---|---|
void |
addSurrogates(Tree tree,
double[] surrogateInfo)
Adds the surrogate information to the tree.
|
ALACART |
clone()
Clones an
ALACART decision tree. |
int |
getNumberOfSurrogateSplits()
Returns the number of surrogate splits.
|
double[] |
getSurrogateInfo()
Returns the surrogate split information.
|
protected int |
selectSplitVariable(double[][] xy,
double[] classCounts,
double[] parentFreq,
double[] splitValue,
double[] splitCriterionValue,
int[] splitPartition)
Selects the split variable for the present node using the
CARTTM method.
|
protected void |
setConfiguration(PredictiveModel pm)
Sets the configuration of
PredictiveModel to that of the
input model. |
void |
setNumberOfSurrogateSplits(int nSplits)
Sets the number of surrogate splits.
|
getCountXY, getCriteriaValueCategorical, setGainCriteria, setUseRatio, useGainRatio
fitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinCostComplexityValue, getMinObsPerChildNode, getMinObsPerNode, getNodeAssigments, getNumberOfComplexityValues, getNumberOfRandomFeatures, isAutoPruningFlag, isRandomFeatureSelection, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNode, setNumberOfRandomFeatures, setRandomFeatureSelection
getClassCounts, getClassErrors, getClassLabels, getClassProbabilities, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setTrainingData, setVariableType, setWeights
public ALACART(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
ALACART
decision tree for a single response
variable and multiple predictor variables.xy
- a double
matrix containing the training data and
associated response valuesresponseColumnIndex
- an int
specifying the column
index in xy
of the response variablevarType
- a PredictiveModel.VariableType
array containing the type of each variablepublic ALACART(ALACART alacartModel)
ALACART
decision tree.alacartModel
- an ALACART
decision treepublic ALACART clone()
ALACART
decision tree.clone
in class PredictiveModel
ALACART
decision treepublic void addSurrogates(Tree tree, double[] surrogateInfo)
addSurrogates
in interface DecisionTreeSurrogateMethod
tree
- a Tree
containing the decision tree structuresurrogateInfo
- a double
array containing the surrogate
split informationpublic int getNumberOfSurrogateSplits()
getNumberOfSurrogateSplits
in interface DecisionTreeSurrogateMethod
int
, the number of surrogate splitsprotected final void setConfiguration(PredictiveModel pm)
DecisionTree
PredictiveModel
to that of the
input model.setConfiguration
in class DecisionTree
pm
- a PredictiveModel
objectpublic void setNumberOfSurrogateSplits(int nSplits)
setNumberOfSurrogateSplits
in interface DecisionTreeSurrogateMethod
nSplits
- an int
specifying the number of predictors to
consider as surrogate splitting variables
Default: nSplits
= 0
public double[] getSurrogateInfo()
getSurrogateInfo
in interface DecisionTreeSurrogateMethod
double
array containing the surrogate split
informationprotected int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)
selectSplitVariable
in class DecisionTreeInfoGain
xy
- a double
matrix containing the dataclassCounts
- a double
array containing the counts for
each class of the response variable, when it is categoricalparentFreq
- a double
array used to determine the
subset of the observations that belong to the current nodesplitValue
- a double
array representing the resulting
split point if the selected variable is quantitativesplitCriterionValue
- a double
, the value of the
criterion used to determine the splitting variablesplitPartition
- an int
array indicating the resulting
split partition if the selected variable is categoricalint
specifying the index of the split variable in
this.getPredictorIndexes()
Copyright © 2020 Rogue Wave Software. All rights reserved.