Class ALACART

All Implemented Interfaces:
DecisionTreeSurrogateMethod, Serializable, Cloneable

Generates a decision tree using the CARTTM method of Breiman, Friedman, Olshen and Stone (1984). CARTTM stands for Classification and Regression Trees and applies to categorical or quantitative type variables.

Only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.

For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most. For a given set of data S at a node, the node impurity for a C-class categorical response is a function of the class probabilities.

$$I(S)=\phi(p(1|S),p(2|S),\ldots,p(C|S))$$

The measure function \(\phi(\cdot)\) should be 0 for "pure" nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.

As only binary splits of a subset S are considered (S1, S2 such that \(S=S_1\cup S_2 \) and \(S=S_1\cap S_2=\emptyset\)), the reduction in impurity when splitting S into S1, S2 is

$$\Delta I=I(S)-q_1I\left(S_1\right)-q_2 I\left(S_2\right)$$

where $$q_j = Pr[S_j], j = 1, 2$$ is the node probability.

The gain criteria and the reduction in impurity \(\Delta I\) are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by

$$I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array} }^Cp(i|S)=1-\sum^C_{i=1}p^2(i|S)$$

If Y is an ordered response or continuous, the problem is a regression problem. ALACART generates the tree using the same steps, except that node-level measures or loss-functions are the mean squared error (MSE) or mean absolute error (MAD) rather than node impurity measures.

Missing Values

Any observation or case with a missing response variable is eliminated from the analysis. If a predictor has a missing value, each algorithm skips that case when evaluating the given predictor. When making a prediction for a new case, if the split variable is missing, the prediction function applies surrogate split-variables and splitting rules in turn, if they are estimated with the decision tree. Otherwise, the prediction function returns the prediction from the most recent non-terminal node. In this implementation, only ALACART estimates surrogate split variables when requested.

See Also:
  • Constructor Details

    • ALACART

      public ALACART(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
      Constructs an ALACART decision tree for a single response variable and multiple predictor variables.
      Parameters:
      xy - a double matrix containing the training data and associated response values
      responseColumnIndex - an int specifying the column index in xy of the response variable
      varType - a PredictiveModel.VariableType array containing the type of each variable
    • ALACART

      public ALACART(ALACART alacartModel)
      Constructs a copy of the input ALACART decision tree.
      Parameters:
      alacartModel - an ALACART decision tree
  • Method Details

    • clone

      public ALACART clone()
      Clones an ALACART decision tree.
      Specified by:
      clone in class PredictiveModel
      Returns:
      a clone of the ALACART decision tree
    • addSurrogates

      public void addSurrogates(Tree tree, double[] surrogateInfo)
      Adds the surrogate information to the tree.
      Specified by:
      addSurrogates in interface DecisionTreeSurrogateMethod
      Parameters:
      tree - a Tree containing the decision tree structure
      surrogateInfo - a double array containing the surrogate split information
    • getNumberOfSurrogateSplits

      public int getNumberOfSurrogateSplits()
      Returns the number of surrogate splits.
      Specified by:
      getNumberOfSurrogateSplits in interface DecisionTreeSurrogateMethod
      Returns:
      an int, the number of surrogate splits
    • setConfiguration

      protected final void setConfiguration(PredictiveModel pm)
      Description copied from class: DecisionTree
      Sets the configuration of PredictiveModel to that of the input model.
      Overrides:
      setConfiguration in class DecisionTree
      Parameters:
      pm - a PredictiveModel object
    • setNumberOfSurrogateSplits

      public void setNumberOfSurrogateSplits(int nSplits)
      Sets the number of surrogate splits.
      Specified by:
      setNumberOfSurrogateSplits in interface DecisionTreeSurrogateMethod
      Parameters:
      nSplits - an int specifying the number of predictors to consider as surrogate splitting variables

      Default: nSplits = 0

    • getSurrogateInfo

      public double[] getSurrogateInfo()
      Returns the surrogate split information.
      Specified by:
      getSurrogateInfo in interface DecisionTreeSurrogateMethod
      Returns:
      a double array containing the surrogate split information
    • selectSplitVariable

      protected int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)
      Selects the split variable for the present node using the CARTTM method.
      Specified by:
      selectSplitVariable in class DecisionTreeInfoGain
      Parameters:
      xy - a double matrix containing the data
      classCounts - a double array containing the counts for each class of the response variable, when it is categorical
      parentFreq - a double array used to determine the subset of the observations that belong to the current node
      splitValue - a double array representing the resulting split point if the selected variable is quantitative
      splitCriterionValue - a double, the value of the criterion used to determine the splitting variable
      splitPartition - an int array indicating the resulting split partition if the selected variable is categorical
      Returns:
      an int specifying the index of the split variable in this.getPredictorIndexes()