com.imsl.datamining.decisionTree.C45

All Implemented Interfaces:: Serializable, Cloneable

public class C45 extends DecisionTreeInfoGain implements Serializable, Cloneable

Generates a decision tree using the C4.5 algorithm for a categorical response variable and categorical or quantitative predictor variables. The C4.5 procedure (Quinlan, 1995) partitions the sample space using an information gain or a gain ratio as the splitting criterion. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as

$$E(S)=-\sum_{i=1}^Cp_i\mbox{log}\left(p_i\right) $$

Where $p_i=\mbox{Pr}[Y=i|S]$ is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by

$$E(S,X)=-\sum_{k=1}^{K}\sum_{i=1}^{C_k}p\left(S_k \right)E\left(S_k\right)$$

If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:

$$g(S,X)=E(S)-E(S,X)$$

The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:

$$gR(S,X)=\frac{E(S)-E(S,X)}{E_X\left(S\right)} $$

where

$$E_x\left(S\right)=-\sum_{k=1}^K\nu_k\mbox{log} \left(\nu_k\right)$$

with

$$\nu_k=\mbox{Pr}[X=k|S]$$

Note that E_X(S) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form $X\le d$ and $X\gt d$ are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.

See Also:

Nested Class Summary

Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTreeInfoGain
DecisionTreeInfoGain.GainCriteria

Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTree
DecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeException

Nested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType
Constructor Summary

Constructors

Constructor

Description

C45(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)

Constructs a C45 object for a single response variable and multiple predictor variables.

C45(C45 c45Model)

Constructs a copy of the input C45 decision tree.
Method Summary

Modifier and Type

Method

Description

C45

clone()

Clones a C45 decision tree.

protected int

selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)

Selects the split variable for the present node using the C45 method.

Methods inherited from class com.imsl.datamining.decisionTree.DecisionTreeInfoGain
getCountXY, getCriteriaValueCategorical, setGainCriteria, setUseRatio, useGainRatio

Methods inherited from class com.imsl.datamining.decisionTree.DecisionTree
fitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinCostComplexityValue, getMinObsPerChildNode, getMinObsPerNode, getNodeAssigments, getNumberOfComplexityValues, getNumberOfRandomFeatures, isAutoPruningFlag, isRandomFeatureSelection, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setConfiguration, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNode, setNumberOfRandomFeatures, setRandomFeatureSelection

Methods inherited from class com.imsl.datamining.PredictiveModel
getClassCounts, getClassErrors, getClassErrors, getClassLabels, getClassProbabilities, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setTrainingData, setVariableType, setWeights

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- C45
  
  public C45(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
  
  Constructs a C45 object for a single response variable and multiple predictor variables.
  
  Parameters:
  
  xy - a double matrix containing the training data and associated response values
  
  responseColumnIndex - an int specifying the column index of the response variable
  
  varType - a PredictiveModel.VariableType array containing the type of each variable
- C45
  
  public C45(C45 c45Model)
  
  Constructs a copy of the input C45 decision tree.
  
  Parameters:
  
  c45Model - a C45 decision tree
Method Details
- clone
  
  public C45 clone()
  
  Clones a C45 decision tree.
  
  Specified by:
  
  clone in class PredictiveModel
  
  Returns:
  
  a clone of the C45 decision tree
- selectSplitVariable
  
  protected int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)
  
  Selects the split variable for the present node using the C45 method.
  
  Specified by:
  
  selectSplitVariable in class DecisionTreeInfoGain
  
  Parameters:
  
  xy - a double matrix containing the data
  
  classCounts - a double array containing the counts for each class of the response variable, when it is categorical
  
  parentFreq - a double array used to indicate which subset of the observations belong in the current node
  
  splitValue - a double array representing the resulting split point if the selected variable is quantitative
  
  splitCriterionValue - a double, the value of the criterion used to determine the splitting variable
  
  splitPartition - an int array indicating the resulting split partition if the selected variable is categorical
  
  Returns:
  
  an int specifying the column index of the split variable in this.getPredictorIndexes()

Class C45

Nested Class Summary

Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTreeInfoGain

Nested classes/interfaces inherited from class com.imsl.datamining.decisionTree.DecisionTree

Nested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel

Constructor Summary

Method Summary

Methods inherited from class com.imsl.datamining.decisionTree.DecisionTreeInfoGain

Methods inherited from class com.imsl.datamining.decisionTree.DecisionTree

Methods inherited from class com.imsl.datamining.PredictiveModel

Methods inherited from class java.lang.Object

Constructor Details

C45

C45

Method Details

clone

selectSplitVariable