public class C45 extends DecisionTreeInfoGain implements Serializable, Cloneable
Generates a decision tree using the C4.5 algorithm for a categorical response variable and categorical or quantitative predictor variables. The C4.5 procedure (Quinlan, 1995) partitions the sample space using an information gain or a gain ratio as the splitting criterion. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as
$$E(S)=-\sum_{i=1}^Cp_i\mbox{log}\left(p_i\right) $$Where \(p_i=\mbox{Pr}[Y=i|S]\) is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by
$$E(S,X)=-\sum_{k=1}^{K}\sum_{i=1}^{C_k}p\left(S_k \right)E\left(S_k\right)$$If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:
$$g(S,X)=E(S)-E(S,X)$$The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:
$$gR(S,X)=\frac{E(S)-E(S,X)}{E_X\left(S\right)} $$where
$$E_x\left(S\right)=-\sum_{k=1}^K\nu_k\mbox{log} \left(\nu_k\right)$$with
$$\nu_k=\mbox{Pr}[X=k|S]$$Note that EX(S) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form \(X\le d\) and \(X\gt d\) are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.
DecisionTreeInfoGain.GainCriteria
DecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeException
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType
Constructor and Description |
---|
C45(C45 c45Model)
Constructs a copy of the input
C45 decision tree. |
C45(double[][] xy,
int responseColumnIndex,
PredictiveModel.VariableType[] varType)
Constructs a
C45 object for a single response variable and
multiple predictor variables. |
Modifier and Type | Method and Description |
---|---|
C45 |
clone()
Clones a
C45 decision tree. |
protected int |
selectSplitVariable(double[][] xy,
double[] classCounts,
double[] parentFreq,
double[] splitValue,
double[] splitCriterionValue,
int[] splitPartition)
Selects the split variable for the present node using the C45 method.
|
getCountXY, getCriteriaValueCategorical, setGainCriteria, setUseRatio, useGainRatio
fitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinCostComplexityValue, getMinObsPerChildNode, getMinObsPerNode, getNodeAssigments, getNumberOfComplexityValues, getNumberOfRandomFeatures, isAutoPruningFlag, isRandomFeatureSelection, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setConfiguration, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNode, setNumberOfRandomFeatures, setRandomFeatureSelection
getClassCounts, getClassErrors, getClassLabels, getClassProbabilities, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setTrainingData, setVariableType, setWeights
public C45(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
C45
object for a single response variable and
multiple predictor variables.xy
- a double
matrix containing the training data and
associated response valuesresponseColumnIndex
- an int
specifying the column
index of the response variablevarType
- a PredictiveModel.VariableType
array containing the type of each variablepublic C45(C45 c45Model)
C45
decision tree.c45Model
- a C45
decision treepublic C45 clone()
C45
decision tree.clone
in class PredictiveModel
C45
decision treeprotected int selectSplitVariable(double[][] xy, double[] classCounts, double[] parentFreq, double[] splitValue, double[] splitCriterionValue, int[] splitPartition)
selectSplitVariable
in class DecisionTreeInfoGain
xy
- a double
matrix containing the dataclassCounts
- a double
array containing the counts for
each class of the response variable, when it is categoricalparentFreq
- a double
array used to indicate which
subset of the observations belong in the current nodesplitValue
- a double
array representing the resulting
split point if the selected variable is quantitativesplitCriterionValue
- a double
, the value of the
criterion used to determine the splitting variablesplitPartition
- an int
array indicating the resulting
split partition if the selected variable is categoricalint
specifying the column index of the split
variable in this.getPredictorIndexes()
Copyright © 2020 Rogue Wave Software. All rights reserved.