public class C45 extends DecisionTreeInfoGain implements Serializable, Cloneable
Generates a decision tree using the C4.5 algorithm for a categorical response variable and categorical or quantitative predictor variables. The C45 procedure (Quinlan, 1995) partitions the sample space using an information gain or a gain ratio as the splitting criterion. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as
![]()
Where
is the probability that the
response takes on category i on the dataset S. This measure is
widely known as the Shannon Entropy. Splitting the dataset further may either
increase or decrease the entropy in the response variable. For example, the
entropy of Y over a partitioning of S by X, a variable
with K categories, is given by
![]()
If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:
![]()
The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:
![]()
where
![]()
with
![]()
Note that EX(S) is just the entropy of the variable
X over S. The gain ratio is thought to be less biased toward
predictors with many categories. C4.5 treats the continuous variable
similarly, except that only binary splits of the form
and
are considered,
where d is a value in the range of X on S. The best
split is determined by the split variable and split point that gives the
largest criterion value. It is possible that no variable meets the threshold
for further splitting at the current node, in which case growing stops and
the node becomes a terminal node. Otherwise, the node is split
creating two or more child nodes. Then, using the dataset partition defined
by the splitting variable and split value, the very same procedure is
repeated for each child node. Thus a collection of nodes and child nodes are
generated, or, in other words, the tree is grown. The growth stops
after one or more different conditions are met.
DecisionTreeInfoGain.GainCriteriaDecisionTree.MaxTreeSizeExceededException, DecisionTree.PruningFailedToConvergeException, DecisionTree.PureNodeExceptionPredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType| Constructor and Description |
|---|
C45(double[][] xy,
int responseColumnIndex,
PredictiveModel.VariableType[] varType)
Constructs a
C45 object for a single response variable and
multiple predictor variables. |
| Modifier and Type | Method and Description |
|---|---|
protected int |
selectSplitVariable(double[][] xy,
double[] classCounts,
double[] parentFreq,
double[] splitValue,
int[] splitPartition)
Selects the split variable for the present node using the C45 method.
|
information, setGainCriteria, setUseRatio, useGainRatiofitModel, getCostComplexityValues, getDecisionTree, getFittedMeanSquaredError, getMaxDepth, getMaxNodes, getMeanSquaredPredictionError, getMinObsPerChildNode, getMinObsPerNode, getNumberOfComplexityValues, getNumberOfSets, isAutoPruningFlag, predict, predict, predict, printDecisionTree, printDecisionTree, pruneTree, setAutoPruningFlag, setConfiguration, setCostComplexityValues, setMaxDepth, setMaxNodes, setMinCostComplexityValue, setMinObsPerChildNode, setMinObsPerNodegetClassCounts, getCostMatrix, getMaxNumberOfCategories, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isMustFitModelFlag, isUserFixedNClasses, setClassCounts, setCostMatrix, setMaxNumberOfCategories, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setWeightspublic C45(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
C45 object for a single response variable and
multiple predictor variables.xy - a double matrix that is a number of observations
by the number of variables, which is the number of predictor variables
plus one response variable.responseColumnIndex - an int specifying the column
index of the response variable.varType - a PredictiveModel.VariableType
array containing the type of each variable.protected int selectSplitVariable(double[][] xy,
double[] classCounts,
double[] parentFreq,
double[] splitValue,
int[] splitPartition)
selectSplitVariable in class DecisionTreeInfoGainxy - a double matrix containing the data.classCounts - a double array containing the counts for
each class of the response variable, when it is categorical.parentFreq - a double array used to determine which
subset of the observations belong in the current node.splitValue - a double array representing the resulting
split point if the selected variable is quantitative.splitPartition - an int array indicating the resulting
split partition if the selected variable is categorical.int specifying the column index of the split
variable in xy.Copyright © 1970-2015 Rogue Wave Software
Built June 18 2015.