imsl.data_mining.C45DecisionTree¶
-
class
C45DecisionTree
(response_col_idx, var_type, criteria=0, use_gain_ratio=False, min_n_node=7, min_split=21, max_x_cats=10, max_size=100, max_depth=10, priors=None, response_name='Y', var_names=None, class_names=None, categ_names=None)¶ Generate a decision tree using the C4.5 method.
Generate a decision tree for a single response variable and two or more predictor variables using the C4.5 method.
Parameters: - response_col_idx (int) – Column index of the response variable.
- var_type ((N,) array_like) –
Array indicating the type of each variable.
var_type[i] Type 0 Categorical 1 Ordered Discrete (Low, Med., High) 2 Quantitative or Continuous 3 Ignore this variable - criteria (int, optional) –
Specifies which criteria the C4.5 method should use in the gain calculations to determine the best split at each node.
criteria Measure 0 Shannon Entropy 1 Gini Index 2 Deviance Default is 0.
- Shannon Entropy
- Shannon Entropy is a measure of randomness or uncertainty. For a
categorical variable having \(C\) distinct values over a
dataset \(S\), the Shannon Entropy is defined as\(\sum_{i=1}^{C}p_i\log(p_i)\)
where
\(p_i = Pr(Y=i)\)and where
\(p_i \log(p_i) = 0\)if \(p_i=0\).
- Gini Index
- Gini Index is a measure of statistical dispersion. For a
categorical variable having \(C\) distinct values over a
dataset \(S\), the Gini index is defined as\(I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array}}^C p(i|S)p(j|S)=1-\sum^C_{i=1}p^2(i|S)\)
where \(p(i|S)\) denotes the probability that the variable is equal to the state \(i\) on the dataset \(S\).
- Deviance
- Deviance is a measure of the quality of fit. For a categorical
variable having \(C\) distinct values over a dataset \(S\),
the Deviance measure is\(\sum_{i=1}^{C}n_i\log(p_i)\)
where
\(p_i = Pr(Y=i)\)and \(n_i\) is the number of cases with \(Y=i\) on the node.
- use_gain_ratio (bool, optional) –
The C4.5 method uses a gain ratio instead of just the gain to determine the best split.
Default is False.
- min_n_node (int, optional) –
Do not split a node if one of its child nodes will have fewer than min_n_node observations.
Default is 7.
- min_split (int, optional) –
Do not split a node if the node has fewer than min_split observations.
Default is 21.
- max_x_cats (int, optional) –
Allow for up to max_x_cats for categorical predictor variables.
Default is 10.
- max_size (int, optional) –
Stop growing the tree once it has reached max_size number of nodes.
Default is 100.
- max_depth (int, optional) –
Stop growing the tree once it has reached max_depth number of levels.
Default is 10.
- priors ((N,) array_like, optional) – An array containing prior probabilities for class membership. The argument is ignored for continuous response variables. By default, the prior probabilities are estimated from the data.
- response_name (string, optional) –
A string representing the name of the response variable.
Default is “Y”.
- var_names (tuple, optional) –
A tuple containing strings representing the names of predictors.
Default is “X0”, “X1”, etc.
- class_names (tuple, optional) –
A tuple containing strings representing the names of the different classes in Y, assuming Y is of categorical type.
Default is “0”, “1”, etc.
- categ_names (tuple, optional) –
A tuple containing strings representing the names of the different category levels for each predictor of categorical type.
Default is “0”, “1”, etc.
Notes
The method C4.5 ([1]) is a tree partitioning algorithm for a categorical response variable and categorical or quantitivate predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as
\[E(S)=-\sum_{i=1}^Cp_i\mbox{log}\left(p_i\right)\]Where \(p_i=\mbox{Pr}[Y=i|S]\) is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by
\[E(S,X)=-\sum_{k=1}^{K}\sum_{i=1}^{C_k}p\left(S_k\right)E\left(S_k\right)\]If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:
\[g(S,X)=E(S)-E(S,X)\]The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:
\[gR(S,X)=\frac{E(S)-E(S,X)}{E_X\left(S\right)}\]where
\[E_x\left(S\right)=-\sum_{k=1}^K\nu_k\mbox{log}\left(\nu_k\right)\]with
\[\nu_k=\mbox{Pr}[X=k|S]\]Note that \(E_X(S)\) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form \(X\le d\) and \(X\gt d\) are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split, creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child-nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.
References
[1] Quinlan, J.R. (1993). C4.5 Programs for Machine Learning, Morgan Kaufmann. For the latest information on Quinlan’s algorithms see http://www.rulequest.com/. Examples
In this example, we use a small dataset with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by the C4.5 method. The control parameters are adjusted because of the small data size, and no cross-validation or pruning is performed. Notice that C4.5 splits on Outlook, then Humidity and Wind.
>>> import numpy as np >>> import imsl.data_mining as dm >>> xy = np.array([[0, 85.0, 85.0, 0, 0], ... [0, 80.0, 90.0, 1, 0], ... [1, 83.0, 78.0, 0, 1], ... [2, 70.0, 96.0, 0, 1], ... [2, 68.0, 80.0, 0, 1], ... [2, 65.0, 70.0, 1, 0], ... [1, 64.0, 65.0, 1, 1], ... [0, 72.0, 95.0, 0, 0], ... [0, 69.0, 70.0, 0, 1], ... [2, 75.0, 80.0, 0, 1], ... [0, 75.0, 70.0, 1, 1], ... [1, 72.0, 90.0, 1, 1], ... [1, 81.0, 75.0, 0, 1], ... [2, 71.0, 80.0, 1, 0]]) >>> response_column_index = 4 >>> var_type = np.array([0, 2, 2, 0, 0], dtype=int) >>> names = ["Outlook", "Temperature", "Humidity", "Wind"] >>> class_names = ["Don't Play", "Play"] >>> var_levels = ["Sunny", "Overcast", "Rainy", "False", "True"] >>> print("Decision Tree using Method C4.5:\n\n") ... Decision Tree using Method C4.5: >>> with dm.C45DecisionTree(response_column_index, var_type, ... min_n_node=2, min_split=3, max_x_cats=10, ... max_size=50, max_depth=10, var_names=names, ... class_names=class_names, ... categ_names=var_levels) as decision_tree: ... decision_tree.train(xy) ... print(decision_tree) ... Decision Tree: Node 0: Cost = 0.357, N = 14.0, Level = 0, Child Nodes: 1 4 5 P(Y=0) = 0.357 P(Y=1) = 0.643 Predicted Y: Play Node 1: Cost = 0.143, N = 5.0, Level = 1, Child Nodes: 2 3 Rule: Outlook in: { Sunny } P(Y=0) = 0.600 P(Y=1) = 0.400 Predicted Y: Don't Play Node 2: Cost = 0.000, N = 2.0, Level = 2 Rule: Humidity <= 77.500 P(Y=0) = 0.000 P(Y=1) = 1.000 Predicted Y: Play Node 3: Cost = 0.000, N = 3.0, Level = 2 Rule: Humidity > 77.500 P(Y=0) = 1.000 P(Y=1) = 0.000 Predicted Y: Don't Play Node 4: Cost = 0.000, N = 4.0, Level = 1 Rule: Outlook in: { Overcast } P(Y=0) = 0.000 P(Y=1) = 1.000 Predicted Y: Play Node 5: Cost = 0.143, N = 5.0, Level = 1, Child Nodes: 6 7 Rule: Outlook in: { Rainy } P(Y=0) = 0.400 P(Y=1) = 0.600 Predicted Y: Play Node 6: Cost = 0.000, N = 3.0, Level = 2 Rule: Wind in: { False } P(Y=0) = 0.000 P(Y=1) = 1.000 Predicted Y: Play Node 7: Cost = 0.000, N = 2.0, Level = 2 Rule: Wind in: { True } P(Y=0) = 1.000 P(Y=1) = 0.000 Predicted Y: Don't Play
>>> import numpy as np >>> import imsl.data_mining as dm >>> xy = np.array([[2, 0, 2], ... [1, 0, 0], ... [2, 1, 3], ... [0, 1, 0], ... [1, 2, 0], ... [2, 2, 3], ... [2, 2, 3], ... [0, 1, 0], ... [0, 0, 0], ... [0, 1, 0], ... [1, 2, 0], ... [2, 0, 2], ... [0, 2, 0], ... [2, 0, 1], ... [0, 0, 0], ... [2, 0, 1], ... [1, 0, 0], ... [0, 2, 0], ... [2, 0, 1], ... [1, 2, 0], ... [0, 2, 2], ... [2, 1, 3], ... [1, 1, 0], ... [2, 2, 3], ... [1, 2, 0], ... [2, 2, 3], ... [2, 0, 1], ... [2, 1, 3], ... [1, 2, 0], ... [1, 1, 0]]) >>> response_column_index = 2 >>> var_type = np.array([0, 0, 0], dtype=int) >>> names = ['Var1', 'Var2'] >>> class_names = ["c1", "c2", "c3", "c4"] >>> response_name = 'Response' >>> var_levels = ["L1", "L2", "L3", "A", "B", "C"] >>> print("Decision Tree using Method C4.5:\n\n") ... Decision Tree using Method C4.5: >>> with dm.C45DecisionTree(response_column_index, var_type, ... min_n_node=5, min_split=10, max_x_cats=10, ... max_size=50, max_depth=10, ... response_name=response_name, var_names=names, ... class_names=class_names, ... categ_names=var_levels) as decision_tree: ... decision_tree.train(xy) ... print(decision_tree) ... Decision Tree: Node 0: Cost = 0.467, N = 30.0, Level = 0, Child Nodes: 1 2 3 P(Y=0) = 0.533 P(Y=1) = 0.133 P(Y=2) = 0.100 P(Y=3) = 0.233 Predicted Response: c1 Node 1: Cost = 0.033, N = 8.0, Level = 1 Rule: Var1 in: { L1 } P(Y=0) = 0.875 P(Y=1) = 0.000 P(Y=2) = 0.125 P(Y=3) = 0.000 Predicted Response: c1 Node 2: Cost = 0.000, N = 9.0, Level = 1 Rule: Var1 in: { L2 } P(Y=0) = 1.000 P(Y=1) = 0.000 P(Y=2) = 0.000 P(Y=3) = 0.000 Predicted Response: c1 Node 3: Cost = 0.200, N = 13.0, Level = 1 Rule: Var1 in: { L3 } P(Y=0) = 0.000 P(Y=1) = 0.308 P(Y=2) = 0.154 P(Y=3) = 0.538 Predicted Response: c4
Methods
predict
(data[, weights])Compute predicted values using a decision tree. train
(training_data[, weights])Train a decision tree using training data and weights. Attributes
categ_names
Return names of category levels for each categorical predictor. class_names
Return names of different classes in Y. n_classes
Return number of classes assumed by response variable. n_levels
Return number of levels or depth of tree. n_nodes
Return number of nodes or size of tree. n_preds
Return number of predictors used in the model. pred_n_values
Return number of values of predictor variables. pred_type
Return types of predictor variables. response_name
Return name of the response variable. response_type
Return type of the response variable. var_names
Return names of the predictors.