decisionTree¶
Generates a decision tree for a single response variable and two or more predictor variables.
Synopsis¶
decisionTree (xy, responseColIdx, varType)
Required Arguments¶
- float
xy[[]]
(Input) - Array of size
n
×nCols
containing the data. - int
responseColIdx
(Input) - Column index of the response variable.
- int
varType[]
(Input) - Array of length
ncols
indicating the type of each variable.
varType[i] | Type |
---|---|
0 | Categorical |
1 | Ordered Discrete (Low, Med., High) |
2 | Quantitative or Continuous |
3 | Ignore this variable |
Return Value¶
A structure. If an error occurs, None
is returned.
Optional Arguments¶
method
, int (Input)Specifies the tree generation method. The key for the variable type index is provided above.
method
Method Response varType
Predictor varType
0 C4.5 0 0, 1, 2 1
ALACART (Breiman, et. al.) 0, 1, 2 0, 1, 2 2
QUEST 0 0, 1, 2 3 CHAID 0, 1, 2 0 Default:
method
= 0.criteria
, int (Input)Specifies which criteria the ALACART method and the C4.5 method should use in the gain calculations to determine the best split at each node.
criteria
Measure 0 Shannon Entropy 1
Gini Index Default:
criteria
= 0.ratio
, (Input)If present, the ALACART method and C4.5 method each uses a gain ratio instead of just the gain to determine the best split.
Default: Uses gain.
weights
, float[]
(Input)An array of length
n
containing case weights.Default:
weights[i]
= 1.0
.costMatrix
, float[]
(Input)An array of length
nClasses
×nClasses
containing the cost matrix for a categorical response variable, wherenClasses
is the number of classes the response variable may assume. The cost matrix has elements \(C(i,j)\) = cost of misclassifying a response in class j as in class i. The diagonal elements of the cost matrix must be 0.Default:
costMatrix[i]
= 1.0,
fori
on the off-diagonal,costMatrix[i]
= 0.0
, fori
on the diagonal.control
, int[]
(Input)Array of length 5 containing parameters to control the maximum size of the tree and other stopping rules.
control[i]
name Action 0
minNNode
Do not split a node if one of its child nodes will have fewer than minNNode
observations.1
minSplit
Do not split a node if the node has fewer than minSplit
observations2
maxXCats
Allow for up to maxXCats
for categorical predictor variables.3
maxSize
Stop growing the tree once it has reached maxSize
number of nodes.4
maxDepth
Stop growing the tree once it has reached maxDepth
number of levels.Default:
control[] = [7, 21, 10, 100, 10]
complexity
, float (Input)The minimum complexity parameter to use in cross-validation. Complexity must be ≥ 0.
Default:
complexity
= 0.0.nSurrogates
, int (Input)Indicates the number of surrogate splits. Only used if
method
=
1.Default:
nSurrogates
= 0.alphas
, float[]
(Input)An array of length 3 containing the significance levels.
alphas[0]
=
significance level for split variable selection (CHAID and QUEST);alphas[1]=
significance level for merging categories of a variable (CHAID), andalphas[2]
=
significance level for splitting previously merged categories (CHAID). Valid values are in the range0
<
alphas[i]
<
1.0,
andalphas[2]
<= alphas[1].
Settingalphas[2]
= -1.0
disables splitting of merged categories.Default:
alphas[] = [0.05, 0.05, -1.0]
priors
, float[]
(Input)- An array of length
nClasses
, wherenClasses
is the number of classes the response variable may assume, containing prior probabilities for class membership. The argument is ignored for continuous response variables (varType[responseColIdx]=2
). By default, the prior probabilities are estimated from the data. nFolds
, int (Input)The number of folds to use in cross validation tree selection.
nFolds
must be between 1 andn
, inclusive. IfnFolds
=
1
the full data set is used once to generate the decision tree. In other words, no cross-validation is performed. If1
<n/nFolds
≤3
, then leave-one-out cross validation is performed.Default:
nFolds
= 10.nSamples
, int (Input)The number of bootstrap samples to use in bootstrap aggregation (bagging) when predicted values are requested. To obtain predictions produced by bagging, set
nSamples
>
0 and use one ofpredicted
.Default:
nSamples
= 0.tolerance
, float (Input)Error tolerance to use in the algorithm.
Default:
tolerance
=100.0e0 \* machine(4)
.randomSeed
, int (Input)The seed of the random number generator used in sampling or cross-validation. By changing the value of
seed
on different calls todecisionTree
, with the same data set, calls may produce slightly different results. SettingrandomSeed
to zero forces random number seed determination by the system clock.Default:
randomSeed = 0
printLevel
, int (Input)printLevel
Action 0 No printing 1 Prints final results only. 2 Prints intermediate and final results. Default:
printLevel = 0.
testData
, float[]
(Input)An array of size
nTest
×ncols
containing hold-out or test data for which predictions are requested. When this optional argument is present, the number of observations intestData
,nTest
, must be greater than 0. The response variable may have missing values intestData
, but it must be in the same column and the predictors must be in the same columns as they are inxy
. If the test data is not provided but predictions are requested, thenxy
is used and the predictions are the fitted values.Default:
testData
=
xy
testDataWeights
, float[]
(Input)An array of size
nTest
containing frequencies or weights for each observation inxyTest
. This argument is ignored iftestData
is not present.Default:
testDataWeights = weights
.errorSs
(Output)- The fitted data error mean sum of squares in the absence of test data
(
xyTest
). When test data is provided, the prediction error mean sum of squares is returned. predicted
(Output)- An array of length
n
containing the fitted or predicted value of the response variable for each case in the input data or test data, if provided.
Description¶
This implementation includes four of the most widely used algorithms for decision trees. Below is a brief summary of each approach.
C4.5¶
The method C4.5 (Quinlan, 1995) is a tree partitioning algorithm for a categorical response variable and categorical or quantitative predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as
Where \(p_i=Pr[Y=i|S]\) is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by
If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:
The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:
where
with
Note that \(E_X(S)\) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form \(X\leq d\) and \(X>d\) are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child-nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.
ALACART¶
ALACART implements the method of Breiman, Friedman, Olshen and Stone (1984), the original authors and developers of CART™. CART™ is the trademarked name for Classification and Regression Trees. In ALACART, only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.
For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most. For a given set of data S at a node, the node impurity for a C-class categorical response is a function of the class probabilities
The measure function \(\varphi(\cdot)\) should be 0 for “pure” nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.
As only binary splits of a subset S are considered (\(S_1\), \(S_2\) such that \(S=S_1\cup S_2\) and \(S=S_1\cap S_2=\emptyset\)), the reduction in impurity when splitting S into \(S_1\), \(S_2\) is
where \(q_j=Pr[S_j]\), j =1,2 — the node probability.
The gain criteria and the reduction in impurity ΔI are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by
If Y is an ordered response or continuous, the problem is a regression problem. ALACART generates the tree using the same steps, except that node-level measures or loss-functions are the mean squared error (MSE) or mean absolute error (MAD) rather than node impurity measures.
CHAID¶
The third method is appropriate only for categorical or discrete ordered predictor variables. Due to Kass (1980), CHAID is an acronym for chi-square automatic interaction detection. At each node, as above, CHAID looks for the best splitting variable. The approach is as follows: given a predictor variable X, perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y. The least significant result is noted and, if a threshold is met, the two categories of X are merged. Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. If a merged category consists of three or more of the original categories of X, CHAID calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth. Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories define the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.
QUEST¶
The fourth method, the QUEST algorithm (Loh and Shih, 1997), is appropriate for a categorical response variable and predictors of either categorical or quantitative type. For each categorical predictor, QUEST performs a multi-way chi-square test of association between the predictor and Y. For every continuous predictor, QUEST performs an ANOVA test to see if the means of the predictor vary among the groups of Y. Among these tests, the variable with the most significant result is selected as a potential splitting variable, say, \(X_j\). If the p-value (adjusted for multiple tests) is less than the specified splitting threshold, then \(X_j\) is the splitting variable for the current node. If not, QUEST performs for each continuous variable X a Levene’s test of homogeneity to see if the variance of X varies within the different groups of Y. Among these tests, we again find the predictor with the most significant result, say \(X_i\) If its p-value (adjusted for multiple tests) is less than the splitting threshold, \(X_i\) is the splitting variable. Otherwise, the node is not split.
Assuming a splitting variable is found, the next step is to determine how the variable should be split. If the selected variable \(X_j\) is continuous, a split point d is determined by quadratic discriminant analysis (QDA) of \(X_j\) into two populations determined by a binary partition of the response Y. The goal of this step is to group the classes of Y into two subsets or super classes, A and B. If there are only two classes in the response Y, the super classes are obvious. Otherwise, calculate the means and variances of \(X_j\) in each of the classes of Y. If the means are all equal, put the largest-sized class into group A and combine the rest to form group B. If they are not all equal, use a k-means clustering method (\(k=2\)) on the class means to determine A and B.
\(X_j\) in A and in B is assumed to be normally distributed with estimated means \(\overline{x}_{j|A}\), \(\overline{x}_{j|B}\), and variances \(S^2 _{j|A}\), \(S^2 _{j|B}\), respectively. The quadratic discriminant is the partition \(X_j\leq d\) and \(X_j>d\) such that \(Pr(X_j,A)=Pr(X_j,B)\). The discriminant rule assigns an observation to A if \(x_{ij}\leq d\) and to B if \(x_{ij}>d\). For d to maximally discriminate, the probabilities must be equal.
If the selected variable \(X_j\) is categorical, it is first transformed using the method outlined in Loh and Shih (1997) and then QDA is performed as above. The transformation is related to the discriminant coordinate (CRIMCOORD) approach due to Gnanadesikan (1977).
Minimal-Cost Complexity Pruning¶
One way to address overfitting is to grow the tree as large as possible, and then use some logic to prune it back. Let T represent a decision tree generated by any of the methods above. The idea (from Breiman, et. al.) is to find the smallest sub-tree of T that minimizes the cost complexity measure:
\(\tilde{T}\) denotes the set of terminal nodes, \(|\tilde{T}|\) represents the number of terminal nodes, and δ ≥ 0 is a cost-complexity parameter. For a categorical target variable
and \(C(i|j)\) is the cost for misclassifying the actual class j as i. Note that \(C(j|j)=0\) and \(C(i|j)>0\), for \(i\neq j\).
When the target is continuous (and the problem is a regression problem), the metric is instead the mean squared error
This software implements the optimal pruning algorithm10.1, page 294 in Breiman, et. al (1984). The result of the algorithm is a sequence of sub-trees \(T_{max}\succ T_1\succ T_2\succ\cdots T_{M-1}\succ\{t_0\}\) obtained by pruning the fully generated tree, \(T_{max}\), until the sub-tree consists of the single root node, \(\{t_0\}\). Corresponding to the sequence of sub-trees is the sequence of complexity values, \(0\leq \delta_{min}=\delta_1<\delta_2<\cdots<\delta_{M-1}<\delta_M\) where M is the number of steps it takes in the algorithm to reach the root node. The sub-trees represent the optimally pruned sub-trees for the sequence of complexity values. The minimum complexity \(\delta_{min}\) can be set via an optional argument.
V-Fold Cross-Validation¶
In V-fold cross validation, the training data is partitioned randomly into V approximately equally sized sub-samples. The model is then trained V different times with each of the sub-samples removed in turn. The cross-validated estimate of the risk of a decision function \(R(d_k)\) is
where \(L(y,d(x))\) is the loss incurred when the decision is \(d(x)\) for the actual, y. The symbol η denotes the full training data set, and \(d_k^v\) denotes the set of decisions corresponding to \(T_k^v\), the \(k^{th}\) optimally pruned tree using the training sample \(\eta-\eta_v\) and \(\delta'_k=\sqrt{\delta_k \delta_{k+1}}\), where \(\delta_k\), \(\delta_{k+1}\) come from the pruning on the full data set, η.
For example, if the problem is classification, priors are estimated from the data Nj/N and T represents any tree,
where
and
is the overall number of j cases misclassified as i. The standard error of \(R^{CV}(d_k)\) is approximated with
where
the “sample variance” of the cross-validated estimates.
Final selection rules include:
- Select \(T_{k_1}\) such that \(R^{CV} \left( T_{k_1} \right)=\min_k R^{CV} \left( T_k \right)\)
- Select \(T_{k_2}\) such that \(k_2\) is the largest k satisfying \(R^{CV} \left( T_{k_2} \right)\leq R^{CV} \left( T_{k_1} \right)+SE \left( R^{CV} \left( T_{k_1} \right) \right)\).
- For a specified complexity parameter \(\delta_fin\), select \(T_{k_3}\) such that \(R^{CV} \left( T_{k_3} \right)\leq\min_k R^{CV} \left( T_k \right)+\delta_{\mathit{fin}} | \tilde{T}_k |\).
Bagging¶
Bagging is a resampling approach for generating predictions. In particular, bagging stands for bootstrap aggregating. In the procedure, m bootstrap samples of size n are drawn from the training set of size n. Bootstrap sampling is sampling with replacement so that each sample has repeated observations, and some observations will be left out. A decision tree is grown treating each sample as a separate training set. The tree is then used to generate predictions for the test data. For each test case, the m predictions are combined by averaging the output (regression) or voting (classification) to obtain a final prediction.
Bagging leads to “improvements for unstable procedures,” such as neural nets, classification and regression trees, and subset selection in linear regression. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).
Missing Values¶
Any observation or case with a missing response variable is eliminated from the analysis. If a predictor has a missing value, each algorithm will skip that case when evaluating the given predictor. When making a prediction for a new case, if the split variable is missing, the prediction function applies surrogate split-variables and splitting rules in turn, if they are estimated with the decision tree. Otherwise, the prediction function returns the prediction from the most recent non-terminal node. In this implementation, only ALACART estimates surrogate split variables when requested.
Structure Definitions¶
Name | Type | Description |
---|---|---|
nClasses |
int | Number of classes assumed by the response variable, if the response variable is categorical |
nLevels |
int | Number of levels or depth of tree |
n_nodes |
int | Number of nodes or size of tree |
nodes |
Imsls_f tree_node | An array of tree_node
structures of size n_nodes |
n_preds |
int | Number of predictors used in the model |
nSurrogates |
int | Number of surrogate splits searched for at each node. Available for method=1 |
pred_type |
int | An array of length n_preds
containing the type of each
predictor variable |
pred_n_values |
int | An array of length n_preds
containing the number of values
of each predictor variable |
response_type |
int | Type of the response variable |
terminal_nodes |
int | An array of length n_nodes
indicating which nodes are
terminal nodes |
Name | Type | Description |
children_ids |
int | An array of length n_children
containing the IDs of the children
nodes |
cost |
float | Misclassification cost (in-sample cost measure at the current node) |
n_cases |
int | Number of cases of the training data that fall into the current node |
n_children |
int | Number of children of the current node |
node_id |
int | Node ID, where the root node
corresponds to node_id= 0 |
node_prob |
float | Estimate of the probability of a new case belonging to the node |
node_split_value |
float | Value around which the node will be split, if the node variable is of continuous type |
node_values_ind |
int | Values of the split variable for
the current node, if
node_var_id has type 0 or 1 |
node_var_id |
int | ID of the variable that defined the split in the parent node |
parent_id |
int | ID of the parent of the node with
ID node_id |
predicted_class |
int | Predicted class at the current node, for response variables of categorical type |
predicted_val |
float | Predicted value of the response variable if the response variable is of continuous type |
surrogate_info |
float | Array containing the surrogate split information |
y_probs |
float | The array of class probabilities at the current node, if the response variable is of categorical type |
Examples¶
Example 1¶
In this example, we use a small data set with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by C4.5 and the ALACART methods. The control parameters are adjusted because of the small data size and no cross-validation or pruning is performed. Notice that C4.5 splits on Outlook, then Humidity and Wind, while ALACART splits on Outlook, then Temperature.
from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreeFree import decisionTreeFree
xy = [[0, 85, 85, 0, 0],
[0, 80, 90, 1, 0],
[1, 83, 78, 0, 1],
[2, 70, 96, 0, 1],
[2, 68, 80, 0, 1],
[2, 65, 70, 1, 0],
[1, 64, 65, 1, 1],
[0, 72, 95, 0, 0],
[0, 69, 70, 0, 1],
[2, 75, 80, 0, 1],
[0, 75, 70, 1, 1],
[1, 72, 90, 1, 1],
[1, 81, 75, 0, 1],
[2, 71, 80, 1, 0]]
responseColIdx = 4
method = 1
varType = [0, 2, 2, 0, 0]
control = [2, 3, 10, 50, 10]
names = ["Outlook", "Temperature", "Humidity", "Wind", "Play"]
classNames = ["Don't Play", "Play"]
varLevels = ["Sunny", "Overcast", "Rainy", "False", "True"]
tree = decisionTree(xy, responseColIdx, varType,
nFolds=1, control=control)
print("Decision Tree using Method C4.5:")
decisionTreePrint(tree, varNames=names, classNames=classNames,
categNames=varLevels)
decisionTreeFree(tree)
tree = decisionTree(xy, responseColIdx, varType,
method=method, nFolds=1, control=control)
print("\n\nDecision Tree using Method ALACART:")
decisionTreePrint(tree, varNames=names, classNames=classNames,
categNames=varLevels)
decisionTreeFree(tree)
Output¶
Decision Tree using Method C4.5:
Decision Tree using Method ALACART:
Decision Tree:
Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes: 1 4 5
P(Y=0)= 0.357
P(Y=1)= 0.643
Predicted Y: Play
Node 1: Cost = 0.143, N= 5, Level = 1, Child nodes: 2 3
Rule: Outlook in: { Sunny }
P(Y=0)= 0.600
P(Y=1)= 0.400
Predicted Y: Don't Play
Node 2: Cost = 0.000, N= 2, Level = 2
Rule: Humidity <= 77.500
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 3: Cost = 0.000, N= 3, Level = 2
Rule: Humidity > 77.500
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Node 4: Cost = 0.000, N= 4, Level = 1
Rule: Outlook in: { Overcast }
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 5: Cost = 0.143, N= 5, Level = 1, Child nodes: 6 7
Rule: Outlook in: { Rainy }
P(Y=0)= 0.400
P(Y=1)= 0.600
Predicted Y: Play
Node 6: Cost = 0.000, N= 3, Level = 2
Rule: Wind in: { False }
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 7: Cost = 0.000, N= 2, Level = 2
Rule: Wind in: { True }
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Decision Tree:
Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes: 1 8
P(Y=0)= 0.357
P(Y=1)= 0.643
Predicted Y: Play
Node 1: Cost = 0.357, N= 10, Level = 1, Child nodes: 2 7
Rule: Outlook in: { Sunny Rainy }
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Y: Don't Play
Node 2: Cost = 0.214, N= 8, Level = 2, Child nodes: 3 6
Rule: Temperature <= 77.500
P(Y=0)= 0.375
P(Y=1)= 0.625
Predicted Y: Play
Node 3: Cost = 0.214, N= 6, Level = 3, Child nodes: 4 5
Rule: Temperature <= 73.500
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Y: Play
Node 4: Cost = 0.071, N= 4, Level = 4
Rule: Temperature <= 70.500
P(Y=0)= 0.250
P(Y=1)= 0.750
Predicted Y: Play
Node 5: Cost = 0.000, N= 2, Level = 4
Rule: Temperature > 70.500
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Node 6: Cost = 0.000, N= 2, Level = 3
Rule: Temperature > 73.500
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 7: Cost = 0.000, N= 2, Level = 2
Rule: Temperature > 77.500
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Node 8: Cost = 0.000, N= 4, Level = 1
Rule: Outlook in: { Overcast }
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Example 2¶
This example applies the ALACART method to a simulated data set with 50 cases and three predictors of mixed-type. A maximally grown tree under the default controls and the optimally pruned sub-tree obtained from cross-validation and minimal cost complexity pruning are produced. Notice that the optimally pruned tree consists of just the root node, whereas the maximal tree has five nodes and three levels.
from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreeFree import decisionTreeFree
xy = [[2, 25.928690, 0, 0],
[1, 51.632450, 1, 1],
[1, 25.784321, 0, 2],
[0, 39.379478, 0, 3],
[2, 24.650579, 0, 2],
[2, 45.200840, 0, 2],
[2, 52.679600, 1, 3],
[1, 44.283421, 1, 3],
[2, 40.635231, 1, 3],
[2, 51.760941, 0, 3],
[2, 26.303680, 0, 1],
[2, 20.702299, 1, 0],
[2, 38.742729, 1, 3],
[2, 19.473330, 0, 0],
[1, 26.422110, 0, 0],
[2, 37.059860, 1, 0],
[1, 51.670429, 1, 3],
[0, 42.401562, 0, 3],
[2, 33.900269, 1, 2],
[1, 35.432819, 0, 0],
[1, 44.303692, 0, 1],
[0, 46.723869, 0, 2],
[1, 46.992619, 0, 2],
[0, 36.059231, 0, 3],
[2, 36.831970, 1, 1],
[1, 61.662571, 1, 2],
[0, 25.677139, 0, 3],
[1, 39.085670, 1, 0],
[0, 48.843410, 1, 1],
[1, 39.343910, 0, 3],
[2, 24.735220, 0, 2],
[1, 50.552509, 1, 3],
[0, 31.342630, 1, 3],
[1, 27.157949, 1, 0],
[0, 31.726851, 0, 2],
[0, 25.004080, 0, 3],
[1, 26.354570, 1, 3],
[2, 38.123428, 0, 1],
[0, 49.940300, 0, 2],
[1, 42.457790, 1, 3],
[0, 38.809479, 1, 1],
[0, 43.227989, 1, 1],
[0, 41.876240, 0, 3],
[2, 48.078201, 0, 2],
[0, 43.236729, 1, 0],
[2, 39.412941, 0, 3],
[1, 23.933460, 0, 2],
[2, 42.841301, 1, 3],
[2, 30.406691, 0, 1],
[0, 37.773891, 0, 2]]
responseColIdx = 3
method = 3
varType = [0, 2, 0, 0]
tree = decisionTree(xy, responseColIdx, varType,
method=method, randomSeed=123457,
printLevel=1)
print("\nMaximal tree:")
decisionTreePrint(tree, printMax=True)
print("\n\nOptimally pruned subtree:")
decisionTreePrint(tree)
decisionTreeFree(tree)
Output¶
Maximal tree:
Optimally pruned subtree:
Growing the maximal tree using method QUEST:
Cross-Validation:
Tree complexity CV-err CV-Std.Error
0 0.00000 0.68295 0.07567
1 0.02000 0.70530 0.08069
2 0.04000 0.72814 0.08598
Select tree number 2, cost complexity parameter = 0.04000:
Decision Tree:
Node 0: Cost = 0.620, N= 50, Level = 0, Child nodes: 1 2
P(Y=0)= 0.180
P(Y=1)= 0.180
P(Y=2)= 0.260
P(Y=3)= 0.380
Predicted Y: 3
Node 1: Cost = 0.220, N= 17, Level = 1
Rule: X1 <= 35.031
P(Y=0)= 0.294
P(Y=1)= 0.118
P(Y=2)= 0.353
P(Y=3)= 0.235
Predicted Y: 2
Node 2: Cost = 0.360, N= 33, Level = 1, Child nodes: 3 4
Rule: X1 > 35.031
P(Y=0)= 0.121
P(Y=1)= 0.212
P(Y=2)= 0.212
P(Y=3)= 0.455
Predicted Y: 3
Node 3: Cost = 0.180, N= 19, Level = 2
Rule: X1 <= 43.265
P(Y=0)= 0.211
P(Y=1)= 0.211
P(Y=2)= 0.053
P(Y=3)= 0.526
Predicted Y: 3
Node 4: Cost = 0.160, N= 14, Level = 2
Rule: X1 > 43.265
P(Y=0)= 0.000
P(Y=1)= 0.214
P(Y=2)= 0.429
P(Y=3)= 0.357
Predicted Y: 2
Decision Tree:
Node 0: Cost = 0.620, N= 50, Level = 0
P(Y=0)= 0.180
P(Y=1)= 0.180
P(Y=2)= 0.260
P(Y=3)= 0.380
Predicted Y: 3
Pruned at Node id 0.
Example 3¶
This example uses the dataset Kyphosis. The 81 cases represent 81 children who have undergone surgery to correct a type of spinal deformity known as Kyphosis. The response variable is the presence or absence of Kyphosis after the surgery. Three predictors are Age of the patient in months, Start, the number of the vertebra where the surgery started, and Number, the number of vertebra involved in the surgery. This example uses the method QUEST to produce a maximal tree. It also requests predictions for a test-data set consisting of 10 “new” cases.
from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreeFree import decisionTreeFree
xy = [[0, 71, 3, 5],
[0, 158, 3, 14],
[1, 128, 4, 5],
[0, 2, 5, 1],
[0, 1, 4, 15],
[0, 1, 2, 16],
[0, 61, 2, 17],
[0, 37, 3, 16],
[0, 113, 2, 16],
[1, 59, 6, 12],
[1, 82, 5, 14],
[0, 148, 3, 16],
[0, 18, 5, 2],
[0, 1, 4, 12],
[0, 168, 3, 18],
[0, 1, 3, 16],
[0, 78, 6, 15],
[0, 175, 5, 13],
[0, 80, 5, 16],
[0, 27, 4, 9],
[0, 22, 2, 16],
[1, 105, 6, 5],
[1, 96, 3, 12],
[0, 131, 2, 3],
[1, 15, 7, 2],
[0, 9, 5, 13],
[0, 8, 3, 6],
[0, 100, 3, 14],
[0, 4, 3, 16],
[0, 151, 2, 16],
[0, 31, 3, 16],
[0, 125, 2, 11],
[0, 130, 5, 13],
[0, 112, 3, 16],
[0, 140, 5, 11],
[0, 93, 3, 16],
[0, 1, 3, 9],
[1, 52, 5, 6],
[0, 20, 6, 9],
[1, 91, 5, 12],
[1, 73, 5, 1],
[0, 35, 3, 13],
[0, 143, 9, 3],
[0, 61, 4, 1],
[0, 97, 3, 16],
[1, 139, 3, 10],
[0, 136, 4, 15],
[0, 131, 5, 13],
[1, 121, 3, 3],
[0, 177, 2, 14],
[0, 68, 5, 10],
[0, 9, 2, 17],
[1, 139, 10, 6],
[0, 2, 2, 17],
[0, 140, 4, 15],
[0, 72, 5, 15],
[0, 2, 3, 13],
[1, 120, 5, 8],
[0, 51, 7, 9],
[0, 102, 3, 13],
[1, 130, 4, 1],
[1, 114, 7, 8],
[0, 81, 4, 1],
[0, 118, 3, 16],
[0, 118, 4, 16],
[0, 17, 4, 10],
[0, 195, 2, 17],
[0, 159, 4, 13],
[0, 18, 4, 11],
[0, 15, 5, 16],
[0, 158, 5, 14],
[0, 127, 4, 12],
[0, 87, 4, 16],
[0, 206, 4, 10],
[0, 11, 3, 15],
[0, 178, 4, 15],
[1, 157, 3, 13],
[0, 26, 7, 13],
[0, 120, 2, 13],
[1, 42, 7, 6],
[0, 36, 4, 13]]
xyTest = [[0, 71, 3, 5],
[1, 128, 4, 5],
[0, 1, 4, 15],
[0, 61, 6, 10],
[0, 113, 2, 16],
[1, 82, 5, 14],
[0, 148, 3, 16],
[0, 1, 4, 12],
[0, 1, 3, 16],
[0, 175, 5, 13]]
responseColIdx = 0
method = 3
control = [5, 10, 10, 50, 10]
varType = [0, 2, 2, 2]
nTest = 10
names = ["Age", "Number", "Start"]
classNames = ["Absent", "Present"]
responseName = ["Kyphosis"]
predicted = []
errorSs = []
tree = decisionTree(xy, responseColIdx, varType,
method=method, nFolds=1,
control=control, testData=xyTest,
printLevel=2, predicted=predicted,
errorSs=errorSs)
decisionTreePrint(tree, varNames=names, classNames=classNames,
respName=responseName)
print("\nPredictions for test data:")
print("%5s%8s%7s%10s\n" % (names[0], names[1], names[2],
responseName[0]))
for i in range(nTest):
idx = int(predicted[i])
print("%5.0f%8.0f%7.0f%10s" % (
xyTest[i][1],
xyTest[i][2],
xyTest[i][3],
classNames[idx]))
print("Mean squared prediction error: %f" % errorSs[0])
decisionTreeFree(tree)
Output¶
Predictions for test data:
Age Number Start Kyphosis
71 3 5 Absent
128 4 5 Present
1 4 15 Absent
61 6 10 Absent
113 2 16 Absent
82 5 14 Absent
148 3 16 Absent
1 4 12 Absent
1 3 16 Absent
175 5 13 Absent
Mean squared prediction error: 0.100000
The response variable has 0 missing values.
Growing the maximal tree using method QUEST:
Node 2 is a terminal node. It has 7 cases--too few cases to split.
Node 3 is a terminal node. It has 6 cases--too few cases to split.
Node 5 is a terminal node. It has 6 cases--too few cases to split.
Node 8 is an terminal node. The split is too thin having count 2.
Node 10 is a terminal node. It has 6 cases--too few cases to split.
Node 11 is a terminal node, because it is pure.
Node 11 is a terminal node. It has 7 cases--too few cases to split.
Node 13 is a terminal node. It has 5 cases--too few cases to split.
Node 14 is a terminal node, because it is pure.
Decision Tree:
Node 0: Cost = 0.210, N= 81, Level = 0, Child nodes: 1 4
P(Y=0)= 0.790
P(Y=1)= 0.210
Predicted Kyphosis: Absent
Node 1: Cost = 0.074, N= 13, Level = 1, Child nodes: 2 3
Rule: Start <= 5.155
P(Y=0)= 0.538
P(Y=1)= 0.462
Predicted Kyphosis: Absent
Node 2: Cost = 0.025, N= 7, Level = 2
Rule: Age <= 84.030
P(Y=0)= 0.714
P(Y=1)= 0.286
Predicted Kyphosis: Absent
Node 3: Cost = 0.025, N= 6, Level = 2
Rule: Age > 84.030
P(Y=0)= 0.333
P(Y=1)= 0.667
Predicted Kyphosis: Present
Node 4: Cost = 0.136, N= 68, Level = 1, Child nodes: 5 6
Rule: Start > 5.155
P(Y=0)= 0.838
P(Y=1)= 0.162
Predicted Kyphosis: Absent
Node 5: Cost = 0.012, N= 6, Level = 2
Rule: Start <= 8.862
P(Y=0)= 0.167
P(Y=1)= 0.833
Predicted Kyphosis: Present
Node 6: Cost = 0.074, N= 62, Level = 2, Child nodes: 7 12
Rule: Start > 8.862
P(Y=0)= 0.903
P(Y=1)= 0.097
Predicted Kyphosis: Absent
Node 7: Cost = 0.062, N= 28, Level = 3, Child nodes: 8 9
Rule: Start <= 13.092
P(Y=0)= 0.821
P(Y=1)= 0.179
Predicted Kyphosis: Absent
Node 8: Cost = 0.025, N= 15, Level = 4
Rule: Age <= 91.722
P(Y=0)= 0.867
P(Y=1)= 0.133
Predicted Kyphosis: Absent
Node 9: Cost = 0.037, N= 13, Level = 4, Child nodes: 10 11
Rule: Age > 91.722
P(Y=0)= 0.769
P(Y=1)= 0.231
Predicted Kyphosis: Absent
Node 10: Cost = 0.037, N= 6, Level = 5
Rule: Number <= 3.450
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Kyphosis: Absent
Node 11: Cost = 0.000, N= 7, Level = 5
Rule: Number > 3.450
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Kyphosis: Absent
Node 12: Cost = 0.012, N= 34, Level = 3, Child nodes: 13 14
Rule: Start > 13.092
P(Y=0)= 0.971
P(Y=1)= 0.029
Predicted Kyphosis: Absent
Node 13: Cost = 0.012, N= 5, Level = 4
Rule: Start <= 14.864
P(Y=0)= 0.800
P(Y=1)= 0.200
Predicted Kyphosis: Absent
Node 14: Cost = 0.000, N= 29, Level = 4
Rule: Start > 14.864
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Kyphosis: Absent
Warning Errors¶
IMSLS_NO_SURROGATES |
Use of surrogates is limited to method 1 (ALACART). |
IMSLS_NO_CONVERGENCE |
Convergence was not achieved. |
Fatal Errors¶
IMSLS_INVALID_METHOD2 |
Choose a valid tree generation method; 0 (C4.5), 1 (ALACART), 2 (CHAID) or 3 (QUEST). |
IMSLS_VALUE_GT_ZERO |
The value of # must be strictly
positive. |