decisionTree

Generates a decision tree for a single response variable and two or more predictor variables.

Synopsis

decisionTree (xy, responseColIdx, varType)

Required Arguments

float xy[[]] (Input)
Array of size n × nCols containing the data.
int responseColIdx (Input)
Column index of the response variable.
int varType[] (Input)
Array of length ncols indicating the type of each variable.
varType[i] Type
0 Categorical
1 Ordered Discrete (Low, Med., High)
2 Quantitative or Continuous
3 Ignore this variable

Return Value

A structure. If an error occurs, None is returned.

Optional Arguments

method, int (Input)

Specifies the tree generation method. The key for the variable type index is provided above.

method Method Response varType Predictor varType
0 C4.5 0 0, 1, 2
1 ALACART (Breiman, et. al.) 0, 1, 2 0, 1, 2
2 QUEST 0 0, 1, 2
3 CHAID 0, 1, 2 0

Default: method = 0.

criteria, int (Input)

Specifies which criteria the ALACART method and the C4.5 method should use in the gain calculations to determine the best split at each node.

criteria Measure
0 Shannon Entropy
1 Gini Index

Default: criteria = 0.

ratio, (Input)

If present, the ALACART method and C4.5 method each uses a gain ratio instead of just the gain to determine the best split.

Default: Uses gain.

weights, float[] (Input)

An array of length n containing case weights.

Default: weights[i] = 1.0.

costMatrix, float[] (Input)

An array of length nClasses × nClasses containing the cost matrix for a categorical response variable, where nClasses is the number of classes the response variable may assume. The cost matrix has elements \(C(i,j)\) = cost of misclassifying a response in class j as in class i. The diagonal elements of the cost matrix must be 0.

Default: costMatrix[i] = 1.0, for i on the off-diagonal, costMatrix[i] = 0.0, for i on the diagonal.

control, int[] (Input)

Array of length 5 containing parameters to control the maximum size of the tree and other stopping rules.

control[i] name Action
0 minNNode Do not split a node if one of its child nodes will have fewer than minNNode observations.
1 minSplit Do not split a node if the node has fewer than minSplit observations
2 maxXCats Allow for up to maxXCats for categorical predictor variables.
3 maxSize Stop growing the tree once it has reached maxSize number of nodes.
4 maxDepth Stop growing the tree once it has reached maxDepth number of levels.

Default: control[] = [7, 21, 10, 100, 10]

complexity, float (Input)

The minimum complexity parameter to use in cross-validation. Complexity must be ≥ 0.

Default: complexity = 0.0.

nSurrogates, int (Input)

Indicates the number of surrogate splits. Only used if method = 1.

Default: nSurrogates = 0.

alphas, float [] (Input)

An array of length 3 containing the significance levels. alphas[0] = significance level for split variable selection (CHAID and QUEST); alphas[1]= significance level for merging categories of a variable (CHAID), and alphas[2] = significance level for splitting previously merged categories (CHAID). Valid values are in the range 0 < alphas[i] < 1.0, and alphas[2] <= alphas[1]. Setting alphas[2] = -1.0 disables splitting of merged categories.

Default: alphas[] = [0.05, 0.05, -1.0]

priors, float[] (Input)
An array of length nClasses, where nClasses is the number of classes the response variable may assume, containing prior probabilities for class membership. The argument is ignored for continuous response variables (varType[responseColIdx]=2). By default, the prior probabilities are estimated from the data.
nFolds, int (Input)

The number of folds to use in cross validation tree selection. nFolds must be between 1 and n, inclusive. If nFolds = 1 the full data set is used once to generate the decision tree. In other words, no cross-validation is performed. If 1 < n/nFolds3, then leave-one-out cross validation is performed.

Default: nFolds = 10.

nSamples, int (Input)

The number of bootstrap samples to use in bootstrap aggregation (bagging) when predicted values are requested. To obtain predictions produced by bagging, set nSamples > 0 and use one of predicted.

Default: nSamples = 0.

tolerance, float (Input)

Error tolerance to use in the algorithm.

Default: tolerance = 100.0e0 \* machine(4).

randomSeed, int (Input)

The seed of the random number generator used in sampling or cross-validation. By changing the value of seed on different calls to decisionTree, with the same data set, calls may produce slightly different results. Setting randomSeed to zero forces random number seed determination by the system clock.

Default: randomSeed = 0

printLevel, int (Input)
printLevel Action
0 No printing
1 Prints final results only.
2 Prints intermediate and final results.

Default: printLevel = 0.

testData, float[] (Input)

An array of size nTest × ncols containing hold-out or test data for which predictions are requested. When this optional argument is present, the number of observations in testData, nTest, must be greater than 0. The response variable may have missing values in testData, but it must be in the same column and the predictors must be in the same columns as they are in xy. If the test data is not provided but predictions are requested, then xy is used and the predictions are the fitted values.

Default: testData = xy

testDataWeights, float[] (Input)

An array of size nTest containing frequencies or weights for each observation in xyTest. This argument is ignored if testData is not present.

Default: testDataWeights = weights.

errorSs (Output)
The fitted data error mean sum of squares in the absence of test data (xyTest). When test data is provided, the prediction error mean sum of squares is returned.
predicted (Output)
An array of length n containing the fitted or predicted value of the response variable for each case in the input data or test data, if provided.

Description

This implementation includes four of the most widely used algorithms for decision trees. Below is a brief summary of each approach.

C4.5

The method C4.5 (Quinlan, 1995) is a tree partitioning algorithm for a categorical response variable and categorical or quantitative predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as

\[E(S) - \sum_{i=1}^{C} p_i \log\left(p_i\right)\]

Where \(p_i=Pr[Y=i|S]\) is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by

\[E(S,X) = - \sum_{k=1}^{K} \sum_{i=1}^{C_k} p\left(S_k\right) E \left(S_k\right)\]

If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:

\[g (S,X) = E(S) - E(S,X)\]

The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:

\[gR(S,X) = \frac{E(S) - E(S,X)}{E_X(S)}\]

where

\[E_x(S) = -\sum_{k=1}^{K} v_k \log\left(v_k\right)\]

with

\[ν_k = Pr[X= k|S]\]

Note that \(E_X(S)\) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form \(X\leq d\) and \(X>d\) are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child-nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.

ALACART

ALACART implements the method of Breiman, Friedman, Olshen and Stone (1984), the original authors and developers of CART™. CART™ is the trademarked name for Classification and Regression Trees. In ALACART, only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.

For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most. For a given set of data S at a node, the node impurity for a C-class categorical response is a function of the class probabilities

\[I(S)=φ(p(1|S), p(2|S),…, p(C|S))\]

The measure function \(\varphi(\cdot)\) should be 0 for “pure” nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.

As only binary splits of a subset S are considered (\(S_1\), \(S_2\) such that \(S=S_1\cup S_2\) and \(S=S_1\cap S_2=\emptyset\)), the reduction in impurity when splitting S into \(S_1\), \(S_2\) is

\[ΔI = I(S) -q_1I(S_1) -q_2I(S_2)\]

where \(q_j=Pr[S_j]\), j =1,2 — the node probability.

The gain criteria and the reduction in impurity ΔI are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by

\[\begin{split}I(S) = \sum_{\substack{i,j=1 \\ i \neq j}}^{C} p(i|S) p(j|S) = 1 - \sum_{i=1}^{C} p^2(i|S)\end{split}\]

If Y is an ordered response or continuous, the problem is a regression problem. ALACART generates the tree using the same steps, except that node-level measures or loss-functions are the mean squared error (MSE) or mean absolute error (MAD) rather than node impurity measures.

CHAID

The third method is appropriate only for categorical or discrete ordered predictor variables. Due to Kass (1980), CHAID is an acronym for chi-square automatic interaction detection. At each node, as above, CHAID looks for the best splitting variable. The approach is as follows: given a predictor variable X, perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y. The least significant result is noted and, if a threshold is met, the two categories of X are merged. Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. If a merged category consists of three or more of the original categories of X, CHAID calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth. Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories define the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.

QUEST

The fourth method, the QUEST algorithm (Loh and Shih, 1997), is appropriate for a categorical response variable and predictors of either categorical or quantitative type. For each categorical predictor, QUEST performs a multi-way chi-square test of association between the predictor and Y. For every continuous predictor, QUEST performs an ANOVA test to see if the means of the predictor vary among the groups of Y. Among these tests, the variable with the most significant result is selected as a potential splitting variable, say, \(X_j\). If the p-value (adjusted for multiple tests) is less than the specified splitting threshold, then \(X_j\) is the splitting variable for the current node. If not, QUEST performs for each continuous variable X a Levene’s test of homogeneity to see if the variance of X varies within the different groups of Y. Among these tests, we again find the predictor with the most significant result, say \(X_i\) If its p-value (adjusted for multiple tests) is less than the splitting threshold, \(X_i\) is the splitting variable. Otherwise, the node is not split.

Assuming a splitting variable is found, the next step is to determine how the variable should be split. If the selected variable \(X_j\) is continuous, a split point d is determined by quadratic discriminant analysis (QDA) of \(X_j\) into two populations determined by a binary partition of the response Y. The goal of this step is to group the classes of Y into two subsets or super classes, A and B. If there are only two classes in the response Y, the super classes are obvious. Otherwise, calculate the means and variances of \(X_j\) in each of the classes of Y. If the means are all equal, put the largest-sized class into group A and combine the rest to form group B. If they are not all equal, use a k-means clustering method (\(k=2\)) on the class means to determine A and B.

\(X_j\) in A and in B is assumed to be normally distributed with estimated means \(\overline{x}_{j|A}\), \(\overline{x}_{j|B}\), and variances \(S^2 _{j|A}\), \(S^2 _{j|B}\), respectively. The quadratic discriminant is the partition \(X_j\leq d\) and \(X_j>d\) such that \(Pr(X_j,A)=Pr(X_j,B)\). The discriminant rule assigns an observation to A if \(x_{ij}\leq d\) and to B if \(x_{ij}>d\). For d to maximally discriminate, the probabilities must be equal.

If the selected variable \(X_j\) is categorical, it is first transformed using the method outlined in Loh and Shih (1997) and then QDA is performed as above. The transformation is related to the discriminant coordinate (CRIMCOORD) approach due to Gnanadesikan (1977).

Minimal-Cost Complexity Pruning

One way to address overfitting is to grow the tree as large as possible, and then use some logic to prune it back. Let T represent a decision tree generated by any of the methods above. The idea (from Breiman, et. al.) is to find the smallest sub-tree of T that minimizes the cost complexity measure:

\[R_δ(T) =R(T)+δ∣\tilde{T}∣,\]

\(\tilde{T}\) denotes the set of terminal nodes, \(|\tilde{T}|\) represents the number of terminal nodes, and δ ≥ 0 is a cost-complexity parameter. For a categorical target variable

\[R(T) = \sum_{t \in \tilde{T}} R(t) = \sum_{t \in \tilde{T}} r(t) p(t)\]
\[r(t) \min_i \sum_j C(i|j) p(j|t)\]
\[p(t) = Pr[x ∈ t],\]
\[and p(j∣t) = Pr[y = j ∣ x ∈ t],\]

and \(C(i|j)\) is the cost for misclassifying the actual class j as i. Note that \(C(j|j)=0\) and \(C(i|j)>0\), for \(i\neq j\).

When the target is continuous (and the problem is a regression problem), the metric is instead the mean squared error

\[R(T) = \sum_{t \in \tilde{T}} R(t) = \tfrac{1}{N} \sum_{t \in \tilde{T}} \sum_{y_n \in t} \left(y_n - \hat{y}(t)\right)^2\]

This software implements the optimal pruning algorithm10.1, page 294 in Breiman, et. al (1984). The result of the algorithm is a sequence of sub-trees \(T_{max}\succ T_1\succ T_2\succ\cdots T_{M-1}\succ\{t_0\}\) obtained by pruning the fully generated tree, \(T_{max}\), until the sub-tree consists of the single root node, \(\{t_0\}\). Corresponding to the sequence of sub-trees is the sequence of complexity values, \(0\leq \delta_{min}=\delta_1<\delta_2<\cdots<\delta_{M-1}<\delta_M\) where M is the number of steps it takes in the algorithm to reach the root node. The sub-trees represent the optimally pruned sub-trees for the sequence of complexity values. The minimum complexity \(\delta_{min}\) can be set via an optional argument.

V-Fold Cross-Validation

In V-fold cross validation, the training data is partitioned randomly into V approximately equally sized sub-samples. The model is then trained V different times with each of the sub-samples removed in turn. The cross-validated estimate of the risk of a decision function \(R(d_k)\) is

\[R^{CV} \left(d_k\right) = \tfrac{1}{N} \sum_{v=1}^{V} \sum_{\left(x_n,y_n\right) \in \eta_v} L\left(y_n,d_k^v\left(x_n\right)\right)\]

where \(L(y,d(x))\) is the loss incurred when the decision is \(d(x)\) for the actual, y. The symbol η denotes the full training data set, and \(d_k^v\) denotes the set of decisions corresponding to \(T_k^v\), the \(k^{th}\) optimally pruned tree using the training sample \(\eta-\eta_v\) and \(\delta'_k=\sqrt{\delta_k \delta_{k+1}}\), where \(\delta_k\), \(\delta_{k+1}\) come from the pruning on the full data set, η.

For example, if the problem is classification, priors are estimated from the data Nj/N and T represents any tree,

\[\begin{split}\begin{array}{l} R^{CV}(T) = \displaystyle\sum_jR^{CV}(j)\hat{\pi}_j = \sum_j\left(\sum_iC(i|j)Q^{CV}(i|j)\right)\frac{N_j}{N} = \sum_j\left(\sum_iC(i|j)\frac{N^{ij}}{N_j}\right)\frac{N_j}{N} \\ = \frac{1}{N}\displaystyle\sum_j\left(\sum_iC(i|j)N^{ij}\right) \end{array}\end{split}\]

where

\[Q^*(i|j) = P(d(X) = i|Y=j)\]
\[R^*(j) = \sum_i C(i|j) Q^*(i|j)\]
\[R^*(d) = \sum_j R^*(j) \pi_j\]

and

\[N^{ij} = \sum_{v=1}^{V} N_{v}^{ij}\]

is the overall number of j cases misclassified as i. The standard error of \(R^{CV}(d_k)\) is approximated with

\[SE\left(R^{CV}\left(d_k\right)\right) = \sqrt{s^2 / N}\]

where

\[s^2 = \tfrac{1}{N} \sum_{\eta} \left[L\left(Y_n,d_k^{\left(v_n\right)}\left(x_n\right)\right) - R^{CV}\left(d_k\right)\right]^2\]

the “sample variance” of the cross-validated estimates.

Final selection rules include:

  1. Select \(T_{k_1}\) such that \(R^{CV} \left( T_{k_1} \right)=\min_k R^{CV} \left( T_k \right)\)
  2. Select \(T_{k_2}\) such that \(k_2\) is the largest k satisfying \(R^{CV} \left( T_{k_2} \right)\leq R^{CV} \left( T_{k_1} \right)+SE \left( R^{CV} \left( T_{k_1} \right) \right)\).
  3. For a specified complexity parameter \(\delta_fin\), select \(T_{k_3}\) such that \(R^{CV} \left( T_{k_3} \right)\leq\min_k R^{CV} \left( T_k \right)+\delta_{\mathit{fin}} | \tilde{T}_k |\).

Bagging

Bagging is a resampling approach for generating predictions. In particular, bagging stands for bootstrap aggregating. In the procedure, m bootstrap samples of size n are drawn from the training set of size n. Bootstrap sampling is sampling with replacement so that each sample has repeated observations, and some observations will be left out. A decision tree is grown treating each sample as a separate training set. The tree is then used to generate predictions for the test data. For each test case, the m predictions are combined by averaging the output (regression) or voting (classification) to obtain a final prediction.

Bagging leads to “improvements for unstable procedures,” such as neural nets, classification and regression trees, and subset selection in linear regression. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).

Missing Values

Any observation or case with a missing response variable is eliminated from the analysis. If a predictor has a missing value, each algorithm will skip that case when evaluating the given predictor. When making a prediction for a new case, if the split variable is missing, the prediction function applies surrogate split-variables and splitting rules in turn, if they are estimated with the decision tree. Otherwise, the prediction function returns the prediction from the most recent non-terminal node. In this implementation, only ALACART estimates surrogate split variables when requested.

Structure Definitions

Table 13.2 — Structure Imsls_d_decisionTree
Name Type Description
nClasses int Number of classes assumed by the response variable, if the response variable is categorical
nLevels int Number of levels or depth of tree
n_nodes int Number of nodes or size of tree
nodes Imsls_f tree_node An array of tree_node structures of size n_nodes
n_preds int Number of predictors used in the model
nSurrogates int Number of surrogate splits searched for at each node. Available for method=1
pred_type int An array of length n_preds containing the type of each predictor variable
pred_n_values int An array of length n_preds containing the number of values of each predictor variable
response_type int Type of the response variable
terminal_nodes int An array of length n_nodes indicating which nodes are terminal nodes
Table 13.3 — Structure Imsl_d_tree_node
Name Type Description
children_ids int An array of length n_children containing the IDs of the children nodes
cost float Misclassification cost (in-sample cost measure at the current node)
n_cases int Number of cases of the training data that fall into the current node
n_children int Number of children of the current node
node_id int Node ID, where the root node corresponds to node_id= 0
node_prob float Estimate of the probability of a new case belonging to the node
node_split_value float Value around which the node will be split, if the node variable is of continuous type
node_values_ind int Values of the split variable for the current node, if node_var_id has type 0 or 1
node_var_id int ID of the variable that defined the split in the parent node
parent_id int ID of the parent of the node with ID node_id
predicted_class int Predicted class at the current node, for response variables of categorical type
predicted_val float Predicted value of the response variable if the response variable is of continuous type
surrogate_info float Array containing the surrogate split information
y_probs float The array of class probabilities at the current node, if the response variable is of categorical type

Examples

Example 1

In this example, we use a small data set with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by C4.5 and the ALACART methods. The control parameters are adjusted because of the small data size and no cross-validation or pruning is performed. Notice that C4.5 splits on Outlook, then Humidity and Wind, while ALACART splits on Outlook, then Temperature.

from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreeFree import decisionTreeFree

xy = [[0, 85, 85, 0, 0],
      [0, 80, 90, 1, 0],
      [1, 83, 78, 0, 1],
      [2, 70, 96, 0, 1],
      [2, 68, 80, 0, 1],
      [2, 65, 70, 1, 0],
      [1, 64, 65, 1, 1],
      [0, 72, 95, 0, 0],
      [0, 69, 70, 0, 1],
      [2, 75, 80, 0, 1],
      [0, 75, 70, 1, 1],
      [1, 72, 90, 1, 1],
      [1, 81, 75, 0, 1],
      [2, 71, 80, 1, 0]]

responseColIdx = 4
method = 1
varType = [0, 2, 2, 0, 0]
control = [2, 3, 10, 50, 10]
names = ["Outlook", "Temperature", "Humidity", "Wind", "Play"]
classNames = ["Don't Play", "Play"]
varLevels = ["Sunny", "Overcast", "Rainy", "False", "True"]

tree = decisionTree(xy, responseColIdx, varType,
                    nFolds=1, control=control)

print("Decision Tree using Method C4.5:")
decisionTreePrint(tree, varNames=names, classNames=classNames,
                  categNames=varLevels)

decisionTreeFree(tree)

tree = decisionTree(xy, responseColIdx, varType,
                    method=method, nFolds=1, control=control)

print("\n\nDecision Tree using Method ALACART:")
decisionTreePrint(tree, varNames=names, classNames=classNames,
                  categNames=varLevels)

decisionTreeFree(tree)

Output

Decision Tree using Method C4.5:


Decision Tree using Method ALACART:

Decision Tree:

Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes:  1  4  5 
P(Y=0)= 0.357
P(Y=1)= 0.643
Predicted Y:  Play 
   Node 1: Cost = 0.143, N= 5, Level = 1, Child nodes:  2  3 
   Rule:  Outlook  in: { Sunny }
    P(Y=0)= 0.600
    P(Y=1)= 0.400
    Predicted Y:  Don't Play 
      Node 2: Cost = 0.000, N= 2, Level = 2
      Rule:  Humidity     <= 77.500
        P(Y=0)= 0.000
        P(Y=1)= 1.000
        Predicted Y:  Play 
      Node 3: Cost = 0.000, N= 3, Level = 2
      Rule:  Humidity     > 77.500
        P(Y=0)= 1.000
        P(Y=1)= 0.000
        Predicted Y:  Don't Play 
   Node 4: Cost = 0.000, N= 4, Level = 1
   Rule:  Outlook  in: { Overcast }
    P(Y=0)= 0.000
    P(Y=1)= 1.000
    Predicted Y:  Play 
   Node 5: Cost = 0.143, N= 5, Level = 1, Child nodes:  6  7 
   Rule:  Outlook  in: { Rainy }
    P(Y=0)= 0.400
    P(Y=1)= 0.600
    Predicted Y:  Play 
      Node 6: Cost = 0.000, N= 3, Level = 2
      Rule:  Wind  in: { False }
        P(Y=0)= 0.000
        P(Y=1)= 1.000
        Predicted Y:  Play 
      Node 7: Cost = 0.000, N= 2, Level = 2
      Rule:  Wind  in: { True }
        P(Y=0)= 1.000
        P(Y=1)= 0.000
        Predicted Y:  Don't Play 

Decision Tree:

Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes:  1  8 
P(Y=0)= 0.357
P(Y=1)= 0.643
Predicted Y:  Play 
   Node 1: Cost = 0.357, N= 10, Level = 1, Child nodes:  2  7 
   Rule:  Outlook  in: { Sunny  Rainy }
    P(Y=0)= 0.500
    P(Y=1)= 0.500
    Predicted Y:  Don't Play 
      Node 2: Cost = 0.214, N= 8, Level = 2, Child nodes:  3  6 
      Rule:  Temperature     <= 77.500
        P(Y=0)= 0.375
        P(Y=1)= 0.625
        Predicted Y:  Play 
         Node 3: Cost = 0.214, N= 6, Level = 3, Child nodes:  4  5 
         Rule:  Temperature       <= 73.500
            P(Y=0)= 0.500
            P(Y=1)= 0.500
            Predicted Y:  Play 
            Node 4: Cost = 0.071, N= 4, Level = 4
            Rule:  Temperature         <= 70.500
                P(Y=0)= 0.250
                P(Y=1)= 0.750
                Predicted Y:  Play 
            Node 5: Cost = 0.000, N= 2, Level = 4
            Rule:  Temperature         > 70.500
                P(Y=0)= 1.000
                P(Y=1)= 0.000
                Predicted Y:  Don't Play 
         Node 6: Cost = 0.000, N= 2, Level = 3
         Rule:  Temperature       > 73.500
            P(Y=0)= 0.000
            P(Y=1)= 1.000
            Predicted Y:  Play 
      Node 7: Cost = 0.000, N= 2, Level = 2
      Rule:  Temperature     > 77.500
        P(Y=0)= 1.000
        P(Y=1)= 0.000
        Predicted Y:  Don't Play 
   Node 8: Cost = 0.000, N= 4, Level = 1
   Rule:  Outlook  in: { Overcast }
    P(Y=0)= 0.000
    P(Y=1)= 1.000
    Predicted Y:  Play

Example 2

This example applies the ALACART method to a simulated data set with 50 cases and three predictors of mixed-type. A maximally grown tree under the default controls and the optimally pruned sub-tree obtained from cross-validation and minimal cost complexity pruning are produced. Notice that the optimally pruned tree consists of just the root node, whereas the maximal tree has five nodes and three levels.

from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreeFree import decisionTreeFree

xy = [[2, 25.928690, 0, 0],
      [1, 51.632450, 1, 1],
      [1, 25.784321, 0, 2],
      [0, 39.379478, 0, 3],
      [2, 24.650579, 0, 2],
      [2, 45.200840, 0, 2],
      [2, 52.679600, 1, 3],
      [1, 44.283421, 1, 3],
      [2, 40.635231, 1, 3],
      [2, 51.760941, 0, 3],
      [2, 26.303680, 0, 1],
      [2, 20.702299, 1, 0],
      [2, 38.742729, 1, 3],
      [2, 19.473330, 0, 0],
      [1, 26.422110, 0, 0],
      [2, 37.059860, 1, 0],
      [1, 51.670429, 1, 3],
      [0, 42.401562, 0, 3],
      [2, 33.900269, 1, 2],
      [1, 35.432819, 0, 0],
      [1, 44.303692, 0, 1],
      [0, 46.723869, 0, 2],
      [1, 46.992619, 0, 2],
      [0, 36.059231, 0, 3],
      [2, 36.831970, 1, 1],
      [1, 61.662571, 1, 2],
      [0, 25.677139, 0, 3],
      [1, 39.085670, 1, 0],
      [0, 48.843410, 1, 1],
      [1, 39.343910, 0, 3],
      [2, 24.735220, 0, 2],
      [1, 50.552509, 1, 3],
      [0, 31.342630, 1, 3],
      [1, 27.157949, 1, 0],
      [0, 31.726851, 0, 2],
      [0, 25.004080, 0, 3],
      [1, 26.354570, 1, 3],
      [2, 38.123428, 0, 1],
      [0, 49.940300, 0, 2],
      [1, 42.457790, 1, 3],
      [0, 38.809479, 1, 1],
      [0, 43.227989, 1, 1],
      [0, 41.876240, 0, 3],
      [2, 48.078201, 0, 2],
      [0, 43.236729, 1, 0],
      [2, 39.412941, 0, 3],
      [1, 23.933460, 0, 2],
      [2, 42.841301, 1, 3],
      [2, 30.406691, 0, 1],
      [0, 37.773891, 0, 2]]
responseColIdx = 3
method = 3
varType = [0, 2, 0, 0]

tree = decisionTree(xy, responseColIdx, varType,
                    method=method, randomSeed=123457,
                    printLevel=1)

print("\nMaximal tree:")
decisionTreePrint(tree, printMax=True)

print("\n\nOptimally pruned subtree:")
decisionTreePrint(tree)

decisionTreeFree(tree)

Output

Maximal tree:


Optimally pruned subtree:
Growing the maximal tree using method QUEST:

Cross-Validation:
Tree  complexity   CV-err  CV-Std.Error
   0     0.00000  0.68295       0.07567
   1     0.02000  0.70530       0.08069
   2     0.04000  0.72814       0.08598

Select tree number 2, cost complexity parameter = 0.04000:


Decision Tree:

Node 0: Cost = 0.620, N= 50, Level = 0, Child nodes:  1  2 
P(Y=0)= 0.180
P(Y=1)= 0.180
P(Y=2)= 0.260
P(Y=3)= 0.380
Predicted Y:   3 
   Node 1: Cost = 0.220, N= 17, Level = 1
   Rule: X1   <= 35.031
    P(Y=0)= 0.294
    P(Y=1)= 0.118
    P(Y=2)= 0.353
    P(Y=3)= 0.235
    Predicted Y:   2 
   Node 2: Cost = 0.360, N= 33, Level = 1, Child nodes:  3  4 
   Rule: X1   > 35.031
    P(Y=0)= 0.121
    P(Y=1)= 0.212
    P(Y=2)= 0.212
    P(Y=3)= 0.455
    Predicted Y:   3 
      Node 3: Cost = 0.180, N= 19, Level = 2
      Rule: X1     <= 43.265
        P(Y=0)= 0.211
        P(Y=1)= 0.211
        P(Y=2)= 0.053
        P(Y=3)= 0.526
        Predicted Y:   3 
      Node 4: Cost = 0.160, N= 14, Level = 2
      Rule: X1     > 43.265
        P(Y=0)= 0.000
        P(Y=1)= 0.214
        P(Y=2)= 0.429
        P(Y=3)= 0.357
        Predicted Y:   2 

Decision Tree:

Node 0: Cost = 0.620, N= 50, Level = 0
P(Y=0)= 0.180
P(Y=1)= 0.180
P(Y=2)= 0.260
P(Y=3)= 0.380
Predicted Y:   3 
Pruned at Node id 0.

Example 3

This example uses the dataset Kyphosis. The 81 cases represent 81 children who have undergone surgery to correct a type of spinal deformity known as Kyphosis. The response variable is the presence or absence of Kyphosis after the surgery. Three predictors are Age of the patient in months, Start, the number of the vertebra where the surgery started, and Number, the number of vertebra involved in the surgery. This example uses the method QUEST to produce a maximal tree. It also requests predictions for a test-data set consisting of 10 “new” cases.

from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreeFree import decisionTreeFree

xy = [[0, 71, 3, 5],
      [0, 158, 3, 14],
      [1, 128, 4, 5],
      [0, 2, 5, 1],
      [0, 1, 4, 15],
      [0, 1, 2, 16],
      [0, 61, 2, 17],
      [0, 37, 3, 16],
      [0, 113, 2, 16],
      [1, 59, 6, 12],
      [1, 82, 5, 14],
      [0, 148, 3, 16],
      [0, 18, 5, 2],
      [0, 1, 4, 12],
      [0, 168, 3, 18],
      [0, 1, 3, 16],
      [0, 78, 6, 15],
      [0, 175, 5, 13],
      [0, 80, 5, 16],
      [0, 27, 4, 9],
      [0, 22, 2, 16],
      [1, 105, 6, 5],
      [1, 96, 3, 12],
      [0, 131, 2, 3],
      [1, 15, 7, 2],
      [0, 9, 5, 13],
      [0, 8, 3, 6],
      [0, 100, 3, 14],
      [0, 4, 3, 16],
      [0, 151, 2, 16],
      [0, 31, 3, 16],
      [0, 125, 2, 11],
      [0, 130, 5, 13],
      [0, 112, 3, 16],
      [0, 140, 5, 11],
      [0, 93, 3, 16],
      [0, 1, 3, 9],
      [1, 52, 5, 6],
      [0, 20, 6, 9],
      [1, 91, 5, 12],
      [1, 73, 5, 1],
      [0, 35, 3, 13],
      [0, 143, 9, 3],
      [0, 61, 4, 1],
      [0, 97, 3, 16],
      [1, 139, 3, 10],
      [0, 136, 4, 15],
      [0, 131, 5, 13],
      [1, 121, 3, 3],
      [0, 177, 2, 14],
      [0, 68, 5, 10],
      [0, 9, 2, 17],
      [1, 139, 10, 6],
      [0, 2, 2, 17],
      [0, 140, 4, 15],
      [0, 72, 5, 15],
      [0, 2, 3, 13],
      [1, 120, 5, 8],
      [0, 51, 7, 9],
      [0, 102, 3, 13],
      [1, 130, 4, 1],
      [1, 114, 7, 8],
      [0, 81, 4, 1],
      [0, 118, 3, 16],
      [0, 118, 4, 16],
      [0, 17, 4, 10],
      [0, 195, 2, 17],
      [0, 159, 4, 13],
      [0, 18, 4, 11],
      [0, 15, 5, 16],
      [0, 158, 5, 14],
      [0, 127, 4, 12],
      [0, 87, 4, 16],
      [0, 206, 4, 10],
      [0, 11, 3, 15],
      [0, 178, 4, 15],
      [1, 157, 3, 13],
      [0, 26, 7, 13],
      [0, 120, 2, 13],
      [1, 42, 7, 6],
      [0, 36, 4, 13]]

xyTest = [[0, 71, 3, 5],
          [1, 128, 4, 5],
          [0, 1, 4, 15],
          [0, 61, 6, 10],
          [0, 113, 2, 16],
          [1, 82, 5, 14],
          [0, 148, 3, 16],
          [0, 1, 4, 12],
          [0, 1, 3, 16],
          [0, 175, 5, 13]]

responseColIdx = 0
method = 3
control = [5, 10, 10, 50, 10]
varType = [0, 2, 2, 2]
nTest = 10
names = ["Age", "Number", "Start"]
classNames = ["Absent", "Present"]
responseName = ["Kyphosis"]

predicted = []
errorSs = []
tree = decisionTree(xy, responseColIdx, varType,
                    method=method, nFolds=1,
                    control=control, testData=xyTest,
                    printLevel=2, predicted=predicted,
                    errorSs=errorSs)


decisionTreePrint(tree, varNames=names, classNames=classNames,
                  respName=responseName)

print("\nPredictions for test data:")
print("%5s%8s%7s%10s\n" % (names[0], names[1], names[2],
                           responseName[0]))
for i in range(nTest):
    idx = int(predicted[i])
    print("%5.0f%8.0f%7.0f%10s" % (
        xyTest[i][1],
        xyTest[i][2],
        xyTest[i][3],
        classNames[idx]))

print("Mean squared prediction error: %f" % errorSs[0])

decisionTreeFree(tree)

Output

Predictions for test data:
  Age  Number  Start  Kyphosis

   71       3      5    Absent
  128       4      5   Present
    1       4     15    Absent
   61       6     10    Absent
  113       2     16    Absent
   82       5     14    Absent
  148       3     16    Absent
    1       4     12    Absent
    1       3     16    Absent
  175       5     13    Absent
Mean squared prediction error: 0.100000
The response variable has 0 missing values.
Growing the maximal tree using method QUEST:
Node 2 is a terminal node. It has 7 cases--too few cases to split.
Node 3 is a terminal node. It has 6 cases--too few cases to split.
Node 5 is a terminal node. It has 6 cases--too few cases to split.
Node 8 is an terminal node. The split is too thin having count 2.
Node 10 is a terminal node. It has 6 cases--too few cases to split.
Node 11 is a terminal node, because it is pure.
Node 11 is a terminal node. It has 7 cases--too few cases to split.
Node 13 is a terminal node. It has 5 cases--too few cases to split.
Node 14 is a terminal node, because it is pure.

Decision Tree:

Node 0: Cost = 0.210, N= 81, Level = 0, Child nodes:  1  4 
P(Y=0)= 0.790
P(Y=1)= 0.210
Predicted Kyphosis:  Absent 
Node 1: Cost = 0.074, N= 13, Level = 1, Child nodes:  2  3 
Rule:  Start <= 5.155
P(Y=0)= 0.538
P(Y=1)= 0.462
Predicted Kyphosis:  Absent 
Node 2: Cost = 0.025, N= 7, Level = 2
Rule:  Age <= 84.030
P(Y=0)= 0.714
P(Y=1)= 0.286
Predicted Kyphosis:  Absent 
Node 3: Cost = 0.025, N= 6, Level = 2
Rule:  Age > 84.030
P(Y=0)= 0.333
P(Y=1)= 0.667
Predicted Kyphosis:  Present 
Node 4: Cost = 0.136, N= 68, Level = 1, Child nodes:  5  6 
Rule:  Start > 5.155
P(Y=0)= 0.838
P(Y=1)= 0.162
Predicted Kyphosis:  Absent 
Node 5: Cost = 0.012, N= 6, Level = 2
Rule:  Start <= 8.862
P(Y=0)= 0.167
P(Y=1)= 0.833
Predicted Kyphosis:  Present 
Node 6: Cost = 0.074, N= 62, Level = 2, Child nodes:  7  12 
Rule:  Start > 8.862
P(Y=0)= 0.903
P(Y=1)= 0.097
Predicted Kyphosis:  Absent 
Node 7: Cost = 0.062, N= 28, Level = 3, Child nodes:  8  9 
Rule:  Start <= 13.092
P(Y=0)= 0.821
P(Y=1)= 0.179
Predicted Kyphosis:  Absent 
Node 8: Cost = 0.025, N= 15, Level = 4
Rule:  Age <= 91.722
P(Y=0)= 0.867
P(Y=1)= 0.133
Predicted Kyphosis:  Absent 
Node 9: Cost = 0.037, N= 13, Level = 4, Child nodes:  10  11 
Rule:  Age > 91.722
P(Y=0)= 0.769
P(Y=1)= 0.231
Predicted Kyphosis:  Absent 
Node 10: Cost = 0.037, N= 6, Level = 5
Rule:  Number <= 3.450
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Kyphosis:  Absent 
Node 11: Cost = 0.000, N= 7, Level = 5
Rule:  Number > 3.450
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Kyphosis:  Absent 
Node 12: Cost = 0.012, N= 34, Level = 3, Child nodes:  13  14 
Rule:  Start > 13.092
P(Y=0)= 0.971
P(Y=1)= 0.029
Predicted Kyphosis:  Absent 
Node 13: Cost = 0.012, N= 5, Level = 4
Rule:  Start <= 14.864
P(Y=0)= 0.800
P(Y=1)= 0.200
Predicted Kyphosis:  Absent 
Node 14: Cost = 0.000, N= 29, Level = 4
Rule:  Start > 14.864
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Kyphosis:  Absent

Warning Errors

IMSLS_NO_SURROGATES Use of surrogates is limited to method 1 (ALACART).
IMSLS_NO_CONVERGENCE Convergence was not achieved.

Fatal Errors

IMSLS_INVALID_METHOD2 Choose a valid tree generation method; 0 (C4.5), 1 (ALACART), 2 (CHAID) or 3 (QUEST).
IMSLS_VALUE_GT_ZERO The value of # must be strictly positive.