IMSL C Stat Library
decision_tree
Generates a decision tree for a single response variable and two or more predictor variables.
Synopsis
#include <imsls.h>
Imsls_f_decision_tree* imsls_f_decision_tree (int n, int n_cols, float xy[], int response_col_idx, int var_type[], ..., 0)
The type double function is imsls_d_decision_tree.
Required Arguments
int n (Input)
The number of rows in xy.
int n_cols (Input)
The number of columns in xy.
float xy[] (Input)
Array of size n × n_cols containing the data.
int response_col_idx (Input)
Column index of the response variable.
int var_type[] (Input)
Array of length ncols indicating the type of each variable.
var_type[i]
Type
0
Categorical
1
Ordered Discrete (Low, Med., High)
2
Quantitative or Continuous
3
Ignore this variable
Return Value
A pointer to a structure of type Imsls_f_decision_tree. If an error occurs, NULL is returned.
Synopsis with Optional Arguments
#include <imsls.h>
Imsls_f_decision_tree *imsls_f_decision_tree (int n, int n_cols, float xy[], int response_col_idx, int var_type[],
IMSLS_METHOD, int method,
IMSLS_CRITERIA, int criteria,
IMSLS_RATIO,
IMSLS_WEIGHTS
, float weights[],
IMSLS_COST_MATRIX, int n_classes, float cost_matrix[],
IMSLS_CONTROL, int params[],
IMSLS_COMPLEXITY, float complexity,
IMSLS_N_SURROGATES, int n_surrogates,
IMSLS_ALPHAS
, float alphas[],
IMSLS_PRIORS, int  n_classes, float priors[],
IMSLS_N_FOLDS, int n_folds,
IMSLS_N_SAMPLE, int n_samples,
IMSLS_TOLERANCE, float tol,
IMSLS_RANDOM_SEED, int seed,
IMSLS_PRINT, int print_level,
IMSLS_TEST_DATA, int n_test, float xy_test[],
IMSLS_TEST_DATA_WEIGHTS
, float weights_test[],
IMSLS_ERROR_SS, float *pred_err_ss,
IMSLS_PREDICTED, float **predictions,
IMSLS_PREDICTED_USER, float predictions[],
0)
Optional Arguments
IMSLS_METHOD, int method (Input)
Specifies the tree generation method. The key for the variable type index is provided above.
method
Method
Response var_type
Predictor var_type
0
C4.5
0
0, 1, 2
1
ALACART (Breiman, et. al.)
0, 1, 2
0, 1, 2
2
QUEST
0
0, 1, 2
3
CHAID
0, 1, 2
0
Default: method = 0.
IMSLS_CRITERIA, int criteria (Input)
Specifies which criteria the ALACART method and the C4.5 method should use in the gain calculations to determine the best split at each node.
criteria
Measure
0
Shannon Entropy
1
Gini Index
Default: criteria = 0.
IMSLS_RATIO, (Input)
If present, the ALACART method and C4.5 method each uses a gain ratio instead of just the gain to determine the best split.
Default: Uses gain.
IMSLS_WEIGHTS, float weights[] (Input)
An array of length n containing case weights.
Default: weights[i] = 1.0.
IMSLS_COST_MATRIX, int n_classes, float  cost_matrix[] (Input)
An array of length n_classes x n_classes containing the cost matrix for a categorical response variable, where n_classes is the number of classes the response variable may assume. The cost matrix has elements C(i, j) = cost of misclassifying a response in class j as in class i. The diagonal elements of the cost matrix must be 0.
Default: cost_matrix[i] = 1.0, for i on the off-diagonal, cost_matrix[i] = 0.0, for i on the diagonal.
IMSLS_CONTROL,int params[] (Input)
Array of length 5 containing parameters to control the maximum size of the tree and other stopping rules.
params[i]
name
Action
0
min_n_node
Do not split a node if one of its child nodes will have fewer than min_n_node observations.
1
min_split
Do not split a node if the node has fewer than min_split observations
2
max_x_cats
Allow for up to max_x_cats for categorical predictor variables.
3
max_size
Stop growing the tree once it has reached max_size number of nodes.
4
max_depth
Stop growing the tree once it has reached max_depth number of levels.
Default: params[] = {7, 21, 10, 100, 10}
IMSLS_COMPLEXITY, float complexity (Input)
The minimum complexity parameter to use in cross-validation. Complexity must be  0.
Default: complexity = 0.0.
IMSLS_N_SURROGATES, int n_surrogates (Input)
Indicates the number of surrogate splits. Only used if method = 1.
Default: n_surrogates = 0.
IMSLS_ALPHAS, float  alphas[] (Input)
An array of length 3 containing the significance levels. alphas[0] = significance level for split variable selection (CHAID and QUEST); alphas[1]= significance level for merging categories of a variable (CHAID), and alphas[2] = significance level for splitting previously merged categories (CHAID). Valid values are in the range 0 < alphas[i] < 1.0, and alphas[2] <= alphas[1]. Setting alphas[2] = -1.0 disables splitting of merged categories.
Default: alphas[] = {0.05, 0.05, -1.0}
IMSLS_PRIORS, int n_classes, float priors[] (Input)
An array of length n_classes, where n_classes is the number of classes the response variable may assume, containing prior probabilities for class membership. The argument is ignored for continuous response variables (var_type[response_col_idx]=2). By default, the prior probabilities are estimated from the data.
IMSLS_N_FOLDS, int n_folds (Input)
The number of folds to use in cross validation tree selection. n_folds must be between 1 and n, inclusive. If n_folds = 1 the full data set is used once to generate the decision tree. In other words, no cross-validation is performed. If < n/n_folds  3, then leave-one-out cross validation is performed.
Default: n_folds = 10.
IMSLS_N_SAMPLE, int n_samples (Input)
The number of bootstrap samples to use in bootstrap aggregation (bagging) when predicted values are requested. To obtain predictions produced by bagging, set n_samples  >  0 and use one of IMSLS_PREDICTED or IMSLS_PREDICTED_USER.
Default: n_samples = 0.
IMSLS_TOLERANCE, float tol (Input)
Error tolerance to use in the algorithm.
Default: tol = 100.0e0 * imsls_f_machine(4).
IMSLS_RANDOM_SEED, int seed (Input)
The seed of the random number generator used in sampling or cross-validation. By changing the value of seed on different calls to imsls_f_decision_tree, with the same data set, calls may produce slightly different results. Setting seed to zero forces random number seed determination by the system clock.
Default: seed = 0
IMSLS_PRINT, int print_level (Input)
print_level
Action
0
No printing
1
Prints final results only.
2
Prints intermediate and final results.
Default: print_level = 0.
IMSLS_TEST_DATA, int  n_test, float  xy_test[] (Input)
xy_test is an array of size n_test x ncols containing hold-out or test data for which predictions are requested. When this optional argument is present, the number of observations in xy_test, n_test, must be greater than 0. The response variable may have missing values in xy_test, but it must be in the same column and the predictors must be in the same columns as they are in xy. If the test data is not provided but predictions are requested, then xy is used and the predictions are the fitted values.
Default: xy_test = xy
IMSLS_TEST_DATA_WEIGHTS, float weights_test[] (Input)
An array of size n_test containing frequencies or weights for each observation in xy_test. This argument is ignored if IMSLS_TEST_DATA is not present.
Default: weights_test = weights.
IMSLS_ERROR_SS, float *pred_err_ss (Output)
The fitted data error mean sum of squares in the absence of test data (xy_test). When test data is provided, the prediction error mean sum of squares is returned.
IMSLS_PREDICTED, float **predicted (Output)
Address of a pointer to an array of length n containing the fitted or predicted value of the response variable for each case in the input data or test data, if provided.
IMSLS_PREDICTED_USER, float predicted[] (Output)
Storage for the array of the fitted or predicted value for each case is provided by the user.
Description
This implementation includes four of the most widely used algorithms for decision trees. Below is a brief summary of each approach.
C4.5
The method C4.5 (Quinlan, 1995) is a tree partitioning algorithm for a categorical response variable and categorical or quantitative predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as
Where pi = Pr[Y = i|S] is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by
If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:
g (S,X= E(S) - E(S,X)
The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:
where
with
νk = Pr[X= k|S]
Note that EX(S) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form X d and X > d are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child-nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.
ALACART
ALACART implements the method of Breiman, Friedman, Olshen and Stone (1984), the original authors and developers of CART™. CART™ is the trademarked name for Classification and Regression Trees. In ALACART, only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.
For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most. For a given set of data S at a node, the node impurity for a C-class categorical response is a function of the class probabilities
I(S)=φ(p(1|S), p(2|S),, p(C|S))
The measure function φ() should be 0 for “pure” nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.
As only binary splits of a subset S are considered (S1S2 such that S = S1  S2 and S = S1  S2 = ), the reduction in impurity when splitting S into S1S2 is
ΔI = I(S-q1I(S1-q2I(S2)
where qj =Pr[Sj], j =1,2 — the node probability.
The gain criteria and the reduction in impurity ΔI are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by
ve
If Y is an ordered response or continuous, the problem is a regression problem. ALACART generates the tree using the same steps, except that node-level measures or loss-functions are the mean squared error (MSE) or mean absolute error (MAD) rather than node impurity measures.
CHAID
The third method is appropriate only for categorical or discrete ordered predictor variables. Due to Kass (1980), CHAID is an acronym for chi-square automatic interaction detection. At each node, as above, CHAID looks for the best splitting variable. The approach is as follows: given a predictor variable X, perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y. The least significant result is noted and, if a threshold is met, the two categories of X are merged. Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. If a merged category consists of three or more of the original categories of X, CHAID calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth. Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories define the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.
QUEST
The fourth method, the QUEST algorithm ( Loh and Shih, 1997), is appropriate for a categorical response variable and predictors of either categorical or quantitative type. For each categorical predictor, QUEST performs a multi-way chi-square test of association between the predictor and Y. For every continuous predictor, QUEST performs an ANOVA test to see if the means of the predictor vary among the groups of Y. Among these tests, the variable with the most significant result is selected as a potential splitting variable, say, Xj. If the p-value (adjusted for multiple tests) is less than the specified splitting threshold, then Xj is the splitting variable for the current node. If not, QUEST performs for each continuous variable X a Levene’s test of homogeneity to see if the variance of X varies within the different groups of Y. Among these tests, we again find the predictor with the most significant result, say Xi If its p-value (adjusted for multiple tests) is less than the splitting threshold, Xi is the splitting variable. Otherwise, the node is not split.
Assuming a splitting variable is found, the next step is to determine how the variable should be split. If the selected variable Xj is continuous, a split point d is determined by quadratic discriminant analysis (QDA) of Xj into two populations determined by a binary partition of the response Y. The goal of this step is to group the classes of Y into two subsets or super classes, A and B. If there are only two classes in the response Y, the super classes are obvious. Otherwise, calculate the means and variances of Xj in each of the classes of Y. If the means are all equal, put the largest-sized class into group A and combine the rest to form group B. If they are not all equal, use a k-means clustering method (k = 2) on the class means to determine A and B.
Xj in A and in B is assumed to be normally distributed with estimated means , and variances S2 j|AS2 j|B, respectively. The quadratic discriminant is the partition Xj  d and Xj > d such that Pr(XjA= Pr(XjB). The discriminant rule assigns an observation to A if xij  d and to B if xij > d. For d to maximally discriminate, the probabilities must be equal.
If the selected variable Xj is categorical, it is first transformed using the method outlined in Loh and Shih (1997) and then QDA is performed as above. The transformation is related to the discriminant coordinate (CRIMCOORD) approach due to Gnanadesikan (1977).
Minimal-Cost Complexity Pruning
One way to address overfitting is to grow the tree as large as possible, and then use some logic to prune it back. Let T represent a decision tree generated by any of the methods above. The idea (from Breiman, et. al.) is to find the smallest sub-tree of T that minimizes the cost complexity measure:
Rδ(T=R(T)+δ,
denotes the set of terminal nodes, represents the number of terminal nodes, and δ  0 is a cost-complexity parameter. For a categorical target variable
,
p(t= Pr[x  t],
and p(jt= Pr[y = j  x  t],
and C(ij) is the cost for misclassifying the actual class j as i. Note that C(jj= 0 and C(ij> 0, for ≠ j.
When the target is continuous (and the problem is a regression problem), the metric is instead the mean squared error
This software implements the optimal pruning algorithm10.1, page 294 in Breiman, et. al (1984). The result of the algorithm is a sequence of sub-trees Tmax  T1  T2   TM-1  {t0} obtained by pruning the fully generated tree, Tmax , until the sub-tree consists of the single root node, {t0}. Corresponding to the sequence of sub-trees is the sequence of complexity values, 0   δmin  = δ1 < δ2 <  <δM-1 < δM where M is the number of steps it takes in the algorithm to reach the root node. The sub-trees represent the optimally pruned sub-trees for the sequence of complexity values. The minimum complexity δmin can be set via an optional argument.
V-Fold Cross-Validation
In V-fold cross validation, the training data is partitioned randomly into V approximately equally sized sub-samples. The model is then trained V different times with each of the sub-samples removed in turn. The cross-validated estimate of the risk of a decision function R(dk) is
where L(yd(x)) is the loss incurred when the decision is d(x) for the actual, y. The symbol η denotes the full training data set, and denotes the set of decisions corresponding to Tkv, the kth optimally pruned tree using the training sample η-ηv and , where δkδk+1 come from the pruning on the full data set, η.
For example, if the problem is classification, priors are estimated from the data Nj/N and T represents any tree,
where
and
is the overall number of j cases misclassified as i. The standard error of RCV(dk) is approximated with
,
where
,
the “sample variance” of the cross-validated estimates.
Final selection rules include:
1. Select such that
2. Select such that k2 is the largest k satisfying .
3. For a specified complexity parameter , select such that .
Bagging
Bagging is a resampling approach for generating predictions. In particular, bagging stands for bootstrap aggregating. In the procedure, m bootstrap samples of size n are drawn from the training set of size n. Bootstrap sampling is sampling with replacement so that each sample has repeated observations, and some observations will be left out. A decision tree is grown treating each sample as a separate training set. The tree is then used to generate predictions for the test data. For each test case, the m predictions are combined by averaging the output (regression) or voting (classification) to obtain a final prediction.
Bagging leads to "improvements for unstable procedures," such as neural nets, classification and regression trees, and subset selection in linear regression. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).
Missing Values
Any observation or case with a missing response variable is eliminated from the analysis. If a predictor has a missing value, each algorithm will skip that case when evaluating the given predictor. When making a prediction for a new case, if the split variable is missing, the prediction function applies surrogate split-variables and splitting rules in turn, if they are estimated with the decision tree. Otherwise, the prediction function returns the prediction from the most recent non-terminal node. In this implementation, only ALACART estimates surrogate split variables when requested.
Structure Definitions
Table 13.44 — Structure Imsls_f_decision_tree
Name
Type
Description
n_classes
int
Number of classes assumed by the response variable, if the response variable is categorical
n_levels
int
Number of levels or depth of tree
n_nodes
int
Number of nodes or size of tree
nodes
Imsls_f tree_node*
Pointer to an array of tree_node structures of size n_nodes
n_preds
int
Number of predictors used in the model
n_surrogates
int
Number of surrogate splits searched for at each node. Available for method=1
pred_type
int*
Pointer to an array of length n_preds containing the type of each predictor variable
pred_n_values
int*
Pointer to an array of length n_preds containing the number of values of each predictor variable
response_type
int
Type of the response variable
terminal_nodes
int*
Pointer to an array of length n_nodes indicating which nodes are terminal nodes
Table 13.45 — Structure tree_node
Name
Type
Description
children_ids
int*
Pointer to an array of length n_children containing the IDs of the children nodes
cost
float
Misclassification cost (in-sample cost measure at the current node)
n_cases
int
Number of cases of the training data that fall into the current node
n_children
int
Number of children of the current node
node_id
int
Node ID, where the root node corresponds to node_id= 0
node_prob
float*
Estimate of the probability of a new case belonging to the node
node_split_value
float
Value around which the node will be split, if the node variable is of continuous type
node_values_ind
int*
Values of the split variable for the current node, if node_var_id has type 0 or 1
node_var_id
int
ID of the variable that defined the split in the parent node
parent_id
int
ID of the parent of the node with ID node_id
predicted_class
int
Predicted class at the current node, for response variables of categorical type
predicted_val
float
Predicted value of the response variable if the response variable is of continuous type
surrogate_info
float*
Array containing the surrogate split information
y_probs
float*
Pointer to the array of class probabilities at the current node, if the response variable is of categorical type
Examples
Example 1
In this example, we use a small data set with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by C4.5 and the ALACART methods. The control parameters are adjusted because of the small data size and no cross-validation or pruning is performed. The maximal trees are printed out using Imsls_f_decision_tree_print. Notice that C4.5 splits on Outlook, then Humidity and Wind, while ALACART splits on Outlook, then Temperature.
 
#include <imsls.h>
#include <stdio.h>
 
int main(){
 
float xy[] =
{
0, 85, 85, 0, 0,
0, 80, 90, 1, 0,
1, 83, 78, 0, 1,
2, 70, 96, 0, 1,
2, 68, 80, 0, 1,
2, 65, 70, 1, 0,
1, 64, 65, 1, 1,
0, 72, 95, 0, 0,
0, 69, 70, 0, 1,
2, 75, 80, 0, 1,
0, 75, 70, 1, 1,
1, 72, 90, 1, 1,
1, 81, 75, 0, 1,
2, 71, 80, 1, 0
};
 
int n = 14;
int ncols = 5;
int response_col_idx = 4;
int method = 1;
int var_type[] = {0, 2, 2, 0, 0};
int control[] = {2, 3, 10, 50, 10};
 
const char* names[] = {"Outlook", "Temperature", "Humidity", "Wind",
"Play"};
const char* class_names[] = {"Don't Play", "Play"};
const char* var_levels[] = {"Sunny", "Overcast", "Rainy", "False", "True"};
 
Imsls_f_decision_tree *tree = NULL;
 
tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,
IMSLS_N_FOLDS, 1,
IMSLS_CONTROL, control,
0);
 
printf("Decision Tree using Method C4.5:\n\n");
imsls_f_decision_tree_print(tree,
IMSLS_VAR_NAMES, names,
IMSLS_CLASS_NAMES, class_names,
IMSLS_CATEG_NAMES, var_levels,
0);
 
imsls_f_decision_tree_free(tree);
 
tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,
IMSLS_N_FOLDS, 1,
IMSLS_METHOD, method,
IMSLS_CONTROL, control,
0);
 
printf("Decision Tree using Method ALACART:\n\n");
imsls_f_decision_tree_print(tree,
IMSLS_VAR_NAMES, names,
IMSLS_CLASS_NAMES, class_names,
IMSLS_CATEG_NAMES, var_levels,
0);
 
imsls_f_decision_tree_free(tree);
}
 
Output
Decision Tree using Method C4.5:
 
 
Decision Tree:
 
Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes: 1 4 5
P(Y=0)= 0.357
P(Y=1)= 0.643
Predicted Y: Play
Node 1: Cost = 0.143, N= 5, Level = 1, Child nodes: 2 3
Rule: Outlook in: { Sunny }
P(Y=0)= 0.600
P(Y=1)= 0.400
Predicted Y: Don't Play
Node 2: Cost = 0.000, N= 2, Level = 2
Rule: Humidity <= 77.500
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 3: Cost = 0.000, N= 3, Level = 2
Rule: Humidity > 77.500
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Node 4: Cost = 0.000, N= 4, Level = 1
Rule: Outlook in: { Overcast }
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 5: Cost = 0.143, N= 5, Level = 1, Child nodes: 6 7
Rule: Outlook in: { Rainy }
P(Y=0)= 0.400
P(Y=1)= 0.600
Predicted Y: Play
Node 6: Cost = 0.000, N= 3, Level = 2
Rule: Wind in: { False }
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 7: Cost = 0.000, N= 2, Level = 2
Rule: Wind in: { True }
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Decision Tree using Method ALACART:
 
 
Decision Tree:
 
Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes: 1 8
P(Y=0)= 0.357
P(Y=1)= 0.643
Predicted Y: Play
Node 1: Cost = 0.357, N= 10, Level = 1, Child nodes: 2 7
Rule: Outlook in: { Sunny Rainy }
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Y: Don't Play
Node 2: Cost = 0.214, N= 8, Level = 2, Child nodes: 3 6
Rule: Temperature <= 77.500
P(Y=0)= 0.375
P(Y=1)= 0.625
Predicted Y: Play
Node 3: Cost = 0.214, N= 6, Level = 3, Child nodes: 4 5
Rule: Temperature <= 73.500
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Y: Don't Play
Node 4: Cost = 0.071, N= 4, Level = 4
Rule: Temperature <= 70.500
P(Y=0)= 0.250
P(Y=1)= 0.750
Predicted Y: Play
Node 5: Cost = 0.000, N= 2, Level = 4
Rule: Temperature > 70.500
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Node 6: Cost = 0.000, N= 2, Level = 3
Rule: Temperature > 73.500
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Node 7: Cost = 0.000, N= 2, Level = 2
Rule: Temperature > 77.500
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Y: Don't Play
Node 8: Cost = 0.000, N= 4, Level = 1
Rule: Outlook in: { Overcast }
P(Y=0)= 0.000
P(Y=1)= 1.000
Predicted Y: Play
Example 2
This example applies the ALACART method to a simulated data set with 50 cases and three predictors of mixed-type. A maximally grown tree under the default controls and the optimally pruned sub-tree obtained from cross-validation and minimal cost complexity pruning are produced. Notice that the optimally pruned tree consists of just the root node, whereas the maximal tree has five nodes and three levels.
 
#include <imsls.h>
#include <stdio.h>
 
int main(){
float xy[50*4] =
{
2, 25.928690, 0, 0,
1, 51.632450, 1, 1,
1, 25.784321, 0, 2,
0, 39.379478, 0, 3,
2, 24.650579, 0, 2,
2, 45.200840, 0, 2,
2, 52.679600, 1, 3,
1, 44.283421, 1, 3,
2, 40.635231, 1, 3,
2, 51.760941, 0, 3,
2, 26.303680, 0, 1,
2, 20.702299, 1, 0,
2, 38.742729, 1, 3,
2, 19.473330, 0, 0,
1, 26.422110, 0, 0,
2, 37.059860, 1, 0,
1, 51.670429, 1, 3,
0, 42.401562, 0, 3,
2, 33.900269, 1, 2,
1, 35.432819, 0, 0,
1, 44.303692, 0, 1,
0, 46.723869, 0, 2,
1, 46.992619, 0, 2,
0, 36.059231, 0, 3,
2, 36.831970, 1, 1,
1, 61.662571, 1, 2,
0, 25.677139, 0, 3,
1, 39.085670, 1, 0,
0, 48.843410, 1, 1,
1, 39.343910, 0, 3,
2, 24.735220, 0, 2,
1, 50.552509, 1, 3,
0, 31.342630, 1, 3,
1, 27.157949, 1, 0,
0, 31.726851, 0, 2,
0, 25.004080, 0, 3,
1, 26.354570, 1, 3,
2, 38.123428, 0, 1,
0, 49.940300, 0, 2,
1, 42.457790, 1, 3,
0, 38.809479, 1, 1,
0, 43.227989, 1, 1,
0, 41.876240, 0, 3,
2, 48.078201, 0, 2,
0, 43.236729, 1, 0,
2, 39.412941, 0, 3,
1, 23.933460, 0, 2,
2, 42.841301, 1, 3,
2, 30.406691, 0, 1,
0, 37.773891, 0, 2
};
 
int n = 50;
int ncols = 4;
int method = 3;
int var_type[] = {0, 2, 0, 0};
int response_col_idx = 3;
 
Imsls_f_decision_tree *tree = NULL;
 
tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,
IMSLS_METHOD, method,
IMSLS_RANDOM_SEED, 123457,
IMSLS_PRINT, 1,
0);
 
printf("\nMaximal tree: \n\n");
imsls_f_decision_tree_print(tree,
IMSLS_PRINT_MAX,
0);
 
printf("\nOptimally pruned subtree: \n\n");
imsls_f_decision_tree_print(tree, 0);
 
imsls_f_decision_tree_free(tree);
}
Output
Growing the maximal tree using method QUEST:
 
Cross-Validation:
Tree complexity CV-err CV-Std.Error
0 0.00000 0.60000 0.05842
1 0.02000 0.62000 0.06236
2 0.04000 0.62000 0.06233
 
Select tree number 2, cost complexity parameter = 0.04000:
 
 
Maximal tree:
 
 
Decision Tree:
 
Node 0: Cost = 0.620, N= 50, Level = 0, Child nodes: 1 2
P(Y=0)= 0.180
P(Y=1)= 0.180
P(Y=2)= 0.260
P(Y=3)= 0.380
Predicted Y: 3
Node 1: Cost = 0.220, N= 17, Level = 1
Rule: X1 <= 35.031
P(Y=0)= 0.294
P(Y=1)= 0.118
P(Y=2)= 0.353
P(Y=3)= 0.235
Predicted Y: 2
Node 2: Cost = 0.360, N= 33, Level = 1, Child nodes: 3 4
Rule: X1 > 35.031
P(Y=0)= 0.121
P(Y=1)= 0.212
P(Y=2)= 0.212
P(Y=3)= 0.455
Predicted Y: 3
Node 3: Cost = 0.180, N= 19, Level = 2
Rule: X1 <= 43.265
P(Y=0)= 0.211
P(Y=1)= 0.211
P(Y=2)= 0.053
P(Y=3)= 0.526
Predicted Y: 3
Node 4: Cost = 0.160, N= 14, Level = 2
Rule: X1 > 43.265
P(Y=0)= 0.000
P(Y=1)= 0.214
P(Y=2)= 0.429
P(Y=3)= 0.357
Predicted Y: 2
 
Optimally pruned subtree:
 
 
Decision Tree:
 
Node 0: Cost = 0.620, N= 50, Level = 0
P(Y=0)= 0.180
P(Y=1)= 0.180
P(Y=2)= 0.260
P(Y=3)= 0.380
Predicted Y: 3
Pruned at Node id 0.
Example 3
This example uses the dataset Kyphosis. The 81 cases represent 81 children who have undergone surgery to correct a type of spinal deformity known as Kyphosis. The response variable is the presence or absence of Kyphosis after the surgery. Three predictors are Age of the patient in months, Start, the number of the vertebra where the surgery started, and Number, the number of vertebra involved in the surgery. This example uses the method QUEST to produce a maximal tree. It also requests predictions for a test-data set consisting of 10 “new” cases.
 
#include <imsls.h>
#include <stdio.h>
 
int main()
{
float xy[81*4] =
{
0, 71, 3, 5,
0, 158, 3, 14,
1, 128, 4, 5,
0, 2, 5, 1,
0, 1, 4, 15,
0, 1, 2, 16,
0, 61, 2, 17,
0, 37, 3, 16,
0, 113, 2, 16,
1, 59, 6, 12,
1, 82, 5, 14,
0, 148, 3, 16,
0, 18, 5, 2,
0, 1, 4, 12,
0, 168, 3, 18,
0, 1, 3, 16,
0, 78, 6, 15,
0, 175, 5, 13,
0, 80, 5, 16,
0, 27, 4, 9,
0, 22, 2, 16,
1, 105, 6, 5,
1, 96, 3, 12,
0, 131, 2, 3,
1, 15, 7, 2,
0, 9, 5, 13,
0, 8, 3, 6,
0, 100, 3, 14,
0, 4, 3, 16,
0, 151, 2, 16,
0, 31, 3, 16,
0, 125, 2, 11,
0, 130, 5, 13,
0, 112, 3, 16,
0, 140, 5, 11,
0, 93, 3, 16,
0, 1, 3, 9,
1, 52, 5, 6,
0, 20, 6, 9,
1, 91, 5, 12,
1, 73, 5, 1,
0, 35, 3, 13,
0, 143, 9, 3,
0, 61, 4, 1,
0, 97, 3, 16,
1, 139, 3, 10,
0, 136, 4, 15,
0, 131, 5, 13,
1, 121, 3, 3,
0, 177, 2, 14,
0, 68, 5, 10,
0, 9, 2, 17,
1, 139, 10, 6,
0, 2, 2, 17,
0, 140, 4, 15,
0, 72, 5, 15,
0, 2, 3, 13,
1, 120, 5, 8,
0, 51, 7, 9,
0, 102, 3, 13,
1, 130, 4, 1,
1, 114, 7, 8,
0, 81, 4, 1,
0, 118, 3, 16,
0, 118, 4, 16,
0, 17, 4, 10,
0, 195, 2, 17,
0, 159, 4, 13,
0, 18, 4, 11,
0, 15, 5, 16,
0, 158, 5, 14,
0, 127, 4, 12,
0, 87, 4, 16,
0, 206, 4, 10,
0, 11, 3, 15,
0, 178, 4, 15,
1, 157, 3, 13,
0, 26, 7, 13,
0, 120, 2, 13,
1, 42, 7, 6,
0, 36, 4, 13
};
 
float xy_test[10*4] =
{
0, 71, 3, 5,
1, 128, 4, 5,
0, 1, 4, 15,
0, 61, 6, 10,
0, 113, 2, 16,
1, 82, 5, 14,
0, 148, 3, 16,
0, 1, 4, 12,
0, 1, 3, 16,
0, 175, 5, 13
};
 
int n = 81;
int ncols = 4;
int response_col_idx = 0;
int method = 3;
int control[] = {5, 10, 10, 50, 10};
int var_type[] = {0, 2, 2, 2};
 
int n_test = 10;
int i, idx;
 
float *predictions;
float pred_err_ss;
const char* names[] = {"Age", "Number", "Start"};
const char* class_names[] = {"Absent", "Present"};
const char* response_name[] = {"Kyphosis"};
Imsls_f_decision_tree *tree = NULL;
 
tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,
IMSLS_METHOD, method,
IMSLS_N_FOLDS, 1,
IMSLS_CONTROL, control,
IMSLS_TEST_DATA, n_test, xy_test,
IMSLS_PRINT, 2,
IMSLS_PREDICTED, &predictions,
IMSLS_ERROR_SS, &pred_err_ss,
0);
 
imsls_f_decision_tree_print(tree,
IMSLS_RESP_NAME, response_name,
IMSLS_VAR_NAMES, names,
IMSLS_CLASS_NAMES, class_names,
0);
 
printf("\nPredictions for test data:\n");
printf("%5s%8s%7s%10s\n", names[0], names[1], names[2],
response_name[0]);
for(i=0;i<n_test;i++){
printf("%5.0f%8.0f%7.0f",
xy_test[i*ncols+1],
xy_test[i*ncols+2],
xy_test[i*ncols+3]);
idx = (int)predictions[i];
printf("%10s\n", class_names[idx]);
}
printf("\nMean squared prediction error: %f\n", pred_err_ss);
 
imsls_f_decision_tree_free(tree);
imsls_free(predictions);
}
Output
The response variable has 0 missing values.
Growing the maximal tree using method QUEST:
Node 2 is a terminal node. It has 7 cases--too few cases to split.
Node 3 is a terminal node. It has 6 cases--too few cases to split.
Node 5 is a terminal node. It has 6 cases--too few cases to split.
Node 8 is an terminal node. The split is too thin having count 2.
Node 10 is a terminal node. It has 6 cases--too few cases to split.
Node 11 is a terminal node, because it is pure.
Node 11 is a terminal node. It has 7 cases--too few cases to split.
Node 13 is a terminal node. It has 5 cases--too few cases to split.
Node 14 is a terminal node, because it is pure.
 
Decision Tree:
 
Node 0: Cost = 0.210, N= 81, Level = 0, Child nodes: 1 4
P(Y=0)= 0.790
P(Y=1)= 0.210
Predicted Kyphosis Absent
Node 1: Cost = 0.074, N= 13, Level = 1, Child nodes: 2 3
Rule: Start <= 5.155
P(Y=0)= 0.538
P(Y=1)= 0.462
Predicted Kyphosis Absent
Node 2: Cost = 0.025, N= 7, Level = 2
Rule: Age <= 84.030
P(Y=0)= 0.714
P(Y=1)= 0.286
Predicted Kyphosis Absent
Node 3: Cost = 0.025, N= 6, Level = 2
Rule: Age > 84.030
P(Y=0)= 0.333
P(Y=1)= 0.667
Predicted Kyphosis Present
Node 4: Cost = 0.136, N= 68, Level = 1, Child nodes: 5 6
Rule: Start > 5.155
P(Y=0)= 0.838
P(Y=1)= 0.162
Predicted Kyphosis Absent
Node 5: Cost = 0.012, N= 6, Level = 2
Rule: Start <= 8.862
P(Y=0)= 0.167
P(Y=1)= 0.833
Predicted Kyphosis Present
Node 6: Cost = 0.074, N= 62, Level = 2, Child nodes: 7 12
Rule: Start > 8.862
P(Y=0)= 0.903
P(Y=1)= 0.097
Predicted Kyphosis Absent
Node 7: Cost = 0.062, N= 28, Level = 3, Child nodes: 8 9
Rule: Start <= 13.092
P(Y=0)= 0.821
P(Y=1)= 0.179
Predicted Kyphosis Absent
Node 8: Cost = 0.025, N= 15, Level = 4
Rule: Age <= 91.722
P(Y=0)= 0.867
P(Y=1)= 0.133
Predicted Kyphosis Absent
Node 9: Cost = 0.037, N= 13, Level = 4, Child nodes: 10 11
Rule: Age > 91.722
P(Y=0)= 0.769
P(Y=1)= 0.231
Predicted Kyphosis Absent
Node 10: Cost = 0.037, N= 6, Level = 5
Rule: Number <= 3.450
P(Y=0)= 0.500
P(Y=1)= 0.500
Predicted Kyphosis Absent
Node 11: Cost = 0.000, N= 7, Level = 5
Rule: Number > 3.450
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Kyphosis Absent
Node 12: Cost = 0.012, N= 34, Level = 3, Child nodes: 13 14
Rule: Start > 13.092
P(Y=0)= 0.971
P(Y=1)= 0.029
Predicted Kyphosis Absent
Node 13: Cost = 0.012, N= 5, Level = 4
Rule: Start <= 14.864
P(Y=0)= 0.800
P(Y=1)= 0.200
Predicted Kyphosis Absent
Node 14: Cost = 0.000, N= 29, Level = 4
Rule: Start > 14.864
P(Y=0)= 1.000
P(Y=1)= 0.000
Predicted Kyphosis Absent
 
Predictions for test data:
Age Number Start Kyphosis
71 3 5 Absent
128 4 5 Present
1 4 15 Absent
61 6 10 Absent
113 2 16 Absent
82 5 14 Absent
148 3 16 Absent
1 4 12 Absent
1 3 16 Absent
175 5 13 Absent
 
Mean squared prediction error: 0.010000
Warning Errors
IMSLS_NO_SURROGATES
Use of surrogates is limited to method 1 (ALACART).
IMSLS_NO_CONVERGENCE
Convergence was not achieved.
Fatal Errors
IMSLS_INVALID_METHOD2
Choose a valid tree generation method; 0 (C4.5), 1 (ALACART), 2 (CHAID) or 3 (QUEST).
IMSLS_VALUE_GT_ZERO
The value of # must be strictly positive.