decision_tree

Generates a decision tree for a single response variable and two or more predictor variables.

Synopsis

#include <imsls.h>

Imsls_f_decision_tree* imsls_f_decision_tree (int n, int n_cols, float xy[], int response_col_idx, int var_type[], ..., 0)

The type double function is imsls_d_decision_tree.

Required Arguments

int n (Input)
The number of rows in xy.

int n_cols (Input)
The number of columns in xy.

float xy[] (Input)
Array of size n × n_cols containing the data.

int response_col_idx (Input)
Column index of the response variable.

int var_type[] (Input)
Array of length ncols indicating the type of each variable.

var_type[i]

Type

0

Categorical

1

Ordered Discrete (Low, Med., High)

2

Quantitative or Continuous

3

Ignore this variable

 

Note: When the variable type (var_type) is specified as Categorical (0), the numbering of the categories must begin at 0. For example, if there are three categories, they must be represented as 0, 1 and 2 in the xy array.

The number of classes for a categorical response variable is determined by the largest value discovered in the data. To set this value in another way, see optional arguments IMSLS_PRIORS or IMSLS_COST_MATRIX. Also note that a warning message is displayed if a class level in 0, 1, , n_classes-1 has a 0 count in the data.

Return Value

A pointer to a structure of type Imsls_f_decision_tree. If an error occurs, NULL is returned.

Synopsis with Optional Arguments

#include <imsls.h>

Imsls_f_decision_tree *imsls_f_decision_tree (int n, int n_cols, float xy[], int response_col_idx, int var_type[],
IMSLS_METHOD, int method,
IMSLS_CRITERIA, int criteria,
IMSLS_RATIO,
IMSLS_WEIGHTS
, float weights[],
IMSLS_COST_MATRIX, int n_classes, float cost_matrix[],
IMSLS_CONTROL, int params[],
IMSLS_COMPLEXITY, float complexity,
IMSLS_N_SURROGATES, int n_surrogates,
IMSLS_ALPHAS
, float alphas[],
IMSLS_PRIORS, int  n_classes, float priors[],
IMSLS_N_FOLDS, int n_folds,
IMSLS_N_SAMPLE, int n_samples,
IMSLS_RANDOM_FEATURES,
IMSLS_N_RANDOM_FEATURES, int n_features,
IMSLS_TOLERANCE, float tol,
IMSLS_RANDOM_SEED, int seed,
IMSLS_PRINT, int print_level,
IMSLS_TEST_DATA, int n_test, float xy_test[],
IMSLS_TEST_DATA_WEIGHTS, float weights_test[],
IMSLS_ERROR_SS, float *pred_err_ss,
IMSLS_PREDICTED, float **predictions,
IMSLS_PREDICTED_USER, float predictions[],
IMSLS_CLASS_ERROR, float **class_errors,
IMSLS_CLASS_ERROR_USER, float class_errors[],
IMSLS_MEAN_ERROR, float *mean_error,
IMSLS_OUT_OF_BAG_PREDICTED, float **oob_predicted,
IMSLS_OUT_OF_BAG_PREDICTED_USER, float oob_predicted[],
IMSLS_OUT_OF_BAG_MEAN_ERROR, float *out_of_bag_mean_error,
IMSLS_OUT_OF_BAG_CLASS_ERROR, float **out_of_bag_class_errors,
IMSLS_OUT_OF_BAG_CLASS_ERROR_USER, float out_of_bag_class_errors[],
IMSLS_OUT_OF_BAG_VAR_IMPORTANCE, float **out_of_bag_var_importance,
IMSLS_OUT_OF_BAG_VAR_IMPORTANCE_USER, float out_of_bag_var_importance[],
IMSLS_RETURN_TREES, Imsls_f_decision_tree ***bagged_trees,
0)

Optional Arguments

IMSLS_METHOD, int method (Input)
Specifies the tree generation method. The key for the variable type index is provided above.

method

Method

Response var_type

Predictor var_type

0

C4.5

0

0, 1, 2

1

ALACART (Breiman, et. al.)

0, 1, 2

0, 1, 2

2

CHAID

0, 1, 2

0

3

QUEST

0

0, 1, 2

Default: method = 0.

IMSLS_CRITERIA, int criteria (Input)
Specifies which criteria the ALACART method and the C4.5 method should use in the gain calculations to determine the best split at each node.

criteria

Measure

0

Shannon Entropy

1

Gini Index

2

Deviance

Default: criteria = 0.

Shannon Entropy – A measure of randomness or uncertainty.
For a categorical variable having C distinct values over the data set S, the Shannon Entropy is defined as

 

where

pi = Pr(Y = i)

and where

pi log(pi) := 0

if pi = 0

Gini Index – A measure of statistical dispersion.
For a categorical variable having C distinct values over the data set S, the Gini Index is defined as

 

where p(i|S) denotes the probability that the variable is equal to the state i on the data set, S.

Deviance – A measure of the quality of fit.
For a categorical variable having C distinct values over a data set S, the Deviance measure is

 

where

pi = Pr(Y = i)

and where

ni

is the number of cases with Y = i on the node.

IMSLS_RATIO, (Input)
If present, the ALACART method and C4.5 method each uses a gain ratio instead of just the gain to determine the best split.
Default: Uses gain.

IMSLS_WEIGHTS, float weights[] (Input)
An array of length n containing case weights.
Default: weights[i] = 1.0.

IMSLS_COST_MATRIX, int n_classes, float  cost_matrix[] (Input)
An array of length n_classes x n_classes containing the cost matrix for a categorical response variable, where n_classes is the number of classes the response variable may assume. The cost matrix has elements C(i, j) = cost of misclassifying a response in class j as in class i. The diagonal elements of the cost matrix must be 0.
Default: cost_matrix[i] = 1.0, for i on the off-diagonal, cost_matrix[i] = 0.0, for i on the diagonal.

IMSLS_CONTROL,int params[] (Input)
Array of length 5 containing parameters to control the maximum size of the tree and other stopping rules.

params[i]

name

Action

0

min_n_node

Do not split a node if one of its child nodes will have fewer than min_n_node observations.

1

min_split

Do not split a node if the node has fewer than min_split observations

2

max_x_cats

Allow for up to max_x_cats for categorical predictor variables.

3

max_size

Stop growing the tree once it has reached max_size number of nodes.

4

max_depth

Stop growing the tree once it has reached max_depth number of levels.

Default: params[] = {7, 21, 10, 100, 10}

IMSLS_COMPLEXITY, float complexity (Input)
The minimum complexity parameter to use in cross-validation. Complexity must be  0.
Default: complexity = 0.0.

IMSLS_N_SURROGATES, int n_surrogates (Input)
Indicates the number of surrogate splits. Only used if method = 1.
Default: n_surrogates = 0.

IMSLS_ALPHAS, float  alphas[] (Input)
An array of length 3 containing the significance levels. alphas[0] = significance level for split variable selection (CHAID and QUEST); alphas[1]= significance level for merging categories of a variable (CHAID), and alphas[2] = significance level for splitting previously merged categories (CHAID). Valid values are in the range 0 < alphas[i] < 1.0, and alphas[2] <= alphas[1]. Setting alphas[2] = -1.0 disables splitting of merged categories.
Default: alphas[] = {0.05, 0.05, -1.0}

IMSLS_PRIORS, int n_classes, float priors[] (Input)
An array of length n_classes, where n_classes is the number of classes the response variable may assume, containing prior probabilities for class membership. The argument is ignored for continuous response variables (var_type[response_col_idx]=2). By default, the prior probabilities are estimated from the data.

IMSLS_N_FOLDS, int n_folds (Input)
The number of folds to use in cross validation tree selection. n_folds must be between 1 and n, inclusive. If n_folds = 1 the full data set is used once to generate the decision tree. In other words, no cross-validation is performed. If < n/n_folds  3, then leave-one-out cross validation is performed.
Default: n_folds = 10.

IMSLS_N_SAMPLE, int n_samples (Input)
The number of bootstrap samples to use in bootstrap aggregation (bagging) when predicted values are requested. To obtain predictions produced by bagging, set n_samples  >  0 and use one of IMSLS_PREDICTED or IMSLS_PREDICTED_USER.
Default: n_samples = 0 unless random features or out-of-bag calculations are requested; then, n_samples = 50.

IMSLS_RANDOM_FEATURES, (Input)
If present, the decision tree splitting rules at each node are decided from a random subset of predictors. Use this argument to generate a random forest for predictions. Use the argument IMSLS_N_SAMPLE to control the number of trees and IMSLS_N_RANDOM_FEATURES to control the number of random features.
Default: No random feature selection. The algorithms use all predictors in every selection.

IMSLS_N_RANDOM_FEATURES, int n_random_features (Input)
The number of predictors in each random subset from which to select during random feature selection, when it is activated.
Default: For categorical variables, n_random_features = the nearest integer ≤  or 1; for regression variables, n_random_features = the nearest integer ≤  or 1, whichever is larger, where p = the number of variables.

IMSLS_TOLERANCE, float tol (Input)
Error tolerance to use in the algorithm.
Default: tol = 100.0 * imsls_f_machine(4).

IMSLS_RANDOM_SEED, int seed (Input)
The seed of the random number generator used in sampling or cross-validation. By changing the value of seed on different calls to imsls_f_decision_tree, with the same data set, calls may produce slightly different results. Setting seed to zero forces random number seed determination by the system clock.
Default: seed = 0

IMSLS_PRINT, int print_level (Input)

print_level

Action

0

No printing

1

Prints final results only.

2

Prints intermediate and final results.

Default: print_level = 0.

IMSLS_TEST_DATA, int  n_test, float  xy_test[] (Input)
xy_test is an array of size n_test x ncols containing hold-out or test data for which predictions are requested. When this optional argument is present, the number of observations in xy_test, n_test, must be greater than 0. The response variable may have missing values in xy_test, but it must be in the same column and the predictors must be in the same columns as they are in xy. If the test data is not provided but predictions are requested, then xy is used and the predictions are the fitted values.

Default: xy_test = xy

IMSLS_TEST_DATA_WEIGHTS, float weights_test[] (Input)
An array of size n_test containing frequencies or weights for each observation in xy_test. This argument is ignored if IMSLS_TEST_DATA is not present.
Default: weights_test = weights.

IMSLS_ERROR_SS, float *pred_err_ss (Output)
The fitted data error mean sum of squares in the absence of test data (xy_test). When test data is provided, the prediction error mean sum of squares is returned.

IMSLS_PREDICTED, float **predicted (Output)
Address of a pointer to an array of length n containing the fitted or predicted value of the response variable for each case in the input data or test data, if provided.

IMSLS_PREDICTED_USER, float predicted[] (Output)
Storage for the array of the fitted or predicted value for each case is provided by the user.

IMSLS_CLASS_ERROR, float **class_errors (Output)
Address of a pointer to an array of length 2 × (n_classes + 1) containing classification errors for each level of the categorical response variable, along with the total occurrence in the input data or test data, and overall totals.
To illustrate, if class_errors[2*j] = 20, and class_errors[2*j+1] = 105, then we know there were 20 misclassifications of class level j out of a total of 105 in the data.

IMSLS_CLASS_ERROR_USER, float class_errors[] (Output)
Storage for the array of the class errors is provided by the user.

IMSLS_MEAN_ERROR, float *mean_error (Output)
The fitted data error mean sum of squares (continuous response) or misclassification percentage (categorical response) in the absence of test data (xy_test). When test data is provided, the prediction mean error is returned.

IMSLS_OUT_OF_BAG_PREDICTED, float **oob_predicted (Output)
Address of a pointer to an array of length n containing the out-of-bag predicted value of the response variable for every case in the input data when bagging is performed.

IMSLS_OUT_OF_BAG_PREDICTED_USER, float oob_predicted[] (Output)
Storage for the array of the out-of-bag predicted values is provided by the user.

IMSLS_OUT_OF_BAG_MEAN_ERROR, float *out_of_bag_mean_error (Output)
The out-of-bag predictive error mean sum of squares (continuous response) or misclassification percentage (categorical response) on the input data, when bagging is performed.

IMSLS_OUT_OF_BAG_CLASS_ERROR, float ***out_of_bag_class_errors (Output)
Address of a pointer to an array of length 2 × (n_classes + 1) containing out-of-bag classification errors for each level of the categorical response variable, along with the total occurrence in the input data, and overall totals. This output is populated only when bagging (n_samples > 1) is requested.

IMSLS_OUT_OF_BAG_CLASS_ERROR_USER, float out_of_bag_class_errors[] (Output)
Storage for the array of the out-of-bag class errors is provided by the user.

IMSLS_OUT_OF_BAG_VAR_IMPORTANCE, float **out_of_bag_var_importance (Output)
Address of a pointer to an array of length n_preds containing the out-of-bag variable importance measure for each predictor. This output is populated only when bagging (n_samples > 1) is requested.

IMSLS_OUT_OF_BAG_VAR_IMPORTANCE_USER, float out_of_bag_var_importance[] (Output)
Storage for the array of the out-of-bag variable importance is provided by the user.

IMSLS_RETURN_TREES, Imsls_f_decision_tree ***bagged_trees (Output)
Address of a pointer to an array of length n_samples containing the collection of trees generated during the algorithm. To release this space, use imsls_bagged_trees_free.

Description

This implementation includes four of the most widely used algorithms for decision trees. Below is a brief summary of each approach.

C4.5

The method C4.5 (Quinlan, 1995) is a tree partitioning algorithm for a categorical response variable and categorical or quantitative predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as

 

Where pi = Pr[Y = i|S] is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by

 

If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:

g (S,X= E(S) - E(S,X)

The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:

 

where

 

with

νk = Pr[X= k|S]

Note that EX(S) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form X d and X > d are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child-nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.

ALACART

ALACART implements the method of Breiman, Friedman, Olshen and Stone (1984), the original authors and developers of CART™. CART™ is the trademarked name for Classification and Regression Trees. In ALACART, only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.

For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most. For a given set of data S at a node, the node impurity for a C-class categorical response is a function of the class probabilities

I(S)=φ(p(1|S), p(2|S),, p(C|S))

The measure function φ() should be 0 for “pure” nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.

As only binary splits of a subset S are considered (S1S2 such that S = S1  S2 and S = S1  S2 = ), the reduction in impurity when splitting S into S1S2 is

ΔI = I(S-q1I(S1-q2I(S2)

where qj =Pr[Sj], j =1,2 — the node probability.

The gain criteria and the reduction in impurity ΔI are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by

ve

If Y is an ordered response or continuous, the problem is a regression problem. ALACART generates the tree using the same steps, except that node-level measures or loss-functions are the mean squared error (MSE) or mean absolute error (MAD) rather than node impurity measures.

CHAID

The third method is appropriate only for categorical or discrete ordered predictor variables. Due to Kass (1980), CHAID is an acronym for chi-square automatic interaction detection. At each node, as above, CHAID looks for the best splitting variable. The approach is as follows: given a predictor variable X, perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y. The least significant result is noted and, if a threshold is met, the two categories of X are merged. Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. If a merged category consists of three or more of the original categories of X, CHAID calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth. Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories define the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.

QUEST

The fourth method, the QUEST algorithm ( Loh and Shih, 1997), is appropriate for a categorical response variable and predictors of either categorical or quantitative type. For each categorical predictor, QUEST performs a multi-way chi-square test of association between the predictor and Y. For every continuous predictor, QUEST performs an ANOVA test to see if the means of the predictor vary among the groups of Y. Among these tests, the variable with the most significant result is selected as a potential splitting variable, say, Xj. If the p-value (adjusted for multiple tests) is less than the specified splitting threshold, then Xj is the splitting variable for the current node. If not, QUEST performs for each continuous variable X a Levene’s test of homogeneity to see if the variance of X varies within the different groups of Y. Among these tests, we again find the predictor with the most significant result, say Xi If its p-value (adjusted for multiple tests) is less than the splitting threshold, Xi is the splitting variable. Otherwise, the node is not split.

Assuming a splitting variable is found, the next step is to determine how the variable should be split. If the selected variable Xj is continuous, a split point d is determined by quadratic discriminant analysis (QDA) of Xj into two populations determined by a binary partition of the response Y. The goal of this step is to group the classes of Y into two subsets or super classes, A and B. If there are only two classes in the response Y, the super classes are obvious. Otherwise, calculate the means and variances of Xj in each of the classes of Y. If the means are all equal, put the largest-sized class into group A and combine the rest to form group B. If they are not all equal, use a k-means clustering method (k = 2) on the class means to determine A and B.

Xj in A and in B is assumed to be normally distributed with estimated means , and variances S2 j|AS2 j|B, respectively. The quadratic discriminant is the partition Xj  d and Xj > d such that Pr(XjA= Pr(XjB). The discriminant rule assigns an observation to A if xij  d and to B if xij > d. For d to maximally discriminate, the probabilities must be equal.

If the selected variable Xj is categorical, it is first transformed using the method outlined in Loh and Shih (1997) and then QDA is performed as above. The transformation is related to the discriminant coordinate (CRIMCOORD) approach due to Gnanadesikan (1977).

Minimal-Cost Complexity Pruning

One way to address overfitting is to grow the tree as large as possible, and then use some logic to prune it back. Let T represent a decision tree generated by any of the methods above. The idea (from Breiman, et. al.) is to find the smallest sub-tree of T that minimizes the cost complexity measure:

Rδ(T=R(T)+δ,

denotes the set of terminal nodes, represents the number of terminal nodes, and δ  0 is a cost-complexity parameter. For a categorical target variable

 

,

p(t= Pr[x  t],

and p(jt= Pr[y = j  x  t],

and C(ij) is the cost for misclassifying the actual class j as i. Note that C(jj= 0 and C(ij> 0, for ≠ j.

When the target is continuous (and the problem is a regression problem), the metric is instead the mean squared error

 

This software implements the optimal pruning algorithm10.1, page 294 in Breiman, et. al (1984). The result of the algorithm is a sequence of sub-trees Tmax  T1  T2   TM-1  {t0} obtained by pruning the fully generated tree, Tmax , until the sub-tree consists of the single root node, {t0}. Corresponding to the sequence of sub-trees is the sequence of complexity values, 0   δmin  = δ1 < δ2 <  <δM-1 < δM where M is the number of steps it takes in the algorithm to reach the root node. The sub-trees represent the optimally pruned sub-trees for the sequence of complexity values. The minimum complexity δmin can be set via an optional argument.

V-Fold Cross-Validation

In V-fold cross validation, the training data is partitioned randomly into V approximately equally sized sub-samples. The model is then trained V different times with each of the sub-samples removed in turn. The cross-validated estimate of the risk of a decision function R(dk) is

 

where L(yd(x)) is the loss incurred when the decision is d(x) for the actual, y. The symbol η denotes the full training data set, and denotes the set of decisions corresponding to Tkv, the kth optimally pruned tree using the training sample η-ηv and , where δkδk+1 come from the pruning on the full data set, η.

For example, if the problem is classification, priors are estimated from the data Nj/N and T represents any tree,

 

where

 

 

 

and

 

is the overall number of j cases misclassified as i. The standard error of RCV(dk) is approximated with

,

where

,

the “sample variance” of the cross-validated estimates.

Final selection rules include:

1. Select such that
2.   Select such that k2 is the largest k satisfying .
3.   For a specified complexity parameter , select such that .

Bagging

Bagging is a resampling approach for generating predictions. In particular, bagging stands for bootstrap aggregating. In the procedure, m bootstrap samples of size n are drawn from the training set of size n. Bootstrap sampling is sampling with replacement so that almost every sample has repeated observations, and some observations will be left out. A decision tree is grown treating each sample as a separate training set. The tree is then used to generate predictions for the test data. For each test case, the m predictions are combined by averaging the output (regression) or voting (classification) to obtain a final prediction.

The bagged predictions are generated for the test data if test data is provided. Otherwise, the bagged predictions are generated for the input data (training data). The out-of-bag predictions are available outputs as well, but these are always for the input data. An out-of-bag prediction of a particular observation is combined using only those bootstrap samples that do not include that observation.

Bagging leads to "improvements for unstable procedures," such as neural nets, classification and regression trees, and subset selection in linear regression. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).

Random Trees

A random forest is an ensemble of decision trees. Like bootstrap aggregation, a tree is fit to each of m bootstrap samples from the training data. Each tree is then used to generate predictions. For a regression problem (continuous response variable), the m predictions are combined into a single predicted value by averaging. For classification (categorical response variable), majority vote is used.

A random forest also randomizes the predictors. That is, in every tree, the splitting variable at every node is selected from a random subset of the predictors. Randomization of the predictors reduces correlation among individual trees. The random forest was invented by Leo Breiman in 2001 (Breiman, 2001). Random ForestsTM is the trademark term for this approach. Also see Hastie, Tibshirani, and Friedman, 2009, for further discussion.

To generate predictions or fitted values using a random forest, use the optional argument, IMSLS_RANDOM_FEATURES. The number of trees is equivalent to the number of bootstrap samples and can be set using IMSLS_N_SAMPLE. The number of random features can also be set using an optional argument.

Missing Values

Any observation or case with a missing response variable is eliminated from the analysis. If a predictor has a missing value, each algorithm will skip that case when evaluating the given predictor. When making a prediction for a new case, if the split variable is missing, the prediction function applies surrogate split-variables and splitting rules in turn, if they are estimated with the decision tree. Otherwise, the prediction function returns the prediction from the most recent non-terminal node. In this implementation, only ALACART estimates surrogate split variables when requested.

Structure Definitions

 

Table 1 – Structure Imsls_f_decision_tree

Name

Type

Description

n_classes

int

Number of classes assumed by the response variable, if the response variable is categorical

n_levels

int

Number of levels or depth of tree

n_nodes

int

Number of nodes or size of tree

nodes

Imsls_f tree_node*

Pointer to an array of tree_node structures of size n_nodes

n_preds

int

Number of predictors used in the model

n_surrogates

int

Number of surrogate splits searched for at each node. Available for method=1

pred_type

int*

Pointer to an array of length n_preds containing the type of each predictor variable

pred_n_values

int*

Pointer to an array of length n_preds containing the number of values of each predictor variable

response_type

int

Type of the response variable

terminal_nodes

int*

Pointer to an array of length n_nodes indicating which nodes are terminal nodes

 

 

Table 2 – Structure tree_node

Name

Type

Description

children_ids

int*

Pointer to an array of length n_children containing the IDs of the children nodes

cost

float

Misclassification cost (in-sample cost measure at the current node)

n_cases

int

Number of cases of the training data that fall into the current node

n_children

int

Number of children of the current node

node_id

int

Node ID, where the root node corresponds to node_id= 0

node_prob

float*

Estimate of the probability of a new case belonging to the node

node_split_value

float

Value around which the node will be split, if the node variable is of continuous type

node_values_ind

int*

Values of the split variable for the current node, if node_var_id has type 0 or 1

node_var_id

int

ID of the variable that defined the split in the parent node

parent_id

int

ID of the parent of the node with ID node_id

predicted_class

int

Predicted class at the current node, for response variables of categorical type

predicted_val

float

Predicted value of the response variable if the response variable is of continuous type

surrogate_info

float*

Array containing the surrogate split information

y_probs

float*

Pointer to the array of class probabilities at the current node, if the response variable is of categorical type

Examples

Example 1

In this example, we use a small data set with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by C4.5 and the ALACART methods. The control parameters are adjusted because of the small data size and no cross-validation or pruning is performed. The maximal trees are printed out using Imsls_f_decision_tree_print. Notice that C4.5 splits on Outlook, then Humidity and Wind, while ALACART splits on Outlook, then Temperature.

 

#include <imsls.h>

#include <stdio.h>

 

int main(){

 

float xy[] =

{

0, 85, 85, 0, 0,

0, 80, 90, 1, 0,

1, 83, 78, 0, 1,

2, 70, 96, 0, 1,

2, 68, 80, 0, 1,

2, 65, 70, 1, 0,

1, 64, 65, 1, 1,

0, 72, 95, 0, 0,

0, 69, 70, 0, 1,

2, 75, 80, 0, 1,

0, 75, 70, 1, 1,

1, 72, 90, 1, 1,

1, 81, 75, 0, 1,

2, 71, 80, 1, 0

};

 

int n = 14;

int ncols = 5;

int response_col_idx = 4;

int method = 1;

int var_type[] = {0, 2, 2, 0, 0};

int control[] = {2, 3, 10, 50, 10};

 

const char* names[] = {"Outlook", "Temperature", "Humidity", "Wind",

"Play"};

const char* class_names[] = {"Don't Play", "Play"};

const char* var_levels[] = {"Sunny", "Overcast", "Rainy", "False", "True"};

 

Imsls_f_decision_tree *tree = NULL;

 

tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,

IMSLS_N_FOLDS, 1,

IMSLS_CONTROL, control,

0);

 

printf("Decision Tree using Method C4.5:\n\n");

imsls_f_decision_tree_print(tree,

IMSLS_VAR_NAMES, names,

IMSLS_CLASS_NAMES, class_names,

IMSLS_CATEG_NAMES, var_levels,

0);

 

imsls_f_decision_tree_free(tree);

 

tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,

IMSLS_N_FOLDS, 1,

IMSLS_METHOD, method,

IMSLS_CONTROL, control,

0);

 

printf("Decision Tree using Method ALACART:\n\n");

imsls_f_decision_tree_print(tree,

IMSLS_VAR_NAMES, names,

IMSLS_CLASS_NAMES, class_names,

IMSLS_CATEG_NAMES, var_levels,

0);

 

imsls_f_decision_tree_free(tree);

}

 

Output

Decision Tree using Method C4.5:

 

 

Decision Tree:

 

Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes: 1 4 5

P(Y=0)= 0.357

P(Y=1)= 0.643

Predicted Y: Play

Node 1: Cost = 0.143, N= 5, Level = 1, Child nodes: 2 3

Rule: Outlook in: { Sunny }

P(Y=0)= 0.600

P(Y=1)= 0.400

Predicted Y: Don't Play

Node 2: Cost = 0.000, N= 2, Level = 2

Rule: Humidity <= 77.500

P(Y=0)= 0.000

P(Y=1)= 1.000

Predicted Y: Play

Node 3: Cost = 0.000, N= 3, Level = 2

Rule: Humidity > 77.500

P(Y=0)= 1.000

P(Y=1)= 0.000

Predicted Y: Don't Play

Node 4: Cost = 0.000, N= 4, Level = 1

Rule: Outlook in: { Overcast }

P(Y=0)= 0.000

P(Y=1)= 1.000

Predicted Y: Play

Node 5: Cost = 0.143, N= 5, Level = 1, Child nodes: 6 7

Rule: Outlook in: { Rainy }

P(Y=0)= 0.400

P(Y=1)= 0.600

Predicted Y: Play

Node 6: Cost = 0.000, N= 3, Level = 2

Rule: Wind in: { False }

P(Y=0)= 0.000

P(Y=1)= 1.000

Predicted Y: Play

Node 7: Cost = 0.000, N= 2, Level = 2

Rule: Wind in: { True }

P(Y=0)= 1.000

P(Y=1)= 0.000

Predicted Y: Don't Play

Decision Tree using Method ALACART:

 

 

Decision Tree:

 

Node 0: Cost = 0.357, N= 14, Level = 0, Child nodes: 1 8

P(Y=0)= 0.357

P(Y=1)= 0.643

Predicted Y: Play

Node 1: Cost = 0.357, N= 10, Level = 1, Child nodes: 2 7

Rule: Outlook in: { Sunny Rainy }

P(Y=0)= 0.500

P(Y=1)= 0.500

Predicted Y: Don't Play

Node 2: Cost = 0.214, N= 8, Level = 2, Child nodes: 3 6

Rule: Temperature <= 77.500

P(Y=0)= 0.375

P(Y=1)= 0.625

Predicted Y: Play

Node 3: Cost = 0.214, N= 6, Level = 3, Child nodes: 4 5

Rule: Temperature <= 73.500

P(Y=0)= 0.500

P(Y=1)= 0.500

Predicted Y: Don't Play

Node 4: Cost = 0.071, N= 4, Level = 4

Rule: Temperature <= 70.500

P(Y=0)= 0.250

P(Y=1)= 0.750

Predicted Y: Play

Node 5: Cost = 0.000, N= 2, Level = 4

Rule: Temperature > 70.500

P(Y=0)= 1.000

P(Y=1)= 0.000

Predicted Y: Don't Play

Node 6: Cost = 0.000, N= 2, Level = 3

Rule: Temperature > 73.500

P(Y=0)= 0.000

P(Y=1)= 1.000

Predicted Y: Play

Node 7: Cost = 0.000, N= 2, Level = 2

Rule: Temperature > 77.500

P(Y=0)= 1.000

P(Y=1)= 0.000

Predicted Y: Don't Play

Node 8: Cost = 0.000, N= 4, Level = 1

Rule: Outlook in: { Overcast }

P(Y=0)= 0.000

P(Y=1)= 1.000

Predicted Y: Play

Example 2

This example applies the QUEST method to a simulated data set with 50 cases and three predictors of mixed-type. A maximally grown tree under the default controls and the optimally pruned sub-tree obtained from cross-validation and minimal cost complexity pruning are produced. Notice that the optimally pruned tree consists of just the root node, whereas the maximal tree has five nodes and three levels.

 

#include <imsls.h>

#include <stdio.h>

 

int main(){

float xy[50*4] =

{

2, 25.928690, 0, 0,

1, 51.632450, 1, 1,

1, 25.784321, 0, 2,

0, 39.379478, 0, 3,

2, 24.650579, 0, 2,

2, 45.200840, 0, 2,

2, 52.679600, 1, 3,

1, 44.283421, 1, 3,

2, 40.635231, 1, 3,

2, 51.760941, 0, 3,

2, 26.303680, 0, 1,

2, 20.702299, 1, 0,

2, 38.742729, 1, 3,

2, 19.473330, 0, 0,

1, 26.422110, 0, 0,

2, 37.059860, 1, 0,

1, 51.670429, 1, 3,

0, 42.401562, 0, 3,

2, 33.900269, 1, 2,

1, 35.432819, 0, 0,

1, 44.303692, 0, 1,

0, 46.723869, 0, 2,

1, 46.992619, 0, 2,

0, 36.059231, 0, 3,

2, 36.831970, 1, 1,

1, 61.662571, 1, 2,

0, 25.677139, 0, 3,

1, 39.085670, 1, 0,

0, 48.843410, 1, 1,

1, 39.343910, 0, 3,

2, 24.735220, 0, 2,

1, 50.552509, 1, 3,

0, 31.342630, 1, 3,

1, 27.157949, 1, 0,

0, 31.726851, 0, 2,

0, 25.004080, 0, 3,

1, 26.354570, 1, 3,

2, 38.123428, 0, 1,

0, 49.940300, 0, 2,

1, 42.457790, 1, 3,

0, 38.809479, 1, 1,

0, 43.227989, 1, 1,

0, 41.876240, 0, 3,

2, 48.078201, 0, 2,

0, 43.236729, 1, 0,

2, 39.412941, 0, 3,

1, 23.933460, 0, 2,

2, 42.841301, 1, 3,

2, 30.406691, 0, 1,

0, 37.773891, 0, 2

};

 

int n = 50;

int ncols = 4;

int method = 3;

int var_type[] = {0, 2, 0, 0};

int response_col_idx = 3;

 

Imsls_f_decision_tree *tree = NULL;

 

tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,

IMSLS_METHOD, method,

IMSLS_RANDOM_SEED, 123457,

IMSLS_PRINT, 1,

0);

 

printf("\nMaximal tree: \n\n");

imsls_f_decision_tree_print(tree,

IMSLS_PRINT_MAX,

0);

 

printf("\nOptimally pruned subtree: \n\n");

imsls_f_decision_tree_print(tree, 0);

 

imsls_f_decision_tree_free(tree);

}

Output

Growing the maximal tree using method QUEST:

 

Cross-Validation:

Tree complexity CV-err CV-Std.Error

0 0.00000 0.70406 0.08044

1 0.02000 0.72641 0.08562

2 0.04000 0.72814 0.08598

 

Select tree number 2, cost complexity parameter = 0.04000:

 

Maximal tree:

 

 

Decision Tree:

 

Node 0: Cost = 0.620, N= 50, Level = 0, Child nodes: 1 2

P(Y=0)= 0.180

P(Y=1)= 0.180

P(Y=2)= 0.260

P(Y=3)= 0.380

Predicted Y: 3

Node 1: Cost = 0.220, N= 17, Level = 1

Rule: X1 <= 35.031

P(Y=0)= 0.294

P(Y=1)= 0.118

P(Y=2)= 0.353

P(Y=3)= 0.235

Predicted Y: 2

Node 2: Cost = 0.360, N= 33, Level = 1, Child nodes: 3 4

Rule: X1 > 35.031

P(Y=0)= 0.121

P(Y=1)= 0.212

P(Y=2)= 0.212

P(Y=3)= 0.455

Predicted Y: 3

Node 3: Cost = 0.180, N= 19, Level = 2

Rule: X1 <= 43.265

P(Y=0)= 0.211

P(Y=1)= 0.211

P(Y=2)= 0.053

P(Y=3)= 0.526

Predicted Y: 3

Node 4: Cost = 0.160, N= 14, Level = 2

Rule: X1 > 43.265

P(Y=0)= 0.000

P(Y=1)= 0.214

P(Y=2)= 0.429

P(Y=3)= 0.357

Predicted Y: 2

 

Optimally pruned subtree:

 

 

Decision Tree:

 

Node 0: Cost = 0.620, N= 50, Level = 0

P(Y=0)= 0.180

P(Y=1)= 0.180

P(Y=2)= 0.260

P(Y=3)= 0.380

Predicted Y: 3

Pruned at Node id 0.

Example 3

This example uses the dataset Kyphosis. The 81 cases represent 81 children who have undergone surgery to correct a type of spinal deformity known as Kyphosis. The response variable is the presence or absence of Kyphosis after the surgery. Three predictors are Age of the patient in months, Start, the number of the vertebra where the surgery started, and Number, the number of vertebra involved in the surgery. This example uses the method QUEST to produce a maximal tree. It also requests predictions for a test-data set consisting of 10 “new” cases.

 

#include <imsls.h>

#include <stdio.h>

 

int main()

{

float xy[81*4] =

{

0, 71, 3, 5,

0, 158, 3, 14,

1, 128, 4, 5,

0, 2, 5, 1,

0, 1, 4, 15,

0, 1, 2, 16,

0, 61, 2, 17,

0, 37, 3, 16,

0, 113, 2, 16,

1, 59, 6, 12,

1, 82, 5, 14,

0, 148, 3, 16,

0, 18, 5, 2,

0, 1, 4, 12,

0, 168, 3, 18,

0, 1, 3, 16,

0, 78, 6, 15,

0, 175, 5, 13,

0, 80, 5, 16,

0, 27, 4, 9,

0, 22, 2, 16,

1, 105, 6, 5,

1, 96, 3, 12,

0, 131, 2, 3,

1, 15, 7, 2,

0, 9, 5, 13,

0, 8, 3, 6,

0, 100, 3, 14,

0, 4, 3, 16,

0, 151, 2, 16,

0, 31, 3, 16,

0, 125, 2, 11,

0, 130, 5, 13,

0, 112, 3, 16,

0, 140, 5, 11,

0, 93, 3, 16,

0, 1, 3, 9,

1, 52, 5, 6,

0, 20, 6, 9,

1, 91, 5, 12,

1, 73, 5, 1,

0, 35, 3, 13,

0, 143, 9, 3,

0, 61, 4, 1,

0, 97, 3, 16,

1, 139, 3, 10,

0, 136, 4, 15,

0, 131, 5, 13,

1, 121, 3, 3,

0, 177, 2, 14,

0, 68, 5, 10,

0, 9, 2, 17,

1, 139, 10, 6,

0, 2, 2, 17,

0, 140, 4, 15,

0, 72, 5, 15,

0, 2, 3, 13,

1, 120, 5, 8,

0, 51, 7, 9,

0, 102, 3, 13,

1, 130, 4, 1,

1, 114, 7, 8,

0, 81, 4, 1,

0, 118, 3, 16,

0, 118, 4, 16,

0, 17, 4, 10,

0, 195, 2, 17,

0, 159, 4, 13,

0, 18, 4, 11,

0, 15, 5, 16,

0, 158, 5, 14,

0, 127, 4, 12,

0, 87, 4, 16,

0, 206, 4, 10,

0, 11, 3, 15,

0, 178, 4, 15,

1, 157, 3, 13,

0, 26, 7, 13,

0, 120, 2, 13,

1, 42, 7, 6,

0, 36, 4, 13

};

 

float xy_test[10*4] =

{

0, 71, 3, 5,

1, 128, 4, 5,

0, 1, 4, 15,

0, 61, 6, 10,

0, 113, 2, 16,

1, 82, 5, 14,

0, 148, 3, 16,

0, 1, 4, 12,

0, 1, 3, 16,

0, 175, 5, 13

};

 

int n = 81;

int ncols = 4;

int response_col_idx = 0;

int method = 3;

int control[] = {5, 10, 10, 50, 10};

int var_type[] = {0, 2, 2, 2};

 

int n_test = 10;

int i, idx;

 

float *predictions;

float pred_err_ss;

const char* names[] = {"Age", "Number", "Start"};

const char* class_names[] = {"Absent", "Present"};

const char* response_name[] = {"Kyphosis"};

Imsls_f_decision_tree *tree = NULL;

 

tree=imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,

IMSLS_METHOD, method,

IMSLS_N_FOLDS, 1,

IMSLS_CONTROL, control,

IMSLS_TEST_DATA, n_test, xy_test,

IMSLS_PRINT, 2,

IMSLS_PREDICTED, &predictions,

IMSLS_ERROR_SS, &pred_err_ss,

0);

 

imsls_f_decision_tree_print(tree,

IMSLS_RESP_NAME, response_name,

IMSLS_VAR_NAMES, names,

IMSLS_CLASS_NAMES, class_names,

0);

 

printf("\nPredictions for test data:\n");

printf("%5s%8s%7s%10s\n", names[0], names[1], names[2],

response_name[0]);

for(i=0;i<n_test;i++){

printf("%5.0f%8.0f%7.0f",

xy_test[i*ncols+1],

xy_test[i*ncols+2],

xy_test[i*ncols+3]);

idx = (int)predictions[i];

printf("%10s\n", class_names[idx]);

}

printf("\nMean squared prediction error: %f\n", pred_err_ss);

 

imsls_f_decision_tree_free(tree);

imsls_free(predictions);

}

Output

The response variable has 0 missing values.

Growing the maximal tree using method QUEST:

Node 2 is a terminal node. It has 7 cases--too few cases to split.

Node 3 is a terminal node. It has 6 cases--too few cases to split.

Node 5 is a terminal node. It has 6 cases--too few cases to split.

Node 8 is an terminal node. The split is too thin having count 2.

Node 10 is a terminal node. It has 6 cases--too few cases to split.

Node 11 is a terminal node, because it is pure.

Node 11 is a terminal node. It has 7 cases--too few cases to split.

Node 13 is a terminal node. It has 5 cases--too few cases to split.

Node 14 is a terminal node, because it is pure.

 

Decision Tree:

 

Node 0: Cost = 0.210, N= 81, Level = 0, Child nodes: 1 4

P(Y=0)= 0.790

P(Y=1)= 0.210

Predicted Kyphosis Absent

Node 1: Cost = 0.074, N= 13, Level = 1, Child nodes: 2 3

Rule: Start <= 5.155

P(Y=0)= 0.538

P(Y=1)= 0.462

Predicted Kyphosis Absent

Node 2: Cost = 0.025, N= 7, Level = 2

Rule: Age <= 84.030

P(Y=0)= 0.714

P(Y=1)= 0.286

Predicted Kyphosis Absent

Node 3: Cost = 0.025, N= 6, Level = 2

Rule: Age > 84.030

P(Y=0)= 0.333

P(Y=1)= 0.667

Predicted Kyphosis Present

Node 4: Cost = 0.136, N= 68, Level = 1, Child nodes: 5 6

Rule: Start > 5.155

P(Y=0)= 0.838

P(Y=1)= 0.162

Predicted Kyphosis Absent

Node 5: Cost = 0.012, N= 6, Level = 2

Rule: Start <= 8.862

P(Y=0)= 0.167

P(Y=1)= 0.833

Predicted Kyphosis Present

Node 6: Cost = 0.074, N= 62, Level = 2, Child nodes: 7 12

Rule: Start > 8.862

P(Y=0)= 0.903

P(Y=1)= 0.097

Predicted Kyphosis Absent

Node 7: Cost = 0.062, N= 28, Level = 3, Child nodes: 8 9

Rule: Start <= 13.092

P(Y=0)= 0.821

P(Y=1)= 0.179

Predicted Kyphosis Absent

Node 8: Cost = 0.025, N= 15, Level = 4

Rule: Age <= 91.722

P(Y=0)= 0.867

P(Y=1)= 0.133

Predicted Kyphosis Absent

Node 9: Cost = 0.037, N= 13, Level = 4, Child nodes: 10 11

Rule: Age > 91.722

P(Y=0)= 0.769

P(Y=1)= 0.231

Predicted Kyphosis Absent

Node 10: Cost = 0.037, N= 6, Level = 5

Rule: Number <= 3.450

P(Y=0)= 0.500

P(Y=1)= 0.500

Predicted Kyphosis Absent

Node 11: Cost = 0.000, N= 7, Level = 5

Rule: Number > 3.450

P(Y=0)= 1.000

P(Y=1)= 0.000

Predicted Kyphosis Absent

Node 12: Cost = 0.012, N= 34, Level = 3, Child nodes: 13 14

Rule: Start > 13.092

P(Y=0)= 0.971

P(Y=1)= 0.029

Predicted Kyphosis Absent

Node 13: Cost = 0.012, N= 5, Level = 4

Rule: Start <= 14.864

P(Y=0)= 0.800

P(Y=1)= 0.200

Predicted Kyphosis Absent

Node 14: Cost = 0.000, N= 29, Level = 4

Rule: Start > 14.864

P(Y=0)= 1.000

P(Y=1)= 0.000

Predicted Kyphosis Absent

 

Predictions for test data:

Age Number Start Kyphosis

71 3 5 Absent

128 4 5 Present

1 4 15 Absent

61 6 10 Absent

113 2 16 Absent

82 5 14 Absent

148 3 16 Absent

1 4 12 Absent

1 3 16 Absent

175 5 13 Absent

 

Mean squared prediction error: 0.010000

Example 4

For the Kyphosis dataset of Example 3, this example produces random forest predictions using the optional arguments for random feature selection.

 

#include <imsls.h>

#include <stdio.h>

 

#define NOBS 81

#define NCLASSES 2

#define NPREDS 3

#define NTEST 10

 

int main(){

float xy[81 * 4] = {

0, 71, 3, 5,

0, 158, 3, 14,

1, 128, 4, 5,

0, 2, 5, 1,

0, 1, 4, 15,

0, 1, 2, 16,

0, 61, 2, 17,

0, 37, 3, 16,

0, 113, 2, 16,

1, 59, 6, 12,

1, 82, 5, 14,

0, 148, 3, 16,

0, 18, 5, 2,

0, 1, 4, 12,

0, 168, 3, 18,

0, 1, 3, 16,

0, 78, 6, 15,

0, 175, 5, 13,

0, 80, 5, 16,

0, 27, 4, 9,

0, 22, 2, 16,

1, 105, 6, 5,

1, 96, 3, 12,

0, 131, 2, 3,

1, 15, 7, 2,

0, 9, 5, 13,

0, 8, 3, 6,

0, 100, 3, 14,

0, 4, 3, 16,

0, 151, 2, 16,

0, 31, 3, 16,

0, 125, 2, 11,

0, 130, 5, 13,

0, 112, 3, 16,

0, 140, 5, 11,

0, 93, 3, 16,

0, 1, 3, 9,

1, 52, 5, 6,

0, 20, 6, 9,

1, 91, 5, 12,

1, 73, 5, 1,

0, 35, 3, 13,

0, 143, 9, 3,

0, 61, 4, 1,

0, 97, 3, 16,

1, 139, 3, 10,

0, 136, 4, 15,

0, 131, 5, 13,

1, 121, 3, 3,

0, 177, 2, 14,

0, 68, 5, 10,

0, 9, 2, 17,

1, 139, 10, 6,

0, 2, 2, 17,

0, 140, 4, 15,

0, 72, 5, 15,

0, 2, 3, 13,

1, 120, 5, 8,

0, 51, 7, 9,

0, 102, 3, 13,

1, 130, 4, 1,

1, 114, 7, 8,

0, 81, 4, 1,

0, 118, 3, 16,

0, 118, 4, 16,

0, 17, 4, 10,

0, 195, 2, 17,

0, 159, 4, 13,

0, 18, 4, 11,

0, 15, 5, 16,

0, 158, 5, 14,

0, 127, 4, 12,

0, 87, 4, 16,

0, 206, 4, 10,

0, 11, 3, 15,

0, 178, 4, 15,

1, 157, 3, 13,

0, 26, 7, 13,

0, 120, 2, 13,

1, 42, 7, 6,

0, 36, 4, 13 };

 

float xytest[10 * 4] = { 0, 71, 3, 5,

1, 128, 4, 5,

0, 1, 4, 15,

0, 61, 6, 10,

0, 113, 2, 16,

1, 82, 5, 14,

0, 148, 3, 16,

0, 1, 4, 12,

0, 1, 3, 16,

0, 175, 5, 13 };

 

int N = NOBS;

int nclasses = NCLASSES;

int npreds = NPREDS;

int ncols = NPREDS + 1;

int response_idx = 0;

int var_type[] = { 0, 2, 2, 2 };

int ntest = NTEST;

int i;

long seed = 123457;

float *predictions = NULL;

Imsls_f_decision_tree *tree = NULL;

 

tree = imsls_f_decision_tree(N, ncols,

xy,

response_idx,

var_type,

IMSLS_METHOD, 1,

IMSLS_N_FOLDS, 1,

IMSLS_PREDICTED, &predictions,

IMSLS_TEST_DATA, ntest, xytest,

0);

 

printf("Single tree predictions vs. actuals:\n\n");

for (i = 0; i < ntest; i++){

printf("%d\t%f \t %f\n", i + 1, predictions[i],

xytest[i*ncols + response_idx]);

}

 

imsls_f_decision_tree_free(tree);

imsls_free(predictions);

 

tree = imsls_f_decision_tree(N, ncols,

xy,

response_idx,

var_type,

IMSLS_METHOD, 1,

IMSLS_N_FOLDS, 1,

IMSLS_PREDICTED, &predictions,

IMSLS_TEST_DATA, ntest, xytest,

IMSLS_RANDOM_FEATURES,

IMSLS_N_RANDOM_FEATURES, 2,

IMSLS_RANDOM_SEED, seed,

IMSLS_N_SAMPLE, 100,

0);

 

printf("\n\nRandom forest predictions vs. actuals:\n\n");

for (i = 0; i < ntest; i++){

printf("%d\t%f\t %f\n", i + 1, predictions[i],

xytest[i*ncols + response_idx]);

}

 

imsls_f_decision_tree_free(tree);

imsls_free(predictions);

}

Output

Single tree predictions vs. actuals:

 

1 0.000000 0.000000

2 0.000000 1.000000

3 0.000000 0.000000

4 1.000000 0.000000

5 0.000000 0.000000

6 0.000000 1.000000

7 0.000000 0.000000

8 0.000000 0.000000

9 0.000000 0.000000

10 0.000000 0.000000

 

 

Random forest predictions vs. actuals:

 

1 0.000000 0.000000

2 1.000000 1.000000

3 0.000000 0.000000

4 0.000000 0.000000

5 0.000000 0.000000

6 0.000000 1.000000

7 0.000000 0.000000

8 0.000000 0.000000

9 0.000000 0.000000

10 0.000000 0.000000

Example 5

In example 5, the random forest is used to produce predictions for Fisher's Iris data.

 

#include <imsls.h>

#include <stdio.h>

 

#define NOBS 150

#define NCLASSES 3

#define NPREDS 4

 

int main(){

int i = 0;

int n = NOBS;

int ncols = NPREDS + 1;

int nclasses = NCLASSES;

int response_idx = 0;

int var_type[] = { 0, 2, 2, 2, 2 };

 

float iris_xy[150 * 5];

float *iris_data = NULL;

char *classLabel[] = { "Setosa", "Versicolour", "Virginica",

"Total" };

char *colLabel[] = { "Species", "Number of Errors",

"Total N" };

 

float out_of_bag_mean_error = 0.0;

float *out_of_bag_class_errors = NULL;

Imsls_f_decision_tree *tree = NULL;

 

iris_data = imsls_f_data_sets(3, 0);

 

for (i = 0; i < n*ncols; i++){

iris_xy[i] = iris_data[i];

}

for (i = 0; i < n; i++){

iris_xy[i*ncols + response_idx] -= 1;

}

 

tree = imsls_f_decision_tree(n, ncols,

iris_xy,

response_idx,

var_type,

IMSLS_METHOD, 1,

IMSLS_N_FOLDS, 1,

IMSLS_OUT_OF_BAG_MEAN_ERROR, &out_of_bag_mean_error,

IMSLS_OUT_OF_BAG_CLASS_ERROR, &out_of_bag_class_errors,

IMSLS_RANDOM_FEATURES,

IMSLS_RANDOM_SEED, 123457,

0);

 

imsls_f_write_matrix("Out of bag errors by class", nclasses + 1, 2,

out_of_bag_class_errors,

IMSLS_ROW_LABELS, classLabel,

IMSLS_COL_LABELS, colLabel, 0);

 

printf("\nOut-of-bag mean error = %3.2f\n", out_of_bag_mean_error);

 

imsls_f_decision_tree_free(tree);

imsls_free(out_of_bag_class_errors);

imsls_free(iris_data);

}

 

Output

 

Out of bag errors by class

Species Number of Errors Total N

 

Setosa 0 50

Versicolour 3 50

Virginica 6 50

Total 9 150

 

Out-of-bag mean error = 0.06

Example 6

In this example, a random forest is used to predict the categorical response on simulated data. The data is random with no real relationship among the predictors and the response variable, reflected by the high number of errors. Furthermore, the variable importance measure is slightly negative for the predictors, another symptom of noisy data.

 

#include <imsls.h>

#include <stdio.h>

 

#define NOBS 50

#define NCLASSES 3

#define NPREDS 3

 

int main(){

int n = NOBS;

int nclasses = NCLASSES;

int ncols = NPREDS + 1;

int var_type[] = { 0, 2, 0, 0 };

int response_idx = 0;

float xy[50 * 4] =

{

2, 25.92869, 0, 0,

1, 51.63245, 1, 1,

1, 25.78432, 0, 2,

0, 39.37948, 0, 3,

2, 24.65058, 0, 2,

2, 45.20084, 0, 2,

2, 52.67960, 1, 3,

1, 44.28342, 1, 3,

2, 40.63523, 1, 3,

2, 51.76094, 0, 3,

2, 26.30368, 0, 1,

2, 20.70230, 1, 0,

2, 38.74273, 1, 3,

2, 19.47333, 0, 0,

1, 26.42211, 0, 0,

2, 37.05986, 1, 0,

1, 51.67043, 1, 3,

0, 42.40156, 0, 3,

2, 33.90027, 1, 2,

1, 35.43282, 0, 0,

1, 44.30369, 0, 1,

0, 46.72387, 0, 2,

1, 46.99262, 0, 2,

0, 36.05923, 0, 3,

2, 36.83197, 1, 1,

1, 61.66257, 1, 2,

0, 25.67714, 0, 3,

1, 39.08567, 1, 0,

0, 48.84341, 1, 1,

1, 39.34391, 0, 3,

2, 24.73522, 0, 2,

1, 50.55251, 1, 3,

0, 31.34263, 1, 3,

1, 27.15795, 1, 0,

0, 31.72685, 0, 2,

0, 25.00408, 0, 3,

1, 26.35457, 1, 3,

2, 38.12343, 0, 1,

0, 49.94030, 0, 2,

1, 42.45779, 1, 3,

0, 38.80948, 1, 1,

0, 43.22799, 1, 1,

0, 41.87624, 0, 3,

2, 48.07820, 0, 2,

0, 43.23673, 1, 0,

2, 39.41294, 0, 3,

1, 23.93346, 0, 2,

2, 42.84130, 1, 3,

2, 30.40669, 0, 1,

0, 37.77389, 0, 2

};

 

float out_of_bag_mean_error = 0.0;

float *out_of_bag_class_errors = NULL;

float *variable_importance = NULL;

Imsls_f_decision_tree *tree = NULL;

 

tree = imsls_f_decision_tree(n, ncols,

xy,

response_idx,

var_type,

IMSLS_METHOD, 0,

IMSLS_N_FOLDS, 1,

IMSLS_OUT_OF_BAG_MEAN_ERROR, &out_of_bag_mean_error,

IMSLS_OUT_OF_BAG_CLASS_ERROR, &out_of_bag_class_errors,

IMSLS_OUT_OF_BAG_VAR_IMPORTANCE, &variable_importance,

IMSLS_RANDOM_FEATURES,

IMSLS_RANDOM_SEED, 123457,

0);

 

imsls_f_write_matrix("Errors by class", nclasses + 1, 2,

out_of_bag_class_errors, 0);

printf("\nout-of-bag mean error = %f\n", out_of_bag_mean_error);

imsls_f_write_matrix("Variable importance", ncols - 1, 1,

variable_importance, 0);

 

imsls_f_decision_tree_free(tree);

imsls_free(out_of_bag_class_errors);

imsls_free(variable_importance);

}

Output

Errors by class

1 2

1 11 15

2 14 16

3 15 19

4 40 50

 

out-of-bag mean error = 0.800000

 

Variable importance

1 -0.01758

2 0.00133

3 -0.00881

Example 7

In this example, the bagged trees generated in a random forest are returned.

 

#include <imsls.h>

#include <stdio.h>

 

#define NOBS 150

#define NPREDS 4

 

int main(){

int i = 0;

int n = NOBS;

int ncols = NPREDS + 1;

int response_idx = 0;

int control[] = { 7, 21, 10, 4, 3 };

int var_type[] = { 0, 2, 2, 2, 2 };

float iris_xy[150 * 5];

float *iris_data = NULL;

Imsls_f_decision_tree *tree = NULL;

Imsls_f_decision_tree **bagged_trees = NULL;

 

iris_data = imsls_f_data_sets(3, 0);

 

for (i = 0; i < n*ncols; i++){

iris_xy[i] = iris_data[i];

}

for (i = 0; i < n; i++){

iris_xy[i*ncols + response_idx] -= 1;

}

 

tree = imsls_f_decision_tree(n, ncols,

iris_xy,

response_idx,

var_type,

IMSLS_METHOD, 1,

IMSLS_N_FOLDS, 1,

IMSLS_CONTROL, control,

IMSLS_RETURN_TREES, &bagged_trees,

IMSLS_RANDOM_FEATURES,

IMSLS_RANDOM_SEED, 123457,

0);

 

/* Print the first and the last bagged tree:*/

imsls_f_decision_tree_print(bagged_trees[0], 0);

imsls_f_decision_tree_print(bagged_trees[49], 0);

 

imsls_f_decision_tree_free(tree);

imsls_f_bagged_trees_free(50, bagged_trees);

imsls_free(iris_data);

}

Output

Decision Tree:

 

Node 0: Cost = 0.633, N= 150, Level = 0, Child nodes: 1 2

P(Y=0)= 0.347

P(Y=1)= 0.287

P(Y=2)= 0.367

Predicted Y: 2

Node 1: Cost = 0.113, N= 67, Level = 1

Rule: X0 <= 5.550

P(Y=0)= 0.746

P(Y=1)= 0.224

P(Y=2)= 0.030

Predicted Y: 0

Node 2: Cost = 0.200, N= 83, Level = 1

Rule: X0 > 5.550

P(Y=0)= 0.024

P(Y=1)= 0.337

P(Y=2)= 0.639

Predicted Y: 2

 

Decision Tree:

 

Node 0: Cost = 0.660, N= 150, Level = 0, Child nodes: 1 2

P(Y=0)= 0.333

P(Y=1)= 0.340

P(Y=2)= 0.327

Predicted Y: 1

Node 1: Cost = 0.000, N= 50, Level = 1

Rule: X2 <= 2.600

P(Y=0)= 1.000

P(Y=1)= 0.000

P(Y=2)= 0.000

Predicted Y: 0

Node 2: Cost = 0.327, N= 100, Level = 1

Rule: X2 > 2.600

P(Y=0)= 0.000

P(Y=1)= 0.510

P(Y=2)= 0.490

Predicted Y: 1

Warning Errors

IMSLS_NO_SURROGATES

Use of surrogates is limited to method 1 (ALACART).

IMSLS_NO_CONVERGENCE

Convergence was not achieved.

IMSLS_EMPTY_CLASS_LEVEL

The count of class level # in the training data is zero.

Fatal Errors

IMSLS_INVALID_METHOD2

Choose a valid tree generation method; 0 (C4.5), 1 (ALACART), 2 (CHAID) or 3 (QUEST).

IMSLS_VALUE_GT_ZERO

The value of # must be strictly positive.