CNL Stat : Data Mining : decision_tree_predict
decision_tree_predict
Computes predicted values using a decision tree.
Synopsis
#include <imsls.h>
float* imsls_f_decision_tree_predict (int n, int n_cols, float x[], int var_type[], Imsls_f_decision_tree  *tree, ..., 0)
The type double function is imsls_d_decision_tree_predict.
Required Arguments
int n (Input)
The the number of rows in x.
int n_cols (Input)
The number of columns in x.
float x[] (Input)
Array of size n × ncols containing the data.
int var_type[] (Input)
Array of length ncols indicating the type of each variable.
Value
Type
0
Categorical
1
Ordered Discrete (Low, Med., High)
2
Quantitative or Continuous
3
Ignore this variable
imsls_f_decision_tree  *tree (Input)
An estimated decision tree.
Return Value
An array of length n containing the predicted values. If an error occurs, NULL is returned.
Synopsis with Optional Arguments
#include <imsls.h>
float* imsls_f_decision_tree_predict (int n, int n_colsfloat x[], int var_type[], Imsls_f_decision_tree *tree,
IMSLS_N_SURROGATES, int n_surrogates,
IMSLS_X_RESPONSE_COL, int response_col_idx,
IMSLS_WEIGHTS, float weights[],
IMSLS_X_NODE_IDS, int **node_ids,
IMSLS_X_NODE_IDS_USERint node_ids[],
IMSLS_ERROR_SS, float *pred_err_ss,
IMSLS_RETURN_USER, float predictions[],
0)
Optional Arguments
IMSLS_N_SURROGATES, int n_surrogates (Input)
Indicates the number of surrogate splits for use in methods that find surrogate splits in order to handle missing values.
Default: n_surrogates = 0.
IMSLS_WEIGHTS, float weights[] (Input)
An array of length n containing case weights.
Default: weights[i]=1.0.
IMSLS_X_RESPONSE_COL, int response_col_idx (Input)
The column index of the response variable, if present in the data. A negative value indicates there is no response column.
Default: response_col_idx = -1.
IMSLS_X_NODE_IDS, int **node_ids (Output)
Address of a pointer to the internally allocated array of length n containing for each row in x, the terminal node of the tree to which the observation belongs.
IMSLS_X_NODE_IDS_USER, int node_ids[] (Output)
Storage for node_ids is provided by the user.
IMSLS_ERROR_SS, float* pred_err_ss (Output)
The prediction error mean sum of squares, available when values for the response are present in the data.
IMSLS_RETURN_USER, float predictions[] (Input)
Storage for the return value is provided by the user.
Description
To predict a new set of cases using a fitted or estimated decision tree, imsls_f_decision_tree_predict finds the terminal node of the tree to which each new case belongs. The predicted value is then the predicted value of that node. This is a matter of “putting the data through the tree.” For example, suppose the following weather conditions:
Temperature = 70
Humidity = 82
Outlook = Rainy
Wind = FALSE
According to the C4.5 decision tree in Example 1 for imsls_f_decision_tree, will the golfer play golf or not, under these conditions? The tree splits the root node on Outlook into three nodes: {Sunny, Rainy, and Overcast}. Rainy defines node 5. Node 5 is split into child nodes 6 and 7, according to the presence of wind. If there is wind, Node 7, the prediction is “Don’t Play.” If there is no wind, Node 6, the prediction is “Play.” Therefore, the new observation belongs to Node 6, and the tree predicts that the golfer will play under the given weather conditions. In the ALACART decision tree, Node 4 is the terminal node, and the associated prediction is “Play.”
Comments
1. Users can request predictions and error sum of squares directly from imsls_f_decision_tree or use this separate prediction function when it is not necessary to re-estimate a decision tree.
2. If requested, the prediction mean sum of squared error (mean squared prediction error) is computed when actual response values are available in the data.
3. For cases with missing values in predictors that are involved in the splitting rules of the tree, imsls_f_decision_tree_predict uses surrogate rules if available and when requested. Otherwise, predicted values are missing, and the error sum of squares does include that case.
Example
Using the kyphosis data of Example 2 for imsls_f_decision_tree, this example illustrates using a separate call to imsls_f_decision_tree_predict to obtain the predicted values for a new set of observations (xy_test).
 
#include <imsls.h>
#include <stdio.h>
 
int main()
{
float xy[81*4] =
{
0, 71, 3, 5,
0, 158, 3, 14,
1, 128, 4, 5,
0, 2, 5, 1,
0, 1, 4, 15,
0, 1, 2, 16,
0, 61, 2, 17,
0, 37, 3, 16,
0, 113, 2, 16,
1, 59, 6, 12,
1, 82, 5, 14,
0, 148, 3, 16,
0, 18, 5, 2,
0, 1, 4, 12,
0, 168, 3, 18,
0, 1, 3, 16,
0, 78, 6, 15,
0, 175, 5, 13,
0, 80, 5, 16,
0, 27, 4, 9,
0, 22, 2, 16,
1, 105, 6, 5,
1, 96, 3, 12,
0, 131, 2, 3,
1, 15, 7, 2,
0, 9, 5, 13,
0, 8, 3, 6,
0, 100, 3, 14,
0, 4, 3, 16,
0, 151, 2, 16,
0, 31, 3, 16,
0, 125, 2, 11,
0, 130, 5, 13,
0, 112, 3, 16,
0, 140, 5, 11,
0, 93, 3, 16,
0, 1, 3, 9,
1, 52, 5, 6,
0, 20, 6, 9,
1, 91, 5, 12,
1, 73, 5, 1,
0, 35, 3, 13,
0, 143, 9, 3,
0, 61, 4, 1,
0, 97, 3, 16,
1, 139, 3, 10,
0, 136, 4, 15,
0, 131, 5, 13,
1, 121, 3, 3,
0, 177, 2, 14,
0, 68, 5, 10,
0, 9, 2, 17,
1, 139, 10, 6,
0, 2, 2, 17,
0, 140, 4, 15,
0, 72, 5, 15,
0, 2, 3, 13,
1, 120, 5, 8,
0, 51, 7, 9,
0, 102, 3, 13,
1, 130, 4, 1,
1, 114, 7, 8,
0, 81, 4, 1,
0, 118, 3, 16,
0, 118, 4, 16,
0, 17, 4, 10,
0, 195, 2, 17,
0, 159, 4, 13,
0, 18, 4, 11,
0, 15, 5, 16,
0, 158, 5, 14,
0, 127, 4, 12,
0, 87, 4, 16,
0, 206, 4, 10,
0, 11, 3, 15,
0, 178, 4, 15,
1, 157, 3, 13,
0, 26, 7, 13,
0, 120, 2, 13,
1, 42, 7, 6,
0, 36, 4, 13
};
 
float xy_test[10*4] =
{
0, 71, 3, 5,
1, 128, 4, 5,
0, 1, 4, 15,
0, 61, 6, 10,
0, 113, 2, 16,
1, 82, 5, 14,
0, 148, 3, 16,
0, 1, 4, 12,
0, 1, 3, 16,
0, 175, 5, 13
};
 
int n = 81;
int ncols = 4;
int response_col_idx = 0;
int method = 3;
int control[] = {5, 10, 10, 50, 10};
int var_type[] = {0, 2, 2, 2};
 
int n_test = 10;
int i, idx;
 
float *predictions;
float pred_err_ss;
const char* names[] = {"Age", "Number", "Start"};
const char* classNames[] = {"Absent", "Present"};
const char* responseName[] = {"Kyphosis"};
Imsls_f_decision_tree *tree = NULL;
 
tree = imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,
IMSLS_METHOD, method,
IMSLS_N_FOLDS, 1,
IMSLS_CONTROL, control,
IMSLS_TEST_DATA, n_test, xy_test,
0);
 
predictions = imsls_f_decision_tree_predict(n_test, ncols, xy_test,
var_type, tree,
IMSLS_X_RESPONSE_COL, response_col_idx,
IMSLS_ERROR_SS, &pred_err_ss,
0);
 
printf("\nPredictions for test data:\n");
printf("%5s%8s%7s%10s\n", names[0], names[1], names[2],
responseName[0]);
for(i=0; i<n_test; i++){
printf("%5.0f%8.0f%7.0f",
xy_test[i*ncols+1],
xy_test[i*ncols+2],
xy_test[i*ncols+3]);
idx = (int)predictions[i];
printf("%10s\n", classNames[idx]);
}
printf("\nMean squared prediction error: %f\n", pred_err_ss);
 
imsls_f_decision_tree_free(tree);
imsls_free(predictions);
}
Output
 
Predictions for test data:
Age Number Start Kyphosis
71 3 5 Absent
128 4 5 Present
1 4 15 Absent
61 6 10 Absent
113 2 16 Absent
82 5 14 Absent
148 3 16 Absent
1 4 12 Absent
1 3 16 Absent
175 5 13 Absent
 
Mean squared prediction error: 0.100000
Warning Errors
IMSLS_NO_SURROGATES
Use of surrogates is limited to method 1 (ALACART).
IMSLS_INVALID_PARAM
The value of # is out of range.