decision_tree_predict

Computes predicted values using a decision tree.

Synopsis

#include <imsls.h>

float* imsls_f_decision_tree_predict (int n, int n_cols, float x[], int var_type[], Imsls_f_decision_tree  *tree, ..., 0)

The type double function is imsls_d_decision_tree_predict.

Required Arguments

int n (Input)
The the number of rows in x.

int n_cols (Input)
The number of columns in x.

float x[] (Input)
Array of size n × ncols containing the data.

int var_type[] (Input)
Array of length ncols indicating the type of each variable.

Value

Type

0

Categorical

1

Ordered Discrete (Low, Med., High)

2

Quantitative or Continuous

3

Ignore this variable

imsls_f_decision_tree  *tree (Input)
An estimated decision tree.

Return Value

An array of length n containing the predicted values. If an error occurs, NULL is returned.

Synopsis with Optional Arguments

#include <imsls.h>

float* imsls_f_decision_tree_predict (int n, int n_colsfloat x[], int var_type[], Imsls_f_decision_tree *tree,
IMSLS_N_SURROGATES, int n_surrogates,
IMSLS_X_RESPONSE_COL, int response_col_idx,
IMSLS_WEIGHTS, float weights[],
IMSLS_X_NODE_IDS, int **node_ids,
IMSLS_X_NODE_IDS_USERint node_ids[],
IMSLS_ERROR_SS, float *pred_err_ss,
IMSLS_PREDICTED_CLASS_PROB, float **predicted_probs,
IMSLS_PREDICTED_CLASS_PROB_USER, float predicted_probs[],
IMSLS_RETURN_USER, float predictions[],
0)

Optional Arguments

IMSLS_N_SURROGATES, int n_surrogates (Input)
Indicates the number of surrogate splits for use in methods that find surrogate splits in order to handle missing values.
Default: n_surrogates = 0.

IMSLS_WEIGHTS, float weights[] (Input)
An array of length n containing case weights.
Default: weights[i]=1.0.

IMSLS_X_RESPONSE_COL, int response_col_idx (Input)
The column index of the response variable, if present in the data. A negative value indicates there is no response column.
Default: response_col_idx = -1.

IMSLS_X_NODE_IDS, int **node_ids (Output)
Address of a pointer to the internally allocated array of length n containing for each row in x, the terminal node of the tree to which the observation belongs.

IMSLS_X_NODE_IDS_USER, int node_ids[] (Output)
Storage for node_ids is provided by the user.

IMSLS_ERROR_SS, float* pred_err_ss (Output)
The prediction error mean sum of squares, available when values for the response are present in the data.

IMSLS_PREDICTED_CLASS_PROB, float** predicted_probs (Output)
The predicted class probabilities for a categorical response variable.

IMSLS_PREDICTED_CLASS_PROB_USER, float predicted_probs[] (Output)
Storage for the predicted class probabilities is provided by the user.

IMSLS_RETURN_USER, float predictions[] (Input)
Storage for the return value is provided by the user.

Description

To predict a new set of cases using a fitted or estimated decision tree, imsls_f_decision_tree_predict finds the terminal node of the tree to which each new case belongs. The predicted value is then the predicted value of that node. This is a matter of “putting the data through the tree.” For example, suppose the following weather conditions:

Temperature = 70

Humidity = 82

Outlook = Rainy

Wind = FALSE

According to the C4.5 decision tree in Example 1 for imsls_f_decision_tree, will the golfer play golf or not, under these conditions? The tree splits the root node on Outlook into three nodes: {Sunny, Rainy, and Overcast}. Rainy defines node 5. Node 5 is split into child nodes 6 and 7, according to the presence of wind. If there is wind, Node 7, the prediction is “Don’t Play.” If there is no wind, Node 6, the prediction is “Play.” Therefore, the new observation belongs to Node 6, and the tree predicts that the golfer will play under the given weather conditions. In the ALACART decision tree, Node 4 is the terminal node, and the associated prediction is “Play.”

Comments

1. Users can request predictions and error sum of squares directly from imsls_f_decision_tree or use this separate prediction function when it is not necessary to re-estimate a decision tree.

2. If requested, the prediction mean sum of squared error (mean squared prediction error) is computed when actual response values are available in the data.

3. For cases with missing values in predictors that are involved in the splitting rules of the tree, imsls_f_decision_tree_predict uses surrogate rules if available and when requested. Otherwise, predicted values are missing, and the error sum of squares does include that case.

Example

Using the kyphosis data of Example 2 for imsls_f_decision_tree, this example illustrates using a separate call to imsls_f_decision_tree_predict to obtain the predicted values for a new set of observations (xy_test).

 

#include <imsls.h>

#include <stdio.h>

 

int main()

{

float xy[81*4] =

{

0, 71, 3, 5,

0, 158, 3, 14,

1, 128, 4, 5,

0, 2, 5, 1,

0, 1, 4, 15,

0, 1, 2, 16,

0, 61, 2, 17,

0, 37, 3, 16,

0, 113, 2, 16,

1, 59, 6, 12,

1, 82, 5, 14,

0, 148, 3, 16,

0, 18, 5, 2,

0, 1, 4, 12,

0, 168, 3, 18,

0, 1, 3, 16,

0, 78, 6, 15,

0, 175, 5, 13,

0, 80, 5, 16,

0, 27, 4, 9,

0, 22, 2, 16,

1, 105, 6, 5,

1, 96, 3, 12,

0, 131, 2, 3,

1, 15, 7, 2,

0, 9, 5, 13,

0, 8, 3, 6,

0, 100, 3, 14,

0, 4, 3, 16,

0, 151, 2, 16,

0, 31, 3, 16,

0, 125, 2, 11,

0, 130, 5, 13,

0, 112, 3, 16,

0, 140, 5, 11,

0, 93, 3, 16,

0, 1, 3, 9,

1, 52, 5, 6,

0, 20, 6, 9,

1, 91, 5, 12,

1, 73, 5, 1,

0, 35, 3, 13,

0, 143, 9, 3,

0, 61, 4, 1,

0, 97, 3, 16,

1, 139, 3, 10,

0, 136, 4, 15,

0, 131, 5, 13,

1, 121, 3, 3,

0, 177, 2, 14,

0, 68, 5, 10,

0, 9, 2, 17,

1, 139, 10, 6,

0, 2, 2, 17,

0, 140, 4, 15,

0, 72, 5, 15,

0, 2, 3, 13,

1, 120, 5, 8,

0, 51, 7, 9,

0, 102, 3, 13,

1, 130, 4, 1,

1, 114, 7, 8,

0, 81, 4, 1,

0, 118, 3, 16,

0, 118, 4, 16,

0, 17, 4, 10,

0, 195, 2, 17,

0, 159, 4, 13,

0, 18, 4, 11,

0, 15, 5, 16,

0, 158, 5, 14,

0, 127, 4, 12,

0, 87, 4, 16,

0, 206, 4, 10,

0, 11, 3, 15,

0, 178, 4, 15,

1, 157, 3, 13,

0, 26, 7, 13,

0, 120, 2, 13,

1, 42, 7, 6,

0, 36, 4, 13

};

 

float xy_test[10*4] =

{

0, 71, 3, 5,

1, 128, 4, 5,

0, 1, 4, 15,

0, 61, 6, 10,

0, 113, 2, 16,

1, 82, 5, 14,

0, 148, 3, 16,

0, 1, 4, 12,

0, 1, 3, 16,

0, 175, 5, 13

};

 

int n = 81;

int ncols = 4;

int response_col_idx = 0;

int method = 3;

int control[] = {5, 10, 10, 50, 10};

int var_type[] = {0, 2, 2, 2};

 

int n_test = 10;

int i, idx;

 

float *predictions;

float pred_err_ss;

const char* names[] = {"Age", "Number", "Start"};

const char* classNames[] = {"Absent", "Present"};

const char* responseName[] = {"Kyphosis"};

Imsls_f_decision_tree *tree = NULL;

 

tree = imsls_f_decision_tree(n, ncols, xy, response_col_idx, var_type,

IMSLS_METHOD, method,

IMSLS_N_FOLDS, 1,

IMSLS_CONTROL, control,

IMSLS_TEST_DATA, n_test, xy_test,

0);

 

predictions = imsls_f_decision_tree_predict(n_test, ncols, xy_test,

var_type, tree,

IMSLS_X_RESPONSE_COL, response_col_idx,

IMSLS_ERROR_SS, &pred_err_ss,

0);

 

printf("\nPredictions for test data:\n");

printf("%5s%8s%7s%10s\n", names[0], names[1], names[2],

responseName[0]);

for(i=0; i<n_test; i++){

printf("%5.0f%8.0f%7.0f",

xy_test[i*ncols+1],

xy_test[i*ncols+2],

xy_test[i*ncols+3]);

idx = (int)predictions[i];

printf("%10s\n", classNames[idx]);

}

printf("\nMean squared prediction error: %f\n", pred_err_ss);

 

imsls_f_decision_tree_free(tree);

imsls_free(predictions);

}

Output

 

Predictions for test data:

Age Number Start Kyphosis

71 3 5 Absent

128 4 5 Present

1 4 15 Absent

61 6 10 Absent

113 2 16 Absent

82 5 14 Absent

148 3 16 Absent

1 4 12 Absent

1 3 16 Absent

175 5 13 Absent

 

Mean squared prediction error: 0.100000

Warning Errors

IMSLS_NO_SURROGATES

Use of surrogates is limited to method 1 (ALACART).

IMSLS_INVALID_PARAM

The value of # is out of range.