gradient_boosting_predict

Uses a previously trained gradient boosting model to predict a univariate response variable based on new data.

Synopsis

#include <imsls.h>

float *imsls_f_gradient_boosting_predict (int n_rows, int n_cols, float xy[], int response_col_idx, int var_type[], Imsls_f_gradient_boosting_model *gb_model, …, 0)

The type double function is imsls_d_gradient_boosting_predict.

Required Arguments

int n_rows (Input)
The number of rows in xy.

int n_cols (Input)
The number of columns in xy. It must match the n_cols of the previously trained model.

float xy[] (Input)
Array of size n_rows × n_cols containing the data.

int response_col_idx (Input)
The column index of xy containing the response variable. It must match the response_col_idx of the previously trained model. Use imsls_f_machine(6)(NaN) in this column if y values are not known.

int var_type[] (Input)
Array of length n_cols indicating the type of each variable. The array values must match the var_type used in the previously trained model.

var_type[i]	Description
0	Categorical
1	Ordered Discrete (e.g., Low, Med, High)
2	Quantitative or Continuous
3	Ignore this variable

When the variable type is specified as Categorical (var_type[i] = 0), the numbering of the categories must begin at 0. For example, if there are three categories, they must be represented as 0, 1, and 2 in the xy array.

The number of classes for a categorical response variable is determined by the largest value discovered in the data. Note that a warning message is displayed if a class level in 0, 1, …, n_classes - 1 has a 0 count in the data.

Imsls_f_gradient_boosting_model *gb_model (Input)
Pointer to a structure of type Imsls_f_gradient_boosting_model containing a previously trained gradient boosting model.

Return Value

A pointer to an array of predicted values on the data. If an error occurs, NULL is returned.

Synopsis with Optional Arguments

#include <imsls.h>

float *imsls_f_gradient_boosting_predict (int n_rows, int n_cols, float xy[], int response_col_idx, int var_type[], Imsls_f_gradient_boosting_model *gb_model,

IMSLS_WEIGHTS, float weights[],

IMSLS_PRINT, int print_level,

IMSLS_LOSS_VALUE, float *loss_value,

IMSLS_PROBABILITIES, float **probs,

IMSLS_PROBABILITIES_USER, float probs[],

IMSLS_RETURN_USER, float predictions[],

Optional Arguments

IMSLS_WEIGHTS, float weights[] (Input)
An array of length n_rows containing frequencies or weights for each observation in xy.
Default: weights[i] = 1.0.

IMSLS_PRINT, int print_level (Input)

print_level	Action
0	No printing
1	Print intermediate loss values, provided the response variable values are not missing

Default: print_level = 0.

IMSLS_LOSS_VALUE, float *loss_value (Output)
The final value of the loss function after applying the previously trained model. This output is calculated if the response variable is not missing in xy, otherwise NaN is returned.

IMSLS_PROBABILITIES, float **probs (Output)
Address of a pointer to an array of length n_rows*n_classes containing the predicted class probabilities for each observation in the data.

IMSLS_PROBABILITIES_USER, float probs[] (Output)
Storage for the array of the predicted class probabilities is provided by the user.

IMSLS_RETURN_USER, float predictions[] (Output)
Storage for the array of predicted values is provided by the user.

Description

Stochastic gradient boosting (SGB) is an optimization algorithm for minimizing residual errors to improve the accuracy of predictions. See the description in gradient_boosting, for the details of the algorithm. This function uses a previously trained SGB model to predict a response (or target) variable based on one or more predictors given in the data matrix xy.

The SGB model structure Imsls_f_gradient_boosting_model contains the sequence of boosted trees and other salient parameters from SGB model training. Each row of the data matrix xy is processed through the boosted trees and the prediction function is updated with appropriate parameter values.

Examples

Example 1

In this example an SGB model for a continuous target variable is trained on a data set with six predictor variables. The model is returned and then used to predict the target on a new set of data. In this case, the actual target values are not known in the new data set. Note that custom missing value indicators, such as -9999.0, must be replaced with imsls_f_machine(6) (NaN) in the response column of xy.

#include <imsls.h>
#include <stdio.h>

#define ROW 51
#define COL 7
#define TROW 10
int main() {

    float XY[ROW][COL] = {
        { 1.51599306, 0.2008399745, 0.9003028921, 3.0, 0.0, 2.0, 1.437127559 },
        { 2.72854297, 0.2072261081, 1.2282209327, 2.0, 5.0, 2.0, 0.68596562 },
        { 3.06956138, 0.9067490781, 0.8283077031, 2.0, 0.0, 2.0, 2.862403627 },
        { 1.81659279, 0.4506153886, 1.2822537781, 3.0, 4.0, 2.0, 1.710525684 },
        { 3.75978142, 0.2638894715, 0.4995447062, 0.0, 1.0, 1.0, 1.077172402 },
        { 5.72383445, 0.7682430062, 1.4758595745, 0.0, 3.0, 1.0, 2.365233736 },
        { 3.78155015, 0.6888140934, 0.4809393724, 0.0, 0.0, 1.0, 1.061246069 },
        { 3.60023233, 0.8470419827, 1.6149122352, 1.0, 1.0, 0.0, 0.01120048 },
        { 4.30238917, 0.9484412405, 1.6122899544, 1.0, 4.0, 2.0, 0.782038861 },
        { -0.19206757, 0.7674867723, 0.01665624, 3.0, 5.0, 2.0, 2.924944949 },
        { 3.03246318, 0.8747456241, 1.6051767552, 2.0, 1.0, 0.0, 2.233971364 },
        { 1.56652306, 0.0947128241, 1.470864601, 3.0, 0.0, 1.0, 1.851705944 },
        { 2.77490671, 0.1347932827, 1.3693161067, 1.0, 2.0, 0.0, 0.795709459 },
        { 1.05042043, 0.258093959, 0.4679728113, 3.0, 5.0, 0.0, 2.897785557 },
        { 2.73366469, 0.152943752, 0.5244769375, 1.0, 4.0, 2.0, 2.712871963 },
        { 1.78996951, 0.7921472492, 0.4686144991, 2.0, 4.0, 1.0, 1.295327727 },
        { 1.10343272, 0.123231777, 0.563989053, 2.0, 4.0, 1.0, 0.510414582 },
        { 1.70883743, 0.1931027549, 1.8561577178, 3.0, 5.0, 1.0, 0.165721288 },
        { 2.17977731, 0.316932481, 1.3376214528, 2.0, 2.0, 0.0, 2.366607214 },
        { 2.46127675, 0.9601344266, 0.2090187217, 1.0, 3.0, 1.0, 0.846218965 },
        { 1.92249547, 0.1104206559, 1.739415036, 3.0, 0.0, 0.0, 0.652622544 },
        { 5.81907137, 0.7049566596, 1.6238740934, 0.0, 3.0, 0.0, 1.685337845 },
        { 2.04774497, 0.0480224835, 0.7510998738, 2.0, 5.0, 2.0, 1.400641323 },
        { 4.54023907, 0.0557708007, 1.0864350675, 0.0, 1.0, 1.0, 1.630408823 },
        { 3.66100874, 0.2939440177, 0.9709178614, 0.0, 1.0, 0.0, 0.06970193 },
        { 4.39253655, 0.0982369843, 1.2492676578, 0.0, 2.0, 2.0, 0.138188998 },
        { 3.23303353, 0.3775206071, 0.2937129182, 0.0, 0.0, 2.0, 1.070823081 },
        { 3.13800098, 0.7891691434, 1.90897633, 2.0, 3.0, 0.0, 1.240732062 },
        { 1.49034639, 0.2456938969, 0.9157859818, 3.0, 5.0, 0.0, 0.850803277 },
        { 0.09486277, 0.1240615626, 0.3891524528, 3.0, 5.0, 0.0, 2.532516038 },
        { 3.74460501, 0.0181218453, 1.4921644945, 1.0, 2.0, 1.0, 1.92839241 },
        { 3.24158796, 0.9203409508, 1.1644667462, 2.0, 3.0, 1.0, 1.956283022 },
        { 1.97796767, 0.5977597698, 0.5501609747, 2.0, 5.0, 2.0, 0.39384095 },
        { 4.15214037, 0.1433333508, 1.4292114358, 1.0, 0.0, 0.0, 1.114095218 },
        { 0.7799787, 0.8539819908, 0.7039108537, 3.0, 0.0, 1.0, 1.468978726 },
        { 2.01869009, 0.8919721926, 1.1436212659, 3.0, 4.0, 1.0, 2.09256257 },
        { 0.56311561, 0.0899261576, 0.7989077698, 3.0, 5.0, 0.0, 0.195650739 },
        { 4.74296429, 0.9625684835, 1.5732420743, 0.0, 3.0, 2.0, 2.685061853 },
        { 2.97981809, 0.5511086562, 1.6053283028, 2.0, 5.0, 2.0, 0.906810926 },
        { 2.82187135, 0.3869563073, 0.9321342241, 1.0, 5.0, 1.0, 0.756223386 },
        { 5.24390592, 0.3500950718, 1.7769328682, 0.0, 3.0, 2.0, 1.328165314 },
        { 3.17307157, 0.8798056154, 1.4647966106, 2.0, 5.0, 1.0, 0.561835038 },
        { 0.78246075, 0.1472158518, 0.4658273738, 2.0, 0.0, 0.0, 1.317240539 },
        { 1.57827027, 0.3415432149, 0.7513634153, 2.0, 2.0, 0.0, 1.502675544 },
        { 0.84104905, 0.1501226462, 0.9332020828, 3.0, 1.0, 2.0, 1.083374695 },
        { 2.63627352, 0.1707233109, 1.1676406977, 2.0, 3.0, 0.0, 2.236639737 },
        { 1.30863625, 0.2616807753, 0.8342161868, 3.0, 2.0, 2.0, 1.778402721 },
        { 2.7313073, 0.9616109401, 1.596915911, 3.0, 3.0, 1.0, 0.303127344 },
        { 3.56848173, 0.4072918599, 1.5345127448, 1.0, 2.0, 2.0, 1.47452504 },
        { 5.40152982, 0.7796053565, 1.3659530994, 0.0, 4.0, 1.0, 0.484531098 },
        { 3.94901823, 0.5052344366, 1.9319026601, 1.0, 2.0, 0.0, 2.504392843 }
    };

    float newXY[TROW][COL] = {
       { -9999.0, 0.8587425048, 1.2705688183, 0.0, 0.0, 1.0, 0.836626959 },
       { -9999.0, 0.8928761308, 1.3886538362, 2.0, 1.0, 2.0, 2.155131825 },
       { -9999.0, 0.7385954093, 1.5773203815, 0.0, 4.0, 2.0, 0.075368922 },
       { -9999.0, 0.6227398487, 0.0228797458, 3.0, 4.0, 2.0, 0.070793233 },
       { -9999.0, 0.8519553537, 1.2141886768, 2.0, 4.0, 2.0, 0.762200702 },
       { -9999.0, 0.5578103897, 0.9185446175, 2.0, 4.0, 2.0, 0.085492814 },
       { -9999.0, 0.4178302658, 1.3686663737, 0.0, 0.0, 0.0, 2.573941051 },
       { -9999.0, 0.9829705667, 0.7817731784, 0.0, 5.0, 1.0, 0.865016054 },
       { -9999.0, 0.3859238869, 0.2746516233, 3.0, 4.0, 0.0, 1.908151819 },
       { -9999.0, 0.4165328839, 1.3154437956, 3.0, 4.0, 2.0, 2.752358041 }
    };

    int i = 0;
    int response_col_idx = 0;
    int var_type[] = { 2, 2, 2, 0, 0, 0, 2 };
    float *fitted_values = NULL;
    float *predicted_values = NULL;
    float loss_value = 0.0, new_loss_value = 0.0;
    Imsls_f_gradient_boosting_model *gb_model = NULL;

    fitted_values = imsls_f_gradient_boosting(ROW, COL, &XY[0][0],
        response_col_idx, var_type,
        IMSLS_RANDOM_SEED, 123457,
        IMSLS_LOSS_VALUE, &loss_value,
        IMSLS_RETURN_MODEL, &gb_model,
        0);

    /* Replace custom missing value with NaN */
    for (i = 0; i < TROW; i++) {
        newXY[i][response_col_idx] = imsls_f_machine(6);
    }

    predicted_values = imsls_f_gradient_boosting_predict(TROW, COL,
        &newXY[0][0], response_col_idx, var_type, gb_model,
        IMSLS_LOSS_VALUE, &new_loss_value,
        0);

    /* Write out for future use. */
    imsls_f_gradient_boosting_model_write(gb_model,
        "regression_gb_model.txt", 0);

    printf("Predictions on new data set:\n");
    for (i = 0; i < TROW; i++) {
        printf(" %f\n", predicted_values[i]);
    }
    printf(" \nLoss value on training data: %f\n", loss_value);
    if (new_loss_value != new_loss_value) {
        printf(" \nNo y actuals on new data. new_loss_value is NAN. \n");
    }
    else {
        printf(" \nLoss value on new data: %f \n", new_loss_value);
    }

    if (fitted_values)
        imsls_free(fitted_values);
    if (predicted_values)
        imsls_free(predicted_values);
    if (gb_model)
        imsls_f_gradient_boosting_model_free(gb_model);
}
#undef ROW
#undef COL
#undef TROW
                                                

Output

Predictions on new data set:
 4.230358
 3.356423
 5.551225
 0.904968
 3.054923
 3.138623
 4.719745
 3.648027
 0.742592
 1.733868
 
Loss value on training data: 0.074667
 
No y actuals on new data. new_loss_value is NAN. 
                                                

Example 2

In this example the SGB model trained and written out in Example 1 is read in using imsls_f_gradient_boosting_model_read. The model is then used to predict the same data set, newXY. (Hence the predictions match those in Example 1.) In practice, the saved model is used for different or future data sets having the same column definitions.

#include <imsls.h>
#include <stdio.h>

#define COL 7
#define TROW 10
int main() {

    float newXY[TROW][COL] = {
       { -9999.0, 0.8587425048, 1.2705688183, 0.0, 0.0, 1.0, 0.836626959 },
       { -9999.0, 0.8928761308, 1.3886538362, 2.0, 1.0, 2.0, 2.155131825 },
       { -9999.0, 0.7385954093, 1.5773203815, 0.0, 4.0, 2.0, 0.075368922 },
       { -9999.0, 0.6227398487, 0.0228797458, 3.0, 4.0, 2.0, 0.070793233 },
       { -9999.0, 0.8519553537, 1.2141886768, 2.0, 4.0, 2.0, 0.762200702 },
       { -9999.0, 0.5578103897, 0.9185446175, 2.0, 4.0, 2.0, 0.085492814 },
       { -9999.0, 0.4178302658, 1.3686663737, 0.0, 0.0, 0.0, 2.573941051 },
       { -9999.0, 0.9829705667, 0.7817731784, 0.0, 5.0, 1.0, 0.865016054 },
       { -9999.0, 0.3859238869, 0.2746516233, 3.0, 4.0, 0.0, 1.908151819 },
       { -9999.0, 0.4165328839, 1.3154437956, 3.0, 4.0, 2.0, 2.752358041 }
    };

    int i = 0;
    int response_col_idx = 0;
    int var_type[] = { 2, 2, 2, 0, 0, 0, 2 };
    float *predicted_values = NULL;
    float new_loss_value = 0.0;
    Imsls_f_gradient_boosting_model *gb_model = NULL;


    /* Replace custom missing value with NaN */
    for (i = 0; i < TROW; i++) {
        newXY[i][response_col_idx] = imsls_f_machine(6);
    }

    gb_model = imsls_f_gradient_boosting_model_read(
        "regression_gb_model.txt", 0);

    predicted_values = imsls_f_gradient_boosting_predict(TROW, COL,
        &newXY[0][0], response_col_idx, var_type, gb_model,
        IMSLS_LOSS_VALUE, &new_loss_value,
        0);

    printf("Predictions on new data set using model read from file:\n");

    for (i = 0; i < TROW; i++) {
        printf(" %f\n", predicted_values[i]);
    }
    if (new_loss_value != new_loss_value) {
        printf(" \nNo y actuals on new data. new_loss_value is NAN. \n");
    }
    else {
        printf(" \nLoss value on new data: %f \n", new_loss_value);
    }

    if (predicted_values)
        imsls_free(predicted_values);
    if (gb_model)
        imsls_f_gradient_boosting_model_free(gb_model);
}
#undef COL
#undef TROW
                                                

Output

Predictions on new data set using model read from file:
 4.230358
 3.356423
 5.551225
 0.904968
 3.054923
 3.138623
 4.719745
 3.648027
 0.742592
 1.733868
 
No y actuals on new data. new_loss_value is NAN.