The type double function is imsls_d_gradient_boosting.
Required Arguments
intn (Input) The number of rows in xy.
intn_cols (Input) The number of columns in xy.
floatxy[] (Input) Array of size n×n_cols containing the data.
intresponse_col_idx (Input) The column index of xy containing the response variable.
intvar_type[] (Input) Array of length n_cols indicating the type of each variable.
var_type[i]
Description
0
Categorical
1
Ordered Discrete (e.g., Low, Med, High)
2
Quantitative or Continuous
3
Ignore this variable
Note: When the variable type is specified as Categorical (var_type[i] = 0), the numbering of the categories must begin at 0. For example, if there are three categories, they must be represented as 0, 1, and 2 in the xy array.
The number of classes for a categorical response variable is determined by the largest value discovered in the data. Note that a warning message is displayed if a class level in 0, 1, …, n_classes- 1 has a 0 count in the data.
Return Value
A pointer to an array of predicted values on the test data if test data is provided (see optional argument, IMSLS_TEST_DATA). If test data is not provided, the predicted values are the fitted values on the training data. If an error occurs, NULL is returned.
IMSLS_TEST_DATA, int n_test, floatxy_test[] (Input) xy_test is an array of size n_test × n_cols containing test data for which predictions are requested. When this optional argument is present, the number of observations n_test must be greater than 0. The response variable may have missing values in xy_test, but it must be in the same column as it is in xy and the predictors must be in the same columns as they are in xy. If the test data is not provided but predictions are requested, then xy is used, and the predictions are the fitted values. Default: n_test = n, xy_test = xy.
IMSLS_TEST_DATA_WEIGHTS, floatweights_test[] (Input) An array of size n_test containing the frequencies or weights for each observation in xy_test. This argument is ignored if IMSLS_TEST_DATA is not present. Default: weights_test[i] = 1.0.
IMSLS_WEIGHTS, floatweights[] (Input) An array of length n containing frequencies or weights for each observation in xy. Default: weights[i] = 1.0.
IMSLS_N_SAMPLE, intsample_size (Input) The number of examples to be drawn randomly from the training data in each iteration. Default: sample_size = sample_p*n.
IMSLS_SAMPLE_PROPORTION, floatsample_p (Input) The proportion of the training examples to be drawn randomly from the training data in each iteration. Default: sample_p = 0.5.
IMSLS_SHRINKAGE, floatshrinkage (Input) The shrinkage parameter to be used in the boosting algorithm. The parameter must be in the interval [0,1] inclusive. Default: shrinkage = 1.0 (no shrinkage).
IMSLS_MAX_ITER, intmax_iter (Input) The number of iterations. This value is equivalent to M in the boosting algorithm described below. Default: max_iter = 50.
IMSLS_LOSS_FCN, intloss_fcn_type (Input) An integer specifying the loss function to use in the algorithm for regression problems (loss_fcn_type = 0, 1, 2) or binary classification problems (loss_fcn_type = 3, 4). See the Description section for the loss function in the multinomial case (categorical response variables with more than two outcomes). Default: loss_fcn_type = 0.
Name
loss_fcn_type
Definition
Least Squares
0
The loss function is the sum of squared error:
Least Absolute Deviation
1
The loss function is the sum of absolute errors:
Huber M
2
The loss function is the weighted mixture of squared error and absolute error:
where
and where δ is the α empirical quantile of the errors, .
Adaboost
3
The loss function is the AdaBoost.M1 criterion:
Bernoulli or binomial deviance
4
The loss function is the binomial or Bernoulli negative log-likelihood:
IMSLS_ALPHA, floathuber_alpha (Input) The quantile value for the Huber-M loss function. Default: huber_alpha = 0.05.
IMSLS_CONTROL, intparams[] (Input) Array of length 5 containing parameters to control the size and other characteristics of the decision trees.
params[i]
Name
Action
0
min_n_node
Do not split a node if one of its child nodes will have fewer than min_n_node observations.
1
min_split
Do not split a node if the node has fewer than min_split observations.
2
max_x_cats
Allow for up to max_x_cats number of categories or levels for categorical variables.
3
max_size
Stop growing the tree once it has reached max_size number of nodes.
4
max_depth
Stop growing the tree once it has reached max_depth number of levels.
Default: params[] = {10, 21, 10, 4, 10}.
IMSLS_RANDOM_SEED, intseed (Input) Sets the seed of the random number generator used in sampling. Using the same seed in repeated calls will result in the same output. If seed = 0, the random seed is set by the system clock and repeated calls result in slightly different results. Default: seed = 0.
IMSLS_PRINT, intprint_level (Input)
print_level
Action
0
No printing
1
Print final results only
2
Print intermediate and final results
Default: print_level = 0.
IMSLS_LOSS_VALUE, float*loss_value (Output) The final value of the loss function after M iterations of the algorithm.
IMSLS_TEST_LOSS_VALUE, float*test_loss_value (Output) The final value of the loss function after M iterations of the algorithm on the test data.
IMSLS_FITTED_VALUES, float**fitted_values (Output) Address of a pointer to an array of length n containing the fitted values on the training data xy after M iterations of the algorithm.
IMSLS_FITTED_VALUES_USER, floatfitted_values[] (Output) Storage for the array of the fitted values for the training data is provided by the user.
IMSLS_PROBABILITIES, float**probs (Output) Address of a pointer to an array of length n*n_classes containing the predicted class probabilities for each observation in the test data.
IMSLS_PROBABILITIES_USER, floatprobs[] (Output) Storage for the array of the predicted class probabilities is provided by the user.
IMSLS_FITTED_PROBABILITIES, float**fitted_probabilities (Output) Address of a pointer to an array of length n*n_classes containing the fitted class probabilities on the training data for classification problems.
IMSLS_FITTED_PROBABILITIES_USER, floatfitted_probabilities[] (Output) Storage for the array of the fitted class probabilities is provided by the user.
IMSLS_RETURN_TREES, Imsls_f_decision_tree ***bagged_trees (Output) Address of a pointer to an array of length M containing the collection of trees generated during the algorithm. To release this space, use imsls_f_bagged_trees_free.
IMSLS_RETURN_USER, floatprobabilities[] (Output) Storage for the array of the return value is provided by the user.
Description
Stochastic gradient boosting is an optimization algorithm for minimizing residual errors to improve the accuracy of predictions. This function implements the algorithm by Friedman (1999). For further discussion, see Hastie, et al. (2009).
In the following, xi is the vector of predictor variable values, and yi is the response variable value in the observation at row i. The function fm(xi) evaluated at xi is the predicted value in a particular iteration, m. This value is iteratively reweighted to minimize a loss function. Specifically, the algorithm is:
Initialize the predictor function to the constant
For each iteration ,
1. Calculate the pseudo-residuals
2. Fit a regression tree to the pseudo-residuals rim and use the resulting models to predict the observations in the training data. The resulting terminal nodes define Jm terminal regions Rjm for the response. Compute
3. Update the prediction function for each observation, xi,
where λ∈[0,1] is a shrinkage parameter (λ= 1 means no shrinking, whereas λ= 0 gives just fM=f0 ).
After Miterations, the function fM(⋅) forms the basis of the predictions for the response variable.
Specifically
Response variable type
Definition
QUANTITATIVE_CONTINUOUS
For the regression problem, the predicted value at a new observation vector xi is
CATEGORICAL with 2 outcomes (binomial)
For a classification problem with 2 outcomes, the predicted probability is
Then the predicted value is
where I{⋅} is the indicator function.
CATEGORICAL with 3 or more outcomes (multinomial)
For a classification problem with K≥ 3 outcomes, the predicted probabilities for k= 1,…,K are
Then the predicted value is
For regression problems, the algorithm uses the squared error loss by default. For classification problems with two categories, the Bernoulli or binomial loss function is the default (see optional argument IMSLS_LOSS_FCN). For a categorical response with three or more categories, the multinomial deviance (described below) is used.
For a categorical response with K categories, the loss function is the multinomial negative log-likelihood, or multinomial deviance:
where
Examples
Example 1
This example uses stochastic gradient boosting to obtain fitted values for a regression variable on a small data set with six predictor variables.
This example uses stochastic gradient boosting to obtain probability estimates for a binary response variable and four predictor variables. An estimate of P[Y = 0] is obtained for each example in the training data as well as a small test data set.
Probabilities ≤ 0.5 lead to a prediction of Y = 0, while probabilities > 0.5 lead to a prediction of Y = 1.0.
printf("\nTest data loss value=%f\n", test_loss_value);
imsls_free(predicted_values);
imsls_free(fitted_values);
imsls_free(probabilities);
imsls_free(fitted_probabilities);
}
Output
Training data fitted prob[Y=0] and actuals:
0.35 0
0.82 0
0.87 0
0.25 1
0.90 0
0.24 1
0.26 1
0.90 0
0.30 1
0.84 0
0.23 1
0.35 0
0.24 1
0.85 0
0.84 0
0.26 1
0.82 0
0.85 0
0.85 0
0.22 1
0.83 0
0.85 0
0.87 0
0.75 1
0.83 0
0.35 1
0.26 1
0.35 0
0.81 0
0.18 1
0.24 1
0.23 1
0.30 1
0.17 1
0.83 0
0.76 1
0.85 0
0.83 0
0.90 0
0.35 1
0.83 0
0.21 1
0.84 0
0.83 0
0.75 1
0.81 0
0.90 0
0.82 0
0.87 0
0.76 0
0.26 1
0.85 0
0.82 0
0.24 1
0.24 1
0.89 0
0.16 1
0.23 1
0.83 0
0.24 1
0.83 0
0.90 0
0.85 0
0.78 0
0.35 1
0.22 1
0.35 1
0.83 0
0.76 0
0.78 0
0.83 0
0.87 0
0.18 1
0.22 1
0.26 1
0.35 0
0.90 0
0.77 0
0.87 0
0.89 0
0.90 0
0.83 0
0.35 0
0.84 0
0.83 0
0.77 1
0.90 0
0.75 1
0.23 1
0.85 0
0.84 0
0.22 1
0.18 1
0.35 0
0.81 0
0.32 1
0.90 0
0.85 0
0.16 1
0.24 1
Training data loss_value=0.650631
Test data predicted prob[Y=0] and actuals:
0.83 0
0.75 0
0.22 1
0.17 1
0.18 1
0.85 0
0.89 0
0.76 0
0.83 0
0.30 1
Test data loss value=0.440048
Example 3
This example uses the same data as in Example 2, but switches the response variable to the 4th column of the training data. Because the response is categorical with more than two categories, the multinomial loss function is used.
Note: The response variable is considered to have five categorical levels because its largest value is 4, but the code assumes categorical variables start in '0'. Since '0' is not present in the data, a warning message is printed.