com.imsl.datamining.GradientBoosting

All Implemented Interfaces:: Serializable, Cloneable

public class GradientBoosting extends PredictiveModel implements Serializable, Cloneable

Performs stochastic gradient boosting for a single response variable and multiple predictor variables.

The idea behind boosting is to combine the outputs of relatively weak classifiers or predictive models to achieve iteratively better and better accuracy in either regression problems (the response variable is continuous) or classification problems (the response variable has two or more discrete values). This class implements the stochastic gradient tree boosting algorithm of Friedman, 1999. A sequence of decision trees is fit to random samples of the training data, iteratively re-weighted to minimize a specified loss function. In each iteration, pseudo-residuals are calculated based on a random sample from the original training set and the gradient of the loss function evaluated at values generated in the previous iteration. New base predictors are fit to the pseudo-residuals, and then a new prediction function is selected to minimize the loss-function, completing one iteration. The number of iterations is a parameter for the algorithm.

Gradient boosting is an ensemble method, but instead of using independent trees, gradient boosting forms a sequence of trees, iteratively and judiciously re-weighted to minimize prediction errors. In particular, the decision tree at iteration m+1 is estimated on pseudo-residuals generated using the decision tree at step m. Hence, successive trees are dependent on previous trees. The algorithm in gradient boosting iterates for a fixed number of times and stops, rather than iterating until a convergence criteria is met. The number of iterations is therefore a parameter in the model. Using a randomly selected subset of the training data in each iteration has been shown to substantially improve efficiency and robustness. Thus, the method is called stochastic gradient boosting. For further discussion, see Hastie, et. al. (2008).

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

GradientBoosting.LossFunctionType

The loss function type as specified by the error measure.

Nested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel
PredictiveModel.CloneNotSupportedException, PredictiveModel.PredictiveModelException, PredictiveModel.StateChangeException, PredictiveModel.SumOfProbabilitiesNotOneException, PredictiveModel.VariableType
Constructor Summary

Constructors

Constructor

Description

GradientBoosting(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)

Constructs a GradientBoosting object for a single response variable and multiple predictor variables.

GradientBoosting(GradientBoosting gbModel)

Constructs a copy of the input GradientBoosting predictive model.

GradientBoosting(PredictiveModel pm)

Constructs a GradientBoosting object.
Method Summary

Modifier and Type

Method

Description

GradientBoosting

clone()

Clones a GradientBoosting predictive model.

void

fitModel()

Performs the gradient boosting on the training data.

double[][]

getClassFittedValues()

Returns the fitted values ${f(x_i)}$ for a categorical response variable with two or more levels.

double[][]

getClassProbabilities()

Returns the predicted probabilities on the training data for a categorical response variable.

double[]

getFittedValues()

Returns the fitted values ${f(x_i)}$ for a continuous response variable after gradient boosting.

ArrayList<double[]>

getGammaList()

Returns the gradient descent minimizing values calculated at each iteration for continuous response variables.

ArrayList<double[][]>

getGammaListMNL()

Returns the gradient descent minimizing values calculated at each iteration for categorical response variables.

double[]

getHuberDeltas()

Returns the values of the Huber parameter, $\delta_m$, calculated at each iteration during training with the HUBER_M loss function.

double

getInitialValue()

Returns the initial value of the predictor function $f_0$.

int[]

getIterationsArray()

Returns the array of different values for the number of iterations.

GradientBoosting.LossFunctionType

getLossType()

Returns the current loss function type.

double

getLossValue()

Returns the loss function value.

boolean

getMissingTestYFlag()

Returns the flag indicating whether the test data is missing the response variable data.

double[][]

getMultinomialResponse()

Returns the multinomial representation of the response variable.

int

getNumberOfIterations()

Returns the current setting for the number of iterations to use in the gradient boosting algorithm.

double

getSampleSizeProportion()

Returns the current setting of the sample size proportion.

double

getShrinkage()

Returns the shrinkage parameter.

double

getShrinkageParameter()

Returns the current shrinkage parameter.

double[][]

getTestClassFittedValues()

Returns the fitted values ${f(x_i)}$ for a categorical response variable with two or more levels on the test data.

double[][]

getTestClassProbabilities()

Returns the predicted probabilities on the test data for a categorical response variable.

double[]

getTestFittedValues()

Returns the fitted values ${f(x_i)}$ for a continuous response variable after gradient boosting on the test data.

double

getTestLossValue()

Returns the loss function value on the test data.

double

getTolerance()

Returns the tolerance level.

ArrayList<Tree>

getTreeList()

Returns the list of boosted trees.

double[]

predict()

Returns the predicted values on the training data.

double[]

predict(double[][] testData)

Returns the predicted values on the input test data.

double[]

predict(double[][] testData, double[] testDataWeights)

Runs the gradient boosting on the training data and returns the predicted values on the weighted test data.

void

setIterationsArray(int[] iterationsArray)

Sets the array of different numbers of iterations.

void

setLossFunctionType(GradientBoosting.LossFunctionType lossType)

Sets the loss function type for the gradient boosting algorithm.

void

setMissingTestYFlag(boolean missingTestY)

Sets the flag indicating whether the test data is missing the response variable data.

void

setNumberOfIterations(int numberOfIterations)

Sets the number of iterations.

void

setSampleSizeProportion(double sampleSizeProportion)

Sets the sample size proportion.

void

setSaveTrees(boolean saveTrees)

Sets the flag to save the boosted trees.

void

setShrinkageParameter(double shrinkageParameter)

Sets the value of the shrinkage parameter.

void

setTrainingData(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)

Sets up the training data for the predictive model.

Methods inherited from class com.imsl.datamining.PredictiveModel
getClassCounts, getClassErrors, getClassErrors, getClassLabels, getCostMatrix, getMaxNumberOfCategories, getMaxNumberOfIterations, getNumberOfClasses, getNumberOfColumns, getNumberOfMissing, getNumberOfPredictors, getNumberOfRows, getNumberOfUniquePredictorValues, getPredictorIndexes, getPredictorTypes, getPrintLevel, getPriorProbabilities, getRandomObject, getResponseColumnIndex, getResponseVariableAverage, getResponseVariableMostFrequentClass, getResponseVariableType, getTotalWeight, getVariableType, getWeights, getXY, isConstantSeries, isMustFitModel, isUserFixedNClasses, setClassCounts, setClassLabels, setClassProbabilities, setCostMatrix, setMaxNumberOfCategories, setMaxNumberOfIterations, setMustFitModel, setNumberOfClasses, setPredictorIndex, setPredictorTypes, setPrintLevel, setPriorProbabilities, setRandomObject, setResponseColumnIndex, setVariableType, setWeights

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- GradientBoosting
  
  public GradientBoosting(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
  
  Constructs a GradientBoosting object for a single response variable and multiple predictor variables.
  
  Parameters:
  
  xy - a double matrix containing the training data
  
  responseColumnIndex - an int, the column index for the response variable
  
  varType - a PredictiveModel.VariableType array containing the type of each variable
- GradientBoosting
  
  public GradientBoosting(PredictiveModel pm)
  
  Constructs a GradientBoosting object.
  
  Parameters:
  
  pm - the PredictiveModel to serve as the base learner
  Note: Currently only regression trees are supported as base learners.
- GradientBoosting
  
  public GradientBoosting(GradientBoosting gbModel)
  
  Constructs a copy of the input GradientBoosting predictive model.
  
  Parameters:
  
  gbModel - a GradientBoosting predictive model
Method Details
- clone
  
  public GradientBoosting clone()
  
  Clones a GradientBoosting predictive model.
  
  Specified by:
  
  clone in class PredictiveModel
  
  Returns:
  
  a clone of the GradientBoosting predictive model
- fitModel
  
  public void fitModel() throws PredictiveModel.PredictiveModelException
  
  Performs the gradient boosting on the training data.
  
  Overrides:
  
  fitModel in class PredictiveModel
  
  Throws:
  
  PredictiveModel.PredictiveModelException - is thrown when an exception occurs in the com.imsl.datamining.PredictiveModel. Superclass exceptions should be considered such as com.imsl.datamining.PredictiveModel.StateChangeException and com.imsl.datamining.PredictiveModel.SumOfProbabilitiesNotOneException.
- predict
  
  public double[] predict() throws PredictiveModel.PredictiveModelException
  
  Returns the predicted values on the training data.
  
  Specified by:
  
  predict in class PredictiveModel
  
  Returns:
  
  a double array containing the predicted values on the training data, i.e., the fitted values
  
  Throws:
  
  PredictiveModel.PredictiveModelException - is thrown when an exception occurs in the com.imsl.datamining.PredictiveModel. Superclass exceptions should be considered such as com.imsl.datamining.PredictiveModel.StateChangeException and com.imsl.datamining.PredictiveModel.SumOfProbabilitiesNotOneException.
- predict
  
  public double[] predict(double[][] testData) throws PredictiveModel.PredictiveModelException
  
  Returns the predicted values on the input test data.
  
  Specified by:
  
  predict in class PredictiveModel
  
  Parameters:
  
  testData - a double matrix containing test data
  Note: testData must have the same number of columns and the columns must be in the same arrangement as xy.
  
  Returns:
  
  a double array containing the predicted values
  
  Throws:
  
  PredictiveModel.PredictiveModelException - is thrown when an exception occurs in the com.imsl.datamining.PredictiveModel. Superclass exceptions should be considered such as com.imsl.datamining.PredictiveModel.StateChangeException and com.imsl.datamining.PredictiveModel.SumOfProbabilitiesNotOneException.
- predict
  
  public double[] predict(double[][] testData, double[] testDataWeights) throws PredictiveModel.PredictiveModelException
  
  Runs the gradient boosting on the training data and returns the predicted values on the weighted test data.
  
  Specified by:
  
  predict in class PredictiveModel
  
  Parameters:
  
  testData - a double matrix containing test data
  Note:testData must have the same number of columns and the columns must be in the same arrangement as xy.
  
  testDataWeights - a double array containing weights for each row of testData
  
  Returns:
  
  a double array containing the predicted values
  
  Throws:
  
  PredictiveModel.PredictiveModelException - is thrown when an exception occurs in the com.imsl.datamining.PredictiveModel. Superclass exceptions should be considered such as com.imsl.datamining.PredictiveModel.StateChangeException and com.imsl.datamining.PredictiveModel.SumOfProbabilitiesNotOneException.
- getMissingTestYFlag
  
  public boolean getMissingTestYFlag()
  
  Returns the flag indicating whether the test data is missing the response variable data.
  
  Returns:
  
  a boolean, the flag indicating whether the test data is missing the response variable values
- setMissingTestYFlag
  
  public void setMissingTestYFlag(boolean missingTestY)
  
  Sets the flag indicating whether the test data is missing the response variable data.
  
  Parameters:
  
  missingTestY - a boolean. When true, the response variable in the test data is treated as missing. In that case, the test loss value is Double.NaN. If the response variable is all missing in the test data, setting missingTestY==false has no effect.
  Default: missingTestY=false
- getFittedValues
  
  public double[] getFittedValues()
  
  Returns the fitted values ${f(x_i)}$ for a continuous response variable after gradient boosting.
  
  Returns:
  
  a double array containing the fitted values on the training data
- getTestFittedValues
  
  public double[] getTestFittedValues()
  
  Returns the fitted values ${f(x_i)}$ for a continuous response variable after gradient boosting on the test data.
  
  Returns:
  
  a double array containing the fitted values on the test data
- getClassFittedValues
  
  public double[][] getClassFittedValues()
  
  Returns the fitted values ${f(x_i)}$ for a categorical response variable with two or more levels.
  The underlying loss function is the binomial or multinomial deviance.
  
  Returns:
  
  a double matrix containing the fitted values on the training data
- getGammaList
  
  public ArrayList<double[]> getGammaList()
  
  Returns the gradient descent minimizing values calculated at each iteration for continuous response variables.
  When saveTrees=true, an ArrayList of these parameters will be available after running predict(). Also, see getTreeList(). The class GradientBoostingModelObject uses these parameters to predict new data sets.
  
  Returns:
  
  an ArrayList of double arrays containing the minimizer for each boosting iteration
- getGammaListMNL
  
  public ArrayList<double[][]> getGammaListMNL()
  
  Returns the gradient descent minimizing values calculated at each iteration for categorical response variables.
  When saveTrees=true, an ArrayList of these parameters will be available after running predict(). Also, see getTreeList(). The class GradientBoostingModelObject uses these parameters to predict new data sets.
  
  Returns:
  
  an ArrayList of double matrices containing the minimizer for each boosting iteration
- getTestClassFittedValues
  
  public double[][] getTestClassFittedValues()
  
  Returns the fitted values ${f(x_i)}$ for a categorical response variable with two or more levels on the test data.
  The underlying loss function is the binomial or multinomial deviance.
  
  Returns:
  
  a double matrix containing the fitted values on the test data
- getMultinomialResponse
  
  public double[][] getMultinomialResponse()
  
  Returns the multinomial representation of the response variable.
  $Y^*$ is a matrix with the element at i,k, where i=0,...,nObservations-1 and k=0,...,nClasses-1 $$ y^*_{ik} = \left\{ \begin{array}{ll} 1 & {\rm if}\;y_i = k\; \\ 0 & {\rm otherwise }\;\; \end{array} \right.$$
  
  Note: This representation is not available if the response has only 2 classes (binomial).
  
  Returns:
  
  a double matrix containing the response in multinomial representation
- getClassProbabilities
  
  public double[][] getClassProbabilities()
  
  Returns the predicted probabilities on the training data for a categorical response variable.
  
  Overrides:
  
  getClassProbabilities in class PredictiveModel
  
  Returns:
  
  a double matrix containing the class probabilities fit on the training data. The i,k-th element of the matrix is the estimated probability that the observation at row index i belongs to the k+1-st class, where k=0,..., nClasses-1.
- getTestClassProbabilities
  
  public double[][] getTestClassProbabilities()
  
  Returns the predicted probabilities on the test data for a categorical response variable.
  
  Returns:
  
  a double matrix containing the class probabilities on the test data. The i,k element is the estimated probability that the i-th pattern belongs to the k-th target class, where k=0,...,nClasses-1.
- getLossValue
  
  public double getLossValue()
  
  Returns the loss function value.
  
  Returns:
  
  a double, the loss function value
- getTestLossValue
  
  public double getTestLossValue()
  
  Returns the loss function value on the test data.
  
  Returns:
  
  a double, the loss function value
- getTolerance
  
  public double getTolerance()
  
  Returns the tolerance level.
  The tolerance value is used throughout the boosting procedure to avoid underflow.
  
  Returns:
  
  a double, the tolerance level
- getTreeList
  
  public ArrayList<Tree> getTreeList()
  
  Returns the list of boosted trees.
  The array of boosted trees will be available if saveTrees=true, and after running predict(). The class GradientBoostingModelObject uses the trees to predict new data sets.
  
  Returns:
  
  an ArrayList, the list of boosted trees
- getNumberOfIterations
  
  public int getNumberOfIterations()
  
  Returns the current setting for the number of iterations to use in the gradient boosting algorithm.
  Different values for the number of iterations can be set and used in cross validation. See setIterationsArray(int[]).
  
  Returns:
  
  an int, the current setting for the number of iterations
- setNumberOfIterations
  
  public void setNumberOfIterations(int numberOfIterations)
  
  Sets the number of iterations.
  
  Parameters:
  
  numberOfIterations - an int, the number of iterations
  Default: numberOfIterations = 50. The numberOfIterations must be positive.
  
  Note: This method sets iterationsArray[0] =numberOfIterations.
- setIterationsArray
  
  public void setIterationsArray(int[] iterationsArray)
  
  Sets the array of different numbers of iterations.
  The algorithm in gradient boosting iterates for a fixed number of times and stops, rather than iterating until a convergence criteria is met. The number of iterations is therefore a parameter in the model. After setting the iterationsArray to two or more values, cross-validation can be used to help determine the best choice among the values. By default, iterationsArray contains the single value {50},the default number of iterations. Or it can be set using setNumberOfIterations.
  
  Parameters:
  
  iterationsArray - an int array containing the different numbers of iterations
  Default: iterationsArray = {50}.
- getHuberDeltas
  
  public double[] getHuberDeltas()
  
  Returns the values of the Huber parameter, $\delta_m$, calculated at each iteration during training with the HUBER_M loss function.
  The array is initialized to $\{0.0\}$ and then updated during training when LossFunctionType=HUBER_M.
  
  Returns:
  
  a double array containing the values
- getInitialValue
  
  public double getInitialValue()
  
  Returns the initial value of the predictor function $f_0$.
  Before training the gradient boosting model, $f_0 = 0 $. At the start of training, $f_0$ is calculated using the training data and the loss function.
  
  Returns:
  
  a double, the initial value of the predictor function
- getIterationsArray
  
  public int[] getIterationsArray()
  
  Returns the array of different values for the number of iterations.
  Different values for the number of iterations can be set and used in cross validation. See setIterationsArray(int[]).
  
  Returns:
  
  an int array, containing the values for the number of iterations parameter
- getShrinkage
  
  public double getShrinkage()
  
  Returns the shrinkage parameter.
- getLossType
  
  public GradientBoosting.LossFunctionType getLossType()
  
  Returns the current loss function type.
  
  Returns:
  
  a LossFunctionType, the current setting of the loss function type
- setLossFunctionType
  
  public void setLossFunctionType(GradientBoosting.LossFunctionType lossType)
  
  Sets the loss function type for the gradient boosting algorithm.
  
  Parameters:
  
  lossType - a LossFunctionType, the desired loss function type
  Default: lossType=LossFunctionType.LEAST_SQUARES
- getSampleSizeProportion
  
  public double getSampleSizeProportion()
  
  Returns the current setting of the sample size proportion.
  
  Returns:
  
  a double, the sample size proportion
- setSampleSizeProportion
  
  public void setSampleSizeProportion(double sampleSizeProportion)
  
  Sets the sample size proportion.
  
  Parameters:
  
  sampleSizeProportion - a double in the interval $\left[0,1\right]$ specifying the desired sampling proportion
  Default: sampleSizeProportion = 0.50. If sampleSizeProportion = 1.0, no sampling is performed.
- setShrinkageParameter
  
  public void setShrinkageParameter(double shrinkageParameter)
  
  Sets the value of the shrinkage parameter.
  
  Parameters:
  
  shrinkageParameter - a double in the interval $\left[0,1\right]$ specifying the shrinkage parameter
  Default: shrinkageParameter=1.0 (no shrinkage)
- getShrinkageParameter
  
  public double getShrinkageParameter()
  
  Returns the current shrinkage parameter.
  
  Returns:
  
  a double, the value of shrinkage parameter
- setTrainingData
  
  public void setTrainingData(double[][] xy, int responseColumnIndex, PredictiveModel.VariableType[] varType)
  
  Sets up the training data for the predictive model.
  By calling this method, the problem is either initialized or reset to use the data in the arguments.
  
  Overrides:
  
  setTrainingData in class PredictiveModel
  
  Parameters:
  
  xy - a double matrix containing the training data and associated response values
  
  responseColumnIndex - an int specifying the column index in xy of the response variable
  
  varType - a PredictiveModel.VariableType array of length equal to xy[0].length containing the type of each variable
- setSaveTrees
  
  public void setSaveTrees(boolean saveTrees)
  
  Sets the flag to save the boosted trees. When true, an ArrayList of the boosted trees is available after running predict(). See getTreeList(). The default is saveTrees=true.
  
  Parameters:
  
  saveTrees - a boolean

Class GradientBoosting

Nested Class Summary

Nested classes/interfaces inherited from class com.imsl.datamining.PredictiveModel

Constructor Summary

Method Summary

Methods inherited from class com.imsl.datamining.PredictiveModel

Methods inherited from class java.lang.Object

Constructor Details

GradientBoosting

GradientBoosting

GradientBoosting

Method Details

clone

fitModel

predict

predict

predict

getMissingTestYFlag

setMissingTestYFlag

getFittedValues

getTestFittedValues

getClassFittedValues

getGammaList

getGammaListMNL

getTestClassFittedValues

getMultinomialResponse

getClassProbabilities

getTestClassProbabilities

getLossValue

getTestLossValue

getTolerance

getTreeList

getNumberOfIterations

setNumberOfIterations

setIterationsArray

getHuberDeltas

getInitialValue

getIterationsArray

getShrinkage

getLossType

setLossFunctionType

getSampleSizeProportion

setSampleSizeProportion

setShrinkageParameter

getShrinkageParameter

setTrainingData

setSaveTrees