imsl.data_mining.CHAIDDecisionTree

class CHAIDDecisionTree(response_col_idx, var_type, alphas=(0.05, 0.05, -1.0), min_n_node=7, min_split=21, max_x_cats=10, max_size=100, max_depth=10, priors=None, response_name='Y', var_names=None, class_names=None, categ_names=None)

Generate a decision tree using the CHAID method.

Generate a decision tree for a single response variable and two or more predictor variables using the CHAID method.

Parameters:
  • response_col_idx (int) – Column index of the response variable.
  • var_type ((N,) array_like) –

    Array indicating the type of each variable.

    var_type[i] Type
    0 Categorical
    1 Ordered Discrete (Low, Med., High)
    2 Quantitative or Continuous
    3 Ignore this variable
  • alphas (tuple, optional) –

    Tuple containing the significance levels. alphas[0] = significance level for split variable selection; alphas[1] = significance level for merging categories of a variable, and alphas[2] = significance level for splitting previously merged categories. Valid values are in the range 0 < alphas[1] < 1.0, and alphas[2] <= alphas[1]. Setting alphas[2] = -1.0 disables splitting of merged categories.

    Default is [0.05, 0.05, -1.0].

  • min_n_node (int, optional) –

    Do not split a node if one of its child nodes will have fewer than min_n_node observations.

    Default is 7.

  • min_split (int, optional) –

    Do not split a node if the node has fewer than min_split observations.

    Default is 21.

  • max_x_cats (int, optional) –

    Allow for up to max_x_cats for categorical predictor variables.

    Default is 10.

  • max_size (int, optional) –

    Stop growing the tree once it has reached max_size number of nodes.

    Default is 100.

  • max_depth (int, optional) –

    Stop growing the tree once it has reached max_depth number of levels.

    Default is 10.

  • priors ((N,) array_like, optional) – An array containing prior probabilities for class membership. The argument is ignored for continuous response variables. By default, the prior probabilities are estimated from the data.
  • response_name (string, optional) –

    A string representing the name of the response variable.

    Default is “Y”.

  • var_names (tuple, optional) –

    A tuple containing strings representing the names of predictors.

    Default is “X0”, “X1”, etc.

  • class_names (tuple, optional) –

    A tuple containing strings representing the names of the different classes in Y, assuming Y is of categorical type.

    Default is “0”, “1”, etc.

  • categ_names (tuple, optional) –

    A tuple containing strings representing the names of the different category levels for each predictor of categorical type.

    Default is “0”, “1”, etc.

Notes

The method CHAID is appropriate only for categorical or discrete ordered predictor variables. Due to Kass ([1]), CHAID is an acronym for chi-square automatic interaction detection. At each node, imsl.data_mining.CHAIDDecisionTree() looks for the best splitting variable. The approach is as follows: given a predictor variable X, perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y. The least significant result is noted and, if a threshold is met, the two categories of X are merged. Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. If a merged category consists of three or more of the original categories of X, imsl.data_mining.CHAIDDecisionTree() calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth. Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories defines the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.

References

[1]Kass, G.V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics, Vol. 29, No. 2, pp. 119-127.

Methods

predict(data[, weights]) Compute predicted values using a decision tree.
train(training_data[, weights]) Train a decision tree using training data and weights.

Attributes

categ_names Return names of category levels for each categorical predictor.
class_names Return names of different classes in Y.
n_classes Return number of classes assumed by response variable.
n_levels Return number of levels or depth of tree.
n_nodes Return number of nodes or size of tree.
n_preds Return number of predictors used in the model.
pred_n_values Return number of values of predictor variables.
pred_type Return types of predictor variables.
response_name Return name of the response variable.
response_type Return type of the response variable.
var_names Return names of the predictors.