imsl.data_mining.ALACARTDecisionTree¶

class ALACARTDecisionTree(response_col_idx, var_type, criteria=0, use_gain_ratio=False, n_surrogate_splits=0, min_n_node=7, min_split=21, max_x_cats=10, max_size=100, max_depth=10, priors=None, response_name='Y', var_names=None, class_names=None, categ_names=None)¶

Generate a decision tree using the ALACART method.

Generate a decision tree for a single response variable and two or more predictor variables using the ALACART method.

Parameters:

response_col_idx (int) – Column index of the response variable.
var_type ((N,) array_like) –
Array indicating the type of each variable.

var_type[i] Type

0 Categorical

1 Ordered Discrete (Low, Med., High)

2 Quantitative or Continuous

3 Ignore this variable
criteria (int, optional) –
Specifies which criteria the ALACART method should use in the gain calculations to determine the best split at each node.

criteria Measure

0 Shannon Entropy

1 Gini Index

2 Deviance

Default is 0.

Shannon Entropy

Shannon Entropy is a measure of randomness or uncertainty. For a categorical variable having \(C\) distinct values over a dataset \(S\), the Shannon Entropy is defined as

\(\sum_{i=1}^{C}p_i\log(p_i)\)

where

\(p_i = Pr(Y=i)\)

and where

\(p_i \log(p_i) = 0\)

if \(p_i=0\).

Gini Index

Gini Index is a measure of statistical dispersion. For a categorical variable having \(C\) distinct values over a dataset \(S\), the Gini index is defined as

\(I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array}}^C p(i|S)p(j|S)=1-\sum^C_{i=1}p^2(i|S)\)

where \(p(i|S)\) denotes the probability that the variable is equal to the state \(i\) on the dataset \(S\).

Deviance

Deviance is a measure of the quality of fit. For a categorical variable having \(C\) distinct values over a dataset \(S\), the Deviance measure is

\(\sum_{i=1}^{C}n_i\log(p_i)\)

where

\(p_i = Pr(Y=i)\)

and \(n_i\) is the number of cases with \(Y=i\) on the node.
use_gain_ratio (bool, optional) –
The ALACART method uses a gain ratio instead of just the gain to determine the best split.

Default is False.
n_surrogate_splits (int, optional) –
Indicates the number of surrogate splits.

Default is 0.
min_n_node (int, optional) –
Do not split a node if one of its child nodes will have fewer than min_n_node observations.

Default is 7.
min_split (int, optional) –
Do not split a node if the node has fewer than min_split observations.

Default is 21.
max_x_cats (int, optional) –
Allow for up to max_x_cats for categorical predictor variables.

Default is 10.
max_size (int, optional) –
Stop growing the tree once it has reached max_size number of nodes.

Default is 100.
max_depth (int, optional) –
Stop growing the tree once it has reached max_depth number of levels.

Default is 10.
priors ((N,) array_like, optional) – An array containing prior probabilities for class membership. The argument is ignored for continuous response variables. By default, the prior probabilities are estimated from the data.
response_name (string, optional) –
A string representing the name of the response variable.

Default is “Y”.
var_names (tuple, optional) –
A tuple containing strings representing the names of predictors.

Default is “X0”, “X1”, etc.
class_names (tuple, optional) –
A tuple containing strings representing the names of the different classes in Y, assuming Y is of categorical type.

Default is “0”, “1”, etc.
categ_names (tuple, optional) –
A tuple containing strings representing the names of the different category levels for each predictor of categorical type.

Default is “0”, “1”, etc.

Notes

ALACART implements the method of Breiman, Friedman, Olshen and Stone ([1]), the original authors and developers of CART(TM). CART(TM) is the trademarked name for Classification and Regression Trees. In ALACART, only binary splits are considered for categorical variables. That is, if X has values {A, B, C, D}, splits into only two subsets are considered, e.g., {A} and {B, C, D}, or {A, B} and {C, D}, are allowed, but a three-way split defined by {A}, {B} and {C,D} is not.

For classification problems, ALACART uses a similar criterion to information gain called impurity. The method searches for a split that reduces the node impurity the most. For a given set of data S at a node, the node impurity for a C-class categorical response is a function of the class probabilities

\[I(S)=\phi(p(1|S),p(2|S),\ldots,p(C|S))\]

The measure function \(\phi(\cdot)\) should be 0 for “pure” nodes, where all Y are in the same class, and maximum when Y is uniformly distributed across the classes.

As only binary splits of a subset S are considered (\(S_1\), \(S_2\) such that \(S=S_1\cup S_2\) and \(S=S_1\cap S_2=\emptyset\)), the reduction in impurity when splitting S into \(S_1\), \(S_2\) is

\[\Delta I=I(S)-q_1I\left(S_1\right)-q_2I\left(S_2\right)\]

where

\[q_j = Pr[S_j], j = 1,2\]

is the node probability.

The gain criteria and the reduction in impurity \(\Delta I\) are similar concepts and equivalent when I is entropy and when only binary splits are considered. Another popular measure for the impurity at a node is the Gini index, given by

\[\begin{split}I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array}}^C p(i|S)p(j|S)=1-\sum^C_{i=1}p^2(i|S)\end{split}\]

If Y is an ordered response or continuous, the problem is a regression problem. ALACART generates the tree using the same steps, except that node-level measures or loss-functions are the mean squared error (MSE) or mean absolute error (MAD) rather than node impurity measures.

References

[1]	Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees, Chapman & Hall. For the latest information on CART visit: http://www.salford-systems.com/cart.php.

Examples

In this example, we use a small dataset with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by the ALACART method. The control parameters are adjusted because of the small data size, and no cross-validation or pruning is performed. Notice that ALACART splits on Outlook, then Temperature.

>>> import numpy as np
>>> import imsl.data_mining as dm
>>> xy = np.array([[0, 85.0, 85.0, 0, 0],
...                [0, 80.0, 90.0, 1, 0],
...                [1, 83.0, 78.0, 0, 1],
...                [2, 70.0, 96.0, 0, 1],
...                [2, 68.0, 80.0, 0, 1],
...                [2, 65.0, 70.0, 1, 0],
...                [1, 64.0, 65.0, 1, 1],
...                [0, 72.0, 95.0, 0, 0],
...                [0, 69.0, 70.0, 0, 1],
...                [2, 75.0, 80.0, 0, 1],
...                [0, 75.0, 70.0, 1, 1],
...                [1, 72.0, 90.0, 1, 1],
...                [1, 81.0, 75.0, 0, 1],
...                [2, 71.0, 80.0, 1, 0]])
>>> response_column_index = 4
>>> var_type = np.array([0, 2, 2, 0, 0], dtype=int)
>>> names = ["Outlook", "Temperature", "Humidity", "Wind"]
>>> class_names = ["Don't Play", "Play"]
>>> var_levels = ["Sunny", "Overcast", "Rainy", "False", "True"]
>>> print("Decision Tree using Method ALACART:\n\n")
... 
Decision Tree using Method ALACART:
>>> with dm.ALACARTDecisionTree(response_column_index, var_type,
...                             min_n_node=2, min_split=3, max_x_cats=10,
...                             max_size=50, max_depth=10, var_names=names,
...                             class_names=class_names,
...                             categ_names=var_levels) as decision_tree:
...     decision_tree.train(xy)
...     print(decision_tree)
... 
Decision Tree:
Node 0: Cost = 0.357, N = 14.0, Level = 0, Child Nodes:  1  8
P(Y=0) = 0.357
P(Y=1) = 0.643
Predicted Y: Play
   Node 1: Cost = 0.357, N = 10.0, Level = 1, Child Nodes:  2  7
   Rule: Outlook in: { Sunny  Rainy }
   P(Y=0) = 0.500
   P(Y=1) = 0.500
   Predicted Y: Don't Play
      Node 2: Cost = 0.214, N = 8.0, Level = 2, Child Nodes:  3  6
      Rule: Temperature <= 77.500
      P(Y=0) = 0.375
      P(Y=1) = 0.625
      Predicted Y: Play
         Node 3: Cost = 0.214, N = 6.0, Level = 3, Child Nodes:  4  5
         Rule: Temperature <= 73.500
         P(Y=0) = 0.500
         P(Y=1) = 0.500
         Predicted Y: Play
            Node 4: Cost = 0.071, N = 4.0, Level = 4
            Rule: Temperature <= 70.500
            P(Y=0) = 0.250
            P(Y=1) = 0.750
            Predicted Y: Play
            Node 5: Cost = 0.000, N = 2.0, Level = 4
            Rule: Temperature > 70.500
            P(Y=0) = 1.000
            P(Y=1) = 0.000
            Predicted Y: Don't Play
         Node 6: Cost = 0.000, N = 2.0, Level = 3
         Rule: Temperature > 73.500
         P(Y=0) = 0.000
         P(Y=1) = 1.000
         Predicted Y: Play
      Node 7: Cost = 0.000, N = 2.0, Level = 2
      Rule: Temperature > 77.500
      P(Y=0) = 1.000
      P(Y=1) = 0.000
      Predicted Y: Don't Play
   Node 8: Cost = 0.000, N = 4.0, Level = 1
   Rule: Outlook in: { Overcast }
   P(Y=0) = 0.000
   P(Y=1) = 1.000
   Predicted Y: Play

Methods

`predict`(data[, weights])	Compute predicted values using a decision tree.
`train`(training_data[, weights])	Train a decision tree using training data and weights.

Attributes

`categ_names`	Return names of category levels for each categorical predictor.
`class_names`	Return names of different classes in Y.
`n_classes`	Return number of classes assumed by response variable.
`n_levels`	Return number of levels or depth of tree.
`n_nodes`	Return number of nodes or size of tree.
`n_preds`	Return number of predictors used in the model.
`n_surrogates`	Return number of surrogate splits searched for at each node.
`pred_n_values`	Return number of values of predictor variables.
`pred_type`	Return types of predictor variables.
`response_name`	Return name of the response variable.
`response_type`	Return type of the response variable.
`var_names`	Return names of the predictors.