decisionTreePredict

Computes predicted values using a decision tree.

Synopsis

decisionTreePredict (x, varType, tree)

Required Arguments

float x[[]] (Input)
Array of size n × ncols containing the data.
int varType[] (Input)
Array of length ncols indicating the type of each variable.
Value Type
0 Categorical
1 Ordered Discrete (Low, Med., High)
2 Quantitative or Continuous
3 Ignore this variable
structure tree (Input)
An estimated decision tree.

Return Value

An array of length n containing the predicted values. If an error occurs, None is returned.

Optional Arguments

nSurrogates, int (Input)

Indicates the number of surrogate splits for use in methods that find surrogate splits in order to handle missing values.

Default: nSurrogates = 0.

weights, float[] (Input)

An array of length n containing case weights.

Default: weights[i]=1.0.

xResponseCol, int (Input)

The column index of the response variable, if present in the data. A negative value indicates there is no response column.

Default: xResponseCol = -1.

xNodeIds (Output)
The array of length n containing for each row in x, the terminal node of the tree to which the observation belongs.
errorSs (Output)
The prediction error mean sum of squares, available when values for the response are present in the data.

Description

To predict a new set of cases using a fitted or estimated decision tree, decisionTreePredict finds the terminal node of the tree to which each new case belongs. The predicted value is then the predicted value of that node. This is a matter of “putting the data through the tree.” For example, suppose the following weather conditions:

Temperature = 70
Humidity = 82
Outlook = Rainy
Wind = FALSE

According to the C4.5 decision tree in Example 1 for decisionTree, will the golfer play golf or not, under these conditions? The tree splits the root node on Outlook into three nodes: {Sunny, Rainy, and Overcast}. Rainy defines node 5. Node 5 is split into child nodes 6 and 7, according to the presence of wind. If there is wind, Node 7, the prediction is “Don’t Play.” If there is no wind, Node 6, the prediction is “Play.” Therefore, the new observation belongs to Node 6, and the tree predicts that the golfer will play under the given weather conditions. In the ALACART decision tree, Node 4 is the terminal node, and the associated prediction is “Play.”

Comments

1. Users can request predictions and error sum of squares directly from decisionTree or use this separate prediction function when it is not necessary to re-estimate a decision tree.

2. If requested, the prediction mean sum of squared error (mean squared prediction error) is computed when actual response values are available in the data.

3. For cases with missing values in predictors that are involved in the splitting rules of the tree, decisionTreePredict uses surrogate rules if available and when requested. Otherwise, predicted values are missing, and the error sum of squares does include that case.

Example

Using the kyphosis data of Example 2 for decisionTree, this example illustrates using a separate call to decisionTreePredict to obtain the predicted values for a new set of observations (xyTest).

from __future__ import print_function
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.decisionTree import decisionTree
from pyimsl.stat.decisionTreePrint import decisionTreePrint
from pyimsl.stat.decisionTreePredict import decisionTreePredict
from pyimsl.stat.decisionTreeFree import decisionTreeFree

xy = [[0, 71, 3, 5],
      [0, 158, 3, 14],
      [1, 128, 4, 5],
      [0, 2, 5, 1],
      [0, 1, 4, 15],
      [0, 1, 2, 16],
      [0, 61, 2, 17],
      [0, 37, 3, 16],
      [0, 113, 2, 16],
      [1, 59, 6, 12],
      [1, 82, 5, 14],
      [0, 148, 3, 16],
      [0, 18, 5, 2],
      [0, 1, 4, 12],
      [0, 168, 3, 18],
      [0, 1, 3, 16],
      [0, 78, 6, 15],
      [0, 175, 5, 13],
      [0, 80, 5, 16],
      [0, 27, 4, 9],
      [0, 22, 2, 16],
      [1, 105, 6, 5],
      [1, 96, 3, 12],
      [0, 131, 2, 3],
      [1, 15, 7, 2],
      [0, 9, 5, 13],
      [0, 8, 3, 6],
      [0, 100, 3, 14],
      [0, 4, 3, 16],
      [0, 151, 2, 16],
      [0, 31, 3, 16],
      [0, 125, 2, 11],
      [0, 130, 5, 13],
      [0, 112, 3, 16],
      [0, 140, 5, 11],
      [0, 93, 3, 16],
      [0, 1, 3, 9],
      [1, 52, 5, 6],
      [0, 20, 6, 9],
      [1, 91, 5, 12],
      [1, 73, 5, 1],
      [0, 35, 3, 13],
      [0, 143, 9, 3],
      [0, 61, 4, 1],
      [0, 97, 3, 16],
      [1, 139, 3, 10],
      [0, 136, 4, 15],
      [0, 131, 5, 13],
      [1, 121, 3, 3],
      [0, 177, 2, 14],
      [0, 68, 5, 10],
      [0, 9, 2, 17],
      [1, 139, 10, 6],
      [0, 2, 2, 17],
      [0, 140, 4, 15],
      [0, 72, 5, 15],
      [0, 2, 3, 13],
      [1, 120, 5, 8],
      [0, 51, 7, 9],
      [0, 102, 3, 13],
      [1, 130, 4, 1],
      [1, 114, 7, 8],
      [0, 81, 4, 1],
      [0, 118, 3, 16],
      [0, 118, 4, 16],
      [0, 17, 4, 10],
      [0, 195, 2, 17],
      [0, 159, 4, 13],
      [0, 18, 4, 11],
      [0, 15, 5, 16],
      [0, 158, 5, 14],
      [0, 127, 4, 12],
      [0, 87, 4, 16],
      [0, 206, 4, 10],
      [0, 11, 3, 15],
      [0, 178, 4, 15],
      [1, 157, 3, 13],
      [0, 26, 7, 13],
      [0, 120, 2, 13],
      [1, 42, 7, 6],
      [0, 36, 4, 13]]

xyTest = [[0, 71, 3, 5],
          [1, 128, 4, 5],
          [0, 1, 4, 15],
          [0, 61, 6, 10],
          [0, 113, 2, 16],
          [1, 82, 5, 14],
          [0, 148, 3, 16],
          [0, 1, 4, 12],
          [0, 1, 3, 16],
          [0, 175, 5, 13]]

responseColIdx = 0
method = 3
control = [5, 10, 10, 50, 10]
varType = [0, 2, 2, 2]
nTest = 10
names = ["Age", "Number", "Start"]
classNames = ["Absent", "Present"]
responseName = ["Kyphosis"]

errorSs = []
tree = decisionTree(xy, responseColIdx, varType,
                    method=method, nFolds=1,
                    control=control)

predicted = decisionTreePredict(xyTest, varType, tree,
                                xResponseCol=responseColIdx,
                                errorSs=errorSs)

print("\nPredictions for test data:")
print("%5s%8s%7s%10s\n" % (names[0], names[1], names[2],
                           responseName[0]))
for i in range(nTest):
    idx = int(predicted[i])
    print("%5.0f%8.0f%7.0f%10s" % (
        xyTest[i][1],
        xyTest[i][2],
        xyTest[i][3],
        classNames[idx]))

print("Mean squared prediction error: %f" % errorSs[0])

decisionTreeFree(tree)

Output

Predictions for test data:
  Age  Number  Start  Kyphosis

   71       3      5    Absent
  128       4      5   Present
    1       4     15    Absent
   61       6     10    Absent
  113       2     16    Absent
   82       5     14    Absent
  148       3     16    Absent
    1       4     12    Absent
    1       3     16    Absent
  175       5     13    Absent
Mean squared prediction error: 0.100000

Warning Errors

IMSLS_NO_SURROGATES Use of surrogates is limited to method 1 (ALACART).
IMSLS_INVALID_PARAM The value of # is out of range.