logisticRegPredict

Predict a binomial or multinomial outcome given an estimated model and new values of the independent variables.

Synopsis

logisticRegPredict (nIndependent, nClasses, coefs, x)

Required Arguments

int nIndependent (Input)
The number of independent variables.
int nClasses (Input)
The number of discrete outcomes, or classes.
float coefs[[]] (Input)
Array of length nCoefficients × nClasses containing the coefficient estimates of the logistic regression model. nCoefficients is the number of coefficients in the model.
float x[[]] (Input)
Array of length nObservations × nIndependent containing the values of the independent variables.

Return Value

An array containing the predicted responses. The predicted value is the predicted number of outcomes in each class for each new observation provided in x. If frequencies[i] = 1 for all observations, then the return value is equivalent to the predicted probabilities. If the option confidence is specified, the length of the return array is (nObservations × nClasses × 3) and the array includes the lower and upper prediction limits. Otherwise, the array is of length (nObservations × nClasses). Note that if the data is column-oriented (see columnWise), the return value will also be column‑oriented.

Optional Arguments

y, float[] (Input)

Array containing the actual responses corresponding to the independent variables. If present, the expected length for y is nObservations × nClasses unless one of groups or groupCounts is also present. y is required when prederr is requested.

Default: The function expects that y is not given.

groupCounts (Input)

or

groups (Input)

These optional arguments specify alternative formats of the input array y. If groupCounts is present, y is of length nObservations × (nClasses ‑ 1), and contains counts for all but one of the classes for each observation. The missing class is treated as the reference class. If groupCounts is present and if any y[i] > 1, frequencies is required. If groups is present, the input array y is of length nObservations and y[i] contains the group number to which the observation belongs. In this case, frequencies[i] is set to 1 for all observations.

Default: Unless one of the arguments is present, the function expects that y is nObservations × nClasses and contains counts for all the classes.

columnWise (Input)

If present, the input arrays are column‑oriented. That is, contiguous elements in x are values of the same independent variable, or column, except at multiples of nObservations.

Default: Input arrays are row‑oriented.

frequencies, int[] (Input)

Array of length nObservations containing the number of replications or trials for each of the observations. This argument is required if groupCounts is present and if any y[i] > 1.

Default: frequencies[i] = 1.

referenceClass, int (Input)

Number specifying which class or outcome category to use as the reference class. The purpose of the reference class is explained in the Description section.

Default: referenceClass = nClasses.

noIntercept (Input)

If present, the model will not include an intercept term.

Default: The intercept term is included.

xIndices, xin (Input)

An array of length nXin providing the variable indices of x that correspond to the independent variables the user wishes to be included in the logistic regression model.

Default: All nIndependent variables are included.

xInteractions, xinteract (Input)

An array of length nXinteract × 2 providing pairs of variable indices of x that define the interaction terms in the model. Adjacent indices should be unique.

Default: No interaction terms are included.

confidence, float (Input)

This value provides the confidence level to use in the calculation of the prediction intervals. If this argument is present and valid (0 < confidence < 100), confidence% prediction intervals are provided for each predicted value.

Default: Prediction intervals are not provided.

model, structure (Input)

A structure containing information about the logistic regression fit. See logisticRegression. Required when confidence is present.

Default: Not needed if confidence is not present.

prederr (Output)
The mean squared prediction error when y is present.

Description

Function logisticRegPredict calculates the predicted outcomes for a binomial or multinomial response variable given an estimated logistic regression model and new observations of the independent variables.

For a binary response y, the objective is to estimate the conditional probability of success, \(\pi_1(x)=\Pr\left[ y=1 | x \right]\), where \(x=\left( x_1,x_2,\ldots,x_p \right)'\) is a realization of p independent variables. In particular, the estimated probability of success

\[\hat{\pi}_1(x) = \frac{\exp\left(\hat{\eta}_1\right)} {1 +\exp\left(\hat{\eta}_1\right)}\]

where

\[\hat{\eta}_1 = \hat{\beta}_{01} + x^T \hat{\beta}_1\]

and

\(\hat{\beta}_{01},\hat{\beta}_1=\left( \hat{\beta}_{11},\hat{\beta}_{12},\ldots\hat{\beta}_{1p} \right)'\) are the coefficient estimates. Then \(\hat{y}=n_i \pi_1 \left( x_i \right)\). That is, \(\hat{y}\) is the expected value of the response under the estimated model given the values of the independent variables.

Similarly, for a multinomial response, with class K the reference class,

\[\hat{\pi}_k(x) = \frac {\exp\left(\hat{\eta}_{ik}\right)} {\sum\limits_{l=1}^{K} \exp\left(\hat{\eta}_{il}\right)} = \frac {\exp\left(\hat{\eta}_{ik}\right)} {1 + \sum\limits_{l=1}^{K-1} \exp\left(\hat{\eta}_{il}\right)}\]

Then

\[\hat{\pi}_K(x) = 1 - \sum_{l=1}^{K-1} \hat{\pi}_l(x)\]

and \(\hat{y}_k=n_i \pi_k \left( x_i \right)\). If the actual responses are given, the mean squared prediction error is

\[\mathrm{mspe} = \frac{1}{NK} \sum_{k=1}^{K} \sum_{i=1}^{N} \left(\hat{y}_{ik} - y_{ik}\right)^2\]

If requested, \(100(1-\alpha)\%\) prediction intervals are provided for the predicted values by first finding the prediction standard errors of the logits, \(\hat{\eta}_{ik}=\hat{\beta}_{0k}+x_i^T \hat{\beta}_k\), and then evaluating

\[\frac {\exp \left(\hat{\eta}_{ik} \pm z_{\alpha/2} SE \left(\hat{\eta}_{ik}\right)\right)} {1 + \sum\limits_{l=1}^{K-1} \exp \left(\hat{\eta}_{il} \pm z_{\alpha/2} SE \left(\hat{\eta}_{il}\right)\right)}\]

to obtain the upper and lower limits for \(\hat{\pi}_k \left( x_i \right)\), where \(z_{\alpha/2}\) is the upper \(\alpha/2\) quantile of the standard normal distribution. Note that properties of the prediction intervals are only valid when the new observations are inside the range of the original data used to fit the model. Generally, the model should not be used to extrapolate outside the range of the original data. See Hosmer and Lemeshow (2000) for further details.

Examples

Example 1

The model fit to the beetle mortality data of Prentice (1976) is used to predict the expected mortality at three new doses. For the original data, see Example 1 in logisticRegression.

Log Dosage Number of Beetles Exposed Number of Deaths
1.66 16 ??
1.87 22 ??
1.71 11 ??
from numpy import *
from pyimsl.stat.logisticRegPredict import logisticRegPredict
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.writeMatrix import writeMatrix

y1 = [6, 13, 18, 28, 52, 53, 61, 60]
x1 = [1.69, 1.724, 1.755, 1.784, 1.811, 1.836, 1.861, 1.883]
x2 = [1.66, 1.87, 1.71]
freqs1 = [59, 60, 62, 56, 63, 59, 62, 60]
freqs2 = [16, 22, 11]

n_classes = 2
n_observations = 8
n_independent = 1
n_coefs = 2
n_new_observations = 3

coefs = logisticRegression(n_independent,
                           n_classes, x1, y1,
                           groupCounts=True,
                           frequencies=freqs1)

rowLabels = ['1', '2']
writeMatrix("Coefficient Estimates", coefs.reshape(4)[0:2],
            transpose=True,
            noColLabels=True,
            rowLabels=rowLabels)


yhat = logisticRegPredict(n_independent, n_classes, coefs, x2,
                          frequencies=freqs2)
tmp = []
colLabels = '              Dose          N      Expected Deaths'
yhatlst = [yhat[0, 0], yhat[1, 0], yhat[2, 0]]
for i in range(0, n_new_observations):
    tmp.extend([[x2[i], freqs2[i], yhatlst[i]]])

writeMatrix(colLabels, tmp, noRowLabels=True, noColLabels=True)

Output

 
Coefficient Estimates
   1       -60.76
   2        34.30
 
              Dose          N      Expected Deaths
              1.66        16.00         0.34
              1.87        22.00        21.28
              1.71        11.00         1.19

Example 2

A logistic regression model is fit to artificial (noisy) data with 4 classes and 3 independent variables and used to predict class probabilities at 10 new values of the independent variables. Also shown are the mean squared prediction error and upper and lower limits of the 95% prediction interval for each predicted value.

from __future__ import print_function
from numpy import *
from pyimsl.stat.logisticRegPredict import logisticRegPredict
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.chiSquaredCdf import chiSquaredCdf
from pyimsl.stat.writeMatrix import writeMatrix

x = [[3, 2, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 1, 3, 2,
      2, 1, 2, 1, 3, 2, 1, 2, 1, 2, 3, 2, 1, 2, 1, 1, 2, 3, 1, 2,
      1, 1, 1, 3, 1, 3, 2, 3, 3, 1],
     [25.92869, 51.63245, 25.78432, 39.37948, 24.65058, 45.20084,
      52.6796, 44.28342, 40.63523, 51.76094, 26.30368, 20.70230,
      38.74273, 19.47333, 26.42211, 37.05986, 51.67043, 42.40156,
      33.90027, 35.43282, 44.30369, 46.72387, 46.99262, 36.05923,
      36.83197, 61.66257, 25.67714, 39.08567, 48.84341, 39.34391,
      24.73522, 50.55251, 31.34263, 27.15795, 31.72685, 25.00408,
      26.35457, 38.12343, 49.9403, 42.45779, 38.80948, 43.22799,
      41.87624, 48.0782, 43.23673, 39.41294, 23.93346,
      42.8413, 30.40669, 37.77389],
     [1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1,
      1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2,
      1, 1, 2, 1, 1, 2, 1, 1]]

y = [1, 2, 3, 4, 3, 3, 4, 4, 4, 4, 2, 1, 4, 1, 1, 1, 4, 4, 3, 1, 2,
     3, 3, 4, 2, 3, 4, 1, 2, 4, 3, 4, 4, 1, 3, 4, 4, 2, 3, 4, 2, 2,
     4, 3, 1, 4, 3, 4, 2, 3]

newx = [[2, 2, 1, 3, 3, 3, 2, 3, 3, 3],
        [25.92869, 51.63245, 25.78432, 39.37948, 24.65058, 45.20084,
         52.6796, 44.28342, 40.63523, 51.76094],
        [1, 2, 1, 1, 1, 1, 2, 2, 2, 1]]

newy = [3, 2, 1, 1, 4, 3, 2, 2, 1, 2]

model_info = []
lrstat = []
mspe = []
n_classes = 4
n_observations = 50
n_new_obs = 10
n_independent = 3
n_coefs = 4

coefs = logisticRegression(n_independent,
                           n_classes, x, y,
                           groups=True,
                           columnWise=True,
                           lrstat=lrstat,
                           nextResults=model_info)

yhat = logisticRegPredict(n_independent,
                          n_classes, coefs, newx,
                          y=newy,
                          groups=True,
                          columnWise=True,
                          confidence=95.0,
                          model=model_info[0],
                          prederr=mspe)

dof = n_coefs * (n_classes - 1) - (n_classes - 1)
model_pval = 1.0 - chiSquaredCdf(lrstat[0], dof)

print("Model Fit Summary:")
print("Log-likelihood: %5.2f " % model_info[0].loglike)
print("LR test statistic: %5.2f " % lrstat[0])
print("Degrees of freedom: ", dof)
print("P-value: %5.4f " % model_pval)
print("")
print("Prediction Summary:")
print("Mean squared prediction error: %4.2f" % mspe[0])
print("")
print("Obs\tClass   Estimate Lower  Upper")
for i in range(0, n_new_obs):
    for j in range(0, n_classes):
        print("%d\t %d\t %5.2f  %5.2f  %5.2f" %
              (i + 1, j + 1, yhat[j, 0, i], yhat[j, 1, i], yhat[j, 2, i]))

Output

Model Fit Summary:
Log-likelihood: -58.58 
LR test statistic: 16.37 
Degrees of freedom:  9
P-value: 0.0595 

Prediction Summary:
Mean squared prediction error: 0.21

Obs	Class   Estimate Lower  Upper
1	 1	  0.26   0.14   0.35
1	 2	  0.14   0.06   0.20
1	 3	  0.31   0.18   0.36
1	 4	  0.29   0.10   0.62
2	 1	  0.04   0.01   0.14
2	 2	  0.27   0.11   0.39
2	 3	  0.12   0.04   0.25
2	 4	  0.57   0.22   0.85
3	 1	  0.23   0.07   0.38
3	 2	  0.13   0.04   0.20
3	 3	  0.28   0.12   0.34
3	 4	  0.36   0.08   0.77
4	 1	  0.06   0.02   0.14
4	 2	  0.16   0.07   0.24
4	 3	  0.49   0.28   0.54
4	 4	  0.29   0.08   0.63
5	 1	  0.34   0.17   0.41
5	 2	  0.13   0.06   0.19
5	 3	  0.30   0.17   0.34
5	 4	  0.22   0.05   0.60
6	 1	  0.03   0.00   0.09
6	 2	  0.16   0.06   0.24
6	 3	  0.53   0.27   0.60
6	 4	  0.29   0.07   0.67
7	 1	  0.04   0.01   0.13
7	 2	  0.27   0.10   0.40
7	 3	  0.13   0.04   0.26
7	 4	  0.57   0.21   0.86
8	 1	  0.14   0.04   0.26
8	 2	  0.29   0.12   0.37
8	 3	  0.12   0.04   0.21
8	 4	  0.46   0.15   0.80
9	 1	  0.21   0.08   0.33
9	 2	  0.27   0.12   0.35
9	 3	  0.10   0.03   0.19
9	 4	  0.42   0.14   0.77
10	 1	  0.01   0.00   0.05
10	 2	  0.15   0.04   0.24
10	 3	  0.57   0.23   0.67
10	 4	  0.28   0.05   0.73

Warning Errors

IMSLS_NO_ACTUALS The average squared prediction error cannot be calculated because no actual “y” values are given.

Fatal Errors

IMSLS_OVERFLOW The linear predictor = # is too large and will lead to overflow when exponentiated.