logisticRegPredict¶
Predict a binomial or multinomial outcome given an estimated model and new values of the independent variables.
Synopsis¶
logisticRegPredict (nIndependent, nClasses, coefs, x)
Required Arguments¶
- int
nIndependent(Input) - The number of independent variables.
- int
nClasses(Input) - The number of discrete outcomes, or classes.
- float
coefs[[]](Input) - Array of length
nCoefficients×nClassescontaining the coefficient estimates of the logistic regression model.nCoefficientsis the number of coefficients in the model. - float
x[[]](Input) - Array of length
nObservations×nIndependentcontaining the values of the independent variables.
Return Value¶
An array containing the predicted responses. The predicted value is the
predicted number of outcomes in each class for each new observation provided
in x. If frequencies[i] = 1 for all observations, then the return
value is equivalent to the predicted probabilities. If the option
confidence is specified, the length of the return array is
(nObservations × nClasses × 3) and the array includes the lower and
upper prediction limits. Otherwise, the array is of length
(nObservations × nClasses). Note that if the data is column-oriented
(see columnWise), the return value will also be column‑oriented.
Optional Arguments¶
y, float[](Input)Array containing the actual responses corresponding to the independent variables. If present, the expected length for
yisnObservations×nClassesunless one ofgroupsorgroupCountsis also present.yis required whenprederris requested.Default: The function expects that
yis not given.
groupCounts (Input)
or
groups(Input)These optional arguments specify alternative formats of the input array
y. IfgroupCountsis present,yis of lengthnObservations×(nClasses‑ 1), and contains counts for all but one of the classes for each observation. The missing class is treated as the reference class. IfgroupCountsis present and if anyy[i]> 1,frequenciesis required. Ifgroupsis present, the input arrayyis of lengthnObservationsandy[i]contains the group number to which the observation belongs. In this case,frequencies[i]is set to 1 for all observations.Default: Unless one of the arguments is present, the function expects that
yisnObservations×nClassesand contains counts for all the classes.columnWise(Input)If present, the input arrays are column‑oriented. That is, contiguous elements in
xare values of the same independent variable, or column, except at multiples ofnObservations.Default: Input arrays are row‑oriented.
frequencies, int[](Input)Array of length
nObservationscontaining the number of replications or trials for each of the observations. This argument is required ifgroupCountsis present and if anyy[i]> 1.Default:
frequencies[i]= 1.referenceClass, int (Input)Number specifying which class or outcome category to use as the reference class. The purpose of the reference class is explained in the Description section.
Default:
referenceClass=nClasses.noIntercept(Input)If present, the model will not include an intercept term.
Default: The intercept term is included.
xIndices,xin(Input)An array of length
nXinproviding the variable indices ofxthat correspond to the independent variables the user wishes to be included in the logistic regression model.Default: All
nIndependentvariables are included.xInteractions,xinteract(Input)An array of length
nXinteract× 2 providing pairs of variable indices ofxthat define the interaction terms in the model. Adjacent indices should be unique.Default: No interaction terms are included.
confidence, float (Input)This value provides the confidence level to use in the calculation of the prediction intervals. If this argument is present and valid (0
<confidence<100),confidence%prediction intervals are provided for each predicted value.Default: Prediction intervals are not provided.
model, structure (Input)A structure containing information about the logistic regression fit. See
logisticRegression. Required whenconfidenceis present.Default: Not needed if
confidenceis not present.prederr(Output)- The mean squared prediction error when
yis present.
Description¶
Function logisticRegPredict calculates the predicted outcomes for a
binomial or multinomial response variable given an estimated logistic
regression model and new observations of the independent variables.
For a binary response y, the objective is to estimate the conditional probability of success, \(\pi_1(x)=\Pr\left[ y=1 | x \right]\), where \(x=\left( x_1,x_2,\ldots,x_p \right)'\) is a realization of p independent variables. In particular, the estimated probability of success
where
and
\(\hat{\beta}_{01},\hat{\beta}_1=\left( \hat{\beta}_{11},\hat{\beta}_{12},\ldots\hat{\beta}_{1p} \right)'\) are the coefficient estimates. Then \(\hat{y}=n_i \pi_1 \left( x_i \right)\). That is, \(\hat{y}\) is the expected value of the response under the estimated model given the values of the independent variables.
Similarly, for a multinomial response, with class K the reference class,
Then
and \(\hat{y}_k=n_i \pi_k \left( x_i \right)\). If the actual responses are given, the mean squared prediction error is
If requested, \(100(1-\alpha)\%\) prediction intervals are provided for the predicted values by first finding the prediction standard errors of the logits, \(\hat{\eta}_{ik}=\hat{\beta}_{0k}+x_i^T \hat{\beta}_k\), and then evaluating
to obtain the upper and lower limits for \(\hat{\pi}_k \left( x_i \right)\), where \(z_{\alpha/2}\) is the upper \(\alpha/2\) quantile of the standard normal distribution. Note that properties of the prediction intervals are only valid when the new observations are inside the range of the original data used to fit the model. Generally, the model should not be used to extrapolate outside the range of the original data. See Hosmer and Lemeshow (2000) for further details.
Examples¶
Example 1¶
The model fit to the beetle mortality data of Prentice (1976) is used to
predict the expected mortality at three new doses. For the original data,
see Example 1 in logisticRegression.
| Log Dosage | Number of Beetles Exposed | Number of Deaths |
|---|---|---|
| 1.66 | 16 | ?? |
| 1.87 | 22 | ?? |
| 1.71 | 11 | ?? |
from numpy import *
from pyimsl.stat.logisticRegPredict import logisticRegPredict
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.writeMatrix import writeMatrix
y1 = [6, 13, 18, 28, 52, 53, 61, 60]
x1 = [1.69, 1.724, 1.755, 1.784, 1.811, 1.836, 1.861, 1.883]
x2 = [1.66, 1.87, 1.71]
freqs1 = [59, 60, 62, 56, 63, 59, 62, 60]
freqs2 = [16, 22, 11]
n_classes = 2
n_observations = 8
n_independent = 1
n_coefs = 2
n_new_observations = 3
coefs = logisticRegression(n_independent,
n_classes, x1, y1,
groupCounts=True,
frequencies=freqs1)
rowLabels = ['1', '2']
writeMatrix("Coefficient Estimates", coefs.reshape(4)[0:2],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
yhat = logisticRegPredict(n_independent, n_classes, coefs, x2,
frequencies=freqs2)
tmp = []
colLabels = ' Dose N Expected Deaths'
yhatlst = [yhat[0, 0], yhat[1, 0], yhat[2, 0]]
for i in range(0, n_new_observations):
tmp.extend([[x2[i], freqs2[i], yhatlst[i]]])
writeMatrix(colLabels, tmp, noRowLabels=True, noColLabels=True)
Output¶
Coefficient Estimates
1 -60.76
2 34.30
Dose N Expected Deaths
1.66 16.00 0.34
1.87 22.00 21.28
1.71 11.00 1.19
Example 2¶
A logistic regression model is fit to artificial (noisy) data with 4 classes and 3 independent variables and used to predict class probabilities at 10 new values of the independent variables. Also shown are the mean squared prediction error and upper and lower limits of the 95% prediction interval for each predicted value.
from __future__ import print_function
from numpy import *
from pyimsl.stat.logisticRegPredict import logisticRegPredict
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.chiSquaredCdf import chiSquaredCdf
from pyimsl.stat.writeMatrix import writeMatrix
x = [[3, 2, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 1, 3, 2,
2, 1, 2, 1, 3, 2, 1, 2, 1, 2, 3, 2, 1, 2, 1, 1, 2, 3, 1, 2,
1, 1, 1, 3, 1, 3, 2, 3, 3, 1],
[25.92869, 51.63245, 25.78432, 39.37948, 24.65058, 45.20084,
52.6796, 44.28342, 40.63523, 51.76094, 26.30368, 20.70230,
38.74273, 19.47333, 26.42211, 37.05986, 51.67043, 42.40156,
33.90027, 35.43282, 44.30369, 46.72387, 46.99262, 36.05923,
36.83197, 61.66257, 25.67714, 39.08567, 48.84341, 39.34391,
24.73522, 50.55251, 31.34263, 27.15795, 31.72685, 25.00408,
26.35457, 38.12343, 49.9403, 42.45779, 38.80948, 43.22799,
41.87624, 48.0782, 43.23673, 39.41294, 23.93346,
42.8413, 30.40669, 37.77389],
[1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1,
1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2,
1, 1, 2, 1, 1, 2, 1, 1]]
y = [1, 2, 3, 4, 3, 3, 4, 4, 4, 4, 2, 1, 4, 1, 1, 1, 4, 4, 3, 1, 2,
3, 3, 4, 2, 3, 4, 1, 2, 4, 3, 4, 4, 1, 3, 4, 4, 2, 3, 4, 2, 2,
4, 3, 1, 4, 3, 4, 2, 3]
newx = [[2, 2, 1, 3, 3, 3, 2, 3, 3, 3],
[25.92869, 51.63245, 25.78432, 39.37948, 24.65058, 45.20084,
52.6796, 44.28342, 40.63523, 51.76094],
[1, 2, 1, 1, 1, 1, 2, 2, 2, 1]]
newy = [3, 2, 1, 1, 4, 3, 2, 2, 1, 2]
model_info = []
lrstat = []
mspe = []
n_classes = 4
n_observations = 50
n_new_obs = 10
n_independent = 3
n_coefs = 4
coefs = logisticRegression(n_independent,
n_classes, x, y,
groups=True,
columnWise=True,
lrstat=lrstat,
nextResults=model_info)
yhat = logisticRegPredict(n_independent,
n_classes, coefs, newx,
y=newy,
groups=True,
columnWise=True,
confidence=95.0,
model=model_info[0],
prederr=mspe)
dof = n_coefs * (n_classes - 1) - (n_classes - 1)
model_pval = 1.0 - chiSquaredCdf(lrstat[0], dof)
print("Model Fit Summary:")
print("Log-likelihood: %5.2f " % model_info[0].loglike)
print("LR test statistic: %5.2f " % lrstat[0])
print("Degrees of freedom: ", dof)
print("P-value: %5.4f " % model_pval)
print("")
print("Prediction Summary:")
print("Mean squared prediction error: %4.2f" % mspe[0])
print("")
print("Obs\tClass Estimate Lower Upper")
for i in range(0, n_new_obs):
for j in range(0, n_classes):
print("%d\t %d\t %5.2f %5.2f %5.2f" %
(i + 1, j + 1, yhat[j, 0, i], yhat[j, 1, i], yhat[j, 2, i]))
Output¶
Model Fit Summary:
Log-likelihood: -58.58
LR test statistic: 16.37
Degrees of freedom: 9
P-value: 0.0595
Prediction Summary:
Mean squared prediction error: 0.21
Obs Class Estimate Lower Upper
1 1 0.26 0.14 0.35
1 2 0.14 0.06 0.20
1 3 0.31 0.18 0.36
1 4 0.29 0.10 0.62
2 1 0.04 0.01 0.14
2 2 0.27 0.11 0.39
2 3 0.12 0.04 0.25
2 4 0.57 0.22 0.85
3 1 0.23 0.07 0.38
3 2 0.13 0.04 0.20
3 3 0.28 0.12 0.34
3 4 0.36 0.08 0.77
4 1 0.06 0.02 0.14
4 2 0.16 0.07 0.24
4 3 0.49 0.28 0.54
4 4 0.29 0.08 0.63
5 1 0.34 0.17 0.41
5 2 0.13 0.06 0.19
5 3 0.30 0.17 0.34
5 4 0.22 0.05 0.60
6 1 0.03 0.00 0.09
6 2 0.16 0.06 0.24
6 3 0.53 0.27 0.60
6 4 0.29 0.07 0.67
7 1 0.04 0.01 0.13
7 2 0.27 0.10 0.40
7 3 0.13 0.04 0.26
7 4 0.57 0.21 0.86
8 1 0.14 0.04 0.26
8 2 0.29 0.12 0.37
8 3 0.12 0.04 0.21
8 4 0.46 0.15 0.80
9 1 0.21 0.08 0.33
9 2 0.27 0.12 0.35
9 3 0.10 0.03 0.19
9 4 0.42 0.14 0.77
10 1 0.01 0.00 0.05
10 2 0.15 0.04 0.24
10 3 0.57 0.23 0.67
10 4 0.28 0.05 0.73
Warning Errors¶
IMSLS_NO_ACTUALS |
The average squared prediction error cannot be calculated because no actual “y” values are given. |
Fatal Errors¶
IMSLS_OVERFLOW |
The linear predictor = # is too large and will lead to overflow when exponentiated. |