logisticRegression¶
Fit a binomial or multinomial logistic regression model using iteratively re-weighted least squares.
Synopsis¶
logisticRegression (nIndependent, nClasses, x, y)
Required Arguments¶
- int
nIndependent
(Input) - The number of independent variables.
- int
nClasses
(Input) - The number of discrete outcomes, or classes.
- float
x[[]]
(Input) - An array of length
nObservations
×nIndependent
containing the values of the independent variables corresponding to the responses iny
. - float
y[]
(Input) - An array of length
nObservations
×nClasses
containing the binomial (nClasses
= 2) or multinomial (nClasses>
2) counts per class. In an alternate format,y
is an array of lengthnObservations
× (nClasses
‑ 1) containing the counts for all but one class. The missing class is treated as the reference class. The optional argumentgroupCounts
specifies this format fory
. In another alternative format,y
is an array of lengthnObservations
containing the class id’s. See optional argumentgroups
.
Return Value¶
An array of length nCoefficients
× nClasses
containing the estimated
coefficients. The function fits a full model, where nCoefficients
= 1 +
nIndependent
. The optional arguments noIntercept
, xIndices
, and
xInteractions
may be used to specify different models. Note that the
last column (column nClasses
) represents the reference class and is set
to all zeros.
Optional Arguments¶
groupCounts
(Input)
or
groups
(Input)These optional arguments specify the format of the input array
y
. IfgroupCounts
is present,y
is of lengthnObservations
×(nClasses
‑ 1), and contains counts for all but one of the classes for each observation. The missing class is treated as the reference class.If
groups
is present, the input array y is of lengthnObservations
, andy[i]
contains the group or class number to which the observation belongs. In this case,frequencies[i]
is set to 1 for all observations.Default:
y
isnObservations
×(nClasses)
, and contains counts for all the classes.columnWise
(Input)If present, the input arrays are column-oriented. That is, contiguous elements in
x
are values of the same independent variable, or column, except at multiples ofnObservations
.Default: Input arrays are row-oriented.
frequencies
, int[]
(Input)An array of length
nObservations
containing the number of replications or trials for each of the observations. This argument is required ifgroupCounts
is specified and any element ofy
>
1.Default:
frequencies[i]
= 1.referenceClass
, int (Input)Number specifying which class or outcome category to use as the reference class. See the Description section for details.
Note that the last column of
coefficients
always represents the reference class. So whenreferenceClass
<
nClasses
, columnsreferenceClass
andnClasses
are swapped for the outputcoefficients
, i.e. coefficients for classnClasses
will be returned in columnreferenceClass
ofcoefficients
. For example, ifreferenceClass
= 1 andnClasses
= 3, the first column ofcoefficients
contains the coefficients for class 3 (nClasses
), the second column contains the coefficients for class 2, and the third column contains all zeros for the reference class.Default:
referenceClasses=nClasses
noIntercept
(Input)If present, the model will not include an intercept term.
Default: The intercept term is included.
xIndices
, (Input)An array of length
nXin
providing the column indices ofx
that correspond to the independent variables the user wishes to be included in the logistic regression model. For example, suppose there are five independent variablesx
0,x
1, …,x
4. To fit a model that includes onlyx
2 andx
3, setnXin
= 2,xin
[0] = 2, andxin
[1] = 3.Default: All
nIndependent
variables are included.xInteractions
(Input)An array of length
nXinteract
× 2 providing pairs of column indices ofx
that define the interaction terms in the model. Adjacent indices should be unique. For example, suppose there are two independent variablesx
0 andx
1. To fit a model that includes their interaction term,x
0x
1, setnXinteract
= 1,xinteract
[0] = 0, andxinteract
[1] = 1.Default: No interaction terms are included.
tolerance
, float (Input)Convergence error criteria. Iteration completes when the normed difference between successive estimates is less than
tolerance
ormaxIter
iterations are reached.Default:
tolerance
= 100.00 ×machine(4)
maxIter
, int (Input)The maximum number of iterations.
Default:
maxIter
= 20initInput
, int (Input)initInput
must be 0 or 1. IfinitInput
= 1, initial values for the coefficient estimates are provided in the user arraycoefficients
. IfinitInput
= 0, initial values are computed by the function.Default:
initInput
= 0prevResults
, structure (Input)- A structure containing information about a previous logistic regression
fit. The model is combined with the fit to new data or to
nextResults
, if provided. nextResults
, structurenextModel
(Input/Output)- A structure. If present and
None
, the structure is internally allocated and on output contains the model information. If present and notNone
, its contents are combined with the fit to new data or toprevResults
, if provided. The combined results are returned innextModel
. coefficients
, float[]
(Input/Output)- Storage for the coefficient array of length
nCoefficients
×nClasses
is provided by the user. WheninitInput
= 1,coefficients
should contain the desired initial values of the estimates. lrstat
(Output)- The value of the likelihood ratio test statistic.
Description¶
Function logisticRegression
fits a logistic regression model for discrete
dependent variables with two or more mutually exclusive outcomes or classes.
For a binary response y, the objective is to model the conditional
probability of success, \(\pi_1(x)=Pr[y=1| x]\), where
\(x=(x_1,x_2,\ldots,x_p)\)‘ is a realization of p independent
variables. Logistic regression models the conditional probability,
\(\pi_1(x)\), using the cdf of the logistic distribution. In particular,
where
and
are unknown coefficients that are to be estimated.
Solving for the linear component \(\eta_1\) results in the log-odds or logit transformation of \(\pi_1(x)\):
Given a set of N observations \((y_i,x_i)\), where \(y_i\) follows a binomial \((n,\pi)\) distribution with parameters \(n=1\) and \(\pi=\pi_1(x_i)\), the likelihood and log-likelihood are, respectively,
The log-likelihood in terms of the parameters, \(\{\beta_{01},\beta_1\}\), is therefore
where
With a binary outcome, only one probability needs to be modeled. The second probability can be obtained from the constraint, \(\pi_1(x)+\pi_2(x)=1\). If each \(y_i\) is the number of successes in \(n_i\) independent trials, the log-likelihood becomes
or
See optional argument frequencies
to set frequencies \(n_i\) > 1.
To test the significance of the model, the log-likelihood of the fitted model is compared to that of an intercept‑only model. In particular, \(G=-2 (l(\beta_{01})-l(\beta_{01},\beta_1))\) is a likelihood-ratio test statistic and under the null hypothesis, \(H_0 : \beta_{11}=\beta_{12}=\ldots=\beta_{1p}=0\), G is distributed as chi-squared with p‑1 degrees of freedom. A significant result suggests that at least one parameter in the model is non‑zero. See Hosmer and Lemeshow (2000) for further discussion.
In the multinomial case, the response vector is \(y_i=(y_{i1},y_{i2}, \ldots,y_{iK})'\), where \(y_{ik}=1\) when the i-th observation belongs to class k and \(y_{ik}=0\), otherwise. Furthermore, because the outcomes are mutually exclusive,
and \(\pi_1(x)+\pi_2(x)+\cdots+\pi K(x)=1\). The last class K serves as the baseline or reference class in the sense that it is not modeled directly but found from
If there are multiple trials, \(n_i>1\), then the constraint on the responses is
The log‑likelihood in the multinomial case becomes
or
The constraint
is handled by setting \(\eta_K=0\) for the K‑th class, and then the log‑likelihood is
Note that for the multinomial case, the log‑odds (or logit) is
Note that each of the logits involve the odds ratio of being in class l versus class K, the reference class. Maximum likelihood estimates can be obtained by solving the score equation for each parameter:
To solve the score equations, the function employs a method known as iteratively re-weighted least squares or IRLS. In this case the IRLS is equivalent to the Newton-Raphson algorithm (Hastie, et. al., 2009, Thisted, 1988).
Consider the full vector of parameters
the Newton-Raphson iteration is
where H denotes the Hessian matrix, i.e., the matrix of second partial derivatives defined by
and
and G denotes the gradient vector, the vector of first partial derivatives,
Both the gradient and the Hessian are evaluated at the most recent estimate of the parameters, \(\beta^n\). The iteration continues until convergence or until maximum iterations are reached. Following the theory of maximum likelihood estimation (Kendall and Stuart, 1979), standard errors are obtained from Fisher’s information matrix \((-H)^{-1}\) evaluated at the final estimates.
When the nextResults
option is specified, the function combines estimates
of the same model from separate fits using the method presented in Xi, Lin,
and Chen (2008). To illustrate, let \(\beta_1\) and \(\beta_2\) be
the MLE’s from separate fits to two different sets of data, and let
\(H_1\) and \(H_2\) be the associated Hessian matrices. Then the
combined estimate,
approximates the MLE of the combined data set. The model structure,
nextModel
contains the combined estimates as well as other elements. See
Table 1: Data Structure below.
Parameter | Data Type | Description |
---|---|---|
nObs |
int | Total number of observations. If
the model structure has been
updated three times, first with 100
observations, next with 50, and
third with 50, then nObs = 200. |
n_updates |
int | Total number of times the model
structure has been updated. In the
above scenario, n_updates = 3. |
n_coefs |
int | Number of coefficients in the model. This parameter must be the same for each model update. |
coefs |
float[] |
An array of length
n_coefs ×nClasses
containing the coefficients. |
meany |
float[] |
An array of length nClasses
containing the overall means for
each class variable. |
stderrs |
float[] |
An array of length
n_coefs ×(nClasses ‑ 1)
containing the estimated standard
errors for the estimated
coefficients. |
grad |
float[] |
An array of length
n_coefs ×(nClasses ‑ 1)
containing the estimated gradient
at the coefficient estimates. |
hess |
float[] |
An array of length
n_coefs(nClasses ‑
1)×n_coefs ×(nClasses
‑ 1) containing the estimated
Hessian matrix at the coefficient
estimates. |
Remarks¶
Iteration stops when the estimates converge within tolerance, when maximum iterations are reached, or when the gradient becomes within tolerance of 0, whichever event occurs first. When the gradient converges before the coefficient estimate converges, a condition in the data known as complete or quasi-complete separation may be present. Separation in the data means that one or more independent variable perfectly predicts the response. When detected, the function stops the iteration, issues a warning, and returns the current values of the model estimates. Some of the coefficient estimates and standard errors may not be reliable. Furthermore, overflow issues may occur before the gradient converges. In such cases the program issues a fatal error.
Examples¶
Example 1¶
The first example is from Prentice (1976) and involves the mortality of beetles after five hours exposure to eight different concentrations of carbon disulphide. The table below lists the number of beetles exposed (N) to each concentration level of carbon disulphide (x, given as log dosage) and the number of deaths which result (y):
Log Dosage | Number of Beetles Exposed | Number of Deaths |
---|---|---|
1.690 | 59 | 6 |
1.724 | 60 | 13 |
1.755 | 62 | 18 |
1.784 | 56 | 28 |
1.811 | 63 | 52 |
1.836 | 59 | 53 |
1.861 | 62 | 61 |
1.883 | 60 | 60 |
The number of deaths at each concentration level is the binomial response
(nClasses
= 2) and the log-dosage is the single independent variable.
Note that this example illustrates the GROUP_COUNTS
format for y
and
the optional argument frequencies
.
from numpy import *
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.writeMatrix import writeMatrix
x1 = [1.69, 1.724, 1.755, 1.784, 1.811, 1.836, 1.861, 1.883]
y1 = [6, 13, 18, 28, 52, 53, 61, 60]
freqs = [59, 60, 62, 56, 63, 59, 62, 60]
nclasses = 2
nIndependent = 1
coefs = logisticRegression(nIndependent, nclasses, x1, y1,
groupCounts=True,
frequencies=freqs)
labels = ['1', '2']
tmp = coefs.reshape(4)
writeMatrix("Coefficient Estimates", tmp[0:2],
transpose=True,
noColLabels=True,
rowLabels=labels)
Output¶
Coefficient Estimates
1 -60.76
2 34.30
Example 2¶
In this example the response is a multinomial random variable with 4 outcome classes. The 3 independent variables represent 2 categorical variables and 1 continuous variable. A subset of 2 independent variables along with the intercept defines the logistic regression model. A test of significance is performed.
from __future__ import print_function
from numpy import *
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.chiSquaredCdf import chiSquaredCdf
from pyimsl.stat.writeMatrix import writeMatrix
n_classes = 4
n_observations = 50
n_independent = 3
n_coefs = 3
model = []
xindices = [0, 1]
lrstat = []
x = array([[3, 2, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 1, 3, 2,
2, 1, 2, 1, 3, 2, 1, 2, 1, 2, 3, 2, 1, 2, 1, 1, 2, 3, 1, 2,
1, 1, 1, 3, 1, 3, 2, 3, 3, 1],
[25.92869, 51.63245, 25.78432, 39.37948, 24.65058, 45.20084,
52.6796, 44.28342, 40.63523, 51.76094, 26.30368, 20.70230,
38.74273, 19.47333, 26.42211, 37.05986, 51.67043, 42.40156,
33.90027, 35.43282, 44.30369, 46.72387, 46.99262, 36.05923,
36.83197, 61.66257, 25.67714, 39.08567, 48.84341, 39.43391,
24.73522, 50.55251, 31.34263, 27.15795, 31.72685, 25.00408,
26.35457, 38.12343, 49.9403, 42.45779, 38.80948, 43.22799,
41.87624, 48.0782, 43.23673, 39.41294, 23.93346,
42.8413, 30.40669, 37.77389],
[1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1,
1, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2,
2, 2, 1, 1, 2, 1, 1, 2, 1, 1]])
y = [1., 2., 3, 4, 3, 3, 4, 4, 4, 4, 2, 1, 4, 1, 1, 1, 4, 4, 3, 1,
2, 3, 3, 4, 2, 3, 4, 1, 2, 4, 3, 4, 4, 1, 3, 4, 4, 2, 3, 4,
2, 2, 4, 3, 1, 4, 3, 4, 2, 3]
coefs = logisticRegression(n_independent, n_classes, x, y,
groups=True,
columnWise=True,
xIndices=xindices,
lrstat=lrstat,
nextResults=model)
dof = n_coefs * (n_classes - 1) - (n_classes - 1)
model_pval = 1.0 - chiSquaredCdf(lrstat[0], dof)
rowLabels = list(range(1, 10))
rowLabels = str(rowLabels)
rowLabels = rowLabels.replace('[', '')
rowLabels = rowLabels.replace(']', '')
rowLabels = rowLabels.split(',')
print("")
tmp = coefs.reshape(16)
writeMatrix("Coefficients", tmp[0:9],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
print("")
writeMatrix("Std Errs", model[0].stderrs[0:9],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
print("")
print("Log-likelihood: %5.2f" % model[0].loglike)
print("LR test statistic: %5.2f" % lrstat[0])
print("%d deg. freedom, p-value: %5.4f" % (dof, model_pval))
Output¶
Log-likelihood: -62.92
LR test statistic: 7.69
6 deg. freedom, p-value: 0.2621
Coefficients
1 2.294
2 0.408
3 -0.111
4 -1.159
5 0.245
6 -0.002
7 -0.064
8 0.178
9 -0.017
Std Errs
1 2.259
2 0.548
3 0.051
4 2.122
5 0.500
6 0.044
7 1.862
8 0.442
9 0.039
Example 3¶
Example 3 uses the same data as in Example 2 and an additional set of 50 observations using the same data generating process. The model structure includes all 3 independent variables and an intercept, and a single model fit is approximated from two separate model fits. Example 3 also includes a fit on the full data set for comparison purposes.
from __future__ import print_function
from numpy import *
from pyimsl.stat.logisticRegression import logisticRegression
from pyimsl.stat.chiSquaredCdf import chiSquaredCdf
from pyimsl.stat.writeMatrix import writeMatrix
x1 = [[3, 2, 2, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 1, 3, 2, 2,
1, 2, 1, 3, 2, 1, 2, 1, 2, 3, 2, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1,
1, 3, 1, 3, 2, 3, 3, 1],
[25.92869, 51.63245, 25.78432, 39.37948, 24.65058, 45.20084,
52.6796, 44.28342, 40.63523, 51.76094, 26.30368, 20.70230,
38.74273, 19.47333, 26.42211, 37.05986, 51.67043, 42.40156,
33.90027, 35.43282, 44.30369, 46.72387, 46.99262, 36.05923,
36.83197, 61.66257, 25.67714, 39.08567, 48.84341, 39.34391,
24.73522, 50.55251, 31.34263, 27.15795, 31.72685, 25.00408,
26.35457, 38.12343, 49.9403, 42.45779, 38.80948, 43.22799,
41.87624, 48.0782, 43.23673, 39.41294, 23.93346,
42.8413, 30.40669, 37.77389],
[1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1,
1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2,
1, 1, 2, 1, 1, 2, 1, 1]]
x2 = [[1, 1, 3, 3, 2, 3, 3, 3, 2, 1, 1, 1, 1, 3, 3, 2, 2, 3, 3, 2, 3,
2, 1, 3, 3, 2, 2, 3, 3, 2, 1, 2, 1, 2, 3, 3, 1, 1, 2, 2, 3, 1,
1, 2, 2, 1, 1, 2, 3, 1],
[35.66064, 26.68771, 23.11251, 58.14765, 44.95038, 42.45634,
34.97379, 53.54269, 32.57257, 46.91201, 30.93306, 51.63743,
34.67712, 53.84584, 14.97474, 44.4485, 47.10448, 43.96467,
55.55741, 36.63123, 32.35164, 55.75668, 36.83637, 46.7913,
44.24153, 49.94011, 41.91916, 24.78584, 50.79019, 39.97886,
34.42149, 41.93271, 28.59433, 38.47255, 32.11676, 37.19347,
52.89337, 34.64874, 48.61935, 33.99104, 38.32489, 35.53967,
29.59645, 21.14665, 51.11257, 34.20155, 44.40374, 49.67626,
58.35377, 28.03744],
[1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2,
1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2,
1, 1, 2, 1, 1, 2, 1, 1]]
y1 = [1, 2, 3, 4, 3, 3, 4, 4, 4, 4, 2, 1, 4, 1, 1, 1, 4, 4, 3, 1, 2,
3, 3, 4, 2, 3, 4, 1, 2, 4, 3, 4, 4, 1, 3, 4, 4, 2, 3, 4, 2, 2,
4, 3, 1, 4, 3, 4, 2, 3]
y2 = [1, 4, 1, 4, 1, 1, 3, 1, 2, 4, 3, 1, 3, 2, 4, 4, 4, 2, 3, 2, 1,
4, 4, 4, 4, 3, 1, 1, 3, 1, 4, 2, 4, 2, 1, 2, 3, 1, 1, 4, 1, 2,
4, 3, 4, 2, 4, 3, 2, 4]
# float x3[100 * 3], y3[100], *coefs;
# int i, j,
n_classes = 4
n_observations = 50
n_independent = 3
n_coefs = 4
model1 = []
model12 = []
model3 = []
# first call with x1, y1
coefs = logisticRegression(n_independent,
n_classes, x1, y1,
groups=True,
columnWise=True,
nextResults=model1)
rowLabels = list(range(1, 13))
rowLabels = str(rowLabels)
rowLabels = rowLabels.replace('[', '')
rowLabels = rowLabels.replace(']', '')
rowLabels = rowLabels.split(',')
writeMatrix("First Model Coefficients:", model1[0].coefs[0:12],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
print("")
writeMatrix("First Model Standard Errors:", model1[0].stderrs[0:12],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
# second call with x2,y2
coefs = logisticRegression(n_independent,
n_classes, x2, y2,
groups=True,
columnWise=True,
prevResults=model1[0],
nextResults=model12)
print("")
writeMatrix("Combined Model Coefficients:", model12[0].coefs[0:12],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
writeMatrix("Combined Model Standard Errors:", model12[0].stderrs[0:12],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
# combine the data
y3 = y1
y3 += y2
x3 = x1
x3[0] += x2[0]
x3[1] += x2[1]
x3[2] += x2[2]
coefs = logisticRegression(n_independent,
n_classes, x3, y3,
groups=True,
columnWise=True,
nextResults=model3)
writeMatrix("Full Data Model Coefficients:", model3[0].coefs[0:12],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
writeMatrix("Full Data Standard Errors:", model3[0].stderrs[0:12],
transpose=True,
noColLabels=True,
rowLabels=rowLabels)
Output¶
First Model Coefficients:
1 1.691
2 0.350
3 -0.137
4 1.057
5 -1.254
6 0.242
7 -0.004
8 0.115
9 1.032
10 0.278
11 0.016
12 -1.954
First Model Standard Errors:
1 2.389
2 0.565
3 0.061
4 1.025
5 2.197
6 0.509
7 0.047
8 0.885
9 2.007
10 0.461
11 0.043
12 0.958
Combined Model Coefficients:
1 -1.169
2 0.649
3 -0.038
4 0.608
5 -1.935
6 0.435
7 0.002
8 0.215
9 -0.193
10 0.282
11 0.002
12 -0.630
Combined Model Standard Errors:
1 1.489
2 0.359
3 0.029
4 0.588
5 1.523
6 0.358
7 0.030
8 0.584
9 1.461
10 0.344
11 0.030
12 0.596
Full Data Model Coefficients:
1 -1.009
2 0.640
3 -0.051
4 0.764
5 -2.008
6 0.436
7 0.003
8 0.263
9 -0.413
10 0.299
11 0.004
12 -0.593
Full Data Standard Errors:
1 1.466
2 0.350
3 0.029
4 0.579
5 1.520
6 0.357
7 0.029
8 0.581
9 1.389
10 0.336
11 0.028
12 0.577
Warning Errors¶
IMSLS_NO_CONV_SEP |
Convergence did not occur in #
iterations. “tolerance ” = #,
the error between estimates = #, and
the gradient has norm = #. Adjust
“tolerance ” or
“maxIter ”, or there may be a
separation problem in the data. |
IMSLS_EMPTY_INT_RESULTS |
Intermediate results given to the function are empty and may be expected to be non‑empty in this scenario. |
Fatal Errors¶
IMSLS_NO_CONV_OVERFLOW |
The linear predictor = # is too large and will lead to overflow when exponentiated. The algorithm fails to converge. |