Fit a binomial or multinomial logistic regression model using iteratively re-weighted least squares.
Synopsis
#include<imsls.h>
float*imsls_f_logistic_regression (intn_observations, intn_independent, int n_classes, floatx[], floaty[], ..., 0)
The type double function is imsls_d_logistic_regression.
Required Arguments
intn_observations (Input) The number of observations.
intn_independent (Input) The number of independent variables.
intn_classes (Input) The number of discrete outcomes, or classes.
floatx[] (Input) An array of length n_observations × n_independent containing the values of the independent variables corresponding to the responses in y.
floaty[] (Input) An array of length n_observations × n_classes containing the binomial (n_classes = 2) or multinomial (n_classes>2) counts per class. In an alternate format, y is an array of length n_observations × (n_classes ‑ 1) containing the counts for all but one class. The missing class is treated as the reference class. The optional argument GROUP_COUNTS specifies this format for y. In another alternative format, y is an array of length n_observations containing the class id’s. See optional argument IMSLS_GROUPS.
Return Value
Pointer to an array of length n_coefficients × n_classes containing the estimated coefficients. The function fits a full model, where n_coefficients = 1 + n_independent. The optional arguments IMSLS_NO_INTERCEPT, IMSLS_X_INDICES, and IMSLS_X_INTERACTIONS may be used to specify different models. Note that the last column (column n_classes) represents the reference class and is set to all zeros.
Synopsis with Optional Arguments
#include<imsls.h>
float*imsls_f_logistic_regression (int n_observations, int n_independent, int n_classes, floatx[], float y[],
IMSLS_GROUP_COUNTS, or
IMSLS_GROUPS,
IMSLS_COLUMN_WISE,
IMSLS_FREQUENCIES, int frequencies[],
IMSLS_REFERENCE_CLASS, intref_class,
IMSLS_NO_INTERCEPT,
IMSLS_X_INDICES, int n_xin, int xin[],
IMSLS_X_INTERACTIONS, int n_xinteract, int xinteract[],
IMSLS_TOLERANCE, floattolerance,
IMSLS_MAX_ITER, intmax_iter,
IMSLS_INIT_INPUT, intinit,
IMSLS_PREV_RESULTS, Imsls_f_model*prev_model,
IMSLS_NEXT_RESULTS, Imsls_f_model **next_model,
IMSLS_COEFFICIENTS, float coefficients[],
IMSLS_LRSTAT, float *lrstat,
0)
Optional Arguments
IMSLS_GROUP_COUNTS (Input)
or
IMSLS_GROUPS (Input) These optional arguments specify the format of the input array y. If IMSLS_GROUP_COUNTS is present, y is of length n_observations×(n_classes‑ 1), and contains counts for all but one of the classes for each observation. The missing class is treated as the reference class.
If IMSLS_GROUPS is present, the input array y is of length n_observations, and y[i] contains the group or class number to which the observation belongs. In this case, frequencies[i] is set to 1 for all observations.
Default: y is n_observations×(n_classes), and contains counts for all the classes.
IMSLS_COLUMN_WISE (Input) If present, the input arrays are column-oriented. That is, contiguous elements in x are values of the same independent variable, or column, except at multiples of n_observations.
Default: Input arrays are row-oriented.
IMSLS_FREQUENCIES, int frequencies[] (Input) An array of length n_observations containing the number of replications or trials for each of the observations. This argument is required if IMSLS_GROUP_COUNTS is specified and any element of y> 1.
Default: frequencies[i] = 1.
IMSLS_REFERENCE_CLASS, int ref_class (Input) Number specifying which class or outcome category to use as the reference class. See the Description section for details.
Note that the last column of coefficients always represents the reference class. So when ref_class<n_classes, columns ref_class and n_classes are swapped for the output coefficients, i.e. coefficients for class n_classes will be returned in column ref_class of coefficients. For example, if ref_class = 1 and n_classes = 3, the first column of coefficients contains the coefficients for class 3 ( n_classes), the second column contains the coefficients for class 2, and the third column contains all zeros for the reference class.
Default: ref_classes=n_classes
IMSLS_NO_INTERCEPT (Input) If present, the model will not include an intercept term.
Default: The intercept term is included.
IMSLS_X_INDICES, (Input) An array of length n_xin providing the column indices of x that correspond to the independent variables the user wishes to be included in the logistic regression model. For example, suppose there are five independent variables x0, x1, …, x4. To fit a model that includes only x2 and x3, set n_xin = 2, xin[0] = 2, and xin[1] = 3.
Default: All n_independent variables are included.
IMSLS_X_INTERACTIONS, (Input) An array of length n_xinteract × 2 providing pairs of column indices of x that define the interaction terms in the model. Adjacent indices should be unique. For example, suppose there are two independent variables x0 and x1. To fit a model that includes their interaction term, x0x1, set n_xinteract = 1, xinteract[0] = 0, and xinteract[1] = 1.
Default: No interaction terms are included.
IMSLS_TOLERANCE, floattolerance (Input) Convergence error criteria. Iteration completes when the normed difference between successive estimates is less than tolerance or max_iter iterations are reached.
Default: tolerance = 100.00 ×imsls_f_machine(4)
IMSLS_MAX_ITER, int max_iter (Input) The maximum number of iterations.
Default: max_iter = 20
IMSLS_INIT_INPUT, int init (Input) init must be 0 or 1. If init = 1, initial values for the coefficient estimates are provided in the user array coefficients. If init = 0, initial values are computed by the function.
Default: init = 0
IMSLS_PREV_RESULTS, Imsls_f_model*prev_model (Input) Pointer to a structure of type Imsls_f_model containing information about a previous logistic regression fit. The model is combined with the fit to new data or to IMSLS_NEXT_RESULTS, if provided.
IMSLS_NEXT_RESULTS, Imsls_f_model**next_model (Input/Output) Address of a pointer to a structure of type Imsls_f_model. If present and NULL, the structure is internally allocated and on output contains the model information. If present and not NULL, its contents are combined with the fit to new data or to IMSLS_PREV_RESULTS, if provided. The combined results are returned in next_model.
IMSLS_COEFFICIENTS, float coefficients[] (Input/Output) Storage for the coefficient array of length n_coefficients × n_classes is provided by the user. When init = 1, coefficients should contain the desired initial values of the estimates.
IMSLS_LRSTAT, float *lrstat (Output) The value of the likelihood ratio test statistic.
Description
Function imsls_f_logistic_regression fits a logistic regression model for discrete dependent variables with two or more mutually exclusive outcomes or classes. For a binary response y, the objective is to model the conditional probability of success, π1(x) = Pr[y = 1∣ x], where x = (x1, x2, …, xp)' is a realization of p independent variables. Logistic regression models the conditional probability, , using the cdf of the logistic distribution. In particular,
where
and
are unknown coefficients that are to be estimated.
Solving for the linear component η1 results in the log-odds or logit transformation of π1(x):
Given a set of N observations (yi, xi), where yi follows a binomial (n, π) distribution with parameters n = 1 and π =π1(xi), the likelihood and log-likelihood are, respectively,
The log-likelihood in terms of the parameters, {β01, β1}, is therefore
where
With a binary outcome, only one probability needs to be modeled. The second probability can be obtained from the constraint, π1(x) + π2(x) = 1. If each yi is the number of successes in ni independent trials, the log-likelihood becomes
or
See optional argument IMSLS_FREQUENCIES to set frequencies ni > 1.
To test the significance of the model, the log-likelihood of the fitted model is compared to that of an intercept‑only model. In particular, G = ‑2 (l(β01) - l(β01, β1)) is a likelihood-ratio test statistic and under the null hypothesis, H0 : β11 = β12 = … = β1p = 0, G is distributed as chi-squared with p‑1 degrees of freedom. A significant result suggests that at least one parameter in the model is non‑zero. See Hosmer and Lemeshow (2000) for further discussion.
In the multinomial case, the response vector is yi = (yi1, yi2, …, yiK)', where yik = 1 when the i-th observation belongs to class k and yik = 0, otherwise. Furthermore, because the outcomes are mutually exclusive,
and π1(x) + π2(x) +--- + πK (x) = 1. The last class K serves as the baseline or reference class in the sense that it is not modeled directly but found from
If there are multiple trials, ni> 1, then the constraint on the responses is
The log‑likelihood in the multinomial case becomes
or
The constraint
is handled by setting ηK = 0 for the K‑th class, and then the log‑likelihood is
Note that for the multinomial case, the log‑odds (or logit) is
Note that each of the logits involve the odds ratio of being in class l versus class K, the reference class. Maximum likelihood estimates can be obtained by solving the score equation for each parameter:
To solve the score equations, the function employs a method known as iteratively re-weighted least squares or IRLS. In this case the IRLS is equivalent to the Newton-Raphson algorithm (Hastie, et. al., 2009, Thisted, 1988).
Consider the full vector of parameters
the Newton-Raphson iteration is
where H denotes the Hessian matrix, i.e., the matrix of second partial derivatives defined by
and
and G denotes the gradient vector, the vector of first partial derivatives,
Both the gradient and the Hessian are evaluated at the most recent estimate of the parameters, βn. The iteration continues until convergence or until maximum iterations are reached. Following the theory of maximum likelihood estimation (Kendall and Stuart, 1979), standard errors are obtained from Fisher’s information matrix (-H)-1 evaluated at the final estimates.
When the IMSLS_NEXT_RESULTS option is specified, the function combines estimates of the same model from separate fits using the method presented in Xi, Lin, and Chen (2008). To illustrate, let β1 and β2 be the MLE’s from separate fits to two different sets of data, and let H1 and H2 be the associated Hessian matrices. Then the combined estimate,
approximates the MLE of the combined data set. The model structure, Imsls_f_model **next_model contains the combined estimates as well as other elements. See Table 1: Imsls_f_model Data Structure below.
Table 5.34 — The Imsls_f_model Data Structure
Parameter
Data Type
Description
n_obs
int
Total number of observations. If the model structure has been updated three times, first with 100 observations, next with 50, and third with 50, then n_obs = 200.
n_updates
int
Total number of times the model structure has been updated. In the above scenario, n_updates = 3.
n_coefs
int
Number of coefficients in the model. This parameter must be the same for each model update.
coefs
float[]
An array of length n_coefs×n_classes containing the coefficients.
meany
float[]
An array of length n_classes containing the overall means for each class variable.
stderrs
float[]
An array of length n_coefs×(n_classes‑ 1) containing the estimated standard errors for the estimated coefficients.
grad
float[]
An array of length n_coefs×(n_classes‑ 1) containing the estimated gradient at the coefficient estimates.
hess
float[]
An array of length n_coefs*(n_classes‑ 1)×n_coefs×(n_classes‑ 1) containing the estimated Hessian matrix at the coefficient estimates.
Remarks
Iteration stops when the estimates converge within tolerance, when maximum iterations are reached, or when the gradient becomes within tolerance of 0, whichever event occurs first. When the gradient converges before the coefficient estimate converges, a condition in the data known as complete or quasi-complete separation may be present. Separation in the data means that one or more independent variable perfectly predicts the response. When detected, the function stops the iteration, issues a warning, and returns the current values of the model estimates. Some of the coefficient estimates and standard errors may not be reliable. Furthermore, overflow issues may occur before the gradient converges. In such cases the program issues a fatal error.
Examples
Example 1
The first example is from Prentice (1976) and involves the mortality of beetles after five hours exposure to eight different concentrations of carbon disulphide. The table below lists the number of beetles exposed (N) to each concentration level of carbon disulphide (x, given as log dosage) and the number of deaths which result (y):
Log Dosage
Number of Beetles Exposed
Number of Deaths
1.690
59
6
1.724
60
13
1.755
62
18
1.784
56
28
1.811
63
52
1.836
59
53
1.861
62
61
1.883
60
60
The number of deaths at each concentration level is the binomial response (n_classes = 2) and the log-dosage is the single independent variable. Note that this example illustrates the GROUP_COUNTS format for y and the optional argument IMSLS_FREQUENCIES.
int n_classes=2,n_observations=8,n_independent=1,n_coefs=2;
coefs=imsls_f_logistic_regression(n_observations,
n_independent,n_classes,x1,y1,
IMSLS_GROUP_COUNTS,
IMSLS_FREQUENCIES,freqs,
0);
imsls_f_write_matrix("Coefficient Estimates",
(n_coefs)*(n_classes-1),1,coefs,0);
}
Output
Coefficient Estimates
1 -60.76
2 34.30
Example 2
In this example the response is a multinomial random variable with 4 outcome classes. The 3 independent variables represent 2 categorical variables and 1 continuous variable. A subset of 2 independent variables along with the intercept defines the logistic regression model. A test of significance is performed.
printf("LR test statistic: %5.2f\n%d deg. freedom, "
"p-value: %5.4f\n",lrstat,dof,model_pval,0);
}
Output
Coefficients
1 2.292
2 0.408
3 -0.111
4 -1.162
5 0.245
6 -0.002
7 -0.067
8 0.178
9 -0.017
Std Errs
1 2.259
2 0.548
3 0.051
4 2.122
5 0.500
6 0.044
7 1.862
8 0.442
9 0.039
Log-likelihood: -62.92
LR test statistic: 7.68
6 deg. freedom, p-value: 0.2623
Example 3
Example 3 uses the same data as in Example 2 and an additional set of 50 observations using the same data generating process. The model structure includes all 3 independent variables and an intercept, and a single model fit is approximated from two separate model fits. Example 3 also includes a fit on the full data set for comparison purposes.
imsls_f_write_matrix("Full Data Model Coefficients:",
n_coefs * (n_classes - 1), 1, model3->coefs,
0);
imsls_f_write_matrix("Full Data Model Standard Errors:",
n_coefs * (n_classes - 1), 1, model3->stderrs,
0);
}
Output
First Model Coefficients:
1 1.691
2 0.350
3 -0.137
4 1.057
5 -1.254
6 0.242
7 -0.004
8 0.115
9 1.032
10 0.278
11 0.016
12 -1.954
First Model Standard Errors:
1 2.389
2 0.565
3 0.061
4 1.025
5 2.197
6 0.509
7 0.047
8 0.885
9 2.007
10 0.461
11 0.043
12 0.958
Combined Model Coefficients:
1 -1.169
2 0.649
3 -0.038
4 0.608
5 -1.935
6 0.435
7 0.002
8 0.215
9 -0.193
10 0.282
11 0.002
12 -0.630
Combined Model Standard Errors:
1 1.489
2 0.359
3 0.029
4 0.588
5 1.523
6 0.358
7 0.030
8 0.584
9 1.461
10 0.344
11 0.030
12 0.596
Full Data Model Coefficients:
1 -1.009
2 0.640
3 -0.051
4 0.764
5 -2.008
6 0.436
7 0.003
8 0.263
9 -0.413
10 0.299
11 0.004
12 -0.593
Full Data Model Standard Errors:
1 1.466
2 0.350
3 0.029
4 0.579
5 1.520
6 0.357
7 0.029
8 0.581
9 1.389
10 0.336
11 0.028
12 0.577
Warning Errors
IMSLS_NO_CONV_SEP
Convergence did not occur in # iterations. “tolerance” = #, the error between estimates = #, and the gradient has norm = #. Adjust “tolerance” or “max_iter”, or there may be a separation problem in the data.
IMSLS_EMPTY_INT_RESULTS
Intermediate results given to the function are empty and may be expected to be non‑empty in this scenario.
Fatal Errors
IMSLS_NO_CONV_OVERFLOW
The linear predictor = # is too large and will lead to overflow when exponentiated. The algorithm fails to converge.