categoricalGlm¶
Analyzes categorical data using logistic, Probit, Poisson, and other generalized linear models.
Synopsis¶
categoricalGlm (nClass, nContinuous, model, x)
Required Arguments¶
- int
nClass
(Input) - Number of classification variables.
- int
nContinuous
(Input) - Number of continuous variables.
- int
model
(Input) - Argument model specifies the model used to analyze the data. The six models are as follows:
model |
Relationship | PDF of Response Variable |
---|---|---|
0 | Exponential | Poisson |
1 | Logistic | Negative Binomial |
2 | Logistic | Logarithmic |
3 | Logistic | Binomial |
4 | Probit | Binomial |
5 | Log-log | Binomial |
Relationship between the parameter, θ or λ, and a linear model of the explanatory variables, X β.
Note that the lower bound of the response variable is 1 for model
= 3 and is 0 for all other models. See the
Description section for more
information about these models. |
- float
x[[]]
(Input) Array of size
nObservations
by (nClass + nContinuous
) + m containing data for the independent variables, dependent variable, and optional parameters.The columns must be ordered such that the first
nClass
columns contain data for the class variables, the nextnContinuous
columns contain data for the continuous variables, and the next column contains the response variable. The final (and optional) m − 1 columns contain the optional parameters.
Return Value¶
An integer value indicating the number of estimated coefficients
(nCoefficients
) in the model.
Optional Arguments¶
xColFrequencies
, int (Input)- Column number
xColFrequencies
ofx
containing the frequency of response for each observation. xColFixedParameter
, int (Input)- Column number
xColFixedParameter
ofx
containing a fixed parameter for each observation that is added to the linear response prior to computing the model parameter. The ‘fixed’ parameter allows one to test hypothesis about the parameters via the log-likelihoods. xColDistParameter
, int (Input)Column number
xColDistParameter
ofx
containing the value of the known distribution parameter for each observation, wherex
[i
][xColDistParameter
] is the known distribution parameter associated with the i-th observation. The meaning of the distributional parameter depends upon model as follows:model
Parameter Meaning of x
[i
] [xColDistParameter
]0 E ln (E) is a fixed intercept to be included in the linear predictor (i.e., the offset). 1 S Number of successes required for the negative binomial distribution. 2 Not used for this model. 3-5 N Number of trials required for the binomial distribution. Default: When
model
≠ 2, each observation is assumed to have a parameter value of 1. Whenmodel
= 2, this parameter is not referenced.xColVariables
,iclass
,icontinuous
,iy
(Input)This keyword allows specification of the variables to be used in the analysis and overrides the default ordering of variables described for input argument
x
.Argument
iclass
is an index vector of lengthnClass
containing the column numbers ofx
that correspond to classification variables.Argument
icontinuous
is an index vector of lengthnContinuous
containing the column numbers ofx
that correspond to continuous variables.Argument
iy
indicates the column ofx
that contains the independent variable.eps
, float (Input)Argument
eps
is the convergence criterion. Convergence is assumed when the maximum relative change in any coefficient estimate is less thaneps
from one iteration to the next, or when the relative change in the log-likelihood criterion from one iteration to the next is less thaneps
/ 100.0.Default:
eps
= 0.001maxIterations
, int (Input)Maximum number of iterations. Use
maxIterations
= 0 to compute the Hessian, stored incov
, and the Newton step, stored inlastStep
, at the initial estimates. (The initial estimates must be input. Use keywordinitialEstInput
).Default:
maxIterations
= 30
intercept
(input)
or
noIntercept
(Input)- By default, or if
intercept
is specified, the intercept is automatically included in the model. IfnoIntercept
is specified, there is no intercept in the model (unless otherwise provided for by the user). effects
,nEffects
,nVarEffects
,indicesEffects
(Input)- Variable
nEffects
is the number of effects (sources of variation) in the model. VariablenVarEffects
is an array of lengthnEffects
containing the number of variables associated with each effect in the model. ArgumentindicesEffects
is an index array of lengthnVarEffects
[0]+ nVarEffects
[1]+
…+ nVarEffects
[nEffects
− 1]. The firstnVarEffects
[0] elements give the column numbers ofx
for each variable in the first effect. The nextnVarEffects
[1] elements give the column numbers for each variable in the second effect. The lastnVarEffects
[nEffects
− 1] elements give the column numbers for each variable in the last effect.
initialEstInternal
(Input)
or
initialEstInput
,nCoefInput
,estimates
(Input)- By default, or if
initialEstInput
is specified, then unweighted linear regression is used to obtain initial estimates. IfinitialEstInput
is specified, then thenCoefInput
elements ofestimates
contain initial estimates of the parameters. This requires that the user know the number of coefficients in the model prior to the call tocategoricalGlm
, which can be obtained by callingregressorsForGlm
. maxClass
, int (Input)An upper bound on the sum of the number of distinct values taken on by each classification variable.
Default:
maxClass
=nObservations
×nClass
classInfo
,nClassValues
,classValues
(Output)Argument
nClassValues
is an array of lengthnClass
containing the number of values taken by each classification variable; the i-th classification variable hasnClassValues
[i] distinct values. ArgumentclassValues
is an array of length\[\sum_{\mathrm{i}=0}^{\mathrm{nClass}-1} \mathrm{nClassValues}[i]\]containing the distinct values of the classification variables in ascending order. The first
nClassValues
[0] elements ofclassValues
contain the values for the first classification variables, the nextnClassValues
[1] elements contain the values for the second classification variable, etc.coefStat
(Output)- An array of size
nCoefficients
× 4 containing the parameter estimates and associated statistics, wherenCoefficients
can be computed by callingregressorsForGlm
.
Column | Parameter |
---|---|
0 | Coefficient Estimate. |
1 | Estimated standard deviation of the estimated coefficient. |
2 | Asymptotic normal score for testing that the coefficient is zero. |
3 | The p-value associated with the normal score in column 2. |
criterion
(Output)- Optimization criterion. The maximized log‑likelihood, i.e., the value of the log‑likelihood at the final parameter estimates.
cov
(Output)- The array of size
nCoefficients
×nCoefficients
containing the estimated asymptotic covariance matrix of the coefficients. FormaxIterations
= 0, this is the Hessian computed at the initial parameter estimates, wherenCoefficients
can be computed by callingregressorsForGlm
. means
(Output)- The array containing the means of the design variables. The array is of
length
nCoefficients
ifnoIntercept
is specified, and of lengthnCoefficients
− 1 otherwise, wherenCoefficients
can be computed by callingregressorsForGlm
. caseAnalysis
(Output)The array of size
nObservations
× 5 containing the case analysis.Column Statistic 0 Predicted mean for the observation if model
= 0. Otherwise, contains the probability of success on a single trial.1 The residual. 2 The estimated standard error of the residual. 3 The estimated influence of the observation. 4 The standardized residual. Case statistics are computed for all observations except where missing values prevent their computation.
lastStep
(Output)- The array of length
nCoefficients
containing the last parameter updates (excluding step halvings). FormaxIterations
= 0,lastStep
contains the inverse of the Hessian times the gradient vector, all computed at the initial parameter estimates. obsStatus
(Output)- The array of length
nObservations
indicating which observations are included in the extended likelihood.
obsStatus
[i] |
Status of observation |
---|---|
0 | Observation i is in the likelihood |
1 | Observation i cannot be in the likelihood
because it contains at least one missing value in
x . |
2 | Observation i is not in the likelihood. Its estimated parameter is infinite. |
iterations
,n
,iterations
(Output)- The array of size
nObservations
× 5 containing information about each iteration of the analysis.
Column | Statistic |
---|---|
0 | Method of iteration. Equal to 0 if a Q-N step was taken. Equal to 1 if a N-R step was taken. |
1 | Iteration number. |
2 | Step Size. |
3 | Maximum scaled coefficient update. |
4 | Log-likelihood. |
nRowsMissing
(Output)- Number of rows of data that contain missing values in one or more of the
following arrays or columns of
x
:xColDistParameter
,iy
,xColFrequencies
,xColFixedParameter
,iclass
,icontinuous
, orindicesEffects
.
Remarks¶
- Dummy variables are generated for the classification variables as
follows: An ascending list of all distinct values of each classification
variable is obtained and stored in
classValues
. Dummy variables are then generated for each but the last of these distinct values. Each dummy variable is zero unless the classification variable equals the list value corresponding to the dummy variable, in which case the dummy variable is one. See keywordleaveOutLast
for optional argumentdummy
in function regressorsForGlm (Chapter 2, Regression). - The “product” of a classification variable with a covariate yields dummy variables equal to the product of the covariate with each of the dummy variables associated with the classification variable.
- The “product” of two classification variables yields dummy variables in the usual manner. Each dummy variable associated with the first classification variable multiplies each dummy variable associated with the second classification variable. The resulting dummy variables are such that the index of the second classification variable varies fastest.
Description¶
Function categoricalGlm
uses iteratively re-weighted least squares to
compute (extended) maximum likelihood estimates in some generalized linear
models involving categorized data. One of several models, including the
probit, logistic, Poisson, logarithmic, and negative binomial models, may be
fit.
Note that each row vector in the data matrix can represent a single
observation; or, through the use of optional argument xColFrequencies
,
each row can represent several observations. Also note that classification
variables and their products are easily incorporated into the models via the
usual regression-type specifications.
The models available in categoricalGlm
are:
model |
PDF of the Response Variable | Parameterization |
---|---|---|
0 |
\[f(y)=\left( \lambda_y \exp\left(-\lambda\right) \right)/y!\]
|
\[\lambda=N\times\exp \left( \omega+\eta\right)\]
|
1 |
\[f(y) = \binom {S+y-1}{y-1} \theta^S (1-\theta)^y\]
|
\[\theta=\frac{\exp(\omega+\eta)}{1+\exp(\omega+\eta)}\]
|
2 |
\[f(y)=(1-\theta) y/(y1n \theta)\]
|
\[\theta=\frac{\exp(\omega+\eta)}{1+\exp(\omega+\eta)}\]
|
3 |
\[f(y)=\binom{N}{y}\theta^y/(1-\theta)^{N-y}\]
|
\[\theta=\frac{\exp(\omega+\eta)}{1+\exp(\omega+\eta)}\]
|
4 |
\[f(y)=\binom{N}{y}\theta^y/(1-\theta)^{N-y}\]
|
\[\theta=\phi( \omega+\eta)\]
|
5 |
\[f(y)=\binom{N}{y}\theta^y/(1-\theta)^{N-y}\]
|
\[\theta=1-\exp(-\exp(\omega+\eta))\]
|
Here, Φ denotes the cumulative normal distribution, N and S are known
distribution parameters specified for each observation via the optional
argument xColDistParameter
, and ω is an optional fixed parameter of the
linear response, \(\gamma_i\), specified for each observation. (If
xColFixedParameter
is not specified, then ω is taken to be 0.) Since the
log-log model (model
= 5) probabilities are not symmetric with respect to
0.5, quantitatively, as well as qualitatively, different models result when
the definitions of “success” and “failure” are interchanged in this
distribution. In this model and all other models involving θ, θ is taken to
be the probability of a “success.”
Computational Details¶
The computations proceed as follows:
- The input parameters are checked for consistency and validity.
- Estimates of the means of the “independent” or design variables are
computed. The frequency or the observation in all but binomial
distribution models is taken from vector frequencies. In binomial
distribution models, the frequency is taken as the product of n =
parameter
[i] andfrequencies
[i]. Means are computed as
By default, and when
initialEstInternal
is specified, initial estimates of the coefficients are obtained (based upon the observation intervals) as multiple regression estimates relating transformed observation probabilities to the observation design vector. For example, in the binomial distribution models, θ may be estimated as\[\hat{\theta} = \mathtt{y}[i] / \mathtt{parameter}[i]\]and, when
model
= 3, the linear relationship is given by\[\ln \left(\hat{\theta}/\left(1-\hat{\theta}\right)\right) \approx X\beta\]while if
model
= 4, \(\Phi^{-1}(\theta)=X\beta\). When computing initial estimates, standard modifications are made to prevent illegal operations such as division by zero. Regression estimates are obtained at this point, as well as later, by use of functionregression
(Chapter 2, Regression). Also, at this step of the computations, the regression function is used to detect linear dependence in the model, by the method described forregression
.Newton-Raphson iteration for the maximum likelihood estimates is implemented via iteratively re-weighted least squares. Let
\[\mathit{\Psi}\left(x_i^T \beta\right)\]denote the log of the probability of the i-th observation for coefficients β. In the least-squares model, the weight of the i-th observation is taken as the absolute value of the second derivative of
\[\mathit{\Psi}\left(x_i^T \beta\right)\]with respect to
\[\gamma_i = x_i^T \beta\](times the frequency of the observation), and the dependent variable is taken as the first derivative Ψ with respect to \(\gamma_i\), divided by the square root of the weight times the frequency. The Newton step is given by
\[\mathit{\Delta} \beta = \left(\sum\nolimits^{\mathit{\Psi}''} \left(\gamma_i\right)| x_i x_i^T\right)^{-1} \sum\nolimits^{\mathit{\Psi}'} \left(\gamma_i\right) x_i\]where all derivatives are evaluated at the current estimate of γ and \(\beta_{n+1}=\beta-\delta\beta\). This step is computed as the estimated regression coefficients in the least-squares model. Step halving is used when necessary to ensure a decrease in the criterion.
Convergence is assumed when the maximum relative change in any coefficient update from one iteration to the next is less than
eps
or when the relative change in the log-likelihood from one iteration to the next is less thaneps
/ 100. Convergence is also assumed aftermaxIterations
or when step halving leads to a step size of less than 0.0001 with no increase in the log-likelihood.Residuals are computed according to methods discussed by Pregibon (1981). Let \(l_i(\gamma_i)\) denote the log-likelihood of the i-th observation evaluated at \(\gamma_i\). Then, the standardized residual is computed as
\[r_i = \frac{l'_i\left(\hat{\gamma}_i\right)}{l'_i\left(\hat{\gamma}_i\right)}\]where
\[\hat{\gamma}_i\]is the value of \(\gamma_i\) when evaluated at the optimal
\[\hat{\beta}\]The denominator of this expression is used as the “standard error of the residual” while the numerator is “raw” residual. Following Cook and Weisberg (1982), the influence of the i-th observation is assumed to be
\[l'_i\left(\hat{\gamma}_i\right)^T l''\left(\hat{\gamma}\right)^{-1} l'_i\left(\hat{\gamma}_i\right)\]This quantity is a one-step approximation to the change in the estimates when the i-th observation is deleted. Here, the partial derivatives are with respect to β.
Programming Notes¶
- Indicator (dummy) variables are created for the classification variables
using function
regressorsForGlm
(see Chapter 2, Regression) using keywordleaveOutLast
as the argument to thedummy
optional argument. - To enhance precision, “centering” of covariates is performed if the model
has an intercept and
nObservations
−nRowsMissing
> 1. In doing so, the sample means of the design variables are subtracted from each observation prior to its inclusion in the model. On convergence, the intercept, its variance, and its covariance with the remaining estimates are transformed to the uncentered estimate values. - Two methods for specifying a binomial distribution model are possible. In
the first method, the
xColFrequencies
column ofx
contains the frequency of the observation whiley
is 0 or 1 depending upon whether the observation is a success or failure. In this case, N (distribution parameter) is always 1. The model is treated as repeated Bernoulli trials, and interval observations are not possible. A second method for specifying binomial models is to usey
to represent the number of successes in N trials. In this case, frequencies will usually be 1.
Examples¶
Example 1¶
The first example is from Prentice (1976) and involves the mortality of beetles after five hours exposure to eight different concentrations of carbon disulphide. The table below lists the number of beetles exposed (N) to each concentration level of carbon disulphide (x, given as log dosage) and the number of deaths which result (y). The data is given as follows:
Log Dosage | Number of Beetles Exposed | Number of Deaths |
---|---|---|
1.690 | 59 | 6 |
1.724 | 60 | 13 |
1.755 | 62 | 18 |
1.784 | 56 | 28 |
1.811 | 63 | 52 |
1.836 | 59 | 53 |
1.861 | 62 | 61 |
1.883 | 60 | 60 |
The number of deaths at each concentration level are fitted as a binomial
response using logit (model
= 3), probit (model
= 4), and log-log
(model
= 5) models. Note that the log-log model yields a smaller absolute
log likelihood (14.81) than the logit model (18.78) or the probit model
(18.23). This is to be expected since the response curve of the log-log model
has an asymmetric appearance, but both the logit and probit models are
symmetric about \(\theta=0.5\).
from __future__ import print_function
from numpy import *
from pyimsl.stat.categoricalGlm import categoricalGlm
from pyimsl.stat.writeMatrix import writeMatrix
x = array([[1.69, 6, 59],
[1.724, 13, 60],
[1.755, 18, 62],
[1.784, 28, 56],
[1.811, 52, 63],
[1.836, 53, 59],
[1.861, 61, 62],
[1.883, 60, 60]])
coef_statistics = []
criterion = []
n_class = 0
n_continuous = 1
model = 3
ipar = 2
fmt = "%12.4f"
clabels = ["", "coefficients", "s.e", "z", "p"]
n_coef = categoricalGlm(n_class, n_continuous,
model, x,
xColDistParameter=ipar,
coefStat=coef_statistics,
criterion=criterion)
writeMatrix("Coefficient statistics for model 3", coef_statistics,
writeFormat=fmt, noRowLabels=True, colLabels=clabels)
print('\nLog likelihood ', criterion[0])
model = 4
n_coef = categoricalGlm(n_class, n_continuous,
model, x,
xColDistParameter=ipar,
coefStat=coef_statistics,
criterion=criterion)
writeMatrix("Coefficient statistics for model 4", coef_statistics,
writeFormat=fmt, noRowLabels=True, colLabels=clabels)
print('\nLog likelihood ', criterion[0])
model = 5
n_coef = categoricalGlm(n_class, n_continuous,
model, x,
xColDistParameter=ipar,
coefStat=coef_statistics,
criterion=criterion)
writeMatrix("Coefficient statistics for model 5", coef_statistics,
writeFormat=fmt, noRowLabels=True, colLabels=clabels)
print('\nLog likelihood ', criterion[0])
Output¶
Log likelihood -18.77817904232357
Log likelihood -18.232354574384566
Log likelihood -14.807800327311949
Coefficient statistics for model 3
coefficients s.e z p
-60.7569 5.1876 -11.7118 0.0000
34.2985 2.9164 11.7607 0.0000
Coefficient statistics for model 4
coefficients s.e z p
-34.9441 2.6412 -13.2305 0.0000
19.7367 1.4852 13.2888 0.0000
Coefficient statistics for model 5
coefficients s.e z p
-39.6406 3.2392 -12.2378 0.0000
22.0838 1.7991 12.2746 0.0000
Example 2¶
Consider the use of a loglinear model to analyze survival-time data. Laird and Oliver (1981) investigate patient survival post heart valve replacement surgery. Surveillance after surgery of the 109 patients included in the study ranged from 3 to 97 months. All patients were classified by heart valve type (aortic or mitral) and by age (less than 55 years or at least 55 years). The data could be considered as a three-way contingency table where patients are classified by valve type, age, and survival (yes or no). However, it would be inappropriate to analyze this data using the standard methodology associated with contingency tables, since this methodology ignores survival time.
Consider a variable, say exposure time (\(E_{ij}\)), that is defined as the sum of the length of times patients of each cross-classification are at risk. The length of time for a patient that dies is the number of months from surgery until death and for a survivor, the length of time is the number of months from surgery until the study ends or the patient withdraws from the study. Now we can model the effect of A = age and V = valve type on the expected number of deaths conditional on exposure time. Thus, for the data (shown in the table below), assume the number of deaths are independent Poisson random variables with means \(m_{ij}\) and fit the following model,
where u is the overall mean,
is the effect of age, and
is the effect of the valve type.
Age | Heart Valve Type | ||
Aortic (0) | Mitral (1) | ||
< 55 years (Age = 0) | Deaths | 4 | 1 |
Exposure | 1259 | 2082 | |
≥ 55 years (Age = 1) | Deaths | 7 | 9 |
Exposure | 1417 | 1647 |
From the coefficient statistics table of the output, note that the risk is estimated to be \(e^{1.22}=3.39\) times higher for older patients in the study. This increase in risk is significant (\(p=0.02\)). However, the decrease in risk for the mitral valve patients is estimated to be \(e^{-0.33}=0.72\) times that of the aortic valve patients and this risk is not significant (\(p=0.45\)).
from numpy import *
from pyimsl.stat.categoricalGlm import categoricalGlm
from pyimsl.stat.writeMatrix import writeMatrix
n_class = 2
n_cont = 0
model = 0
coef = empty(0)
x = array([[4, 1259, 0, 0],
[1, 2082, 0, 1],
[7, 1417, 1, 0],
[9, 1647, 1, 1]])
iclass = [2, 3]
icont = [-1]
clabels = ["", "coefficient", "std error", "z-statistic", "p-value"]
fmt = "%10.6W"
x_col_variables = {"iclass": iclass, "icontinuous": icont, "iy": 0}
n_coef = categoricalGlm(n_class, n_cont, model, x,
coefStat=coef,
xColVariables=x_col_variables,
xColDistParameter=1)
writeMatrix("Coefficient Statistics", coef,
colLabels=clabels, rowNumberZero=True,
writeFormat=fmt)
Output¶
Coefficient Statistics
coefficient std error z-statistic p-value
0 -5.4210 0.3456 -15.6837 0.0000
1 -1.2209 0.5138 -2.3763 0.0177
2 0.3299 0.4382 0.7528 0.4517
Warning Errors¶
IMSLS_TOO_MANY_HALVINGS |
Too many step halvings. Convergence is assumed. |
IMSLS_TOO_MANY_ITERATIONS |
Too many iterations. Convergence is assumed. |
IMSLS_RANK_DEFICIENT_WARN |
The model is rank deficient (rank
=#). Computations will proceed per
setting of tolerance . Check
results for accuracy. |
Fatal Errors¶
IMSLS_TOO_FEW_COEF |
initialEstInput is specified
and “nCoefInput ” = #. The
model specified requires #
coefficients. |
IMSLS_MAX_CLASS_TOO_SMALL |
The number of distinct values of the
classification variables exceeds
“maxClass ” = #. |
IMSLS_INVALID_DATA_8 |
“nClassValues [#]” = #. The
number of distinct values for each
classification variable must be
greater than one. |
IMSLS_NMAX_EXCEEDED |
The number of observations to be
deleted has exceeded “lp_max ”
= #. Rerun with a different model or
increase the workspace. |
IMSLS_RANK_DEFICIENT_TERM |
The model is rank deficient (rank
=#). No solution will be computed.
Refer to the documentation of
optional argument tolerance for
other options. |