categoricalGlm

Analyzes categorical data using logistic, Probit, Poisson, and other generalized linear models.

Synopsis

categoricalGlm (nClass, nContinuous, model, x)

Required Arguments

int nClass (Input)
Number of classification variables.
int nContinuous (Input)
Number of continuous variables.
int model (Input)
Argument model specifies the model used to analyze the data. The six models are as follows:
model Relationship PDF of Response Variable
0 Exponential Poisson
1 Logistic Negative Binomial
2 Logistic Logarithmic
3 Logistic Binomial
4 Probit Binomial
5 Log-log Binomial

Relationship between the parameter, θ or λ, and a linear model of the explanatory variables, X β.

Note that the lower bound of the response variable is 1 for model = 3 and is 0 for all other models. See the Description section for more information about these models.
float x[[]] (Input)

Array of size nObservations by (nClass + nContinuous) + m containing data for the independent variables, dependent variable, and optional parameters.

The columns must be ordered such that the first nClass columns contain data for the class variables, the next nContinuous columns contain data for the continuous variables, and the next column contains the response variable. The final (and optional) m − 1 columns contain the optional parameters.

Return Value

An integer value indicating the number of estimated coefficients (nCoefficients) in the model.

Optional Arguments

xColFrequencies, int (Input)
Column number xColFrequencies of x containing the frequency of response for each observation.
xColFixedParameter, int (Input)
Column number xColFixedParameter of x containing a fixed parameter for each observation that is added to the linear response prior to computing the model parameter. The ‘fixed’ parameter allows one to test hypothesis about the parameters via the log-likelihoods.
xColDistParameter, int (Input)

Column number xColDistParameter of x containing the value of the known distribution parameter for each observation, where x[i][xColDistParameter] is the known distribution parameter associated with the i-th observation. The meaning of the distributional parameter depends upon model as follows:

model Parameter Meaning of x [i] [xColDistParameter]
0 E ln (E) is a fixed intercept to be included in the linear predictor (i.e., the offset).
1 S Number of successes required for the negative binomial distribution.
2
Not used for this model.
3-5 N Number of trials required for the binomial distribution.

Default: When model ≠ 2, each observation is assumed to have a parameter value of 1. When model = 2, this parameter is not referenced.

xColVariables, iclass, icontinuous, iy (Input)

This keyword allows specification of the variables to be used in the analysis and overrides the default ordering of variables described for input argument x.

Argument iclass is an index vector of length nClass containing the column numbers of x that correspond to classification variables.

Argument icontinuous is an index vector of length nContinuous containing the column numbers of x that correspond to continuous variables.

Argument iy indicates the column of x that contains the independent variable.

eps, float (Input)

Argument eps is the convergence criterion. Convergence is assumed when the maximum relative change in any coefficient estimate is less than eps from one iteration to the next, or when the relative change in the log-likelihood criterion from one iteration to the next is less than eps / 100.0.

Default: eps = 0.001

maxIterations, int (Input)

Maximum number of iterations. Use maxIterations = 0 to compute the Hessian, stored in cov, and the Newton step, stored in lastStep, at the initial estimates. (The initial estimates must be input. Use keyword initialEstInput).

Default: maxIterations = 30

intercept (input)

or

noIntercept (Input)
By default, or if intercept is specified, the intercept is automatically included in the model. If noIntercept is specified, there is no intercept in the model (unless otherwise provided for by the user).
effects, nEffects, nVarEffects, indicesEffects (Input)
Variable nEffects is the number of effects (sources of variation) in the model. Variable nVarEffects is an array of length nEffects containing the number of variables associated with each effect in the model. Argument indicesEffects is an index array of length nVarEffects [0] + nVarEffects [1] ++ nVarEffects [nEffects − 1]. The first nVarEffects [0] elements give the column numbers of x for each variable in the first effect. The next nVarEffects [1] elements give the column numbers for each variable in the second effect. The last nVarEffects [nEffects − 1] elements give the column numbers for each variable in the last effect.

initialEstInternal (Input)

or

initialEstInput, nCoefInput, estimates (Input)
By default, or if initialEstInput is specified, then unweighted linear regression is used to obtain initial estimates. If initialEstInput is specified, then the nCoefInput elements of estimates contain initial estimates of the parameters. This requires that the user know the number of coefficients in the model prior to the call to categoricalGlm, which can be obtained by calling regressorsForGlm.
maxClass, int (Input)

An upper bound on the sum of the number of distinct values taken on by each classification variable.

Default: maxClass = nObservations × nClass

classInfo, nClassValues, classValues (Output)

Argument nClassValues is an array of length nClass containing the number of values taken by each classification variable; the i-th classification variable has nClassValues [i] distinct values. Argument classValues is an array of length

\[\sum_{\mathrm{i}=0}^{\mathrm{nClass}-1} \mathrm{nClassValues}[i]\]

containing the distinct values of the classification variables in ascending order. The first nClassValues [0] elements of classValues contain the values for the first classification variables, the next nClassValues [1] elements contain the values for the second classification variable, etc.

coefStat (Output)
An array of size nCoefficients × 4 containing the parameter estimates and associated statistics, where nCoefficients can be computed by calling regressorsForGlm.
Column Parameter
0 Coefficient Estimate.
1 Estimated standard deviation of the estimated coefficient.
2 Asymptotic normal score for testing that the coefficient is zero.
3 The p-value associated with the normal score in column 2.
criterion (Output)
Optimization criterion. The maximized log‑likelihood, i.e., the value of the log‑likelihood at the final parameter estimates.
cov (Output)
The array of size nCoefficients × nCoefficients containing the estimated asymptotic covariance matrix of the coefficients. For maxIterations = 0, this is the Hessian computed at the initial parameter estimates, where nCoefficients can be computed by calling regressorsForGlm.
means (Output)
The array containing the means of the design variables. The array is of length nCoefficients if noIntercept is specified, and of length nCoefficients − 1 otherwise, where nCoefficients can be computed by calling regressorsForGlm.
caseAnalysis (Output)

The array of size nObservations × 5 containing the case analysis.

Column Statistic
0 Predicted mean for the observation if model = 0. Otherwise, contains the probability of success on a single trial.
1 The residual.
2 The estimated standard error of the residual.
3 The estimated influence of the observation.
4 The standardized residual.

Case statistics are computed for all observations except where missing values prevent their computation.

lastStep (Output)
The array of length nCoefficients containing the last parameter updates (excluding step halvings). For maxIterations = 0, lastStep contains the inverse of the Hessian times the gradient vector, all computed at the initial parameter estimates.
obsStatus (Output)
The array of length nObservations indicating which observations are included in the extended likelihood.
obsStatus [i] Status of observation
0 Observation i is in the likelihood
1 Observation i cannot be in the likelihood because it contains at least one missing value in x.
2 Observation i is not in the likelihood. Its estimated parameter is infinite.
iterations, n, iterations (Output)
The array of size nObservations × 5 containing information about each iteration of the analysis.
Column Statistic
0 Method of iteration. Equal to 0 if a Q-N step was taken. Equal to 1 if a N-R step was taken.
1 Iteration number.
2 Step Size.
3 Maximum scaled coefficient update.
4 Log-likelihood.
nRowsMissing (Output)
Number of rows of data that contain missing values in one or more of the following arrays or columns of x: xColDistParameter, iy, xColFrequencies, xColFixedParameter, iclass, icontinuous, or indicesEffects.

Remarks

  1. Dummy variables are generated for the classification variables as follows: An ascending list of all distinct values of each classification variable is obtained and stored in classValues. Dummy variables are then generated for each but the last of these distinct values. Each dummy variable is zero unless the classification variable equals the list value corresponding to the dummy variable, in which case the dummy variable is one. See keyword leaveOutLast for optional argument dummy in function regressorsForGlm (Chapter 2, Regression).
  2. The “product” of a classification variable with a covariate yields dummy variables equal to the product of the covariate with each of the dummy variables associated with the classification variable.
  3. The “product” of two classification variables yields dummy variables in the usual manner. Each dummy variable associated with the first classification variable multiplies each dummy variable associated with the second classification variable. The resulting dummy variables are such that the index of the second classification variable varies fastest.

Description

Function categoricalGlm uses iteratively re-weighted least squares to compute (extended) maximum likelihood estimates in some generalized linear models involving categorized data. One of several models, including the probit, logistic, Poisson, logarithmic, and negative binomial models, may be fit.

Note that each row vector in the data matrix can represent a single observation; or, through the use of optional argument xColFrequencies, each row can represent several observations. Also note that classification variables and their products are easily incorporated into the models via the usual regression-type specifications.

The models available in categoricalGlm are:

model PDF of the Response Variable Parameterization
0
\[f(y)=\left( \lambda_y \exp\left(-\lambda\right) \right)/y!\]
\[\lambda=N\times\exp \left( \omega+\eta\right)\]
1
\[f(y) = \binom {S+y-1}{y-1} \theta^S (1-\theta)^y\]
\[\theta=\frac{\exp(\omega+\eta)}{1+\exp(\omega+\eta)}\]
2
\[f(y)=(1-\theta) y/(y1n \theta)\]
\[\theta=\frac{\exp(\omega+\eta)}{1+\exp(\omega+\eta)}\]
3
\[f(y)=\binom{N}{y}\theta^y/(1-\theta)^{N-y}\]
\[\theta=\frac{\exp(\omega+\eta)}{1+\exp(\omega+\eta)}\]
4
\[f(y)=\binom{N}{y}\theta^y/(1-\theta)^{N-y}\]
\[\theta=\phi( \omega+\eta)\]
5
\[f(y)=\binom{N}{y}\theta^y/(1-\theta)^{N-y}\]
\[\theta=1-\exp(-\exp(\omega+\eta))\]

Here, Φ denotes the cumulative normal distribution, N and S are known distribution parameters specified for each observation via the optional argument xColDistParameter, and ω is an optional fixed parameter of the linear response, \(\gamma_i\), specified for each observation. (If xColFixedParameter is not specified, then ω is taken to be 0.) Since the log-log model (model = 5) probabilities are not symmetric with respect to 0.5, quantitatively, as well as qualitatively, different models result when the definitions of “success” and “failure” are interchanged in this distribution. In this model and all other models involving θ, θ is taken to be the probability of a “success.”

Computational Details

The computations proceed as follows:

  1. The input parameters are checked for consistency and validity.
  2. Estimates of the means of the “independent” or design variables are computed. The frequency or the observation in all but binomial distribution models is taken from vector frequencies. In binomial distribution models, the frequency is taken as the product of n = parameter [i] and frequencies [i]. Means are computed as
\[\overline{x} = \frac{\Sigma f_i x_i}{\Sigma f_i}\]
  1. By default, and when initialEstInternal is specified, initial estimates of the coefficients are obtained (based upon the observation intervals) as multiple regression estimates relating transformed observation probabilities to the observation design vector. For example, in the binomial distribution models, θ may be estimated as

    \[\hat{\theta} = \mathtt{y}[i] / \mathtt{parameter}[i]\]

    and, when model = 3, the linear relationship is given by

    \[\ln \left(\hat{\theta}/\left(1-\hat{\theta}\right)\right) \approx X\beta\]

    while if model = 4, \(\Phi^{-1}(\theta)=X\beta\). When computing initial estimates, standard modifications are made to prevent illegal operations such as division by zero. Regression estimates are obtained at this point, as well as later, by use of function regression (Chapter 2, Regression). Also, at this step of the computations, the regression function is used to detect linear dependence in the model, by the method described for regression.

  2. Newton-Raphson iteration for the maximum likelihood estimates is implemented via iteratively re-weighted least squares. Let

    \[\mathit{\Psi}\left(x_i^T \beta\right)\]

    denote the log of the probability of the i-th observation for coefficients β. In the least-squares model, the weight of the i-th observation is taken as the absolute value of the second derivative of

    \[\mathit{\Psi}\left(x_i^T \beta\right)\]

    with respect to

    \[\gamma_i = x_i^T \beta\]

    (times the frequency of the observation), and the dependent variable is taken as the first derivative Ψ with respect to \(\gamma_i\), divided by the square root of the weight times the frequency. The Newton step is given by

    \[\mathit{\Delta} \beta = \left(\sum\nolimits^{\mathit{\Psi}''} \left(\gamma_i\right)| x_i x_i^T\right)^{-1} \sum\nolimits^{\mathit{\Psi}'} \left(\gamma_i\right) x_i\]

    where all derivatives are evaluated at the current estimate of γ and \(\beta_{n+1}=\beta-\delta\beta\). This step is computed as the estimated regression coefficients in the least-squares model. Step halving is used when necessary to ensure a decrease in the criterion.

  3. Convergence is assumed when the maximum relative change in any coefficient update from one iteration to the next is less than eps or when the relative change in the log-likelihood from one iteration to the next is less than eps / 100. Convergence is also assumed after maxIterations or when step halving leads to a step size of less than 0.0001 with no increase in the log-likelihood.

  4. Residuals are computed according to methods discussed by Pregibon (1981). Let \(l_i(\gamma_i)\) denote the log-likelihood of the i-th observation evaluated at \(\gamma_i\). Then, the standardized residual is computed as

    \[r_i = \frac{l'_i\left(\hat{\gamma}_i\right)}{l'_i\left(\hat{\gamma}_i\right)}\]

    where

    \[\hat{\gamma}_i\]

    is the value of \(\gamma_i\) when evaluated at the optimal

    \[\hat{\beta}\]

    The denominator of this expression is used as the “standard error of the residual” while the numerator is “raw” residual. Following Cook and Weisberg (1982), the influence of the i-th observation is assumed to be

    \[l'_i\left(\hat{\gamma}_i\right)^T l''\left(\hat{\gamma}\right)^{-1} l'_i\left(\hat{\gamma}_i\right)\]

    This quantity is a one-step approximation to the change in the estimates when the i-th observation is deleted. Here, the partial derivatives are with respect to β.

Programming Notes

  1. Indicator (dummy) variables are created for the classification variables using function regressorsForGlm (see Chapter 2, Regression) using keyword leaveOutLast as the argument to the dummy optional argument.
  2. To enhance precision, “centering” of covariates is performed if the model has an intercept and nObservationsnRowsMissing > 1. In doing so, the sample means of the design variables are subtracted from each observation prior to its inclusion in the model. On convergence, the intercept, its variance, and its covariance with the remaining estimates are transformed to the uncentered estimate values.
  3. Two methods for specifying a binomial distribution model are possible. In the first method, the xColFrequencies column of x contains the frequency of the observation while y is 0 or 1 depending upon whether the observation is a success or failure. In this case, N (distribution parameter) is always 1. The model is treated as repeated Bernoulli trials, and interval observations are not possible. A second method for specifying binomial models is to use y to represent the number of successes in N trials. In this case, frequencies will usually be 1.

Examples

Example 1

The first example is from Prentice (1976) and involves the mortality of beetles after five hours exposure to eight different concentrations of carbon disulphide. The table below lists the number of beetles exposed (N) to each concentration level of carbon disulphide (x, given as log dosage) and the number of deaths which result (y). The data is given as follows:

Log Dosage Number of Beetles Exposed Number of Deaths
1.690 59 6
1.724 60 13
1.755 62 18
1.784 56 28
1.811 63 52
1.836 59 53
1.861 62 61
1.883 60 60

The number of deaths at each concentration level are fitted as a binomial response using logit (model = 3), probit (model = 4), and log-log (model = 5) models. Note that the log-log model yields a smaller absolute log likelihood (14.81) than the logit model (18.78) or the probit model (18.23). This is to be expected since the response curve of the log-log model has an asymmetric appearance, but both the logit and probit models are symmetric about \(\theta=0.5\).

from __future__ import print_function
from numpy import *
from pyimsl.stat.categoricalGlm import categoricalGlm
from pyimsl.stat.writeMatrix import writeMatrix

x = array([[1.69, 6, 59],
           [1.724, 13, 60],
           [1.755, 18, 62],
           [1.784, 28, 56],
           [1.811, 52, 63],
           [1.836, 53, 59],
           [1.861, 61, 62],
           [1.883, 60, 60]])
coef_statistics = []
criterion = []
n_class = 0
n_continuous = 1

model = 3
ipar = 2
fmt = "%12.4f"
clabels = ["", "coefficients", "s.e", "z", "p"]

n_coef = categoricalGlm(n_class, n_continuous,
                        model, x,
                        xColDistParameter=ipar,
                        coefStat=coef_statistics,
                        criterion=criterion)

writeMatrix("Coefficient statistics for model 3", coef_statistics,
            writeFormat=fmt, noRowLabels=True, colLabels=clabels)

print('\nLog likelihood    ', criterion[0])

model = 4
n_coef = categoricalGlm(n_class, n_continuous,
                        model, x,
                        xColDistParameter=ipar,
                        coefStat=coef_statistics,
                        criterion=criterion)


writeMatrix("Coefficient statistics for model 4", coef_statistics,
            writeFormat=fmt, noRowLabels=True, colLabels=clabels)

print('\nLog likelihood    ', criterion[0])

model = 5
n_coef = categoricalGlm(n_class, n_continuous,
                        model, x,
                        xColDistParameter=ipar,
                        coefStat=coef_statistics,
                        criterion=criterion)

writeMatrix("Coefficient statistics for model 5", coef_statistics,
            writeFormat=fmt, noRowLabels=True, colLabels=clabels)
print('\nLog likelihood    ', criterion[0])

Output

Log likelihood     -18.77817904232357

Log likelihood     -18.232354574384566

Log likelihood     -14.807800327311949
 
          Coefficient statistics for model 3
coefficients           s.e             z             p
    -60.7569        5.1876      -11.7118        0.0000
     34.2985        2.9164       11.7607        0.0000
 
          Coefficient statistics for model 4
coefficients           s.e             z             p
    -34.9441        2.6412      -13.2305        0.0000
     19.7367        1.4852       13.2888        0.0000
 
          Coefficient statistics for model 5
coefficients           s.e             z             p
    -39.6406        3.2392      -12.2378        0.0000
     22.0838        1.7991       12.2746        0.0000

Example 2

Consider the use of a loglinear model to analyze survival-time data. Laird and Oliver (1981) investigate patient survival post heart valve replacement surgery. Surveillance after surgery of the 109 patients included in the study ranged from 3 to 97 months. All patients were classified by heart valve type (aortic or mitral) and by age (less than 55 years or at least 55 years). The data could be considered as a three-way contingency table where patients are classified by valve type, age, and survival (yes or no). However, it would be inappropriate to analyze this data using the standard methodology associated with contingency tables, since this methodology ignores survival time.

Consider a variable, say exposure time (\(E_{ij}\)), that is defined as the sum of the length of times patients of each cross-classification are at risk. The length of time for a patient that dies is the number of months from surgery until death and for a survivor, the length of time is the number of months from surgery until the study ends or the patient withdraws from the study. Now we can model the effect of A = age and V = valve type on the expected number of deaths conditional on exposure time. Thus, for the data (shown in the table below), assume the number of deaths are independent Poisson random variables with means \(m_{ij}\) and fit the following model,

\[\log\left(\frac{m_{ij}}{E_{ij}}\right) = u + \gamma_i^A + \gamma_j^V\]

where u is the overall mean,

\[\lambda_i^A\]

is the effect of age, and

\[\lambda_j^V\]

is the effect of the valve type.

Age   Heart Valve Type
Aortic (0) Mitral (1)
< 55 years (Age = 0) Deaths 4 1
Exposure 1259 2082
≥ 55 years (Age = 1) Deaths 7 9
Exposure 1417 1647

From the coefficient statistics table of the output, note that the risk is estimated to be \(e^{1.22}=3.39\) times higher for older patients in the study. This increase in risk is significant (\(p=0.02\)). However, the decrease in risk for the mitral valve patients is estimated to be \(e^{-0.33}=0.72\) times that of the aortic valve patients and this risk is not significant (\(p=0.45\)).

from numpy import *
from pyimsl.stat.categoricalGlm import categoricalGlm
from pyimsl.stat.writeMatrix import writeMatrix

n_class = 2
n_cont = 0
model = 0
coef = empty(0)
x = array([[4, 1259, 0, 0],
           [1, 2082, 0, 1],
           [7, 1417, 1, 0],
           [9, 1647, 1, 1]])
iclass = [2, 3]
icont = [-1]
clabels = ["", "coefficient", "std error", "z-statistic", "p-value"]
fmt = "%10.6W"
x_col_variables = {"iclass": iclass, "icontinuous": icont, "iy": 0}

n_coef = categoricalGlm(n_class, n_cont, model, x,
                        coefStat=coef,
                        xColVariables=x_col_variables,
                        xColDistParameter=1)

writeMatrix("Coefficient Statistics", coef,
            colLabels=clabels, rowNumberZero=True,
            writeFormat=fmt)

Output

 
              Coefficient Statistics
   coefficient   std error  z-statistic     p-value
0      -5.4210      0.3456     -15.6837      0.0000
1      -1.2209      0.5138      -2.3763      0.0177
2       0.3299      0.4382       0.7528      0.4517

Warning Errors

IMSLS_TOO_MANY_HALVINGS Too many step halvings. Convergence is assumed.
IMSLS_TOO_MANY_ITERATIONS Too many iterations. Convergence is assumed.
IMSLS_RANK_DEFICIENT_WARN The model is rank deficient (rank =#). Computations will proceed per setting of tolerance. Check results for accuracy.

Fatal Errors

IMSLS_TOO_FEW_COEF initialEstInput is specified and “nCoefInput” = #. The model specified requires # coefficients.
IMSLS_MAX_CLASS_TOO_SMALL The number of distinct values of the classification variables exceeds “maxClass” = #.
IMSLS_INVALID_DATA_8 nClassValues[#]” = #. The number of distinct values for each classification variable must be greater than one.
IMSLS_NMAX_EXCEEDED The number of observations to be deleted has exceeded “lp_max” = #. Rerun with a different model or increase the workspace.
IMSLS_RANK_DEFICIENT_TERM The model is rank deficient (rank =#). No solution will be computed. Refer to the documentation of optional argument tolerance for other options.