Usage Notes

FNLStat : Categorical and Discrete Data Analysis : Usage Notes

Usage Notes

Routines for modeling and analyzing a two- or higher‑dimensional contingency table are described in this chapter. Also included are routines for modeling responses from some discrete distributions when discrete or continuous covariates are measured.

The Basic Data Structures

The most common of the three data structures used by the routines in this chapter is a multidimensional (or multi‑way) contingency table input as a real vector with length equal to the product of the number of categories for each dimension. This structure may be obtained from a data matrix X via the routine FREQ in Chapter 1, “Basic Statistics”. Alternatively, multi‑way tables may be created and input directly by the user. The multi‑way structure is used by all of the log‑linear modeling routines (PRPFT, CTLLN, CTPAR, CTASC, and CTSTP), and is also used in the randomization tests routine, CTRAN.

A second data structure used by the categorical generalized linear models routine, CTGLM, is the data matrix X. In CTGLM (and elsewhere), if X has many identical rows, at least on the variables of interest, consider using Chapter 1 routine CSTAT to add a frequency variable to a reduced matrix X. The transposed output from this routine can replace X as input to CTGLM, and CTGLM will perform its computations faster (with a linear speed up) on the reduced matrix.

Finally, two‑way tables are input into routines CTCHI, CTTWO, CTPRB, CTEPR, and CTWLS as two‑dimensional real arrays. As with the multidimensional arrays, two‑dimensional arrays may be created via Chapter 1 routine FREQ, in which case the leading dimension must equal the number of categories for the first dimension in the table, or they can be created and input directly by the user. Alternatively, the routine TWFRQ from Chapter 1 may be used to obtain the two‑way frequency table.

Types of Analysis

Routines CTCHI (r × c) and CTTWO (2 × 2) (see Chapter 1, “Basic Statistics”) compute many statistics of interest in a two‑way table. Statistics computed by these routines include the usual chi‑squared statistics, measures of association, Kappa, and many others. Asymptotic statistics for a two‑way table that are not computed by either CTCHI or CTTWO can probably be computed by routines CTRAN or CTWLS, but note that these latter two routines require more setup since they require that the user indicate how the statistics are to be computed. Exact probabilities for two‑way tables can be computed by CTPRB, but this routine uses the total enumeration algorithm and, thus, often uses orders of magnitude more computer time that CTEPR, which computes the same probabilities by use of the network algorithm (but can still be quite expensive).

The routines in the second section are all concerned with hierarchical log‑linear models (see, e.g., Bishop, Fienberg, and Holland 1975). The routines in Chapter 1, “Basic Statistics” will often be used to obtain the multi‑dimensional tables input into these routines, or the table will be input directly by the user. If the hierarchical is not known, routine CTASC will often be the first routine considered. The partial association statistics computed by this routine can be used to obtain a rough estimate of the model to be used. This rough model can then be refined through the use of CTSTP, which does stepwise model building. Of course, both of these routines are subject to the usual problems associated with building models once the data have been collected: the resulting models may not be correct.

Once a model has been selected (provisional or otherwise), routine CTLLN can be used to compute and print many model statistics (parameter estimates, residuals, goodness of fit tests, etc.). If only the parameter estimates and associated variance/covariance matrix are needed, CTPAR can be used instead. Both of these routines can compute estimates when sampling and/or structural zeros (cells in the table with observed or restricted counts of zero, respectively) are present in the table, as can all routines in this section.

The algorithm underlying all of the routines in the second section is the iterative proportional fitting algorithm, which is implemented in routine PRPFT. When structural or sampling zeros are present in the table, this algorithm can be quite slow to converge. Also, only the expected cell counts are returned by PRPFT, it can be quite difficult to determine degrees of freedom when structural zeros are present in the data. Because a structural zero is a restriction on the parameter space, 1 degree of freedom must be subtracted for each structural zero in the multiway table. The difficulty is in determining where the subtraction should occur. All routines in this section use a Cholesky factorization of XT X where X is the “design matrix.” This is used to determine which effects should lose degrees of freedom because of structural zeros. Sampling zeros, although they can lead to infinite parameter estimates, do not subtract from the total degrees of freedom. See Clarkson and Jennrich (1991), or Baker, Clarke, and Lane (1985) for details.

Routine CTRAN computes generalized Mantel‑Haenszel statistics in stratified r × c tables. Generalized Mantel‑Haenszel statistics assume that the “direction” of departure from the null hypothesis is consistent from one table to the next. Under this assumption, statistics computed for each table are pooled across all strata yielding a more powerful test than could be obtained otherwise. The statistics computed include measures of correlation, location, and independence using user selected row and/or column scores. Details can be found in (Koch, Amara, and Atkinson 1983) or in the “Algorithm” section for CTRAN.

The routine CTGLM in the fourth section is concerned with generalized linear models (see McCullagh and Nelder 1983) in discrete data. This routine may be used to compute estimates and associated statistics in probit, logistic, minimum extreme value, Poisson, negative binomial (with known number of successes), and logarithmic models. Classification variables as well as weights, frequencies and additive constants may be used so that quite general linear models can be fit. Residuals, a measure of influence, the coefficient estimates, and other statistics are returned for each model fit. When infinite parameter estimates are required, extended maximum likelihood estimation may be used. Log‑linear models may be fit in CTGLM through the use of Poisson regression models. Results from Poisson regression models involving structural and sampling zeros will be identical to the results obtained from the log‑linear model routines but will be fit by a quasi‑Newton algorithm rather than through iterative proportional fitting.

The weighted least‑squares analysis of Grizzle, Starmer, and Koch (1969) is implemented in routine CTWLS. In this routine, the user first transforms the observed probability estimates (in predefined ways) and then fits a linear model to the transformed estimates using generalized least squares. Multivariate hypotheses associated with the coefficient estimates for the linear model fit may then be tested. In this way, many statistics of interest such as generalized Kappa statistics and parameter estimates in logistic models may be estimated. Of course, the logistic models fit by CTWLS use a generalized least‑squares criterion rather than the maximum likelihood criterion used to compute the logistic model estimates in CTGLM. The generalized least‑squares estimates will generally differ somewhat from estimates computed via maximum likelihood.

Other Routines

The routines in Chapter 1, “Basic Statistics” may be used to create the data structures discussed above. These routines can also create one‑dimensional frequency tables, which may then be used by routine CHIGF (Chapter 7, “Tests of Goodness of Fit and Randomness”) to compute chi‑squared goodness‑of‑fit test statistics or with routines VHSTP or HHSTP (see Chapter 16, “Line Printer Graphics”) to prepare histograms. Routines CTRHO, TETCC , BSCAT, and BSPBS (see Chapter 3, “Correlation”) may be used to compute some measures of correlation in two‑way contingency tables.