CTCHI

FNLStat : Categorical and Discrete Data Analysis : CTCHI

CTCHI

Required Arguments

Performs a chi‑squared analysis of a two‑way contingency table.

Required Arguments

TABLE — NROW by NCOL matrix containing the observed counts in the contingency table. (Input)

EXPECT — (NROW + 1) by (NCOL + 1) matrix containing the expected values of each cell in TABLE, under the null hypothesis, in the first NROW rows and NCOL columns and the marginal totals in the last row and column. (Output)

CHICTR — (NROW +1) by (NCOL +1) matrix containing the contributions to chi‑squared for each cell in TABLE in the first NROW rows and NCOL columns. (Output)
The last row and column contain the total contribution to chi‑squared for that row or column.

CHISQ — Vector of length 10 containing chi‑squared statistics associated with this contingency table. (Output)

I	CHISQ(I)
1	Pearson chi‑squared statistic
2	Probability of a larger Pearson chi‑squared
3	Degrees of freedom for chi‑squared
4	Likelihood ratio G2 (chi‑squared)
5	Probability of a larger G2
6	Exact mean
7	Exact standard deviation

The following statistics are based upon the chi‑squared statistic CHISQ(1). If ICMPT = 1, NaN (not a number) is reported.

I	CHISQ(I)
8	Phi
9	Contingency coefficient
10	Cramer’s V

STAT — 23 by 5 matrix containing statistics associated with this table. (Output)
If ICMPT = 1, STAT is not referenced and may be a vector of length 1. Each row of the matrix corresponds to a statistic.

Row	Statistic
1	Gamma
2	Kendall’s b
3	Stuart’s c
4	Somers’ D for rows given columns
5	Somers’ D for columns given rows
6	Product moment correlation
7	Spearman rank correlation
8	Goodman and Kruskal for rows given columns
9	Goodman and Kruskal for columns given rows
10	Uncertainty coefficient U (symmetric)
11	Uncertainty Ur\|c (rows)
12	Uncertainty Uc\|r (columns)
13	Optimal prediction λ (symmetric)
14	Optimal prediction λr\|c (rows)
15	Optimal prediction λc\|r (columns)
16	Optimal prediction λ*r\|c (rows)
17	Optimal prediction λ*c\|r (columns)
18	Test for linear trend in row probabilities if NROW= 2. If NROW is not 2, a test for linear trend in column probabilities if NCOL= 2.
19	Kruskal‑Wallis test for no row effect
20	Kruskal‑Wallis test for no column effect
21	Kappa (square tables only)
22	McNemar test of symmetry (square tables only)
23	McNemar one degree of freedom test of symmetry (square tables only)

If a statistic cannot be computed, its value is reported as NaN (not a number). The columns are as follows:

Column	Statistic
1	The estimated statistic
2	Its standard error for any parameter value
3	Its standard error under the null hypothesis
4	The t value for testing the null hypothesis
5	p‑value of the test in column 4

In the McNemar tests, column 1 contains the statistic, column 2 contains the chi‑squared degrees of freedom, column 4 contains the exact p‑value (one degree of freedom only), and column 5 contains the chi‑squared asymptotic p‑value. The Kruskal‑Wallis test is the same except no exact p‑value is computed.

Optional Arguments

NROW — Number of rows in the table. (Input)
Default: NROW = size (TABLE,1).

NCOL — Number of columns in the table. (Input)
Default: NCOL = size (TABLE,2).

LDTABL — Leading dimension of TABLE exactly as specified in the dimension statement of the calling program. (Input)
Default: LDTABL = size (TABLE,1).

ICMPT — Computing option. (Input)
If ICMPT = 0, all of the values in CHISQ and STAT are computed. ICMPT = 1 means compute only the first 5 values of CHISQ and none of the values in STAT. (All values not computed are set to NaN (not a number).
Default: ICMPT = 0.

IPRINT — Printing option. (Input)
IPRINT = 0 means no printing is performed. If IPRINT = 1, printing is performed.
Default: IPRINT = 0.

LDEXPE — Leading dimension of EXPECT exactly as specified in the dimension statement in the calling program. (Input)
Default: LDEXPE = size (EXPECT,1).

LDCHIC — Leading dimension of CHICTR exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCHIC = size (CHICTR,1).

LDSTAT — Leading dimension of STAT exactly as specified in the dimension statement in the calling program. (Input)
Default: LDSTAT = size (STAT,1).

FORTRAN 90 Interface

Generic: CALL CTCHI (TABLE, EXPECT, CHICTR, CHISQ, STAT [, …])

Specific: The specific interface names are S_CTCHI and D_CTCHI.

FORTRAN 77 Interface

Single: CALL CTCHI (NROW, NCOL, TABLE, LDTABL, ICMPT, IPRINT, EXPECT, LDEXPE, CHICTR, LDCHIC, CHISQ, STAT, LDSTAT)

Double: The double precision name is DCTCHI.

Description

Routine CTCHI computes statistics associated with an r × c (NROW × NCOL) contingency table. The routine CTCHI always computes the chi‑squared test of independence, expected values, contributions to chi‑squared, and row and column marginal totals. Optionally, when ICMPT = 0, CTCHI can compute some measures of association, correlation, prediction, uncertainty, the McNemar test for symmetry, a test for linear trend, the odds and the log odds ratio, and the Kappa statistic.

Other IMSL routines that may be of interest include TETCC in Chapter 3, for computing the tetrachoric correlation coefficient, CTTWO, for computing statistics in a 2 × 2 contingency table, and CTPRB, for computing the exact probability of an r × c contingency table.

Notation

Let xij denote the observed cell frequency in the ij cell of the table and n denote the total count in the table. Let pij = pi∙p∙j denote the predicted cell probabilities under the null hypothesis of independence where pi∙ and p∙j are the row and column marginal relative frequencies, respectively. Next, compute the expected cell counts as eij = n pij.

Also required in the following are auv and buv, u, v = 1, …, n. Let (rs, cs) denote the row and column response of observation s. Then, auv = 1, 0, or ‑1, depending upon whether ru< rv, ru = rv, or ru > rv, respectively. The buv are similarly defined in terms of the cs’s.

The Chi-squared Statistics

For each cell in the table, the contribution to X2 is given as (xij ‑ eij)2/eij. The Pearson chi‑squared statistic (denoted X2) is computed as the sum of the cell contributions to chi‑squared. It has (r ‑ 1)(c ‑ 1) degrees of freedom and tests the null hypothesis of independence, i.e., that H0 : pij = pi∙p∙j. The null hypothesis is rejected if the computed value of X2 is too large.

Compute G2, the maximum likelihood equivalent of X2, as

G2 is asymptotically equivalent to X2 and tests the same hypothesis with the same degrees of freedom.

Measures Related to Chi-squared (Phi, Contingency Coefficient, and Cramer's V)

Three measures related to chi‑squared but that do not depend upon the sample size are

phi,

the contingency coefficient,

and Cramer’s V,

Since these statistics do not depend upon sample size and are large when the hypothesis of independence is rejected, they may be thought of as measures of association and may be compared across tables with different sized samples. While both P and V have a range between 0.0 and 1.0, the upper bound of P is actually somewhat less than 1.0 for any given table (see Kendall and Stuart 1979, page 587). The significance of all three statistics is the same as that of the X2 statistic, CHISQ(1).

The distribution of the X2 statistic in finite samples approximates a chi‑squared distribution. To compute the exact mean and standard deviation of the X2 statistic, Haldane (1939) uses the multinomial distribution with fixed table marginals. The exact mean and standard deviation generally differ little from the mean and standard deviation of the associated chi‑squared distribution.

Standard Errors and p-values For Some Measures of Association

In rows 1 through 7 of STAT, estimated standard errors and asymptotic p‑values are reported. Estimates of the standard errors are computed in two ways. The first estimate, in column 2 of matrix STAT, is asymptotically valid for any value of the statistic. The second estimate, in column 3 of the matrix, is only correct under the null hypothesis of no association. The z‑scores in column 4 of matrix STAT are computed using this second estimate of the standard errors. The p‑values in column 5 are computed from this z‑score. See Brown and Benedetti (1977) for a discussion and formulas for the standard errors in column 3.

Measures of Association for Ranked Rows and Columns

The measures of association, ɸ, P, and V, do not require any ordering of the row and column categories. Routine CTCHI also computes several measures of association for tables in which the rows and column categories correspond to ranked observations. Two of these tests, the product‑moment correlation and the Spearman correlation, are correlation coefficients computed using assigned scores for the row and column categories. The cell indices are used for the product‑moment correlation while the average of the tied ranks of the row and column marginals is used for the Spearman rank correlation. Other scores are possible.

Gamma, Kendall’s

b, Stuart’s

c, and Somers’ D are measures of association that are computed like a correlation coefficient in the numerator. In all of these measures, the numerator is computed as the “covariance” between the auv’s and buv’s defined above, i.e., as

Recall that auv and buv can take values ‑1, 0, or 1. Since the product auvbuv = 1 only if auv and buv are both 1 or are both ‑1, it is easy to show that this “covariance” is twice the total number of agreements minus the number of disagreements where a disagreement occurs when auvbuv = ‑1.

Kendall’s

b is computed as the correlation between the auv’s and the buv’s (see Kendall and Stuart 1979, page 593). In a rectangular table (r ≠ c), Kendall’s

b cannot be 1.0 (if all marginal totals are positive). For this reason, Stuart suggested a modification to the denominator of

in which the denominator becomes the largest possible value of the “covariance.” This maximizing value is approximately n2m/(m ‑ 1), where m = min(r, c). Stuart’s

c uses this approximate value in its denominator. For large n,

c ≈ m

b/(m ‑ 1).

Gamma can be motivated in a slightly different manner. Because the “covariance” of the auv’s and the buv’s can be thought of as twice the number of agreements minus the disagreements, (2(A ‑ D), where A is the number of agreements and D is the number of disagreements), gamma is motivated as the probability of agreement minus the probability of disagreement, given that either agreement or disagreement occurred. This is just γ = (A ‑ D)/(A + D).

Two definitions of Somers’ D are possible, one for rows and a second for columns. Somers’ D for rows can be thought of as the regression coefficient for predicting auv from buv. Moreover, Somers’ D for rows is the probability of agreement minus the probability of disagreement, given that the column variable, buv, is not zero. Somers’ D for columns is defined in a similar manner.

A discussion of all of the measures of association in this section can be found in Kendall and Stuart (1979, starting on page 592).

Measures of Prediction and Uncertainty

The Optimal Prediction Coefficients

The measures in this section do not require any ordering of the row or column variables. They are based entirely upon probabilities. Most are discussed in Bishop, Feinberg, and Holland (1975, page 385).

Consider predicting (or classifying) the column for a given row in the table. Under the null hypothesis of independence, one would choose the column with the highest column marginal probability for all rows. In this case, the probability of misclassification for any row is one minus this marginal probability. If independence is not assumed, then within each row one would choose the column with the highest row conditional probability, and the probability of misclassification for the row becomes one minus this conditional probability.

Define the optimal prediction coefficient λc|r for predicting columns from rows as the proportion of the probability of misclassification that is eliminated because the random variables are not independent. It is estimated by

where m is the index of the maximum estimated probability in the row (pim) or row margin (p∙m). A similar coefficient is defined for predicting the rows from the columns. The symmetric version of the optimal prediction λ is obtained by summing the numerators and denominators of λr|c and λc|r and by dividing. Standard errors for these coefficients are given in Bishop, Feinberg, and Holland (1975, page 388).

A problem with the optimal prediction coefficients λ is that they vary with the marginal probabilities. One way to correct for this is to use row conditional probabilities. The optimal prediction λ* coefficients are defined as the corresponding λ coefficients in which one first adjusts the row (or column) marginals to the same number of observations. This yields

where i indexes the rows, j indexes the columns, and pj|i is the (estimated) probability of column j given row i.

λ*r|c

is similarly defined.

Goodman and Kruskal

A second kind of prediction measure attempts to explain the proportion of the explained variation of the row (column) measure given the column (row) measure. Define the total variation in the rows to be

Note that this is 1/(2n) times the sums of squares of the auv’s.

With this definition of variation, the Goodman and Kruskal

coefficient for rows is computed as the reduction of the total variation for rows accounted for by the columns, divided by the total variation for the rows. To compute the reduction in the total variation of the rows accounted for by the columns, note that the total variation for the rows within column j is defined as

The total variation for rows within columns is the sum of the qj’s. Consistent with the usual methods in the analysis of variance, the reduction in the total variation is given as the difference between the total variation for rows and the total variation for rows within the columns.

Goodman and Kruskal’s

for columns is similarly defined. See Bishop, Feinberg, and Holland (1975, page 391) for the standard errors.

The Uncertainty Coefficients

The uncertainty coefficient for rows is the increase in the log‑likelihood that is achieved by the most general model over the independence model, divided by the marginal log‑likelihood for the rows. This is given by

The uncertainty coefficient for columns is similarly defined. The symmetric uncertainty coefficient contains the same numerator as Ur|c and Uc|r but averages the denominators of these two statistics. Standard errors for U are given in Brown (1983).

Kruskal-Wallis

The Kruskal‑Wallis statistic for rows is a one‑way analysis‑of‑variance‑type test that assumes the column variable is monotonically ordered. It tests the null hypothesis that no row populations are identical, using average ranks for the column variable. The Kruskal‑Wallis statistic for columns is similarly defined. Conover (1980) discusses the Kruskal‑Wallis test.

Test for Linear Trend

When there are two rows, it is possible to test for a linear trend in the row probabilities if one assumes that the column variable is monotonically ordered. In this test, the probabilities for row 1 are predicted by the column index using weighted simple linear regression. This slope is given by

where

is the average column index. An asymptotic test that the slope is zero may then be obtained (in large samples) as the usual regression test of zero slope.

In two‑column data, a similar test for a linear trend in the column probabilities is computed. This test assumes that the rows are monotonically ordered.

Kappa

Kappa is a measure of agreement computed on square tables only. In the Kappa statistic, the rows and columns correspond to the responses of two judges. The judges agree along the diagonal and disagree off the diagonal. Let

denote the probability that the two judges agree, and let

denote the expected probability of agreement under the independence model. Kappa is then given by (po ‑ pc)/(1 ‑ pc).

McNemar Tests

The McNemar test is a test of symmetry in a square contingency table, that is, it is a test of the null hypothesis Ho : θ ij = θ ji. The multiple‑degrees‑of‑freedom version of the McNemar test with r(r ‑ 1)/2 degrees of freedom is computed as

The single‑degree‑of‑freedom test assumes that the differences xij ‑ xji are all in one direction. The single‑degree‑of‑freedom test will be more powerful than the multiple‑degrees‑of‑freedom test when this is the case. The test statistic is given as

Its exact probability may be computed via the binomial distribution.

Comments

Informational errors

Type	Code	Description
3	1	Twenty percent of the expected values are less than 5.
3	2	The degrees of freedom for chi‑squared are greater than 30. The exact mean, standard deviation, and normal distribution function should be used.
3	3	Some expected table values are less than 2. Some asymptotic p‑values may not be good.
3	4	Some expected values are less than 1. Some asymptotic p‑values may not be good.

Example

The following example is taken from Kendall and Stuart (1979). It involves the distance vision in the right and left eyes, and especially illustrates the use of Kappa and McNemar tests. Most other test statistics are also computed.

USE CTCHI_INT

IMPLICIT NONE

INTEGER IPRINT, LDSTAT, NCOL, NROW

PARAMETER (IPRINT=1, LDSTAT=23, NCOL=4, NROW=4)

REAL CHICTR(NROW+1,NCOL+1), CHISQ(10), EXPECT(NROW+1,NCOL+1), &

STAT(LDSTAT,5), TABLE(NROW,NCOL)

DATA TABLE/821, 116, 72, 43, 112, 494, 151, 34, 85, 145, 583, &

106, 35, 27, 87, 331/

CALL CTCHI (TABLE, EXPECT, CHICTR, CHISQ, STAT, IPRINT=IPRINT)

END

Output

Table Values

1 2 3 4

1 821.0 112.0 85.0 35.0

2 116.0 494.0 145.0 27.0

3 72.0 151.0 583.0 87.0

4 43.0 34.0 106.0 331.0

Expected Values

row totals in column 5, column totals in row 5

1 2 3 4 5

1 341.69 256.92 298.49 155.90 1053.00

2 253.75 190.80 221.67 115.78 782.00

3 289.77 217.88 253.14 132.21 893.00

4 166.79 125.41 145.70 76.10 514.00

5 1052.00 791.00 919.00 480.00 3242.00

Contibutions to Chi-squared

row totals in column 5, column totals in row 5

1 2 3 4 5

1 672.36 81.74 152.70 93.76 1000.56

2 74.78 481.84 26.52 68.08 651.21

3 163.66 20.53 429.85 15.46 629.50

4 91.87 66.63 10.82 853.78 1023.10

5 1002.68 650.73 619.88 1031.08 3304.37

Chi-square Statistics

Pearson 3304.3682

p-value 0.0000

DF 9.0000

G**2 2781.0188

p-value 0.0000

Exact mean 9.0028

Exact std. 4.2402

Phi 1.0096

P 0.7105

Cramer’s V 0.5829

Table Statistics

standard std. error t-value

statistic error under Ho testing Ho p-value

Gamma 0.7757 0.0123 0.0149 52.19 0.0000

Tau B 0.6429 0.0122 0.0123 52.19 0.0000

Tau C 0.6293 0.0121 NaN 52.19 0.0000

D-Row 0.6418 0.0122 0.0123 52.19 0.0000

D-Column 0.6439 0.0122 0.0123 52.19 0.0000

Correlation 0.6926 0.0128 0.0172 40.27 0.0000

Spearman 0.6939 0.0127 0.0127 54.66 0.0000

GK tau rows 0.3420 0.0123 NaN NaN NaN

GK tau col. 0.3430 0.0122 NaN NaN NaN

U - Sym. 0.3171 0.0110 NaN NaN NaN

U - rows 0.3178 0.0110 NaN NaN NaN

U - cols. 0.3164 0.0110 NaN NaN NaN

Lambda-sym. 0.5373 0.0124 NaN NaN NaN

Lambda-row 0.5374 0.0126 NaN NaN NaN

Lambda-col. 0.5372 0.0126 NaN NaN NaN

l-star-rows 0.5506 0.0136 NaN NaN NaN

l-star-col. 0.5636 0.0127 NaN NaN NaN

Lin. trend NaN NaN NaN NaN NaN

Kruskal row 1561.4861 3.0000 NaN NaN 0.0000

Kruskal col 1563.0300 3.0000 NaN NaN 0.0000

Kappa 0.5744 0.0111 0.0106 54.36 0.0000

McNemar 4.7625 6.0000 NaN NaN 0.5746

McNemar df=1 0.9487 1.0000 NaN 0.35 0.3301