CSTAT
Computes cell frequencies, cell means, and cell sums of squares for multivariate data.
Required Arguments
X — ∣NROW∣ by NCOL matrix containing the data. (Input)
Each column of X represents either a classification variable, a response variable, a weight, or a frequency.
KMAX — Maximum number of cells. (Input)
This quantity does not have to be exact, but must be at least as large as the actual number of cells, K.
CELIF — Matrix with min(KMAX, K) columns containing cell information.
(Output, if IDO = 0 or 1; input/output, if IDO = 2.)
The number of rows in CELIF depends on the eight cases tabled below.
Case | Contents | Rows in CELIF |
---|
1 | MOPT ≤ 0, IFRQ = 0 and IWT = 0 | NCOL + NR + 1 |
2 | MOPT ≤ 0, IFRQ > 0 and IWT = 0 | NCOL + NR |
3 | MOPT ≤ 0, IFRQ = 0 and IWT > 0 | NCOL + NR + 1 |
4 | MOPT ≤ 0, IFRQ > 0 and IWT > 0 | NCOL + NR |
5 | MOPT > 0, IFRQ = 0 and IWT = 0 | NCOL + 2 * NR + 1 |
6 | MOPT > 0, IFRQ > 0 and IWT = 0 | NCOL + 2 * NR |
7 | MOPT > 0, IFRQ = 0 and IWT > 0 | NCOL + 3 * NR |
8 | MOPT > 0, IFRQ > 0 and IWT > 0 | NCOL + 3 * NR ‑ 1 |
Each column contains information on each unique combination of values of the m classification variables that occurs in the data. The first m rows give the values of the classification variables. Row m + 1 gives the number of observations that are in this cell. (For cases 2, 4, 6 and 8, this is the sum of the frequencies.) For case 3 and 4, row m + 2 contains the sum of the weights. For NR greater than zero, the remaining rows (beginning with row m + 3 in case 3 and 4 and with row m + 2 otherwise) contain information concerning the response variables. For cases 1, 2, 3 and 4, there are 2 ∗ NR remaining rows with the cell (weighted) mean and cell (weighted) sum of squares for each of the NR response variables. For cases 5 and 6, there are 3 ∗ NR remaining rows with the sample size, the mean and sum of squares for each of the NR response variables. For case 7 and 8, there are 4 ∗ NR remaining rows with the sample size, the sum of weights, weighted means, and weighted sum of squares for each of the NR response variables.
Optional Arguments
IDO — Processing option. (Input)
Default: IDO = 0.
IDO | Action |
---|
0 | This is the only invocation of CSTAT for this data set, and all the data are input at once. |
1 | This is the first invocation, and additional calls to CSTAT will be made. Initialization and updating for the data in X are performed. |
2 | This is an intermediate invocation of CSTAT, and updating for the data in X is performed. |
NROW — The absolute value of NROW is the number of rows of data currently input in X. (Input)
Default: NROW = size (X,1).
NROW may be positive or negative. Negative NROW means that the ‑NROW rows of data are to be deleted from some aspects of the analysis, and this should be done only if IDO is 2. When a negative value is input for NROW, it is assumed that each of the ‑NROW rows of X has been input (with positive NROW) in previous invocations of CSTAT.
NCOL — Number of columns in X. (Input)
Default: NCOL = size (X,2).
LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input)
Default: LDX = size (X,1).
NR — Number of response variables. (Input)
NR = 0 means no response variables are input. Otherwise, cell means and sums of squares are computed for the response variables.
Default: NR = 0.
IRX — Vector of length NR. (Input if NR is greater than 0.)
The IRX(1), …, IRX(NR) columns of X contain the response variables for which cell means and sums of squares are computed.
IFRQ — Frequency option. (Input)
IFRQ = 0 means that all frequencies are 1.0. For positive IFRQ, column number IFRQ of X contains the frequencies.
Default: IFRQ = 0.
IWT — Weighting option. (Input)
IWT = 0 means that all weights are 1.0. For positive IWT, column IWT of X contains the weights.
Default: IWT = 0.
MOPT — Missing value option. (Input)
If MOPT is zero, the exclusion is listwise. If MOPT is positive, the following occurs: (1) if a classification variable’s value is missing, the entire case is excluded, (2) if
IFRQ > 0 and the frequency variable’s value is missing, the entire case is excluded, (3) if IWT > 0 and the weight variable’s value is missing, the case is classified and the cell frequency updated, but no information with regard to the response variables is computed, and (4) if only some response variables’ values are missing, all computations are performed except those pertaining to the response variables with missing values.
Default: MOPT = 0.
K — Number of cells or an upper bound for this number. (Input/Output)
On the first call K must be input K = 0. It should not be changed between calls to CSTAT. K is incremented by one for each new cell up to KMAX cells. Once KMAX cells are encountered, K is incremented by one for each observation that does not fall into one of the KMAX cells. In this case, K is an upper bound on the number of cells and can be used for KMAX in a subsequent run.
Default: K = 0.
LDCELI — Leading dimension of CELIF exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCELI = size (CELIF,1).
FORTRAN 90 Interface
Generic: CALL CSTAT (X, KMAX, CELIF [, …])
Specific: The specific interface names are S_CSTAT and D_CSTAT.
FORTRAN 77 Interface
Single: CALL CSTAT (IDO, NROW, NCOL, X, LDX, NR, IRX, IFRQ, IWT, MOPT, KMAX, K, CELIF, LDCELI)
Double: The double precision name is DCSTAT.
Description
The routine CSTAT computes cell frequencies, cell means, and cell sums of squares for multivariate data in X. The columns of X can contain data for four types of variables: classification variables, a frequency variable, a weight variable, and response variables. The frequency variable, the weight variable, and the response variables are all designated by indicators in IFRQ, IWT, and IRX. All other variables are considered to be classification variables; hence, there are m classification variables, where m = NCOL ‑ NR if there is no weight or frequency variable, m = NCOL ‑ NR ‑ 1 if there is a weight or frequency variable but not both, and m = NCOL ‑ NR ‑ 2 if there are weight and frequency variables.
Each combination of values of the classification variables is stored in the first m rows of CELIF. For each combination of values of the classification variables, the frequencies are stored in the next row of CELIF. Then, for each combination, means and sums of squares for each of the response variables are computed and stored in the remaining rows of CELIF. If a weighting variable is specified, the sum of the weights for each combination is computed and stored. If missing values are deleted elementwise (that is, if MOPT is positive), the frequencies and sums of weights for each of the response variables are stored in the rows of CELIF.
Comments
1. If no nonmissing observations with positive weights or frequencies exist in a cell for a particular response variable, the mean and sum of squares are set to NaN (not a number).
2. In cases 3 and 6, if a zero weight is encountered, there is no contribution to the means or sums of squares, but the sample sizes are implemented by one for that observation.
Examples
Example 1
In this example, there are two classification variables, C1 and C2, and two response variables, R1 and R2. Their values are shown below.
| C1 |
1 | 2 |
R1 | R2 | R1 | R2 |
C2
| 1 | 5.0 7.0 | 3.4 2.6 | 3.8 5.2 4.9 | 2.4 6.3 1.2 |
2 | 4.3 3.2 1.7 | 9.8 7.1 6.3 | 6.5 3.1 | 3.4 5.1 |
USE CSTAT_INT
USE WRRRL_INT
IMPLICIT NONE
INTEGER KMAX, LDCELI, LDX, NR, NCOL
PARAMETER (KMAX=4, LDCELI=15, LDX=10, NR=2, NCOL=4)
!
INTEGER IDO, IFRQ, IRX(NR), IWT, K, MIN0, MOPT, NROW
REAL CELIF(LDCELI,KMAX), X(LDX,NCOL)
CHARACTER CLABEL(1)*6, FMT*7, RLABEL(7)*6
INTRINSIC MIN0
! Get data for example
DATA X/1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, &
1.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 2.0, 5.0, 7.0, 4.3, &
3.2, 1.7, 3.8, 5.2, 4.9, 6.5, 3.1, 3.4, 2.6, 9.8, 7.1, 6.3, &
2.4, 6.3, 1.2, 3.4, 5.1/
! All data are input at once
IDO = 0
NROW = 10
K = 0
! No unequal frequencies or weights
! are used
IFRQ = 0
IWT = 0
! Response variables are in 3rd and 4th
! columns
IRX(1) = 3
IRX(2) = 4
! Delete any row containing a missing
! value
MOPT = 0
!
CALL CSTAT (X, KMAX, CELIF, NR=NR, IRX=IRX, K=K)
! Print the results
CLABEL(1) = 'NONE'
RLABEL(1) = ' '
RLABEL(2) = ' '
RLABEL(3) = 'Freq.'
RLABEL(4) = 'Mean 1'
RLABEL(5) = 'SS 1'
RLABEL(6) = 'Mean 2'
RLABEL(7) = 'SS 2'
FMT = '(W10.4)'
CALL WRRRL ('Statistics for the Cells', CELIF, &
RLABEL, CLABEL, NRA=(NCOL+NR+1), &
NCA=MIN0(KMAX, K), FMT=FMT)
END
Output
Statistics for the Cells
1.00 1.00 2.00 2.00
1.00 2.00 1.00 2.00
Freq. 2.00 3.00 3.00 2.00
Mean 1 6.00 3.07 4.63 4.80
SS 1 2.00 3.41 1.09 5.78
Mean 2 3.00 7.73 3.30 4.25
SS 2 0.32 6.73 14.22 1.44
Example 2
This example uses the same data as in the first example, except some of the data are set to missing values. Also, a frequency variable is used. It is in the fourth column of X. The frequency variable indicates that the values of the classification and response variables in the first observation occur 3 times and that all other frequencies are 1. Since MOPT is greater than zero, statistics on one response variable are accumulated even if the other response variable has a missing value. If the frequency variable has a missing value, however, the entire observation is omitted.
The missing value is NaN (not a number) that can be obtained with the argument of 6 in the routine AMACH (Reference Material). For this example, we set the first response variable in the first cell (C1 = 1, C2 = 1) to a missing value; we set the second response variable in the (2, 1) cell to a missing value; and we set the frequency variable in the (1, 2) cell to a missing value. The data are now as shown below, with “NaN” in place of the missing values.
| C1 |
1 | 2 |
R1 | R2 | R1 | R2 |
C2
| 1 | NaN NaN NaN 7.0 | 3.4 3.4 3.4 2.6 | 3.8 5.2 4.9 | NaN 6.3 1.2 |
2 | NaN 3.2 1.7 | NaN 7.1 6.3 | 6.5 3.1 | 3.4 5.1 |
The first two rows output in CELIF are the values of the classification variables, and the third row is the frequencies of the cells, as before. The next three rows correspond to the first response variable, and the last three rows correspond to the second response variable. (This is “case 6” above, where the argument CELIF is described.)
USE CSTAT_INT
USE WRRRN_INT
IMPLICIT NONE
INTEGER KMAX, LDCELI, LDX, NR, NCOL, NROW
PARAMETER (KMAX=4, LDCELI=15, LDX=10, NR=2, NCOL=5)
!
INTEGER IDO, IFRQ, IRX(NR), IWT, K, MIN0, MOPT
REAL CELIF(LDCELI,KMAX), X(LDX,NCOL), AMACH
INTRINSIC MIN0
! Get data for example.
DATA X/1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, &
1.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 2.0, 5.0, 7.0, 4.3, &
3.2, 1.7, 3.8, 5.2, 4.9, 6.5, 3.1, 3.4, 2.6, 9.8, 7.1, 6.3, &
2.4, 6.3, 1.2, 3.4, 5.1, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, &
1.0, 1.0, 1.0/
! All data are input at once.
IDO = 0
NROW = 10
K = 0
! Frequencies are in the 5th column.
! All weights are equal
IFRQ = 5
IWT = 0
! Response variables are in 3rd and 4th
! columns.
IRX(1) = 3
IRX(2) = 4
! Set some values to “missing” for
! this example. Specify elementwise
! deletion of missing values of the
! response variables.
MOPT = 1
X(1,3) = AMACH(6)
X(6,4) = AMACH(6)
X(3,5) = AMACH(6)
!
CALL CSTAT (X, KMAX, CELIF, NR=NR, IRX=IRX, MOPT=MOPT, IFRQ=IFRQ, &
K=K)
! Print the results.
CALL WRRRN ('Statistics for the Cells', CELIF, NRA=(NCOL+2*NR), &
NCA=MIN0(KMAX, K))
END
Output
Statistics for the Cells
1 2 3 4
1 1.00 1.00 2.00 2.00
2 1.00 2.00 1.00 2.00
3 4.00 2.00 3.00 2.00
4 1.00 2.00 3.00 2.00
5 7.00 2.45 4.63 4.80
6 0.00 1.12 1.09 5.78
7 4.00 2.00 2.00 2.00
8 3.20 6.70 3.75 4.25
9 0.48 0.32 13.01 1.44