chiSquaredTest

Performs a chi-squared goodness-of-fit test.

Synopsis

chiSquaredTest (userProcCdf, nCategories, x)

Required Arguments

float userProcCdf (y) (Input)
User-supplied function that returns the hypothesized, cumulative distribution function at the point y.
int nCategories (Input)
Number of cells into which the observations are to be tallied.
float x[] (Input)
Array with nObservations components containing the vector of data elements for this test.

Return Value

The p-value for the goodness-of-fit chi-squared statistic.

Optional Arguments

nParametersEstimated, int (Input)
Number of parameters estimated in computing the cumulative distribution function.
ido, int (Input)

Processing option. The argument ido must be one of 0, 1, 2, or 3. If ido = 0 (the default), all of the observations are input during one invocation. If ido = 1, 2, or 3, blocks of rows of the data can be processed sequentially in separate invocations of chiSquaredTest; with this option, it is not a requirement that all observations be memory resident, thus enabling one to handle large data sets.

ido Action
0 This is the only invocation; all the data are input at once. (Default)
1

This is the first invocation with this data; additional calls will be made.

Initialization and updating for the nObservations observations of x will be performed.

2 This is an intermediate invocation; updating for the nObservations observations of x will be performed.
3 This is the final invocation of this function. Updating for the data in x and wrap-up computations are performed. Workspace is released. No further invocations of chiSquaredTest with ido greater than 1 should be made without first invoking chiSquaredTest with ido = 1.

Default: ido = 0

cutpoints (Output)
An array of length nCategories − 1 containing the vector of cutpoints defining the cell intervals. The intervals defined by the cutpoints are such that the lower endpoint is not included and the upper endpoint is included in any interval. If cutpointsEqual is specified, equal probability cutpoints are computed and returned in cutpoints.
cutpointsEqual
If cutpointsUser is specified, then equal probability cutpoints can still be used if, in addition, the cutpointsEqual option is specified. If cutpointsUser is not specified, equal probability cutpoints are used by default.
chiSquared (Output)
If specified, the chi-squared test statistic is returned in chiSquared.
degreesOfFreedom (Output)
If specified, the degrees of freedom for the chi-squared goodness-of-fit test is returned in degreesOfFreedom.
frequencies, float[] (Input)
Array with nObservations components containing the vector frequencies for the observations stored in x.
bounds, float lowerBound, float upperBound (Input)
If bounds is specified, then lowerBound is the lower bound of the range of the distribution and upperBound is the upper bound of this range. If lowerBound = upperBound, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside the range of these bounds are ignored. Distributions conditional on a range can be specified when bounds is used. By convention, lowerBound is excluded from the first interval, but upperBound is included in the last interval.
cellCounts (Output)
An array of length nCategories containing the cell counts. The cell counts are the observed frequencies in each of the nCategories cells.
cellExpected (Output)
An array of length nCategories containing the cell expected values. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct.
cellChiSquared (Output)
An array of length nCategories containing the cell contributions to chi-squared.

Description

Function chiSquaredTest performs a chi-squared goodness-of-fit test that a random sample of observations is distributed according to a specified theoretical cumulative distribution. The theoretical distribution, which can be continuous, discrete, or a mixture of discrete and continuous distributions, is specified by the user-defined function userProcCdf. Because the user is allowed to give a range for the observations, a test that is conditional on the specified range is performed.

Argument nCategories gives the number of intervals into which the observations are to be divided. By default, equiprobable intervals are computed by chiSquaredTest, but intervals that are not equiprobable can be specified through the use of optional argument cutpoints.

Regardless of the method used to obtain the cutpoints, the intervals are such that the lower endpoint is not included in the interval, while the upper endpoint is always included. If the cumulative distribution function has discrete elements, then user-provided cutpoints should always be used since chiSquaredTest cannot determine the discrete elements in discrete distributions.

By default, the lower and upper endpoints of the first and last intervals are −∞ and +∞, respectively. If bounds is specified, the endpoints are user-defined by the two arguments lowerBound and upperBound.

A tally of counts is maintained for the observations in x as follows:

  • If the cutpoints are specified by the user, the tally is made in the interval to which \(x_i\) belongs, using the user-specified endpoints.
  • If the cutpoints are determined by chiSquaredTest, then the cumulative probability at \(x_i\), \(F(x_i)\), is computed by the function userProcCdf.

The tally for \(x_i\) is made in interval number \(\lfloor mF(x_i)+1\rfloor\), where m = nCategories and \(\lfloor\cdot \rfloor\) is the function that takes the greatest integer that is no larger than the argument of the function. Thus, if the computer time required to calculate the cumulative distribution function is large, user-specified cutpoints may be preferred to reduce the total computing time.

If the expected count in any cell is less than 1, then the chi-squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.

Programming Notes

Function userProcCdf must be supplied with calling sequence userProcCdf(y), which returns the value of the cumulative distribution function at any point y in the (optionally) specified range. Many of the cumulative distribution functions in Chapter 11, Probability Distribution Functions and Inverses, can be used for userProcCdf, either directly if the calling sequence is correct or indirectly if, for example, the sample means and standard deviations are to be used in computing the theoretical cumulative distribution function.

Examples

Example 1

This example illustrates the use of chiSquaredTest on a randomly generated sample from the normal distribution. One-thousand randomly generated observations are tallied into 10 equiprobable intervals. The null hypothesis, that the sample is from a normal distribution, is specified by use of normalCdf (Probability Distribution Functions and Inverses), as the hypothesized distribution function. In this example, the null hypothesis is not rejected.

from __future__ import print_function
from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal

seed = 123457
n_categories = 10
n_observations = 1000
randomSeedSet(seed)

# Generate normal deviates
x = randomNormal(n_observations)

# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x)

# Print results
print("p_value: %7.4f" % p_value)

Output

p_value:  0.1546

Example 2

In this example, optional arguments are used for the data in the initial example.

from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
from pyimsl.stat.writeMatrix import writeMatrix

seed = 123457
n_categories = 10
n_observations = 1000
chi_squared_statistics = empty((3), dtype='double')
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)

# Generate normal deviates
x = randomNormal(n_observations)

# Perform chi squared test
cell_chi_squared = []
chi_squared = []
cutpoints = []
cell_counts = []
degrees_of_freedom = []
chi_squared_statistics[2] = \
    chiSquaredTest(normalCdf, n_categories, x,
                   cutpoints=cutpoints,
                   cellCounts=cell_counts,
                   cellChiSquared=cell_chi_squared,
                   chiSquared=chi_squared,
                   degreesOfFreedom=degrees_of_freedom)
chi_squared_statistics[0] = chi_squared[0]
chi_squared_statistics[1] = degrees_of_freedom[0]

# Print results
writeMatrix("\nChi Squared Statistics\n",
            chi_squared_statistics,
            rowLabels=stat_row_labels, column=True)
writeMatrix("Cut Points", cutpoints, writeFormat="%10.3f")
writeMatrix("Cell Counts", cell_counts, writeFormat="%5i")
writeMatrix("Cell Contributions to Chi-Squared", cell_chi_squared,
            writeFormat="%10.3f")

Output

 
 
    Chi Squared Statistics

chi-squared               13.18
degrees of freedom         9.00
p-value                    0.15
 
                              Cut Points
         1           2           3           4           5           6
    -1.282      -0.842      -0.524      -0.253      -0.000       0.253
 
         7           8           9
     0.524       0.842       1.282
 
                             Cell Counts
    1      2      3      4      5      6      7      8      9     10
  106    109     89     92     83     87    110    104    121     99
 
                   Cell Contributions to Chi-Squared
         1           2           3           4           5           6
     0.360       0.810       1.210       0.640       2.890       1.690
 
         7           8           9          10
     1.000       0.160       4.410       0.010

Example 3

In this example, a discrete Poisson random sample of size 1,000 with parameter \(\theta=5.0\) is generated by function randomPoisson (Chapter 12, Random Number Generation). In the call to chiSquaredTest, function poissonCdf (Chapter 11, Probability Distribution Functions and Inverses) is used as function userProcCdf.

from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.poissonCdf import poissonCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomPoisson import randomPoisson
from pyimsl.stat.writeMatrix import writeMatrix


def userProcCdf(k):
    theta = 5.0
    cdf_v = poissonCdf(int(k), theta)
    return cdf_v


seed = 123457
n_categories = 10
n_parameters_estimated = 0
n_numbers = 1000
theta = 5.0
x = empty(n_numbers, dtype='float')
chi_squared_statistics = empty((3), dtype='double')
cell_statistics = empty((3, n_categories), dtype='double')
cutpoints = [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]
cell_row_labels = ["count", "expected count", "cell chi-squared"]
cell_col_labels = ["Poisson value", "0", "1", "2",
                                    "3", "4", "5", "6", "7", "8", "9"]
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)

# Generate normal deviates
poisson = randomPoisson(n_numbers, theta)
for i in range(0, n_numbers):
    x[i] = poisson[i]

# Perform chi squared test
cell_chi_squared = []
chi_squared = []
cell_counts = []
cell_expected = []
degrees_of_freedom = []
chi_squared_statistics[2] = \
    chiSquaredTest(userProcCdf, n_categories, x,
                   cutpointsUser=[cutpoints],
                   cellCounts=cell_counts,
                   cellExpected=cell_expected,
                   cellChiSquared=cell_chi_squared,
                   chiSquared=chi_squared,
                   degreesOfFreedom=degrees_of_freedom)
for i in range(0, n_categories):
    cell_statistics[0][i] = cell_counts[i]
    cell_statistics[1][i] = cell_expected[i]
    cell_statistics[2][i] = cell_chi_squared[i]
chi_squared_statistics[0] = chi_squared[0]
chi_squared_statistics[1] = degrees_of_freedom[0]

# Print results
writeMatrix("\nChi Squared Statistics\n",
            chi_squared_statistics,
            rowLabels=stat_row_labels, column=True)
writeMatrix("\nCell Statistics\n", cell_statistics,
            rowLabels=cell_row_labels,
            colLabels=cell_col_labels,
            writeFormat="%9.1f")

Output

 
 
    Chi Squared Statistics

chi-squared               10.48
degrees of freedom         9.00
p-value                    0.31
 
 
                           Cell Statistics

Poisson value             0          1          2          3          4
count                  41.0       94.0      138.0      158.0      150.0
expected count         40.4       84.2      140.4      175.5      175.5
cell chi-squared        0.0        1.1        0.0        1.7        3.7
 
Poisson value             5          6          7          8          9
count                 159.0      116.0       75.0       37.0       32.0
expected count        146.2      104.4       65.3       36.3       31.8
cell chi-squared        1.1        1.3        1.4        0.0        0.0

Example 4

Continuing with Example 1 data, the example below invokes the chiSquaredTest function using values of ido greater than 0. Also, optional arguments are used for the data.

from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
from pyimsl.stat.writeMatrix import writeMatrix

seed = 123457
n_categories = 10
n_observations = 1000
n_observations_block_1 = 300
n_observations_block_2 = 300
n_observations_block_3 = 400
row_labels = ["chi-squared", "degrees of freedom", "p-value"]

randomSeedSet(seed)

# Generate normal deviates
x1 = randomNormal(n_observations_block_1)
x2 = randomNormal(n_observations_block_2)
x3 = randomNormal(n_observations_block_3)


# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x1,
                         ido=1)
p_value = chiSquaredTest(normalCdf, n_categories, x2,
                         ido=2)
cutpoints = []
chiSquared = []
degreesOfFreedom = []
cellCounts = []
cellChiSquared = []
p_value = chiSquaredTest(normalCdf, n_categories, x3,
                         ido=3,
                         cutpoints=cutpoints,
                         chiSquared=chiSquared,
                         degreesOfFreedom=degreesOfFreedom,
                         cellCounts=cellCounts,
                         cellChiSquared=cellChiSquared)

# Print results
chi_squared_statistics = [chiSquared[0], degreesOfFreedom[0], p_value]
writeMatrix("Chi Squared Statistics", chi_squared_statistics,
            rowLabels=row_labels, transpose=True)
writeMatrix("Cut Points", cutpoints)
writeMatrix("Cell Counts", cellCounts)
writeMatrix("Cell Contributions to Chi-Squared", cellChiSquared)

Output

 
    Chi Squared Statistics
                              1
chi-squared               13.18
degrees of freedom         9.00
p-value                    0.15
 
                                 Cut Points
          1            2            3            4            5            6
     -1.282       -0.842       -0.524       -0.253       -0.000        0.253
 
          7            8            9
      0.524        0.842        1.282
 
                                 Cell Counts
          1            2            3            4            5            6
        106          109           89           92           83           87
 
          7            8            9           10
        110          104          121           99
 
                      Cell Contributions to Chi-Squared
          1            2            3            4            5            6
       0.36         0.81         1.21         0.64         2.89         1.69
 
          7            8            9           10
       1.00         0.16         4.41         0.01

Warning Errors

IMSLS_EXPECTED_VAL_LESS_THAN_1 An expected value is less than 1.
IMSLS_EXPECTED_VAL_LESS_THAN_5 An expected value is less than 5.
IMSLS_X_VALUE_OUT_OF_RANGE Row x contains a value which is out of range.
IMSLS_MISSING_DATA_ELEMENT At least one data element is missing.

Fatal Errors

IMSLS_ALL_OBSERVATIONS_MISSING All observations contain missing values.
IMSLS_INCORRECT_CDF_1 Function userProcCdf is not a cumulative distribution function. The value at the lower bound must be nonnegative, and the value at the upper bound must not be greater than 1.
IMSLS_INCORRECT_CDF_2 Function userProcCdf is not a cumulative distribution function. The probability of the range of the distribution is not positive.
IMSLS_INCORRECT_CDF_3 Function userProcCdf is not a cumulative distribution function. Its evaluation at an element in x is inconsistent with either the evaluation at the lower or upper bound.
IMSLS_INCORRECT_CDF_4 Function userProcCdf is not a cumulative distribution function. Its evaluation at a cutpoint is inconsistent with either the evaluation at the lower or upper bound.
IMSLS_INCORRECT_CDF_5 An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line.
IMSLS_TOO_MANY_CELL_DELETIONS There are more observations deleted from the cell than added.
IMSLS_NO_BOUND_AFTER_100_TRYS After 100 attempts, a bound for the inverse cannot be determined. Try again with a different initial estimate.
IMSLS_NO_UNIQUE_INVERSE_EXISTS No unique inverse exists.
IMSLS_CONVERGENCE_ASSUMED Over 100 iterations have occurred without convergence. Convergence is assumed.
IMSLS_BAD_IDO_6 ido” = #. Initial allocations must be performed by invoking the function with “ido” = 1.
IMSLS_BAD_IDO_7 ido” = #. A new analysis may not begin until the previous analysis is terminated by invoking the function with “ido” = 3.
IMSLS_BAD_N_CATEGORIES nCategories” = #. The number of categories variable, “nCategories”, must be the same in separate function calls.
IMSLS_STOP_USER_FCN

Request from user supplied function to stop algorithm.

User flag = “#”.