chiSquaredTest¶

Performs a chi-squared goodness-of-fit test.

Synopsis¶

chiSquaredTest (userProcCdf, nCategories, x)

Required Arguments¶

float userProcCdf (y) (Input): User-supplied function that returns the hypothesized, cumulative distribution function at the point y.
int nCategories (Input): Number of cells into which the observations are to be tallied.
float x[] (Input): Array with nObservations components containing the vector of data elements for this test.

Return Value¶

The p-value for the goodness-of-fit chi-squared statistic.

Optional Arguments¶

nParametersEstimated, int (Input)

Number of parameters estimated in computing the cumulative distribution function.

ido, int (Input)

Processing option. The argument ido must be one of 0, 1, 2, or 3. If ido = 0 (the default), all of the observations are input during one invocation. If ido = 1, 2, or 3, blocks of rows of the data can be processed sequentially in separate invocations of chiSquaredTest; with this option, it is not a requirement that all observations be memory resident, thus enabling one to handle large data sets.

`ido`	Action
0	This is the only invocation; all the data are input at once. (Default)
1	This is the first invocation with this data; additional calls will be made. Initialization and updating for the `nObservations` observations of `x` will be performed.
2	This is an intermediate invocation; updating for the `nObservations` observations of `x` will be performed.
3	This is the final invocation of this function. Updating for the data in `x` and wrap-up computations are performed. Workspace is released. No further invocations of `chiSquaredTest` with ido greater than 1 should be made without first invoking `chiSquaredTest` with `ido` = 1.

Default: ido = 0

cutpoints (Output)

An array of length nCategories − 1 containing the vector of cutpoints defining the cell intervals. The intervals defined by the cutpoints are such that the lower endpoint is not included and the upper endpoint is included in any interval. If cutpointsEqual is specified, equal probability cutpoints are computed and returned in cutpoints.

cutpointsEqual

If cutpointsUser is specified, then equal probability cutpoints can still be used if, in addition, the cutpointsEqual option is specified. If cutpointsUser is not specified, equal probability cutpoints are used by default.

chiSquared (Output)

If specified, the chi-squared test statistic is returned in chiSquared.

degreesOfFreedom (Output)

If specified, the degrees of freedom for the chi-squared goodness-of-fit test is returned in degreesOfFreedom.

frequencies, float[] (Input)

Array with nObservations components containing the vector frequencies for the observations stored in x.

bounds, float lowerBound, float upperBound (Input)

If bounds is specified, then lowerBound is the lower bound of the range of the distribution and upperBound is the upper bound of this range. If lowerBound = upperBound, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside the range of these bounds are ignored. Distributions conditional on a range can be specified when bounds is used. By convention, lowerBound is excluded from the first interval, but upperBound is included in the last interval.

cellCounts (Output)

An array of length nCategories containing the cell counts. The cell counts are the observed frequencies in each of the nCategories cells.

cellExpected (Output)

An array of length nCategories containing the cell expected values. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct.

cellChiSquared (Output)

An array of length nCategories containing the cell contributions to chi-squared.

Description¶

Function chiSquaredTest performs a chi-squared goodness-of-fit test that a random sample of observations is distributed according to a specified theoretical cumulative distribution. The theoretical distribution, which can be continuous, discrete, or a mixture of discrete and continuous distributions, is specified by the user-defined function userProcCdf. Because the user is allowed to give a range for the observations, a test that is conditional on the specified range is performed.

Argument nCategories gives the number of intervals into which the observations are to be divided. By default, equiprobable intervals are computed by chiSquaredTest, but intervals that are not equiprobable can be specified through the use of optional argument cutpoints.

Regardless of the method used to obtain the cutpoints, the intervals are such that the lower endpoint is not included in the interval, while the upper endpoint is always included. If the cumulative distribution function has discrete elements, then user-provided cutpoints should always be used since chiSquaredTest cannot determine the discrete elements in discrete distributions.

By default, the lower and upper endpoints of the first and last intervals are −∞ and +∞, respectively. If bounds is specified, the endpoints are user-defined by the two arguments lowerBound and upperBound.

A tally of counts is maintained for the observations in x as follows:

If the cutpoints are specified by the user, the tally is made in the interval to which \(x_i\) belongs, using the user-specified endpoints.
If the cutpoints are determined by chiSquaredTest, then the cumulative probability at \(x_i\), \(F(x_i)\), is computed by the function userProcCdf.

The tally for \(x_i\) is made in interval number \(\lfloor mF(x_i)+1\rfloor\), where m = nCategories and \(\lfloor\cdot \rfloor\) is the function that takes the greatest integer that is no larger than the argument of the function. Thus, if the computer time required to calculate the cumulative distribution function is large, user-specified cutpoints may be preferred to reduce the total computing time.

If the expected count in any cell is less than 1, then the chi-squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.

Programming Notes¶

Function userProcCdf must be supplied with calling sequence userProcCdf(y), which returns the value of the cumulative distribution function at any point y in the (optionally) specified range. Many of the cumulative distribution functions in Chapter 11, Probability Distribution Functions and Inverses, can be used for userProcCdf, either directly if the calling sequence is correct or indirectly if, for example, the sample means and standard deviations are to be used in computing the theoretical cumulative distribution function.

Examples¶

Example 1¶

This example illustrates the use of chiSquaredTest on a randomly generated sample from the normal distribution. One-thousand randomly generated observations are tallied into 10 equiprobable intervals. The null hypothesis, that the sample is from a normal distribution, is specified by use of normalCdf (Probability Distribution Functions and Inverses), as the hypothesized distribution function. In this example, the null hypothesis is not rejected.

from __future__ import print_function
from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal

seed = 123457
n_categories = 10
n_observations = 1000
randomSeedSet(seed)

# Generate normal deviates
x = randomNormal(n_observations)

# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x)

# Print results
print("p_value: %7.4f" % p_value)

Output¶

p_value:  0.1546

Example 2¶

In this example, optional arguments are used for the data in the initial example.

from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
from pyimsl.stat.writeMatrix import writeMatrix

seed = 123457
n_categories = 10
n_observations = 1000
chi_squared_statistics = empty((3), dtype='double')
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)

# Generate normal deviates
x = randomNormal(n_observations)

# Perform chi squared test
cell_chi_squared = []
chi_squared = []
cutpoints = []
cell_counts = []
degrees_of_freedom = []
chi_squared_statistics[2] = \
    chiSquaredTest(normalCdf, n_categories, x,
                   cutpoints=cutpoints,
                   cellCounts=cell_counts,
                   cellChiSquared=cell_chi_squared,
                   chiSquared=chi_squared,
                   degreesOfFreedom=degrees_of_freedom)
chi_squared_statistics[0] = chi_squared[0]
chi_squared_statistics[1] = degrees_of_freedom[0]

# Print results
writeMatrix("\nChi Squared Statistics\n",
            chi_squared_statistics,
            rowLabels=stat_row_labels, column=True)
writeMatrix("Cut Points", cutpoints, writeFormat="%10.3f")
writeMatrix("Cell Counts", cell_counts, writeFormat="%5i")
writeMatrix("Cell Contributions to Chi-Squared", cell_chi_squared,
            writeFormat="%10.3f")

Output¶

 
 
    Chi Squared Statistics

chi-squared               13.18
degrees of freedom         9.00
p-value                    0.15
 
                              Cut Points
         1           2           3           4           5           6
    -1.282      -0.842      -0.524      -0.253      -0.000       0.253
 
         7           8           9
     0.524       0.842       1.282
 
                             Cell Counts
    1      2      3      4      5      6      7      8      9     10
  106    109     89     92     83     87    110    104    121     99
 
                   Cell Contributions to Chi-Squared
         1           2           3           4           5           6
     0.360       0.810       1.210       0.640       2.890       1.690
 
         7           8           9          10
     1.000       0.160       4.410       0.010

Example 3¶

In this example, a discrete Poisson random sample of size 1,000 with parameter \(\theta=5.0\) is generated by function randomPoisson (Chapter 12, Random Number Generation). In the call to chiSquaredTest, function poissonCdf (Chapter 11, Probability Distribution Functions and Inverses) is used as function userProcCdf.

from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.poissonCdf import poissonCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomPoisson import randomPoisson
from pyimsl.stat.writeMatrix import writeMatrix


def userProcCdf(k):
    theta = 5.0
    cdf_v = poissonCdf(int(k), theta)
    return cdf_v


seed = 123457
n_categories = 10
n_parameters_estimated = 0
n_numbers = 1000
theta = 5.0
x = empty(n_numbers, dtype='float')
chi_squared_statistics = empty((3), dtype='double')
cell_statistics = empty((3, n_categories), dtype='double')
cutpoints = [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]
cell_row_labels = ["count", "expected count", "cell chi-squared"]
cell_col_labels = ["Poisson value", "0", "1", "2",
                                    "3", "4", "5", "6", "7", "8", "9"]
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)

# Generate normal deviates
poisson = randomPoisson(n_numbers, theta)
for i in range(0, n_numbers):
    x[i] = poisson[i]

# Perform chi squared test
cell_chi_squared = []
chi_squared = []
cell_counts = []
cell_expected = []
degrees_of_freedom = []
chi_squared_statistics[2] = \
    chiSquaredTest(userProcCdf, n_categories, x,
                   cutpointsUser=[cutpoints],
                   cellCounts=cell_counts,
                   cellExpected=cell_expected,
                   cellChiSquared=cell_chi_squared,
                   chiSquared=chi_squared,
                   degreesOfFreedom=degrees_of_freedom)
for i in range(0, n_categories):
    cell_statistics[0][i] = cell_counts[i]
    cell_statistics[1][i] = cell_expected[i]
    cell_statistics[2][i] = cell_chi_squared[i]
chi_squared_statistics[0] = chi_squared[0]
chi_squared_statistics[1] = degrees_of_freedom[0]

# Print results
writeMatrix("\nChi Squared Statistics\n",
            chi_squared_statistics,
            rowLabels=stat_row_labels, column=True)
writeMatrix("\nCell Statistics\n", cell_statistics,
            rowLabels=cell_row_labels,
            colLabels=cell_col_labels,
            writeFormat="%9.1f")

Output¶

 
 
    Chi Squared Statistics

chi-squared               10.48
degrees of freedom         9.00
p-value                    0.31
 
 
                           Cell Statistics

Poisson value             0          1          2          3          4
count                  41.0       94.0      138.0      158.0      150.0
expected count         40.4       84.2      140.4      175.5      175.5
cell chi-squared        0.0        1.1        0.0        1.7        3.7
 
Poisson value             5          6          7          8          9
count                 159.0      116.0       75.0       37.0       32.0
expected count        146.2      104.4       65.3       36.3       31.8
cell chi-squared        1.1        1.3        1.4        0.0        0.0

Example 4¶

Continuing with Example 1 data, the example below invokes the chiSquaredTest function using values of ido greater than 0. Also, optional arguments are used for the data.

from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
from pyimsl.stat.writeMatrix import writeMatrix

seed = 123457
n_categories = 10
n_observations = 1000
n_observations_block_1 = 300
n_observations_block_2 = 300
n_observations_block_3 = 400
row_labels = ["chi-squared", "degrees of freedom", "p-value"]

randomSeedSet(seed)

# Generate normal deviates
x1 = randomNormal(n_observations_block_1)
x2 = randomNormal(n_observations_block_2)
x3 = randomNormal(n_observations_block_3)


# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x1,
                         ido=1)
p_value = chiSquaredTest(normalCdf, n_categories, x2,
                         ido=2)
cutpoints = []
chiSquared = []
degreesOfFreedom = []
cellCounts = []
cellChiSquared = []
p_value = chiSquaredTest(normalCdf, n_categories, x3,
                         ido=3,
                         cutpoints=cutpoints,
                         chiSquared=chiSquared,
                         degreesOfFreedom=degreesOfFreedom,
                         cellCounts=cellCounts,
                         cellChiSquared=cellChiSquared)

# Print results
chi_squared_statistics = [chiSquared[0], degreesOfFreedom[0], p_value]
writeMatrix("Chi Squared Statistics", chi_squared_statistics,
            rowLabels=row_labels, transpose=True)
writeMatrix("Cut Points", cutpoints)
writeMatrix("Cell Counts", cellCounts)
writeMatrix("Cell Contributions to Chi-Squared", cellChiSquared)

Output¶

 
    Chi Squared Statistics
                              1
chi-squared               13.18
degrees of freedom         9.00
p-value                    0.15
 
                                 Cut Points
          1            2            3            4            5            6
     -1.282       -0.842       -0.524       -0.253       -0.000        0.253
 
          7            8            9
      0.524        0.842        1.282
 
                                 Cell Counts
          1            2            3            4            5            6
        106          109           89           92           83           87
 
          7            8            9           10
        110          104          121           99
 
                      Cell Contributions to Chi-Squared
          1            2            3            4            5            6
       0.36         0.81         1.21         0.64         2.89         1.69
 
          7            8            9           10
       1.00         0.16         4.41         0.01

Warning Errors¶

`IMSLS_EXPECTED_VAL_LESS_THAN_1`	An expected value is less than 1.
`IMSLS_EXPECTED_VAL_LESS_THAN_5`	An expected value is less than 5.
`IMSLS_X_VALUE_OUT_OF_RANGE`	Row x contains a value which is out of range.
`IMSLS_MISSING_DATA_ELEMENT`	At least one data element is missing.

Fatal Errors¶

`IMSLS_ALL_OBSERVATIONS_MISSING`	All observations contain missing values.
`IMSLS_INCORRECT_CDF_1`	Function `userProcCdf` is not a cumulative distribution function. The value at the lower bound must be nonnegative, and the value at the upper bound must not be greater than 1.
`IMSLS_INCORRECT_CDF_2`	Function `userProcCdf` is not a cumulative distribution function. The probability of the range of the distribution is not positive.
`IMSLS_INCORRECT_CDF_3`	Function `userProcCdf` is not a cumulative distribution function. Its evaluation at an element in x is inconsistent with either the evaluation at the lower or upper bound.
`IMSLS_INCORRECT_CDF_4`	Function `userProcCdf` is not a cumulative distribution function. Its evaluation at a cutpoint is inconsistent with either the evaluation at the lower or upper bound.
`IMSLS_INCORRECT_CDF_5`	An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line.
`IMSLS_TOO_MANY_CELL_DELETIONS`	There are more observations deleted from the cell than added.
`IMSLS_NO_BOUND_AFTER_100_TRYS`	After 100 attempts, a bound for the inverse cannot be determined. Try again with a different initial estimate.
`IMSLS_NO_UNIQUE_INVERSE_EXISTS`	No unique inverse exists.
`IMSLS_CONVERGENCE_ASSUMED`	Over 100 iterations have occurred without convergence. Convergence is assumed.
`IMSLS_BAD_IDO_6`	“`ido`” = #. Initial allocations must be performed by invoking the function with “`ido`” = 1.
`IMSLS_BAD_IDO_7`	“`ido`” = #. A new analysis may not begin until the previous analysis is terminated by invoking the function with “`ido`” = 3.
`IMSLS_BAD_N_CATEGORIES`	“`nCategories`” = #. The number of categories variable, “`nCategories`”, must be the same in separate function calls.
`IMSLS_STOP_USER_FCN`	Request from user supplied function to stop algorithm. User flag = “#”.