chiSquaredTest

Performs a chi-squared goodness-of-fit test.

Synopsis

chiSquaredTest (userProcCdf, nCategories, x)

Required Arguments

float userProcCdf (float y) (Input)
User-supplied function that returns the hypothesized, cumulative distribution function at the point y.
int nCategories (Input)
The number of cells into which the observations are to be tallied.
float x[] (Input)
Array with nObservations components containing the vector of data elements for this test.

Return Value

The p-value for the goodness-of-fit chi-squared statistic.

Optional Arguments

nParametersEstimated, int (Input)
The number of parameters estimated in computing the cumulative distribution function.
cutpoints (Output)
The cutpoints array.
cutpointsEqual
Equal probability cutpoints.
chiSquared (Output)
If specified, the chi-squared test statistic is returned in chiSquared.
degreesOfFreedom (Output)
If specified, the degrees of freedom for the chi-squared goodness-of-fit test is returned in degreesOfFreedom.
frequencies, float[] (Input)
Array with nObservations components containing the vector frequencies for the observations stored in x.
bounds, float lowerBound, float upperBound (Input)
If bounds is specified, then lowerBound is the lower bound of the range of the distribution, and upperBound is the upper bound of this range. If lowerBound = upperBound, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside the range of these bounds are ignored. Distributions conditional on a range can be specified when bounds is used. By convention, lowerBound is excluded from the first interval, but upperBound is included in the last interval.
cellCounts (Output)
An array containing the cell counts. The cell counts are the observed frequencies in each of the nCategories cells.
cellExpected (Output)
The cell expected values. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct.
cellChiSquared (Output)
An array of length nCategories containing the cell contributions to chi-squared.

Description

The function chiSquaredTest performs a chi-squared goodness-of-fit test that a random sample of observations is distributed according to a specified theoretical cumulative distribution. The theoretical distribution, which may be continuous, discrete, or a mixture of discrete and continuous distributions, is specified via the user-defined function userProcCdf. Because the user is allowed to give a range for the observations, a test conditional upon the specified range is performed.

Argument nCategories gives the number of intervals into which the observations are to be divided. By default, equiprobable intervals are computed by chiSquaredTest, but intervals that are not equiprobable can be specified (through the use of optional argument cutpoints).

Regardless of the method used to obtain the cutpoints, the intervals are such that the lower endpoint is not included in the interval, while the upper endpoint is always included. If the cumulative distribution function has discrete elements, then user-provided cutpoints should always be used since chiSquaredTest cannot determine the discrete elements in discrete distributions.

By default, the lower and upper endpoints of the first and last intervals are − ∞ and + ∞, respectively. If bounds is specified, the endpoints are defined by the user via the two arguments lowerBound and upperBound.

A tally of counts is maintained for the observations in x as follows. If the cutpoints are specified by the user, the tally is made in the interval to which \(x_i\) belongs using the endpoints specified by the user. If the cutpoints are determined by chiSquaredTest, then the cumulative probability at \(x_i\), \(F(x_i)\), is computed via the function userProcCdf. The tally for \(x_i\) is made in interval number

\[\lfloor mF\left(x_i\right)+1\rfloor \text{ where } m = \mathrm{nCategories}\]

and

\[\lfloor\cdot\rfloor\]

is the function that takes the greatest integer that is no larger than the argument of the function. Thus, if the computer time required to calculate the cumulative distribution function is large, user-specified cutpoints may be preferred to reduce the total computing time.

If the expected count in any cell is less than 1, then a rule of thumb is that the chi-squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.

On some platforms, chiSquaredTest can evaluate the user-supplied function userProcCdf in parallel. This is done only if the function ompOptions is called to flag user-defined functions as thread-safe. A function is thread-safe if there are no dependencies between calls. Such dependencies are usually the result of writing to global or static variables

Programming Notes

The user must supply a function userProcCdf with calling sequence userProcCdf(y), that returns the value of the cumulative distribution function at any point y in the (optionally) specified range. Many of the cumulative distribution functions in Special Functions can be used for userProcCdf, either directly, if the calling sequence is correct, or indirectly, if, for example, the sample means and standard deviations are to be used in computing the theoretical cumulative distribution function.

Examples

Example 1

This example illustrates the use of chiSquaredTest on a randomly generated sample from the normal distribution. One-thousand randomly generated observations are tallied into 10 equiprobable intervals. The None hypothesis that the sample is from a normal distribution is specified by use of the normalCdf (see Special Functions) as the hypothesized distribution function. In this example, the None hypothesis is not rejected.

from __future__ import print_function
from numpy import *
from pyimsl.math.chiSquaredTest import chiSquaredTest
from pyimsl.math.normalCdf import normalCdf
from pyimsl.math.randomSeedSet import randomSeedSet
from pyimsl.math.randomNormal import randomNormal

randomSeedSet(123457)

# Generate Normal deviates
n_observations = 1000
n_categories = 10
x = randomNormal(n_observations)

# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x)

# Print results
print("p value %7.4f" % (p_value))

Output

p value  0.1546

Example 2

In this example, some optional arguments are used for the data in the initial example.

from numpy import *
from pyimsl.math.chiSquaredTest import chiSquaredTest
from pyimsl.math.normalCdf import normalCdf
from pyimsl.math.randomSeedSet import randomSeedSet
from pyimsl.math.randomNormal import randomNormal
from pyimsl.math.writeMatrix import writeMatrix

randomSeedSet(123457)

# Generate Normal deviates
n_observations = 1000
n_categories = 10
x = randomNormal(n_observations)

# Perform chi squared test
cutpoints = []
cell_counts = []
cell_chi_squared = []
chi_squared_statistics0 = []
chi_squared_statistics1 = []
chi_squared_statistics2 = chiSquaredTest(normalCdf, n_categories, x,
                                         cutpoints=cutpoints,
                                         cellCounts=cell_counts,
                                         cellChiSquared=cell_chi_squared,
                                         chiSquared=chi_squared_statistics0,
                                         degreesOfFreedom=chi_squared_statistics1)
chi_squared_statistics =\
    [chi_squared_statistics0[0],
     chi_squared_statistics1[0],
     chi_squared_statistics2]
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]

# Print results
writeMatrix("\nChi Squared Statistics\n",
            chi_squared_statistics,
            rowLabels=stat_row_labels,
            column=True)
writeMatrix("Cut Points", cutpoints)
writeMatrix("Cell Counts", cell_counts)
writeMatrix("Cell Contributions to Chi-Squared",
            cell_chi_squared)

Output

 
 
    Chi Squared Statistics

chi-squared               13.18
degrees of freedom         9.00
p-value                    0.15
 
                                 Cut Points
          1            2            3            4            5            6
     -1.282       -0.842       -0.524       -0.253       -0.000        0.253
 
          7            8            9
      0.524        0.842        1.282
 
                                 Cell Counts
          1            2            3            4            5            6
        106          109           89           92           83           87
 
          7            8            9           10
        110          104          121           99
 
                      Cell Contributions to Chi-Squared
          1            2            3            4            5            6
       0.36         0.81         1.21         0.64         2.89         1.69
 
          7            8            9           10
       1.00         0.16         4.41         0.01

Example 3

In this example, a discrete Poisson random sample of size 1000 with parameter \(\theta=5.0\) is generated via function randomPoisson. In the call to chiSquaredTest, function poissonCdf is used as function userProcCdf.

from numpy import *
from pyimsl.math.chiSquaredTest import chiSquaredTest
from pyimsl.math.poissonCdf import poissonCdf
from pyimsl.math.randomSeedSet import randomSeedSet
from pyimsl.math.randomPoisson import randomPoisson
from pyimsl.math.writeMatrix import writeMatrix

theta = 5.0
n_numbers = 1000
n_categories = 10


def user_proc_cdf(k):
    cdf_v = poissonCdf(k, theta)
    return cdf_v


cutpoints = [1.5, 2.5, 3.5, 4.5, 5.5, 6.5,
             7.5, 8.5, 9.5]
cell_row_labels = ["count",
                   "expected count",
                   "cell chi-squared"]
cell_col_labels = ["Poisson value", "0", "1", "2",
                   "3", "4", "5", "6", "7", "8", "9"]
stat_row_labels = ["chi-squared",
                   "degrees of freedom",
                   "p-value"]
randomSeedSet(123457)

# Generate the data
poisson = randomPoisson(n_numbers, theta)

# Copy data to a floating point vector
x = (array(poisson)).copy()

cell_counts = []
cell_expected = []
cell_chi_squared = []
chi_squared_statistics0 = []
chi_squared_statistics1 = []
chi_squared_statistics2 = chiSquaredTest(user_proc_cdf, n_categories, x,
                                         cutpoints=cutpoints,
                                         cellCounts=cell_counts,
                                         cellExpected=cell_expected,
                                         cellChiSquared=cell_chi_squared,
                                         chiSquared=chi_squared_statistics0,
                                         degreesOfFreedom=chi_squared_statistics1)
chi_squared_statistics = [chi_squared_statistics0[0],
                          chi_squared_statistics1[0],
                          chi_squared_statistics2]
cell_statistics = empty((3, n_categories), dtype=double)

# Print results
writeMatrix("\nChi-squared statistics\n",
            chi_squared_statistics,
            rowLabels=stat_row_labels,
            column=True)

for i in range(0, n_categories):
    cell_statistics[0][i] = cell_counts[i]
    cell_statistics[1][i] = cell_expected[i]
    cell_statistics[2][i] = cell_chi_squared[i]

writeMatrix("\nCell Statistics\n", cell_statistics,
            rowLabels=cell_row_labels,
            colLabels=cell_col_labels)

Output

 
 
    Chi-squared statistics

chi-squared               10.48
degrees of freedom         9.00
p-value                    0.31
 
 
                          Cell Statistics

Poisson value               0            1            2            3
count                    41.0         94.0        138.0        158.0
expected count           40.4         84.2        140.4        175.5
cell chi-squared          0.0          1.1          0.0          1.7
 
Poisson value               4            5            6            7
count                   150.0        159.0        116.0         75.0
expected count          175.5        146.2        104.4         65.3
cell chi-squared          3.7          1.1          1.3          1.4
 
Poisson value               8            9
count                    37.0         32.0
expected count           36.3         31.8
cell chi-squared          0.0          0.0

Warning Errors

IMSL_EXPECTED_VAL_LESS_THAN_1 An expected value is less than 1.
IMSL_EXPECTED_VAL_LESS_THAN_5 An expected value is less than 5.

Fatal Errors

IMSL_ALL_OBSERVATIONS_MISSING All observations contain missing values.
IMSL_INCORRECT_CDF_1 The function userProcCdf is not a cumulative distribution function. The value at the lower bound must be nonnegative, and the value at the upper bound must not be greater than one.
IMSL_INCORRECT_CDF_2 The function userProcCdf is not a cumulative distribution function. The probability of the range of the distribution is not positive.
IMSL_INCORRECT_CDF_3 The function userProcCdf is not a cumulative distribution function. Its evaluation at an element in x is inconsistent with either the evaluation at the lower or upper bound.
IMSL_INCORRECT_CDF_4 The function userProcCdf is not a cumulative distribution function. Its evaluation at a cutpoint is inconsistent with either the evaluation at the lower or upper bound.
IMSL_INCORRECT_CDF_5 An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line.
IMSL_STOP_USER_FCN

Request from user supplied function to stop algorithm.

User flag = “#”.