chiSquaredTest¶
Performs a chi-squared goodness-of-fit test.
Synopsis¶
chiSquaredTest (userProcCdf, nCategories, x)
Required Arguments¶
- float
userProcCdf
(floaty
) (Input) - User-supplied function that returns the hypothesized, cumulative
distribution function at the point
y
. - int
nCategories
(Input) - The number of cells into which the observations are to be tallied.
- float
x[]
(Input) - Array with
nObservations
components containing the vector of data elements for this test.
Return Value¶
The p-value for the goodness-of-fit chi-squared statistic.
Optional Arguments¶
nParametersEstimated
, int (Input)- The number of parameters estimated in computing the cumulative distribution function.
cutpoints
(Output)- The cutpoints array.
cutpointsEqual
- Equal probability cutpoints.
chiSquared
(Output)- If specified, the chi-squared test statistic is returned in
chiSquared
. degreesOfFreedom
(Output)- If specified, the degrees of freedom for the chi-squared goodness-of-fit
test is returned in
degreesOfFreedom
. frequencies
, float[]
(Input)- Array with
nObservations
components containing the vector frequencies for the observations stored inx
. bounds
, floatlowerBound
, floatupperBound
(Input)- If
bounds
is specified, thenlowerBound
is the lower bound of the range of the distribution, andupperBound
is the upper bound of this range. IflowerBound
=upperBound
, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside the range of these bounds are ignored. Distributions conditional on a range can be specified whenbounds
is used. By convention,lowerBound
is excluded from the first interval, butupperBound
is included in the last interval. cellCounts
(Output)- An array containing the cell counts. The cell counts are the observed
frequencies in each of the
nCategories
cells. cellExpected
(Output)- The cell expected values. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct.
cellChiSquared
(Output)- An array of length
nCategories
containing the cell contributions to chi-squared.
Description¶
The function chiSquaredTest
performs a chi-squared goodness-of-fit test
that a random sample of observations is distributed according to a specified
theoretical cumulative distribution. The theoretical distribution, which may
be continuous, discrete, or a mixture of discrete and continuous
distributions, is specified via the user-defined function userProcCdf
.
Because the user is allowed to give a range for the observations, a test
conditional upon the specified range is performed.
Argument nCategories
gives the number of intervals into which the
observations are to be divided. By default, equiprobable intervals are
computed by chiSquaredTest
, but intervals that are not equiprobable can
be specified (through the use of optional argument cutpoints
).
Regardless of the method used to obtain the cutpoints, the intervals are
such that the lower endpoint is not included in the interval, while the
upper endpoint is always included. If the cumulative distribution function
has discrete elements, then user-provided cutpoints should always be used
since chiSquaredTest
cannot determine the discrete elements in discrete
distributions.
By default, the lower and upper endpoints of the first and last intervals
are − ∞ and + ∞, respectively. If bounds
is specified, the endpoints are
defined by the user via the two arguments lowerBound
and upperBound
.
A tally of counts is maintained for the observations in x as follows. If
the cutpoints are specified by the user, the tally is made in the interval
to which \(x_i\) belongs using the endpoints specified by the user. If
the cutpoints are determined by chiSquaredTest
, then the cumulative
probability at \(x_i\), \(F(x_i)\), is computed via the function
userProcCdf
. The tally for \(x_i\) is made in interval number
and
is the function that takes the greatest integer that is no larger than the argument of the function. Thus, if the computer time required to calculate the cumulative distribution function is large, user-specified cutpoints may be preferred to reduce the total computing time.
If the expected count in any cell is less than 1, then a rule of thumb is that the chi-squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.
On some platforms, chiSquaredTest
can evaluate the user-supplied
function userProcCdf
in parallel. This is done only if the function
ompOptions is called to flag user-defined functions
as thread-safe. A function is thread-safe if there are no dependencies
between calls. Such dependencies are usually the result of writing to global
or static variables
Programming Notes¶
The user must supply a function userProcCdf
with calling sequence
userProcCdf(y
), that returns the value of the cumulative distribution
function at any point y
in the (optionally) specified range. Many of the
cumulative distribution functions in Special Functions can be used
for userProcCdf
, either directly, if the calling sequence is correct, or
indirectly, if, for example, the sample means and standard deviations are to
be used in computing the theoretical cumulative distribution function.
Examples¶
Example 1¶
This example illustrates the use of chiSquaredTest
on a randomly
generated sample from the normal distribution. One-thousand randomly
generated observations are tallied into 10 equiprobable intervals. The None
hypothesis that the sample is from a normal distribution is specified by use
of the normalCdf
(see Special Functions) as the hypothesized
distribution function. In this example, the None hypothesis is not rejected.
from __future__ import print_function
from numpy import *
from pyimsl.math.chiSquaredTest import chiSquaredTest
from pyimsl.math.normalCdf import normalCdf
from pyimsl.math.randomSeedSet import randomSeedSet
from pyimsl.math.randomNormal import randomNormal
randomSeedSet(123457)
# Generate Normal deviates
n_observations = 1000
n_categories = 10
x = randomNormal(n_observations)
# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x)
# Print results
print("p value %7.4f" % (p_value))
Output¶
p value 0.1546
Example 2¶
In this example, some optional arguments are used for the data in the initial example.
from numpy import *
from pyimsl.math.chiSquaredTest import chiSquaredTest
from pyimsl.math.normalCdf import normalCdf
from pyimsl.math.randomSeedSet import randomSeedSet
from pyimsl.math.randomNormal import randomNormal
from pyimsl.math.writeMatrix import writeMatrix
randomSeedSet(123457)
# Generate Normal deviates
n_observations = 1000
n_categories = 10
x = randomNormal(n_observations)
# Perform chi squared test
cutpoints = []
cell_counts = []
cell_chi_squared = []
chi_squared_statistics0 = []
chi_squared_statistics1 = []
chi_squared_statistics2 = chiSquaredTest(normalCdf, n_categories, x,
cutpoints=cutpoints,
cellCounts=cell_counts,
cellChiSquared=cell_chi_squared,
chiSquared=chi_squared_statistics0,
degreesOfFreedom=chi_squared_statistics1)
chi_squared_statistics =\
[chi_squared_statistics0[0],
chi_squared_statistics1[0],
chi_squared_statistics2]
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
# Print results
writeMatrix("\nChi Squared Statistics\n",
chi_squared_statistics,
rowLabels=stat_row_labels,
column=True)
writeMatrix("Cut Points", cutpoints)
writeMatrix("Cell Counts", cell_counts)
writeMatrix("Cell Contributions to Chi-Squared",
cell_chi_squared)
Output¶
Chi Squared Statistics
chi-squared 13.18
degrees of freedom 9.00
p-value 0.15
Cut Points
1 2 3 4 5 6
-1.282 -0.842 -0.524 -0.253 -0.000 0.253
7 8 9
0.524 0.842 1.282
Cell Counts
1 2 3 4 5 6
106 109 89 92 83 87
7 8 9 10
110 104 121 99
Cell Contributions to Chi-Squared
1 2 3 4 5 6
0.36 0.81 1.21 0.64 2.89 1.69
7 8 9 10
1.00 0.16 4.41 0.01
Example 3¶
In this example, a discrete Poisson random sample of size 1000 with parameter
\(\theta=5.0\) is generated via function
randomPoisson. In the call to chiSquaredTest
,
function poissonCdf
is used as function userProcCdf
.
from numpy import *
from pyimsl.math.chiSquaredTest import chiSquaredTest
from pyimsl.math.poissonCdf import poissonCdf
from pyimsl.math.randomSeedSet import randomSeedSet
from pyimsl.math.randomPoisson import randomPoisson
from pyimsl.math.writeMatrix import writeMatrix
theta = 5.0
n_numbers = 1000
n_categories = 10
def user_proc_cdf(k):
cdf_v = poissonCdf(k, theta)
return cdf_v
cutpoints = [1.5, 2.5, 3.5, 4.5, 5.5, 6.5,
7.5, 8.5, 9.5]
cell_row_labels = ["count",
"expected count",
"cell chi-squared"]
cell_col_labels = ["Poisson value", "0", "1", "2",
"3", "4", "5", "6", "7", "8", "9"]
stat_row_labels = ["chi-squared",
"degrees of freedom",
"p-value"]
randomSeedSet(123457)
# Generate the data
poisson = randomPoisson(n_numbers, theta)
# Copy data to a floating point vector
x = (array(poisson)).copy()
cell_counts = []
cell_expected = []
cell_chi_squared = []
chi_squared_statistics0 = []
chi_squared_statistics1 = []
chi_squared_statistics2 = chiSquaredTest(user_proc_cdf, n_categories, x,
cutpoints=cutpoints,
cellCounts=cell_counts,
cellExpected=cell_expected,
cellChiSquared=cell_chi_squared,
chiSquared=chi_squared_statistics0,
degreesOfFreedom=chi_squared_statistics1)
chi_squared_statistics = [chi_squared_statistics0[0],
chi_squared_statistics1[0],
chi_squared_statistics2]
cell_statistics = empty((3, n_categories), dtype=double)
# Print results
writeMatrix("\nChi-squared statistics\n",
chi_squared_statistics,
rowLabels=stat_row_labels,
column=True)
for i in range(0, n_categories):
cell_statistics[0][i] = cell_counts[i]
cell_statistics[1][i] = cell_expected[i]
cell_statistics[2][i] = cell_chi_squared[i]
writeMatrix("\nCell Statistics\n", cell_statistics,
rowLabels=cell_row_labels,
colLabels=cell_col_labels)
Output¶
Chi-squared statistics
chi-squared 10.48
degrees of freedom 9.00
p-value 0.31
Cell Statistics
Poisson value 0 1 2 3
count 41.0 94.0 138.0 158.0
expected count 40.4 84.2 140.4 175.5
cell chi-squared 0.0 1.1 0.0 1.7
Poisson value 4 5 6 7
count 150.0 159.0 116.0 75.0
expected count 175.5 146.2 104.4 65.3
cell chi-squared 3.7 1.1 1.3 1.4
Poisson value 8 9
count 37.0 32.0
expected count 36.3 31.8
cell chi-squared 0.0 0.0
Warning Errors¶
IMSL_EXPECTED_VAL_LESS_THAN_1 |
An expected value is less than 1. |
IMSL_EXPECTED_VAL_LESS_THAN_5 |
An expected value is less than 5. |
Fatal Errors¶
IMSL_ALL_OBSERVATIONS_MISSING |
All observations contain missing values. |
IMSL_INCORRECT_CDF_1 |
The function userProcCdf is
not a cumulative distribution
function. The value at the lower
bound must be nonnegative, and
the value at the upper bound must
not be greater than one. |
IMSL_INCORRECT_CDF_2 |
The function userProcCdf is
not a cumulative distribution
function. The probability of the
range of the distribution is not
positive. |
IMSL_INCORRECT_CDF_3 |
The function userProcCdf is
not a cumulative distribution
function. Its evaluation at an
element in x is inconsistent
with either the evaluation at the
lower or upper bound. |
IMSL_INCORRECT_CDF_4 |
The function userProcCdf is
not a cumulative distribution
function. Its evaluation at a
cutpoint is inconsistent with
either the evaluation at the
lower or upper bound. |
IMSL_INCORRECT_CDF_5 |
An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line. |
IMSL_STOP_USER_FCN |
Request from user supplied function to stop algorithm. User flag = “#”. |