chiSquaredTest¶
Performs a chi-squared goodness-of-fit test.
Synopsis¶
chiSquaredTest (userProcCdf, nCategories, x)
Required Arguments¶
- float
userProcCdf
(y
) (Input) - User-supplied function that returns the hypothesized, cumulative
distribution function at the point
y
. - int
nCategories
(Input) - Number of cells into which the observations are to be tallied.
- float
x[]
(Input) - Array with
nObservations
components containing the vector of data elements for this test.
Return Value¶
The p-value for the goodness-of-fit chi-squared statistic.
Optional Arguments¶
nParametersEstimated
, int (Input)- Number of parameters estimated in computing the cumulative distribution function.
ido
, int (Input)Processing option. The argument
ido
must be one of 0, 1, 2, or 3. Ifido
= 0 (the default), all of the observations are input during one invocation. Ifido
= 1, 2, or 3, blocks of rows of the data can be processed sequentially in separate invocations ofchiSquaredTest
; with this option, it is not a requirement that all observations be memory resident, thus enabling one to handle large data sets.ido
Action 0 This is the only invocation; all the data are input at once. (Default) 1 This is the first invocation with this data; additional calls will be made.
Initialization and updating for the
nObservations
observations ofx
will be performed.2 This is an intermediate invocation; updating for the nObservations
observations ofx
will be performed.3 This is the final invocation of this function. Updating for the data in x
and wrap-up computations are performed. Workspace is released. No further invocations ofchiSquaredTest
with ido greater than 1 should be made without first invokingchiSquaredTest
withido
= 1.Default:
ido
= 0cutpoints
(Output)- An array of length
nCategories
− 1 containing the vector of cutpoints defining the cell intervals. The intervals defined by the cutpoints are such that the lower endpoint is not included and the upper endpoint is included in any interval. IfcutpointsEqual
is specified, equal probability cutpoints are computed and returned incutpoints
. cutpointsEqual
- If
cutpointsUser
is specified, then equal probability cutpoints can still be used if, in addition, thecutpointsEqual
option is specified. IfcutpointsUser
is not specified, equal probability cutpoints are used by default. chiSquared
(Output)- If specified, the chi-squared test statistic is returned in
chiSquared
. degreesOfFreedom
(Output)- If specified, the degrees of freedom for the chi-squared goodness-of-fit
test is returned in
degreesOfFreedom
. frequencies
, float[]
(Input)- Array with
nObservations
components containing the vector frequencies for the observations stored inx
. bounds
, floatlowerBound
, floatupperBound
(Input)- If
bounds
is specified, thenlowerBound
is the lower bound of the range of the distribution andupperBound
is the upper bound of this range. IflowerBound
=upperBound
, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside the range of these bounds are ignored. Distributions conditional on a range can be specified whenbounds
is used. By convention,lowerBound
is excluded from the first interval, butupperBound
is included in the last interval. cellCounts
(Output)- An array of length
nCategories
containing the cell counts. The cell counts are the observed frequencies in each of thenCategories
cells. cellExpected
(Output)- An array of length
nCategories
containing the cell expected values. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct. cellChiSquared
(Output)- An array of length
nCategories
containing the cell contributions to chi-squared.
Description¶
Function chiSquaredTest
performs a chi-squared goodness-of-fit test that
a random sample of observations is distributed according to a specified
theoretical cumulative distribution. The theoretical distribution, which can
be continuous, discrete, or a mixture of discrete and continuous
distributions, is specified by the user-defined function userProcCdf
.
Because the user is allowed to give a range for the observations, a test
that is conditional on the specified range is performed.
Argument nCategories
gives the number of intervals into which the
observations are to be divided. By default, equiprobable intervals are
computed by chiSquaredTest
, but intervals that are not equiprobable can
be specified through the use of optional argument cutpoints
.
Regardless of the method used to obtain the cutpoints, the intervals are
such that the lower endpoint is not included in the interval, while the
upper endpoint is always included. If the cumulative distribution function
has discrete elements, then user-provided cutpoints should always be used
since chiSquaredTest
cannot determine the discrete elements in discrete
distributions.
By default, the lower and upper endpoints of the first and last intervals
are −∞ and +∞, respectively. If bounds
is specified, the endpoints are
user-defined by the two arguments lowerBound
and upperBound
.
A tally of counts is maintained for the observations in x as follows:
- If the cutpoints are specified by the user, the tally is made in the interval to which \(x_i\) belongs, using the user-specified endpoints.
- If the cutpoints are determined by
chiSquaredTest
, then the cumulative probability at \(x_i\), \(F(x_i)\), is computed by the functionuserProcCdf
.
The tally for \(x_i\) is made in interval number \(\lfloor
mF(x_i)+1\rfloor\), where m = nCategories
and \(\lfloor\cdot
\rfloor\) is the function that takes the greatest integer that is no larger
than the argument of the function. Thus, if the computer time required to
calculate the cumulative distribution function is large, user-specified
cutpoints may be preferred to reduce the total computing time.
If the expected count in any cell is less than 1, then the chi-squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.
Programming Notes¶
Function userProcCdf
must be supplied with calling sequence
userProcCdf
(y
), which returns the value of the cumulative
distribution function at any point y
in the (optionally) specified
range. Many of the cumulative distribution functions in Chapter
11, Probability Distribution Functions and Inverses, can be used for userProcCdf
, either
directly if the calling sequence is correct or indirectly if, for example,
the sample means and standard deviations are to be used in computing the
theoretical cumulative distribution function.
Examples¶
Example 1¶
This example illustrates the use of chiSquaredTest
on a randomly
generated sample from the normal distribution. One-thousand randomly
generated observations are tallied into 10 equiprobable intervals. The null
hypothesis, that the sample is from a normal distribution, is specified by
use of normalCdf
(Probability Distribution Functions and Inverses), as the hypothesized distribution
function. In this example, the null hypothesis is not rejected.
from __future__ import print_function
from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
seed = 123457
n_categories = 10
n_observations = 1000
randomSeedSet(seed)
# Generate normal deviates
x = randomNormal(n_observations)
# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x)
# Print results
print("p_value: %7.4f" % p_value)
Output¶
p_value: 0.1546
Example 2¶
In this example, optional arguments are used for the data in the initial example.
from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
from pyimsl.stat.writeMatrix import writeMatrix
seed = 123457
n_categories = 10
n_observations = 1000
chi_squared_statistics = empty((3), dtype='double')
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)
# Generate normal deviates
x = randomNormal(n_observations)
# Perform chi squared test
cell_chi_squared = []
chi_squared = []
cutpoints = []
cell_counts = []
degrees_of_freedom = []
chi_squared_statistics[2] = \
chiSquaredTest(normalCdf, n_categories, x,
cutpoints=cutpoints,
cellCounts=cell_counts,
cellChiSquared=cell_chi_squared,
chiSquared=chi_squared,
degreesOfFreedom=degrees_of_freedom)
chi_squared_statistics[0] = chi_squared[0]
chi_squared_statistics[1] = degrees_of_freedom[0]
# Print results
writeMatrix("\nChi Squared Statistics\n",
chi_squared_statistics,
rowLabels=stat_row_labels, column=True)
writeMatrix("Cut Points", cutpoints, writeFormat="%10.3f")
writeMatrix("Cell Counts", cell_counts, writeFormat="%5i")
writeMatrix("Cell Contributions to Chi-Squared", cell_chi_squared,
writeFormat="%10.3f")
Output¶
Chi Squared Statistics
chi-squared 13.18
degrees of freedom 9.00
p-value 0.15
Cut Points
1 2 3 4 5 6
-1.282 -0.842 -0.524 -0.253 -0.000 0.253
7 8 9
0.524 0.842 1.282
Cell Counts
1 2 3 4 5 6 7 8 9 10
106 109 89 92 83 87 110 104 121 99
Cell Contributions to Chi-Squared
1 2 3 4 5 6
0.360 0.810 1.210 0.640 2.890 1.690
7 8 9 10
1.000 0.160 4.410 0.010
Example 3¶
In this example, a discrete Poisson random sample of size 1,000 with
parameter \(\theta=5.0\) is generated by function
randomPoisson (Chapter 12, Random Number Generation). In
the call to chiSquaredTest
, function poissonCdf
(Chapter 11, Probability Distribution Functions and Inverses) is used as function
userProcCdf
.
from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.poissonCdf import poissonCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomPoisson import randomPoisson
from pyimsl.stat.writeMatrix import writeMatrix
def userProcCdf(k):
theta = 5.0
cdf_v = poissonCdf(int(k), theta)
return cdf_v
seed = 123457
n_categories = 10
n_parameters_estimated = 0
n_numbers = 1000
theta = 5.0
x = empty(n_numbers, dtype='float')
chi_squared_statistics = empty((3), dtype='double')
cell_statistics = empty((3, n_categories), dtype='double')
cutpoints = [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]
cell_row_labels = ["count", "expected count", "cell chi-squared"]
cell_col_labels = ["Poisson value", "0", "1", "2",
"3", "4", "5", "6", "7", "8", "9"]
stat_row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)
# Generate normal deviates
poisson = randomPoisson(n_numbers, theta)
for i in range(0, n_numbers):
x[i] = poisson[i]
# Perform chi squared test
cell_chi_squared = []
chi_squared = []
cell_counts = []
cell_expected = []
degrees_of_freedom = []
chi_squared_statistics[2] = \
chiSquaredTest(userProcCdf, n_categories, x,
cutpointsUser=[cutpoints],
cellCounts=cell_counts,
cellExpected=cell_expected,
cellChiSquared=cell_chi_squared,
chiSquared=chi_squared,
degreesOfFreedom=degrees_of_freedom)
for i in range(0, n_categories):
cell_statistics[0][i] = cell_counts[i]
cell_statistics[1][i] = cell_expected[i]
cell_statistics[2][i] = cell_chi_squared[i]
chi_squared_statistics[0] = chi_squared[0]
chi_squared_statistics[1] = degrees_of_freedom[0]
# Print results
writeMatrix("\nChi Squared Statistics\n",
chi_squared_statistics,
rowLabels=stat_row_labels, column=True)
writeMatrix("\nCell Statistics\n", cell_statistics,
rowLabels=cell_row_labels,
colLabels=cell_col_labels,
writeFormat="%9.1f")
Output¶
Chi Squared Statistics
chi-squared 10.48
degrees of freedom 9.00
p-value 0.31
Cell Statistics
Poisson value 0 1 2 3 4
count 41.0 94.0 138.0 158.0 150.0
expected count 40.4 84.2 140.4 175.5 175.5
cell chi-squared 0.0 1.1 0.0 1.7 3.7
Poisson value 5 6 7 8 9
count 159.0 116.0 75.0 37.0 32.0
expected count 146.2 104.4 65.3 36.3 31.8
cell chi-squared 1.1 1.3 1.4 0.0 0.0
Example 4¶
Continuing with Example 1 data, the example
below invokes the chiSquaredTest
function using values of ido
greater than 0. Also, optional arguments are used for the data.
from numpy import *
from pyimsl.stat.chiSquaredTest import chiSquaredTest
from pyimsl.stat.normalCdf import normalCdf
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomNormal import randomNormal
from pyimsl.stat.writeMatrix import writeMatrix
seed = 123457
n_categories = 10
n_observations = 1000
n_observations_block_1 = 300
n_observations_block_2 = 300
n_observations_block_3 = 400
row_labels = ["chi-squared", "degrees of freedom", "p-value"]
randomSeedSet(seed)
# Generate normal deviates
x1 = randomNormal(n_observations_block_1)
x2 = randomNormal(n_observations_block_2)
x3 = randomNormal(n_observations_block_3)
# Perform chi squared test
p_value = chiSquaredTest(normalCdf, n_categories, x1,
ido=1)
p_value = chiSquaredTest(normalCdf, n_categories, x2,
ido=2)
cutpoints = []
chiSquared = []
degreesOfFreedom = []
cellCounts = []
cellChiSquared = []
p_value = chiSquaredTest(normalCdf, n_categories, x3,
ido=3,
cutpoints=cutpoints,
chiSquared=chiSquared,
degreesOfFreedom=degreesOfFreedom,
cellCounts=cellCounts,
cellChiSquared=cellChiSquared)
# Print results
chi_squared_statistics = [chiSquared[0], degreesOfFreedom[0], p_value]
writeMatrix("Chi Squared Statistics", chi_squared_statistics,
rowLabels=row_labels, transpose=True)
writeMatrix("Cut Points", cutpoints)
writeMatrix("Cell Counts", cellCounts)
writeMatrix("Cell Contributions to Chi-Squared", cellChiSquared)
Output¶
Chi Squared Statistics
1
chi-squared 13.18
degrees of freedom 9.00
p-value 0.15
Cut Points
1 2 3 4 5 6
-1.282 -0.842 -0.524 -0.253 -0.000 0.253
7 8 9
0.524 0.842 1.282
Cell Counts
1 2 3 4 5 6
106 109 89 92 83 87
7 8 9 10
110 104 121 99
Cell Contributions to Chi-Squared
1 2 3 4 5 6
0.36 0.81 1.21 0.64 2.89 1.69
7 8 9 10
1.00 0.16 4.41 0.01
Warning Errors¶
IMSLS_EXPECTED_VAL_LESS_THAN_1 |
An expected value is less than 1. |
IMSLS_EXPECTED_VAL_LESS_THAN_5 |
An expected value is less than 5. |
IMSLS_X_VALUE_OUT_OF_RANGE |
Row x contains a value which is out of range. |
IMSLS_MISSING_DATA_ELEMENT |
At least one data element is missing. |
Fatal Errors¶
IMSLS_ALL_OBSERVATIONS_MISSING |
All observations contain missing values. |
IMSLS_INCORRECT_CDF_1 |
Function userProcCdf is not
a cumulative distribution
function. The value at the lower
bound must be nonnegative, and
the value at the upper bound
must not be greater than 1. |
IMSLS_INCORRECT_CDF_2 |
Function userProcCdf is not
a cumulative distribution
function. The probability of the
range of the distribution is not
positive. |
IMSLS_INCORRECT_CDF_3 |
Function userProcCdf is not
a cumulative distribution
function. Its evaluation at an
element in x is inconsistent
with either the evaluation at
the lower or upper bound. |
IMSLS_INCORRECT_CDF_4 |
Function userProcCdf is not
a cumulative distribution
function. Its evaluation at a
cutpoint is inconsistent with
either the evaluation at the
lower or upper bound. |
IMSLS_INCORRECT_CDF_5 |
An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line. |
IMSLS_TOO_MANY_CELL_DELETIONS |
There are more observations deleted from the cell than added. |
IMSLS_NO_BOUND_AFTER_100_TRYS |
After 100 attempts, a bound for the inverse cannot be determined. Try again with a different initial estimate. |
IMSLS_NO_UNIQUE_INVERSE_EXISTS |
No unique inverse exists. |
IMSLS_CONVERGENCE_ASSUMED |
Over 100 iterations have occurred without convergence. Convergence is assumed. |
IMSLS_BAD_IDO_6 |
“ido ” = #. Initial
allocations must be performed by
invoking the function with
“ido ” = 1. |
IMSLS_BAD_IDO_7 |
“ido ” = #. A new
analysis may not begin until the
previous analysis is terminated
by invoking the function with
“ido ” = 3. |
IMSLS_BAD_N_CATEGORIES |
“nCategories ” = #. The
number of categories variable,
“nCategories ”, must be
the same in separate function
calls. |
IMSLS_STOP_USER_FCN |
Request from user supplied function to stop algorithm. User flag = “#”. |