kolmogorovTwo¶

Performs a Kolmogorov-Smirnov two-sample test.

Synopsis¶

kolmogorovTwo (x, y)

Required Arguments¶

float x[] (Input): Array of size nObservationsX containing the observations from sample one.
float y[] (Input): Array of size nObservationsY containing the observations from sample two.

Return Value¶

An array of length 3 containing Z, \(p_1\), and \(p_2\).

Optional Arguments¶

differences (Output): The array containing \(D_n\), \(D_n^+\), \(D_n^-\).
nMissingX (Ouput): Number of missing values in the x sample is returned in nMissingX.
nMissingY (Ouput): Number of missing values in the y sample is returned in nMissingY.

Description¶

Function kolmogorovTwo computes Kolmogorov-Smirnov two-sample test statistics for testing that two continuous cumulative distribution functions (CDF’s) are identical based upon two random samples. One- or two-sided alternatives are allowed. Exact p‑values are computed for the two-sided test when nObservationsX × nObservationsY is less than 104.

Let \(F_n(x)\) denote the empirical CDF in the X sample, let \(G_m(y)\) denote the empirical CDF in the Y sample, where n = nObservationsX - nMissingX and m = nObservationsY - nMissingY, and let the corresponding population distribution functions be denoted by \(F(x)\) and \(G(y)\), respectively. Then, the hypotheses tested by kolmogorovTwo are as follows:

\[\begin{split}\begin{array}{l} \bullet H_0:F(x) = G(x) \phantom{.....} H_1:F(x) \ne G(x) \\ \bullet H_0:F(x) \leq G(x) \phantom{.....} H_1:F(x) > G(x) \\ \bullet H_0:F(x) \geq G(x) \phantom{.....} H_1:F(x) < G(x) \\ \end{array}\end{split}\]

The test statistics are given as follows:

\[\begin{split}\begin{array}{ll} D_{\text{mn}} = \max \left(D_{\text{mn}}^{+}, D_{\text{mn}}^{-}\right) & \phantom{...} \left(\texttt{differences}[0]\right) \\ D_{\text{mn}}^{+} = {\max}_\text{x}\left(F_{\text{n}}(x) - G_{\text{m}}(x)\right) & \phantom{...} \left(\texttt{differences}[1]\right) \\ D_{\text{mn}}^{-} = {\max}_\text{x}\left(G_{\text{m}}(x) - F_{\text{n}}(x)\right) & \phantom{...} \left(\texttt{differences}[1]\right) \\ \end{array}\end{split}\]

Asymptotically, the distribution of the statistic

\[Z = D_{mn} \sqrt{(m*n)/(m+n)}\]

(returned in testStatistics[0]) converges to a distribution given by Smirnov (1939).

Exact probabilities for the two-sided test are computed when n*m is less than or equal to \(10^4\), according to an algorithm given by Kim and Jennrich (1973;). When n*m is greater than \(10^4\), the very good approximations given by Kim and Jennrich are used to obtain the two-sided p-values. The one-sided probability is taken as one half the two-sided probability. This is a very good approximation when the p-value is small (say, less than 0.10) and not very good for large p‑values.

Example¶

This example illustrates the kolmogorovTwo routine with two randomly generated samples from a uniform(0,1) distribution. Since the two theoretical distributions are identical, we would not expect to reject the null hypothesis.

from __future__ import print_function
from numpy import *
from pyimsl.stat.kolmogorovTwo import kolmogorovTwo
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomUniform import randomUniform

nobsx = 100
nobsy = 60
randomSeedSet(123457)
x = randomUniform(nobsx)
y = randomUniform(nobsy)
nMissingX = []
nMissingY = []
differences = []

statistics = kolmogorovTwo(x, y,
                           nMissingX=nMissingX, nMissingY=nMissingY,
                           differences=differences)

print("D      = %8.4f" % (differences[0]))
print("D+     = %8.4f" % (differences[1]))
print("D-     = %8.4f" % (differences[2]))
print("Z      = %8.4f" % (statistics[0]))
print("Prob greater D one sided  = %8.4f" % (statistics[1]))
print("Prob greater D two sided  = %8.4f" % (statistics[2]))
print("Missing X = %d" % (nMissingX[0]))
print("Missing Y = %d" % (nMissingY[0]))

Output¶

D      =   0.1800
D+     =   0.1800
D-     =   0.0100
Z      =   1.1023
Prob greater D one sided  =   0.0720
Prob greater D two sided  =   0.1440
Missing X = 0
Missing Y = 0