exactNetwork

Computes Fisher exact probabilities and a hybrid approximation of the Fisher exact method for a two-way contingency table using the network algorithm.

Synopsis

exactNetwork (table)

Required Arguments

float table[[]] (Input)
Array of length nRows × nColumns containing the observed counts in the contingency table.

Return Value

The p-value for independence of rows and columns. The p-value represents the probability of a more extreme table where “extreme” is taken in the Neyman-Pearson sense. The p-value is “two-sided”.

Optional Arguments

probTable (Output)
Probability of the observed table occurring given that the null hypothesis of independent rows and columns is true.
pValue (Output)

The p-value for independence of rows and columns. The p-value represents the probability of a more extreme table where “extreme” is in the Neyman-Pearson sense. The pValue is “two-sided”. The p-value is also returned in functional form (see “Return Value”).

A table is more extreme if its probability (for fixed marginals) is less than or equal to probTable.

approximationParameters, expect, percent, expectedMinimum (Input)

Parameter expect is the expected value used in the hybrid approximation to Fisher’s exact test algorithm for deciding when to use asymptotic probabilities when computing path lengths. Parameter percent is the percentage of remaining cells that must have estimated expected values greater than expect before asymptotic probabilities can be used in computing path lengths. Parameter expectedMinimum is the minimum cell estimated value allowed for asymptotic chi-squared probabilities to be used.

Asymptotic probabilities are used in computing path lengths whenever percent or more of the cells in the table have estimated expected values of expect or more, with no cell having expected value less than expectedMinimum. See the Description section for details.

Defaults: expect = 5.0, percent = 80.0, expectedMinimum = 1.0

Note that these defaults correspond to the “Cochran” condition.
noApproximation,
The Fisher exact test is used. Arguments expect, percent, and expectedMinimum are ignored.
workspace, factor1, factor2, maxAttempts, nAttempts (Input/Output)

The network algorithm requires a large amount of workspace. Some of the workspace requirements are well-defined, while most of the workspace requirements can only be estimated. The estimate is based primarily on table size.

Function exactEnumeration allocates a default amount of workspace suitable for small problems. If the algorithm determines that this initial allocation of workspace is inadequate, the memory is freed, a larger amount of memory allocated (twice as much as the previous allocation), and the network algorithm is re-started. The algorithm allows for up to maxAttempts attempts to complete the algorithm.

Because each attempt requires computer time, it is suggested that factor1 and factor2 be set to some large numbers (like 1,000 and 30,000) if the problem to be solved is large. It is suggested that factor2 be 30 times larger than factor1. Although exactEnumeration will eventually work its way up to a large enough memory allocation, it is quicker to allocate enough memory initially.

The known (well-defined) workspace requirements are as follows: Define \(f_{\bullet\bullet }=\Sigma\Sigma f_{ij}\) equal to the sum of all cell frequencies in the observed table, \(nt=f_{\bullet\bullet }+1\), mx = max (nRows, nColumns), mn = min (nRows, nColumns), t1 = max (800 + 7mx, (5 + 2mx) (nRows + nColumns + 1) ), and t2 = max (400 + mx, + 1, nRows + nColumns + 1).

The following amount of integer workspace is allocated: \(3mx+2mn+t1\).

The following amount of float (or double, if using exactNetwork) work­space is allocated: \(nt+t2\).

The remainder of the workspace that is required must be estimated and allocated based on factor1 and factor2. The amount of integer workspace allocated is 6n (factor1 + factor2). The amount of real workspace allocated is n (6factor1 + 2factor2). Variable n is the index for the attempt, 1 < nmaxAttempts.

Defaults: factor1 = 100, factor2 = 3000, maxAttempts = 10

Description

Function exactNetwork computes Fisher exact probabilities or a hybrid algorithm approximation to Fisher exact probabilities for an r × c contingency table with fixed row and column marginals (a marginal is the number of counts in a row or column), where r = nRows and c = nColumns. Let \(f_{ij}\) denote the count in row i and column j of a table, and let \(f_{i\bullet}\) and \(f_{\bullet j}\) denote the row and column marginals. Under the hypothesis of independence, the (conditional) probability of the fixed marginals of the observed table is given by

\[P_f = \frac{\prod\limits_{i=1}^{r} f_{i\cdot}! \prod\limits_{j=1}^{c} f_{\cdot j}!} {f_{\cdot\cdot}! \prod\limits_{i=1}^{r} \prod\limits_{j=1}^{c} f_{ij}!}\]

where \(f_{\bullet\bullet }\) is the total number of counts in the table. \(P_f\) corresponds to output argument probTable.

A “more extreme” table X is defined in the probabilistic sense as more extreme than the observed table if the conditional probability computed for table X (for the same marginal sums) is less than the conditional probability computed for the observed table. The user should note that this definition can be considered “two-sided” in the cell counts.

See Example 1 for a comparison of execution times for the various algorithms. Note that the Fisher exact probability and the usual asymptotic chi-squared probability will usually be different. (The network approximation is often 10 times faster than the Fisher exact test, and even faster when compared to the total enumeration method.)

Examples

Example 1

The following example demonstrates and compares the various methods of computing the chi-squared p-value with respect to accuracy and execution time. As seen in the output of this example, the Fisher exact probability and the usual asymptotic chi-squared probability (generated using function contingencyTable) can be different. Also, note that the network algorithm with approximation can be up to 10 times faster than the network algorithm without approximation, and up to 100 times faster than the total enumeration method.

from __future__ import print_function
from numpy import *
from pyimsl.stat.contingencyTable import contingencyTable
from pyimsl.stat.exactEnumeration import exactEnumeration
from pyimsl.stat.exactNetwork import exactNetwork

print('\n==== exactNetwork1:\n')
table = array([[20, 20, 0, 0, 0],
               [10, 10, 2, 2, 1],
               [20, 20, 0, 0, 0]])

print("Asymptotic Chi-Squared p-value")
p = contingencyTable(table)
print("p-value = %9.4f" % (p))

print("\nNetwork Algorithm with Approximation")
p = exactNetwork(table)
print("p-value = %9.4f" % p)

print("\nNetwork Algorithm without Approximation")
p = exactNetwork(table, noApproximation=True)
print("p-value = %9.4f" % p)

print("\nTotal enumeration method")
p = exactEnumeration(table)
print("p-value = %9.4f" % p)

Output

==== exactNetwork1:

Asymptotic Chi-Squared p-value
***
*** Warning error issued from IMSL function contingencyTable:
*** Some expected values are less than 1.  Some asymptotic p-values may not be good.
***
p-value =    0.0323

Network Algorithm with Approximation
p-value =    0.0601

Network Algorithm without Approximation
p-value =    0.0597

Total enumeration method
p-value =    0.0597

Example 2

This document example demonstrates the optional keyword workspace and how different workspace settings affect execution time. Setting the workspace available too low results in poor performance since the algorithm will fail, re-allocate a larger amount of workspace (a factor of 10 larger) and re-start the calculations. (See Test #3, for which nAttempts is returned with a value of 2.) Setting the workspace available very large will provide no improvement in performance.

from __future__ import print_function
from numpy import *
from pyimsl.math.ctime import ctime
from pyimsl.stat.exactEnumeration import exactEnumeration
from pyimsl.stat.exactNetwork import exactNetwork

p = 0.0
n_attempts = 0
table = array([[20, 20, 0, 0, 0],
               [10, 10, 2, 2, 1],
               [20, 20, 0, 0, 0]])
simulation_size = 10

print("Test #1, factor1 = 1000, factor2 = 30000")
a = ctime()
workspace = {'factor1': 1000, 'factor2': 3000, 'maxAttempts': 10}
for i in range(0, simulation_size, 1):
    p = exactNetwork(table, noApproximation=True,
                     workspace=workspace)
b = ctime()
print("n_attempts = %2d" % workspace['nAttempts'])
print("Execution time = %10.4f" % (b - a))

print("\nTest #2, factor1 = 100, factor2 = 3000")
a = ctime()
workspace = {'factor1': 100, 'factor2': 3000, 'maxAttempts': 10}
for i in range(0, simulation_size, 1):
    p = exactNetwork(table, noApproximation=True,
                     workspace=workspace)
b = ctime()
print("n_attempts = %2d" % workspace['nAttempts'])
print("Execution time = %10.4f" % (b - a))

print("\nTest #3, factor1 = 10, factor2 = 300")
a = ctime()
workspace = {'factor1': 10, 'factor2': 300, 'maxAttempts': 10}
for i in range(0, simulation_size, 1):
    p = exactNetwork(table,
                     noApproximation=True,
                     workspace=workspace)
b = ctime()
print("n_attempts = %2d" % workspace["nAttempts"])
print("Execution time = %10.4f" % (b - a))

Output

Test #1, factor1 = 1000, factor2 = 30000
n_attempts =  1
Execution time =     0.0099

Test #2, factor1 = 100, factor2 = 3000
n_attempts =  1
Execution time =     0.0094

Test #3, factor1 = 10, factor2 = 300
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
***
*** Warning error issued from IMSL function exactNetwork:
*** The value "ldkey" = 10 is too small.  "ldkey" is calculated as "factor1" * pow(10,"n_attempt"-1).  Ending this execution attempt.
***
n_attempts =  2
Execution time =     0.0180

Warning Errors

IMSLS_HASH_TABLE_ERROR_2 The value “ldkey” = # is too small. “ldkey” is calculated as “factor1”*pow(10,”nAttempt”−1) ending this execution attempt.
IMSLS_HASH_TABLE_ERROR_3 The value “ldstp” = # is too small. “ldstp” is calculated as “factor2”*pow(10,”nAttempt”−1) ending this execution attempt.

Fatal Errors

IMSLS_HASH_TABLE_ERROR_1 The hash table key cannot be computed because the largest key is larger than the largest representable integer. The algorithm cannot proceed.