falseDiscoveryRates¶
Calculates the False Discovery Rate (FDR) q‑values corresponding to a set of p‑values resulting from multiple simultaneous hypothesis tests.
Synopsis¶
falseDiscoveryRates (nTests, pvalues)
Required Arguments¶
- int
nTests
(Input) - The number of hypothesis tests.
- float
pvalues[]
(Input) - An array of length
nTests
containing the p-values associated with the tests.
Return Value¶
An array of length nTests
containing the calculated q‑values.
Optional Arguments¶
lambdas
, float (Input)An array of length
nLambdas
containing the grid values on [0,1) used in the estimate of the null probability.Default:
lambdas
= {0.0,0.05,…,0.90},len(lambdas)
= 19.gammaParam
, float (Input)Size of the rejection region (0 ≤
gammaParam
≤ 1.0) used in the calculation of the FDR and pFDR measures.Default:
gammaParam
= 0.05method
, int (Input)Specifies the method used to remove the dependence of the null probability estimate on the
lambda
variable.method
= 1 ormethod
= 0: whenmethod
= 0, a bootstrap withnSamples
is used; whenmethod
= 1, a cubic spline smoother is used.Default:
method
= 0smoothingPar
, float (Input)Smoothing parameter (0 ≤
smoothingPar
≤ 1.0) argument for the cubic spline smoother. Only used ifmethod
= 1.Default: If
method
= 1 and this optional argument is not provided, the smoothing parameter is selected by cross‑validation.nSample
, int (Input)Number of bootstrap samples to make for
method
= 0 or when estimating upper confidence limits whenconfidence
is present.Default:
nSample
= 100.randomSeed
, int (Input)The seed of the random number generator used in generating the bootstrap samples. If
randomSeed
is 0, a value is computed using the system clock; hence, the results may be different between different calls with the same inputs.Default:
randomSeed
= 0.confidence
, float (Input)Confidence level (0.0 <
confidence
< 100.0). SeeupperLimits
.Default:
confidence
= 95.0nullProb
(Output)- The null probability estimate.
upperLimits
(Output)- An array of length 2 containing the (
confidence
)% upper bounds for pFDR and FDR.
Description¶
Let \(\{p_1,p_2,\ldots,p_m\}\) be the p‑values associated with m independent tests of a statistical hypothesis. The following table summarizes the possible outcomes of the m tests. Note that the only known quantities in the table are W, R, and, m.
Hypothesis | Accept | Reject | Total |
---|---|---|---|
Null is true | U | V | \(m_0\) |
Alternative is true | T | S | \(m_1\) |
Total | W | R | m |
In the above table, V is the number of false discoveries (and the number of type I errors). Whereas the type I error rate is the probability of rejecting at least one true null hypothesis, the false discovery rate (FDR) (Benjamini and Hochberg (1995)), is the expected proportion of falsely rejected true nulls. In other words, the FDR is the expected proportion of “false positives” among the tests that are deemed significant. Using the notations from the table,
The denominator \(R\lor 1=\max(R,1)\) avoids division by 0 in case there are no significant results (\(R=0\)). The positive false discovery rate, or pFDR, defined in Storey (2001), is conditional on there being at least one significant test (\(R>0\)):
From Theorem 1 in Storey (2001),
where \(\pi_0\) is the probability that an individual hypothesis is null, and \(\Pr[P\leq\gamma | H=0]=\gamma\) is the probability an individual hypothesis is rejected given that it is null. The parameter γ is the predefined size of the rejection region. The denominator is the probability that a test is rejected, given γ. This relationship arises from Baye’s Theorem and the assumption that the p‑values are independent.
An estimator for \(m_0\), the number of true null hypotheses is
where λ is a significance level on the interval \(\left[0,1\right)\) and \(W(\lambda)\) is the number of tests that are accepted at level λ That is, \(W(\lambda)=\#\{p_i> \lambda\}\). An estimator for the probability of a null hypothesis \(\pi_0\) is then:
The parameter λ is a tuning parameter used to estimate the true null distribution. The rationale is that the p-values of the null hypotheses are uniformly distributed and most of the larger p‑values ( >λ) will be from the null distribution. See Storey and Tibshirani ( 2003) for further details.
Now using
Storey (2002) gives the following estimators for pFDR and FDR:
and
Note that \(1-(1-\gamma)^m\) is a lower bound for \(P(R>0)\), the probability of at least one significant test, and
is a conservative estimate for the type I error, given that \(R>0\).
In falseDiscoveryRates
, the estimates above are calculated on a grid of λ
values on [0,1) and the minimizer is noted. The calculations are then
repeated on B bootstrap samples of the p‑values. The dependence on λ is
removed by one of two methods. In Algorithm 3 of Storey (2002),
\(\hat{\lambda}\) minimizes the mean squared error between the bootstrap
estimates and the original minimum of the estimates over the grid of λ
values. Then, \(\hat{\pi}_0=\hat{\pi}_0 \left( \hat{\lambda} \right)\) .
A second method, suggested by Storey and Tibshirani (2003), uses a cubic
spline smoother \(\hat{f}(\lambda)\) on the set of values, \(\left(
\lambda_i,\hat{\pi}_0 \left( \lambda_i \right) \right)\). Then the smoothed
value \(\hat{\pi}_0=\hat{f}(1)\). Upper confidence bounds are determined
by taking the (1 ‑ α) quantile of the bootstrapped
pFDR and FDR values, using the smoothed value \(\hat{\pi}_0\)
obtained by either method above.
The q‑value, introduced in Storey (2001), is the pFDR analogue of the p‑value. For independent tests, the q‑value of the observed p‑value is
Whereas a p‑value is the minimum type I error rate that can occur while
rejecting a test with a given value of the test statistic, the q‑value
is the minimum pFDR that can occur while still rejecting the test. For
more details see Storey (2001) and Storey (2002). To find the q‑values,
falseDiscoveryRates
implements the algorithm given in Storey and
Tibshirani (2003).
Examples¶
Example 1¶
The p‑values are 20 independent realizations of a uniform (0,1) random variable. The null-probability estimate, q‑values, and upper confidence limits are returned.
from __future__ import print_function
from numpy import *
from pyimsl.stat.falseDiscoveryRates import falseDiscoveryRates
from pyimsl.stat.writeMatrix import writeMatrix
from pyimsl.stat.sortData import sortData
iseed = 123457
p1 = [0.7897864, 0.5600287, 0.04625103, 0.4892959, 0.598915,
0.2149330, 0.9683629, 0.1449932, 0.4999971, 0.2820091,
0.3489318, 0.479333, 0.9786092, 0.02232179, 0.2329003,
0.3600357, 0.1341173, 0.5148499, 0.5693829, 0.9914673]
ntests = len(p1)
upperlimits = []
pi0 = []
qvals = falseDiscoveryRates(p1,
nullProb=pi0,
randomSeed=iseed,
upperLimits=upperlimits)
sortData(p1, 1)
print("Null Probability Estimate: %4.3f" % pi0[0])
writeMatrix("Upper Limits for pFDR and FDR:", upperlimits)
print("\n\tP-Value\t Q-Value")
for i in range(0, ntests):
print("\t %4.3f \t %4.3f" % (p1[i], qvals[i]))
Output¶
Null Probability Estimate: 0.500
P-Value Q-Value
0.022 0.223
0.046 0.231
0.134 0.362
0.145 0.362
0.215 0.374
0.233 0.374
0.282 0.374
0.349 0.374
0.360 0.374
0.479 0.374
0.489 0.374
0.500 0.374
0.515 0.374
0.560 0.374
0.569 0.374
0.599 0.374
0.790 0.465
0.968 0.496
0.979 0.496
0.991 0.496
Upper Limits for pFDR and FDR:
1 2
1.000 0.869
Example 2¶
This example applies the cubic spline smoothing method to get an estimate of the null probability. Note that the p‑values are the same as in Example 1, but the null probability estimate is larger in this case.
from __future__ import print_function
from numpy import *
from pyimsl.stat.falseDiscoveryRates import falseDiscoveryRates
from pyimsl.stat.writeMatrix import writeMatrix
from pyimsl.stat.sortData import sortData
iseed = 123457
p1 = [0.7897864, 0.5600287, 0.04625103, 0.4892959, 0.598915,
0.2149330, 0.9683629, 0.1449932, 0.4999971, 0.2820091,
0.3489318, 0.479333, 0.9786092, 0.02232179, 0.2329003,
0.3600357, 0.1341173, 0.5148499, 0.5693829, 0.9914673]
ntests = len(p1)
upperlimits = []
pi0 = []
qvals = falseDiscoveryRates(p1,
nullProb=pi0,
method=1,
randomSeed=iseed,
upperLimits=upperlimits)
sortData(p1, 1)
# print(p1)
print("Null Probability Estimate: %4.3f" % pi0[0])
writeMatrix("Upper Limits for pFDR and FDR:", upperlimits)
print("\n\tP-Value\t Q-Value")
for i in range(0, ntests):
print("\t %4.3f \t %4.3f" % (p1[i], qvals[i]))
Output¶
Null Probability Estimate: 0.870
P-Value Q-Value
0.022 0.388
0.046 0.402
0.134 0.631
0.145 0.631
0.215 0.651
0.233 0.651
0.282 0.651
0.349 0.651
0.360 0.651
0.479 0.651
0.489 0.651
0.500 0.651
0.515 0.651
0.560 0.651
0.569 0.651
0.599 0.651
0.790 0.808
0.968 0.863
0.979 0.863
0.991 0.863
Upper Limits for pFDR and FDR:
1 2
1 1
Warning Errors¶
IMSLS_NULL_PROBABILITY_0 |
The null probability estimate is < =
0. Check that the p‑values are
correct or try lowering the maximum
“lambda ” value, which is
currently = #. |