simpleStatistics¶
Computes basic univariate statistics.
Synopsis¶
simpleStatistics (x)
Required Arguments¶
- float
x[[]]
(Input) - Array of size
nObservations
×nVariables
containing the data matrix.
Return Value¶
An array containing some simple statistics for each of the columns in x
.
If median
and medianAndScale
are not used as optional arguments, the
size of the matrix is 14 × nVariables
. The columns of this matrix
correspond to the columns of x
, and the rows contain the following
statistics:
Row | Statistic |
---|---|
0 | mean |
1 | variance |
2 | standard deviation |
3 | coefficient of skewness |
4 | coefficient of excess (kurtosis) |
5 | minimum value |
6 | maximum value |
7 | range |
8 | coefficient of variation (when defined) If the coefficient of variation is not defined, 0 is returned. |
9 | number of observations (the counts) |
10 | lower confidence limit for the mean (assuming normality) The default is a 95−percent confidence interval. |
11 | upper confidence limit for the mean (assuming normality) |
12 | lower confidence limit for the variance (assuming normality) The default is a 95-percent confidence interval. |
13 | upper confidence limit for the variance (assuming normality) |
Optional Arguments¶
confidenceMeans
, float (Input)- Confidence level for a two-sided interval estimate of the means (assuming
normality) in percent. Argument
confidenceMeans
must be between 0.0 and 100.0 and is often 90.0, 95.0, or 99.0. For a one-sided confidence interval with confidence level c, setconfidenceMeans
= \(100.0-2(100-c)\). IfconfidenceMeans
is not specified, a 95-percent confidence interval is computed. confidenceVariances
, float (Input)- The confidence level for a two-sided interval estimate of the variances
(assuming normality) in percent. The confidence intervals are symmetric in
probability (rather than in length). For a one-sided confidence interval
with confidence level c, set
confidenceMeans
= \(100.0-2(100-c)\). IfconfidenceVariances
is not specified, a 95-percent confidence interval is computed. ido
, int (Input)Processing option.
The argument
ido
must be one of 0, 1, 2, or 3. Ifido
= 0 (the default), all of the observations are input during one invocation. Ifido
= 1, 2, or 3, blocks of rows of the data can be processed sequentially in separate invocations ofsimpleStatistics
; with this option, it is not a requirement that all observations be memory resident, thus enabling one to handle large data sets.ido
Action 0 This is the only invocation; all the data are input at once. (Default) 1 This is the first invocation with this data; additional calls will be made.
Initialization and updating for the
nObservations
observations ofx
will be performed.2 This is an intermediate invocation; updating for the nObservations
observations ofx
will be performed.3 This is the final invocation of this function. Updating for the data in x
and wrap-up computations are performed. Workspace is released. No further invocations ofsimpleStatistics
with ido greater than 1 should be made without first invokingsimpleStatistics
withido
= 1.Default:
ido
= 0
median
, or
medianAndScale
Exactly one of these optional arguments can be specified in order to indicate the additional simple robust statistics to be computed. If
median
is specified, the medians are computed and stored in one additional row (row number 14) in the returned matrix of simple statistics. IfmedianAndScale
is specified, the medians, the medians of the absolute deviations from the medians, and a simple robust estimate of scale are computed, then stored in three additional rows (rows 14, 15, and 16) in the returned matrix of simple statistics.median
ormedianAndScale
can be specified only whenido
is equal to 0.
missingListwise
, or
missingElementwise
- If
missingElementwise
is specified, all non missing data for any variable is used in computing the statistics for that variable. IfmissingListwise
is specified and if an observation (row ofx
) contains a missing value, the observation is excluded from computations for all variables. The default ismissingListwise
. In either case, if weights and/or frequencies are specified and the value of the weight and/or frequency is missing, the observation is excluded from computations for all variables. frequencies
, float[]
(Input)Array of length
nObservations
containing the frequency for each observation.Default: Each observation has a frequency of 1
weights
, floatweights[]
(Input)Array of length
nObservations
containing the weight for each observation.Default: Each observation has a weight of 1
Description¶
For the data in each column of x
, simpleStatistics
computes the
sample mean, variance, minimum, maximum, and other basic statistics. This
function also computes confidence intervals for the mean and variance (under
the hypothesis that the sample is from a normal population).
Frequencies are interpreted as multiple occurrences of the other values in
the observations. In other words, a row of x
with a frequency variable
having a value of 2 has the same effect as two rows with frequencies of 1.
The total of the frequencies is used in computing all the statistics based
on moments (mean, variance, skewness, and kurtosis). Weights are not viewed
as replication factors. The sum of the weights is used only in computing the
mean (the weighted mean is used in computing the central moments). Both
weights and frequencies can be 0, but neither can be negative. In general, a
0 frequency means that the row is to be eliminated from the analysis; no
further processing or error checking is done on the row. A weight of 0
results in the row being counted, and updates are made of the statistics.
The definitions of some of the statistics are given below in terms of a single variable x of which the i-th datum is \(x_i\).
Mean¶
Variance¶
Skewness¶
Excess or Kurtosis¶
Minimum¶
Maximum¶
Range¶
Coefficient of Variation¶
Median¶
Median Absolute Deviation¶
Simple Robust Estimate of Scale¶
where \(\Phi^{-1}(3/4)\approx 0.6745\) is the inverse of the standard normal distribution function evaluated at 3/4. This standardizes MAD in order to make the scale estimate consistent at the normal distribution for estimating the standard deviation (Huber 1981, pp. 107−108).
Examples¶
Example 1¶
Data from Draper and Smith (1981) are used in this example, which includes 5 variables and 13 observations.
from numpy import *
from pyimsl.stat.simpleStatistics import simpleStatistics
from pyimsl.stat.writeMatrix import writeMatrix
x = [[7., 26., 6., 60., 78.5],
[1., 29., 15., 52., 74.3],
[11., 56., 8., 20., 104.3],
[11., 31., 8., 47., 87.6],
[7., 52., 6., 33., 95.9],
[11., 55., 9., 22., 109.2],
[3., 71., 17., 6., 102.7],
[1., 31., 22., 44., 72.5],
[2., 54., 18., 22., 93.1],
[21., 47., 4., 26., 115.9],
[1., 40., 23., 34., 83.8],
[11., 66., 9., 12., 113.3],
[10., 68., 8., 12., 109.4]]
rowLabels = ["means", "variances", "std. dev",
"skewness", "kurtosis",
"minima", "maxima", "ranges", "C.V.",
"counts", "lower mean",
"upper mean", "lower var", "upper var"]
stats = simpleStatistics(x)
writeMatrix("* * * Statistics * * *", stats,
rowLabels=rowLabels, writeFormat="%7.3f")
Output¶
* * * Statistics * * *
1 2 3 4 5
means 7.462 48.154 11.769 30.000 95.423
variances 34.603 242.141 41.026 280.167 226.314
std. dev 5.882 15.561 6.405 16.738 15.044
skewness 0.688 -0.047 0.611 0.330 -0.195
kurtosis 0.075 -1.323 -1.079 -1.014 -1.342
minima 1.000 26.000 4.000 6.000 72.500
maxima 21.000 71.000 23.000 60.000 115.900
ranges 20.000 45.000 19.000 54.000 43.400
C.V. 0.788 0.323 0.544 0.558 0.158
counts 13.000 13.000 13.000 13.000 13.000
lower mean 3.907 38.750 7.899 19.885 86.332
upper mean 11.016 57.557 15.640 40.115 104.514
lower var 17.793 124.512 21.096 144.065 116.373
upper var 94.289 659.816 111.792 763.434 616.688
Example 2¶
Continuing with Example 1 data, the example
below invokes the simpleStatistics
function using values of ido
greater than 0.
from numpy import *
from pyimsl.stat.simpleStatistics import simpleStatistics
from pyimsl.stat.writeMatrix import writeMatrix
x1 = [[7., 26., 6., 60., 78.5],
[1., 29., 15., 52., 74.3]]
x2 = [[11., 56., 8., 20., 104.3],
[11., 31., 8., 47., 87.6],
[7., 52., 6., 33., 95.9],
[11., 55., 9., 22., 109.2],
[3., 71., 17., 6., 102.7],
[1., 31., 22., 44., 72.5],
[2., 54., 18., 22., 93.1],
[21., 47., 4., 26., 115.9]]
x3 = [[1., 40., 23., 34., 83.8],
[11., 66., 9., 12., 113.3],
[10., 68., 8., 12., 109.4]]
rowLabels = ["means", "variances", "std. dev",
"skewness", "kurtosis",
"minima", "maxima", "ranges", "C.V.",
"counts", "lower mean",
"upper mean", "lower var", "upper var"]
stats = simpleStatistics(x1, ido=1)
stats = simpleStatistics(x2, ido=2)
stats = simpleStatistics(x3, ido=3)
writeMatrix("* * * Statistics * * *", stats,
rowLabels=rowLabels, writeFormat="%7.3f")
Output¶
* * * Statistics * * *
1 2 3 4 5
means 7.462 48.154 11.769 30.000 95.423
variances 34.603 242.141 41.026 280.167 226.314
std. dev 5.882 15.561 6.405 16.738 15.044
skewness 0.688 -0.047 0.611 0.330 -0.195
kurtosis 0.075 -1.323 -1.079 -1.014 -1.342
minima 1.000 26.000 4.000 6.000 72.500
maxima 21.000 71.000 23.000 60.000 115.900
ranges 20.000 45.000 19.000 54.000 43.400
C.V. 0.788 0.323 0.544 0.558 0.158
counts 13.000 13.000 13.000 13.000 13.000
lower mean 3.907 38.750 7.899 19.885 86.332
upper mean 11.016 57.557 15.640 40.115 104.514
lower var 17.793 124.512 21.096 144.065 116.373
upper var 94.289 659.816 111.792 763.434 616.688
Warning Errors¶
IMSLS_ROW_OF_X_CONTAINED_NAN |
At least one row of “x ”
contained NaN (a missing value). |
IMSLS_VAR_IN_X_CONTAINED_NAN |
At least one observation for a
variable in “x ” contained
NaN (a missing value). Missing
observations were excluded from
calculations for those variables. |
IMSLS_CONSTANT_OBSERVATIONS |
The observations on variable(s) are constant. |
IMSLS_LESS_THAN_TWO_VALID_OBS |
Fewer than two valid observations are present. The corresponding statistics are set to NaN (not a number), (except for the mean, which is not correct if no valid observations). |
IMSLS_VARIANCE_UNDERFLOW |
The variance for this variable underflows. Therefore, the variance and standard deviation are set to 0, and the skewness and kurtosis are set to NaN (not a number) |
IMSLS_NEGATIVE_VARIANCE |
The variance is negative for the variable. The corresponding confidence limits for the variance are set to NaN (not a number). |
IMSLS_NOT_ENOUGH_OBSERVATIONS |
Fewer than two valid observations are present for the variable. The corresponding statistics are set to NaN (not a number), (except for the mean, which is not correct if no valid observations are present, or is correct if one observation is present) |
IMSLS_MIN_GREATER_THAN_MAX |
The maximum value is less than the minimum value. The corresponding statistics are set to NaN (not a number). |
IMSLS_MAX_LESS_THAN_MIN |
The maximum value is less than the minimum value. The corresponding statistics are set to NaN (not a number). |
IMSLS_SUM_OF_WEIGHTS_ZERO |
The sum of the weights for
variable is zero. The statistics,
except for the minima ,
maxima , ranges and counts,
are set to NaN (not a number). |
IMSLS_ZERO_SUM_OF_WEIGHTS |
The sum of the weights is zero.
The statistics, except for the
minima , maxima , ranges
and counts, are set to NaN (not a
number). |
IMSLS_LESS_THAN_TWO_VALID_OBS |
Fewer than two valid observations are present. The corresponding statistics are set to NaN (not a number), (except for the mean, which is not correct if no valid observations). |
IMSLS_FOURTH_ORDER_UNDERFLOW |
Since the range of variable is very small, the fourth order moment for this variable underflows. Therefore, the kurtosis is set to NaN (not a number). |
IMSLS_HIGH_ORDER_UNDERFLOW |
Since the range of variable %(I1) is very small, the higher order moments for this variable underflow. Therefore, the skewness and kurtosis are set to NaN (not a number). |
IMSLS_CHI_SQUARED_STAT_ERROR |
An error occurred in determining the chi-squared statistic. The lower confidence limit for the variance is set to NaN (not a number). |
Fatal Errors¶
IMSLS_BAD_IDO_6 |
“ido ” = #. Initial allocations
must be performed by invoking the
function with “ido ” = 1. |
IMSLS_BAD_IDO_7 |
“ido ” = #. A new analysis may not
begin until the previous analysis is
terminated by invoking the function with
“ido ” = 3. |
IMSLS_BAD_N_VARIABLES |
“nVariables ” = #. The number of
variables must be the same in separate
function invocations. |