simpleStatistics

Computes basic univariate statistics.

Synopsis

simpleStatistics (x)

Required Arguments

float x[[]] (Input)
Array of size nObservations × nVariables containing the data matrix.

Return Value

An array containing some simple statistics for each of the columns in x. If median and medianAndScale are not used as optional arguments, the size of the matrix is 14 × nVariables. The columns of this matrix correspond to the columns of x, and the rows contain the following statistics:

Row Statistic
0 mean
1 variance
2 standard deviation
3 coefficient of skewness
4 coefficient of excess (kurtosis)
5 minimum value
6 maximum value
7 range
8 coefficient of variation (when defined) If the coefficient of variation is not defined, 0 is returned.
9 number of observations (the counts)
10 lower confidence limit for the mean (assuming normality) The default is a 95−percent confidence interval.
11 upper confidence limit for the mean (assuming normality)
12 lower confidence limit for the variance (assuming normality) The default is a 95-percent confidence interval.
13 upper confidence limit for the variance (assuming normality)

Optional Arguments

confidenceMeans, float (Input)
Confidence level for a two-sided interval estimate of the means (assuming normality) in percent. Argument confidenceMeans must be between 0.0 and 100.0 and is often 90.0, 95.0, or 99.0. For a one-sided confidence interval with confidence level c, set confidenceMeans = \(100.0-2(100-c)\). If confidenceMeans is not specified, a 95-percent confidence interval is computed.
confidenceVariances, float (Input)
The confidence level for a two-sided interval estimate of the variances (assuming normality) in percent. The confidence intervals are symmetric in probability (rather than in length). For a one-sided confidence interval with confidence level c, set confidenceMeans = \(100.0-2(100-c)\). If confidenceVariances is not specified, a 95-percent confidence interval is computed.
ido, int (Input)

Processing option.

The argument ido must be one of 0, 1, 2, or 3. If ido = 0 (the default), all of the observations are input during one invocation. If ido = 1, 2, or 3, blocks of rows of the data can be processed sequentially in separate invocations of simpleStatistics; with this option, it is not a requirement that all observations be memory resident, thus enabling one to handle large data sets.

ido Action
0 This is the only invocation; all the data are input at once. (Default)
1

This is the first invocation with this data; additional calls will be made.

Initialization and updating for the nObservations observations of x will be performed.

2 This is an intermediate invocation; updating for the nObservations observations of x will be performed.
3 This is the final invocation of this function. Updating for the data in x and wrap-up computations are performed. Workspace is released. No further invocations of simpleStatistics with ido greater than 1 should be made without first invoking simpleStatistics with ido = 1.

Default: ido = 0

median, or

medianAndScale

Exactly one of these optional arguments can be specified in order to indicate the additional simple robust statistics to be computed. If median is specified, the medians are computed and stored in one additional row (row number 14) in the returned matrix of simple statistics. If medianAndScale is specified, the medians, the medians of the absolute deviations from the medians, and a simple robust estimate of scale are computed, then stored in three additional rows (rows 14, 15, and 16) in the returned matrix of simple statistics.

median or medianAndScale can be specified only when ido is equal to 0.

missingListwise, or

missingElementwise
If missingElementwise is specified, all non missing data for any variable is used in computing the statistics for that variable. If missingListwise is specified and if an observation (row of x) contains a missing value, the observation is excluded from computations for all variables. The default is missingListwise. In either case, if weights and/or frequencies are specified and the value of the weight and/or frequency is missing, the observation is excluded from computations for all variables.
frequencies, float[] (Input)

Array of length nObservations containing the frequency for each observation.

Default: Each observation has a frequency of 1

weights, float weights[] (Input)

Array of length nObservations containing the weight for each observation.

Default: Each observation has a weight of 1

Description

For the data in each column of x, simpleStatistics computes the sample mean, variance, minimum, maximum, and other basic statistics. This function also computes confidence intervals for the mean and variance (under the hypothesis that the sample is from a normal population).

Frequencies are interpreted as multiple occurrences of the other values in the observations. In other words, a row of x with a frequency variable having a value of 2 has the same effect as two rows with frequencies of 1. The total of the frequencies is used in computing all the statistics based on moments (mean, variance, skewness, and kurtosis). Weights are not viewed as replication factors. The sum of the weights is used only in computing the mean (the weighted mean is used in computing the central moments). Both weights and frequencies can be 0, but neither can be negative. In general, a 0 frequency means that the row is to be eliminated from the analysis; no further processing or error checking is done on the row. A weight of 0 results in the row being counted, and updates are made of the statistics.

The definitions of some of the statistics are given below in terms of a single variable x of which the i-th datum is \(x_i\).

Mean

\[\overline{x}_w = \frac{\Sigma f_i w_i x_i}{\Sigma f_i w_i}\]

Variance

\[s_w^2 = \frac {\Sigma f_i w_i \Sigma f_i w_i \left(x_i - \overline{x}_w\right)^2} {\left(\Sigma f_i w_i\right)^2 - \Sigma f_i^2 w_i^2}\]

Skewness

\[\frac {\Sigma f_i w_i \left(x_i - \overline{x}_w\right)^3/n} {\left[\Sigma f_i w_i \left(x_i - \overline{x}_w\right)^2/n\right]^{3/2}}\]

Excess or Kurtosis

\[\frac {\Sigma f_i w_i \left(x_i - \overline{x}_w\right)^4/n} {\left[\Sigma f_i w_i \left(x_i - \overline{x}_w\right)^2/n\right]^2} - 3\]

Minimum

\[x_{\min} = \min\left(x_i\right)\]

Maximum

\[x_{\max} = \max\left(x_i\right)\]

Range

\[x_{\max} - x_{\min}\]

Coefficient of Variation

\[\frac{s_{\mathrm{w}}}{\overline{x}_w} \text{ for } \overline{x}_w \neq 0\]

Median

\[\begin{split}\mathrm{median} \left\{ x_i \right\} = \begin{cases} \text{middle } x_i \text{ after sorting if } n \text{ is odd} \\ \text{average of middle two } x_i \text{'s if } n \text{ is even} \\ \end{cases}\end{split}\]

Median Absolute Deviation

\[\mathrm{MAD} = \mathrm{median} \left\{ |x_i - \mathrm{median} \left\{x_j\right\}| \right\}\]

Simple Robust Estimate of Scale

\[\frac{\mathit{MAD}}{\phi^{-1}(3/4)}\]

where \(\Phi^{-1}(3/4)\approx 0.6745\) is the inverse of the standard normal distribution function evaluated at 3/4. This standardizes MAD in order to make the scale estimate consistent at the normal distribution for estimating the standard deviation (Huber 1981, pp. 107−108).

Examples

Example 1

Data from Draper and Smith (1981) are used in this example, which includes 5 variables and 13 observations.

from numpy import *
from pyimsl.stat.simpleStatistics import simpleStatistics
from pyimsl.stat.writeMatrix import writeMatrix

x = [[7., 26., 6., 60., 78.5],
     [1., 29., 15., 52., 74.3],
     [11., 56., 8., 20., 104.3],
     [11., 31., 8., 47., 87.6],
     [7., 52., 6., 33., 95.9],
     [11., 55., 9., 22., 109.2],
     [3., 71., 17., 6., 102.7],
     [1., 31., 22., 44., 72.5],
     [2., 54., 18., 22., 93.1],
     [21., 47., 4., 26., 115.9],
     [1., 40., 23., 34., 83.8],
     [11., 66., 9., 12., 113.3],
     [10., 68., 8., 12., 109.4]]

rowLabels = ["means", "variances", "std. dev",
             "skewness", "kurtosis",
             "minima", "maxima", "ranges", "C.V.",
             "counts", "lower mean",
             "upper mean", "lower var", "upper var"]

stats = simpleStatistics(x)

writeMatrix("* * * Statistics * * *", stats,
            rowLabels=rowLabels, writeFormat="%7.3f")

Output

 
                * * * Statistics * * *
                  1        2        3        4        5
means         7.462   48.154   11.769   30.000   95.423
variances    34.603  242.141   41.026  280.167  226.314
std. dev      5.882   15.561    6.405   16.738   15.044
skewness      0.688   -0.047    0.611    0.330   -0.195
kurtosis      0.075   -1.323   -1.079   -1.014   -1.342
minima        1.000   26.000    4.000    6.000   72.500
maxima       21.000   71.000   23.000   60.000  115.900
ranges       20.000   45.000   19.000   54.000   43.400
C.V.          0.788    0.323    0.544    0.558    0.158
counts       13.000   13.000   13.000   13.000   13.000
lower mean    3.907   38.750    7.899   19.885   86.332
upper mean   11.016   57.557   15.640   40.115  104.514
lower var    17.793  124.512   21.096  144.065  116.373
upper var    94.289  659.816  111.792  763.434  616.688

Example 2

Continuing with Example 1 data, the example below invokes the simpleStatistics function using values of ido greater than 0.

from numpy import *
from pyimsl.stat.simpleStatistics import simpleStatistics
from pyimsl.stat.writeMatrix import writeMatrix

x1 = [[7., 26., 6., 60., 78.5],
      [1., 29., 15., 52., 74.3]]
x2 = [[11., 56., 8., 20., 104.3],
      [11., 31., 8., 47., 87.6],
      [7., 52., 6., 33., 95.9],
      [11., 55., 9., 22., 109.2],
      [3., 71., 17., 6., 102.7],
      [1., 31., 22., 44., 72.5],
      [2., 54., 18., 22., 93.1],
      [21., 47., 4., 26., 115.9]]
x3 = [[1., 40., 23., 34., 83.8],
      [11., 66., 9., 12., 113.3],
      [10., 68., 8., 12., 109.4]]

rowLabels = ["means", "variances", "std. dev",
             "skewness", "kurtosis",
             "minima", "maxima", "ranges", "C.V.",
             "counts", "lower mean",
             "upper mean", "lower var", "upper var"]

stats = simpleStatistics(x1, ido=1)
stats = simpleStatistics(x2, ido=2)
stats = simpleStatistics(x3, ido=3)

writeMatrix("* * * Statistics * * *", stats,
            rowLabels=rowLabels, writeFormat="%7.3f")

Output

 
                * * * Statistics * * *
                  1        2        3        4        5
means         7.462   48.154   11.769   30.000   95.423
variances    34.603  242.141   41.026  280.167  226.314
std. dev      5.882   15.561    6.405   16.738   15.044
skewness      0.688   -0.047    0.611    0.330   -0.195
kurtosis      0.075   -1.323   -1.079   -1.014   -1.342
minima        1.000   26.000    4.000    6.000   72.500
maxima       21.000   71.000   23.000   60.000  115.900
ranges       20.000   45.000   19.000   54.000   43.400
C.V.          0.788    0.323    0.544    0.558    0.158
counts       13.000   13.000   13.000   13.000   13.000
lower mean    3.907   38.750    7.899   19.885   86.332
upper mean   11.016   57.557   15.640   40.115  104.514
lower var    17.793  124.512   21.096  144.065  116.373
upper var    94.289  659.816  111.792  763.434  616.688

Warning Errors

IMSLS_ROW_OF_X_CONTAINED_NAN At least one row of “x” contained NaN (a missing value).
IMSLS_VAR_IN_X_CONTAINED_NAN At least one observation for a variable in “x” contained NaN (a missing value). Missing observations were excluded from calculations for those variables.
IMSLS_CONSTANT_OBSERVATIONS The observations on variable(s) are constant.
IMSLS_LESS_THAN_TWO_VALID_OBS Fewer than two valid observations are present. The corresponding statistics are set to NaN (not a number), (except for the mean, which is not correct if no valid observations).
IMSLS_VARIANCE_UNDERFLOW The variance for this variable underflows. Therefore, the variance and standard deviation are set to 0, and the skewness and kurtosis are set to NaN (not a number)
IMSLS_NEGATIVE_VARIANCE The variance is negative for the variable. The corresponding confidence limits for the variance are set to NaN (not a number).
IMSLS_NOT_ENOUGH_OBSERVATIONS Fewer than two valid observations are present for the variable. The corresponding statistics are set to NaN (not a number), (except for the mean, which is not correct if no valid observations are present, or is correct if one observation is present)
IMSLS_MIN_GREATER_THAN_MAX The maximum value is less than the minimum value. The corresponding statistics are set to NaN (not a number).
IMSLS_MAX_LESS_THAN_MIN The maximum value is less than the minimum value. The corresponding statistics are set to NaN (not a number).
IMSLS_SUM_OF_WEIGHTS_ZERO The sum of the weights for variable is zero. The statistics, except for the minima, maxima, ranges and counts, are set to NaN (not a number).
IMSLS_ZERO_SUM_OF_WEIGHTS The sum of the weights is zero. The statistics, except for the minima, maxima, ranges and counts, are set to NaN (not a number).
IMSLS_LESS_THAN_TWO_VALID_OBS Fewer than two valid observations are present. The corresponding statistics are set to NaN (not a number), (except for the mean, which is not correct if no valid observations).
IMSLS_FOURTH_ORDER_UNDERFLOW Since the range of variable is very small, the fourth order moment for this variable underflows. Therefore, the kurtosis is set to NaN (not a number).
IMSLS_HIGH_ORDER_UNDERFLOW Since the range of variable %(I1) is very small, the higher order moments for this variable underflow. Therefore, the skewness and kurtosis are set to NaN (not a number).
IMSLS_CHI_SQUARED_STAT_ERROR An error occurred in determining the chi-squared statistic. The lower confidence limit for the variance is set to NaN (not a number).

Fatal Errors

IMSLS_BAD_IDO_6 ido” = #. Initial allocations must be performed by invoking the function with “ido” = 1.
IMSLS_BAD_IDO_7 ido” = #. A new analysis may not begin until the previous analysis is terminated by invoking the function with “ido” = 3.
IMSLS_BAD_N_VARIABLES nVariables” = #. The number of variables must be the same in separate function invocations.