sortData

Sorts observations by specified keys, with option to tally cases into a multi-way frequency table.

Synopsis

sortData ( x, nKeys)

Required Arguments

float x[[]] (Input/Output)
An nObservations × nVariables matrix containing the observations to be sorted. The sorted matrix is returned in x (exception: see optional argument passive).
int nKeys (Input)
Number of columns of x on which to sort. The first nKeys columns of x are used as the sorting keys (exception: see optional argument indicesKeys).

Optional Arguments

indicesKeys, int[] (Input)

Array of length nKeys giving the column numbers of x which are to be used in the sort.

Default: indicesKeys [ ] = 0, 1, …, nKeys − 1

frequencies, float[] (Input)

Array of length nObservations containing the frequency for each observation in x.

Default: frequencies [ ] = 1

ascending, or

descending
By default, or if ascending is specified, the sort is in ascending order. If descending is specified, the sort is in descending order.

active, or

passive
By default, or if active is specified, the sorted matrix is returned in x. If passive is specified, x is unchanged by sortData (i.e., x becomes input only).
permutation (Output)
An array of length nObservations specifying the rearrangement (permutation) of the observations (rows).
table, nValues, values, table (Output)

Argument nValues is an array of length nKeys containing in its i-th element (i = 0, 1, …, nKeys − 1), the number of levels or categories of the i-th classification variable (column).

Argument values is an array of length nValues [0] + nValues [1] + … + nValues [nKeys1] containing the values of the classification variables. The first nValues [0] elements of values contain the values for the first classification variable. The next nValues [1] contain the values for the second variable. The last nValues [nKeys1] positions contain the values for the last classification variable.

Argument table is an array of length nValues [0] × nValues [1] × … × nValues [nKeys1] containing the frequencies in the cells of the table to be fit.

Empty cells are included in table, and each element of table is nonnegative. The cells of table are sequenced so that the first variable cycles through its nValues [0] categories one time, the second variable cycles through its nValues [1] categories nValues [0] times, the third variable cycles through its nValues [2] categories nValues [0] × nValues [1] times, etc., up to the nKeys-th variable, which cycles through its nValues [nKeys − 1] categories nValues [0] × nValues [1] × … × nValues [nKeys − 2] times.

listCells, nCells, listCells, tableUnbalanced (Output)

Number of nonempty cells is returned by nCells. Argument listCells is an array of size nCells × nKeys containing, for each row, a list of the levels of nKeys corresponding classification variables that describe a cell.

Argument tableUnbalanced is an array of length nCells containing the frequency for each cell.

n, nCells, n (Output)

The integer nCells returns the number of groups of different observations. A group contains observations (rows) in x that are equal with respect to the method of comparison.

Argument n is the an array of length nCells containing the number of observations (rows) in each group.

The first n [0] rows of the sorted x are group number 1. The next n [1]rows of the sorted x are group number 2, etc. The last n [nCells − 1] rows of the sorted x are group number nCells.

Description

Function sortData can perform both a key sort and/or tabulation of frequencies into a multi-way frequency table.

Sorting

Function sortData sorts the rows of real matrix x using a particular row in x as the keys. The sort is algebraic with the first key as the most significant, the second key as the next most significant, etc. When x is sorted in ascending order, the resulting sorted array is such that the following is true:

  • For i = 0, 1, …, nObservations − 2, x [i] [indicesKeys [0]] ≤ x [i + 1] [indicesKeys [0]]
  • For k = 1, …, nKeys − 1, if x [i] [indicesKeys [j]] = x [i + 1] [indicesKeys [j]] for \(j=0,1,\ldots,k-1\), then x [i] [indicesKeys [k]] = x [i + 1] [indicesKeys [k]]

The observations also can be sorted in descending order.

The rows of x containing the missing value code NaN in at least one of the specified columns are considered as an additional group. These rows are moved to the end of the sorted x.

The sorting algorithm is based on a quicksort method given by Singleton (1969) with modifications by Griffen and Redish (1970) and Petro (1970).

Frequency Tabulation

Function sortData determines the distinct values in multivariate data and computes frequencies for the data. This function accepts the data in the matrix x, but performs computations only for the variables (columns) in the first nKeys columns of x (Exception: see optional argument indicesKeys). In general, the variables for which frequencies should be computed are discrete; they should take on a relatively small number of different values. Variables that are continuous can be grouped first. The tableOneway function can be used to group variables and determine the frequencies of groups.

When table is specified, sortData fills the vector values with the unique values of the variables and tallies the number of unique values of each variable in the vector table. Each combination of one value from each variable forms a cell in a multi-way table. The frequencies of these cells are entered in table so that the first variable cycles through its values exactly once, and the last variable cycles through its values most rapidly. Some cells cannot correspond to any observations in the data; in other words, “missing cells” are included in table and have a value of 0.

When listCells is specified, the frequency of each cell is entered in tableUnbalanced so that the first variable cycles through its values exactly once and the last variable cycles through its values most rapidly. All cells have a frequency of at least 1, i.e., there is no “missing cell.” The listCells array can be considered “parallel” to tableUnbalanced because row i of listCells is the set of nKeys values that describes the cell for which row i of tableUnbalanced contains the corresponding frequency.

Examples

Example 1

The rows of a 10 × 3 matrix x are sorted in ascending order using Columns 0 and 1 as the keys. There are two missing values (NaNs) in the keys. The observations containing these values are moved to the end of the sorted array.

from numpy import *
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix

n_keys = 2
x = array([[1.0, 1.0, 1.0],
           [2.0, 1.0, 2.0],
           [1.0, 1.0, 3.0],
           [1.0, 1.0, 4.0],
           [2.0, 2.0, 5.0],
           [1.0, 2.0, 6.0],
           [1.0, 2.0, 7.0],
           [1.0, 1.0, 8.0],
           [2.0, 2.0, 9.0],
           [1.0, 1.0, 9.0]])
x[4][1] = machine(6)
x[6][0] = machine(6)
sortData(x, n_keys)
writeMatrix("Sorted x", x)

Output

 
                Sorted x
              1            2            3
 1            1            1            1
 2            1            1            9
 3            1            1            3
 4            1            1            4
 5            1            1            8
 6            1            2            6
 7            2            1            2
 8            2            2            9
 9  ...........            2            7
10            2  ...........            5

Example 2

This example uses the same data as the previous example. The permutation of the rows is output in the array permutation.

from numpy import *
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix

n_keys = 2
x = array([[1.0, 1.0, 1.0],
           [2.0, 1.0, 2.0],
           [1.0, 1.0, 3.0],
           [1.0, 1.0, 4.0],
           [2.0, 2.0, 5.0],
           [1.0, 2.0, 6.0],
           [1.0, 2.0, 7.0],
           [1.0, 1.0, 8.0],
           [2.0, 2.0, 9.0],
           [1.0, 1.0, 9.0]])
x[4][1] = machine(6)
x[6][0] = machine(6)
n = {}
permutation = []
sortData(x, n_keys, passive=True,
         permutation=permutation, n=n)
writeMatrix("Unchanged x", x)
writeMatrix("Permutation", permutation, writeFormat="%10i")
nn = n["n"]
writeMatrix("n", nn, writeFormat="%10i")

Output

 
               Unchanged x
              1            2            3
 1            1            1            1
 2            2            1            2
 3            1            1            3
 4            1            1            4
 5            2  ...........            5
 6            1            2            6
 7  ...........            2            7
 8            1            1            8
 9            2            2            9
10            1            1            9
 
                              Permutation
         1           2           3           4           5           6
         0           9           2           3           7           5
 
         7           8           9          10
         1           8           6           4
 
                       n
         1           2           3           4
         5           1           1           1

Example 3

The table of frequencies for a data matrix of size 30 × 2 is output in the array table.

from __future__ import print_function
from numpy import *
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix

n_observations = 30
n_variables = 2
n_keys = 2
x = array([[0.5, 1.5],
           [1.5, 3.5],
           [0.5, 3.5],
           [1.5, 2.5],
           [1.5, 3.5],
           [1.5, 4.5],
           [0.5, 1.5],
           [1.5, 3.5],
           [3.5, 6.5],
           [2.5, 3.5],
           [2.5, 4.5],
           [3.5, 6.5],
           [1.5, 2.5],
           [2.5, 4.5],
           [0.5, 3.5],
           [1.5, 2.5],
           [1.5, 3.5],
           [0.5, 3.5],
           [0.5, 1.5],
           [0.5, 2.5],
           [2.5, 5.5],
           [1.5, 2.5],
           [1.5, 3.5],
           [1.5, 4.5],
           [4.5, 5.5],
           [2.5, 4.5],
           [0.5, 3.5],
           [1.5, 2.5],
           [0.5, 2.5],
           [2.5, 5.5]])
table = {}
sortData(x, n_keys, passive=True, table=table)
writeMatrix("Unchanged x", x)
nValues = table["nValues"]
n_rows = nValues[0]
n_columns = nValues[1]
print("n_rows: ", n_rows)
print("n_columns: ", n_columns)
tableValues = table["values"]
rowValues = tableValues[0:n_rows]
colValues = tableValues[n_rows:]
writeMatrix("Row values", rowValues, writeFormat="%10.1f")
writeMatrix("Column values", colValues, writeFormat="%10.1f")
tabtab = table["table"]
writeMatrix("Table", tabtab, writeFormat="%8i")

Output

n_rows:  5
n_columns:  6
 
         Unchanged x
              1            2
 1          0.5          1.5
 2          1.5          3.5
 3          0.5          3.5
 4          1.5          2.5
 5          1.5          3.5
 6          1.5          4.5
 7          0.5          1.5
 8          1.5          3.5
 9          3.5          6.5
10          2.5          3.5
11          2.5          4.5
12          3.5          6.5
13          1.5          2.5
14          2.5          4.5
15          0.5          3.5
16          1.5          2.5
17          1.5          3.5
18          0.5          3.5
19          0.5          1.5
20          0.5          2.5
21          2.5          5.5
22          1.5          2.5
23          1.5          3.5
24          1.5          4.5
25          4.5          5.5
26          2.5          4.5
27          0.5          3.5
28          1.5          2.5
29          0.5          2.5
30          2.5          5.5
 
                        Row values
         1           2           3           4           5
       0.5         1.5         2.5         3.5         4.5
 
                             Column values
         1           2           3           4           5           6
       1.5         2.5         3.5         4.5         5.5         6.5
 
                            Table
          1         2         3         4         5         6
1         3         2         4         0         0         0
2         0         5         5         2         0         0
3         0         0         1         3         2         0
4         0         0         0         0         0         2
5         0         0         0         0         1         0