sortData¶
Sorts observations by specified keys, with option to tally cases into a multi-way frequency table.
Synopsis¶
sortData ( x, nKeys)
Required Arguments¶
- float
x[[]](Input/Output) - An
nObservations×nVariablesmatrix containing the observations to be sorted. The sorted matrix is returned inx(exception: see optional argumentpassive). - int
nKeys(Input) - Number of columns of
xon which to sort. The firstnKeyscolumns ofxare used as the sorting keys (exception: see optional argumentindicesKeys).
Optional Arguments¶
indicesKeys, int[](Input)Array of length
nKeysgiving the column numbers ofxwhich are to be used in the sort.Default:
indicesKeys[ ] = 0, 1, …,nKeys− 1frequencies, float[](Input)Array of length
nObservationscontaining the frequency for each observation inx.Default:
frequencies[ ] = 1
ascending, or
descending- By default, or if
ascendingis specified, the sort is in ascending order. Ifdescendingis specified, the sort is in descending order.
active, or
passive- By default, or if
activeis specified, the sorted matrix is returned inx. Ifpassiveis specified,xis unchanged bysortData(i.e.,xbecomes input only). permutation(Output)- An array of length
nObservationsspecifying the rearrangement (permutation) of the observations (rows). table,nValues,values,table(Output)Argument
nValuesis an array of lengthnKeyscontaining in its i-th element (i = 0, 1, …,nKeys− 1), the number of levels or categories of the i-th classification variable (column).Argument
valuesis an array of lengthnValues[0] +nValues[1] + … +nValues[nKeys−1] containing the values of the classification variables. The firstnValues[0] elements ofvaluescontain the values for the first classification variable. The nextnValues[1] contain the values for the second variable. The lastnValues[nKeys−1] positions contain the values for the last classification variable.Argument
tableis an array of lengthnValues[0] ×nValues[1] × … ×nValues[nKeys−1] containing the frequencies in the cells of the table to be fit.Empty cells are included in
table, and each element oftableis nonnegative. The cells oftableare sequenced so that the first variable cycles through itsnValues[0] categories one time, the second variable cycles through itsnValues[1] categoriesnValues[0] times, the third variable cycles through itsnValues[2] categoriesnValues[0] ×nValues[1] times, etc., up to thenKeys-th variable, which cycles through itsnValues[nKeys− 1] categoriesnValues[0] ×nValues[1] × … ×nValues[nKeys− 2] times.listCells,nCells,listCells,tableUnbalanced(Output)Number of nonempty cells is returned by
nCells. ArgumentlistCellsis an array of sizenCells×nKeyscontaining, for each row, a list of the levels ofnKeyscorresponding classification variables that describe a cell.Argument
tableUnbalancedis an array of lengthnCellscontaining the frequency for each cell.n,nCells,n(Output)The integer
nCellsreturns the number of groups of different observations. A group contains observations (rows) inxthat are equal with respect to the method of comparison.Argument
nis the an array of lengthnCellscontaining the number of observations (rows) in each group.The first
n[0] rows of the sortedxare group number 1. The nextn[1]rows of the sortedxare group number 2, etc. The lastn[nCells− 1] rows of the sortedxare group numbernCells.
Description¶
Function sortData can perform both a key sort and/or tabulation of
frequencies into a multi-way frequency table.
Sorting¶
Function sortData sorts the rows of real matrix x using a particular
row in x as the keys. The sort is algebraic with the first key as the
most significant, the second key as the next most significant, etc. When
x is sorted in ascending order, the resulting sorted array is such that
the following is true:
- For i = 0, 1, …,
nObservations− 2,x[i] [indicesKeys[0]] ≤x[i + 1] [indicesKeys[0]] - For k = 1, …,
nKeys− 1, ifx[i] [indicesKeys[j]] =x[i + 1] [indicesKeys[j]] for \(j=0,1,\ldots,k-1\), thenx[i] [indicesKeys[k]] =x[i + 1] [indicesKeys[k]]
The observations also can be sorted in descending order.
The rows of x containing the missing value code NaN in at least one of
the specified columns are considered as an additional group. These rows are
moved to the end of the sorted x.
The sorting algorithm is based on a quicksort method given by Singleton (1969) with modifications by Griffen and Redish (1970) and Petro (1970).
Frequency Tabulation¶
Function sortData determines the distinct values in multivariate data
and computes frequencies for the data. This function accepts the data in the
matrix x, but performs computations only for the variables (columns) in
the first nKeys columns of x (Exception: see optional argument
indicesKeys). In general, the variables for which frequencies should be
computed are discrete; they should take on a relatively small number of
different values. Variables that are continuous can be grouped first. The
tableOneway function can be used to group variables and
determine the frequencies of groups.
When table is specified, sortData fills the vector values with
the unique values of the variables and tallies the number of unique values
of each variable in the vector table. Each combination of one value from
each variable forms a cell in a multi-way table. The frequencies of these
cells are entered in table so that the first variable cycles through its
values exactly once, and the last variable cycles through its values most
rapidly. Some cells cannot correspond to any observations in the data; in
other words, “missing cells” are included in table and have a value of
0.
When listCells is specified, the frequency of each cell is entered in
tableUnbalanced so that the first variable cycles through its values
exactly once and the last variable cycles through its values most rapidly.
All cells have a frequency of at least 1, i.e., there is no “missing cell.”
The listCells array can be considered “parallel” to tableUnbalanced
because row i of listCells is the set of nKeys values that
describes the cell for which row i of tableUnbalanced contains the
corresponding frequency.
Examples¶
Example 1¶
The rows of a 10 × 3 matrix x are sorted in ascending order using
Columns 0 and 1 as the keys. There are two missing values (NaNs) in the
keys. The observations containing these values are moved to the end of the
sorted array.
from numpy import *
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix
n_keys = 2
x = array([[1.0, 1.0, 1.0],
[2.0, 1.0, 2.0],
[1.0, 1.0, 3.0],
[1.0, 1.0, 4.0],
[2.0, 2.0, 5.0],
[1.0, 2.0, 6.0],
[1.0, 2.0, 7.0],
[1.0, 1.0, 8.0],
[2.0, 2.0, 9.0],
[1.0, 1.0, 9.0]])
x[4][1] = machine(6)
x[6][0] = machine(6)
sortData(x, n_keys)
writeMatrix("Sorted x", x)
Output¶
Sorted x
1 2 3
1 1 1 1
2 1 1 9
3 1 1 3
4 1 1 4
5 1 1 8
6 1 2 6
7 2 1 2
8 2 2 9
9 ........... 2 7
10 2 ........... 5
Example 2¶
This example uses the same data as the previous example. The permutation of
the rows is output in the array permutation.
from numpy import *
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix
n_keys = 2
x = array([[1.0, 1.0, 1.0],
[2.0, 1.0, 2.0],
[1.0, 1.0, 3.0],
[1.0, 1.0, 4.0],
[2.0, 2.0, 5.0],
[1.0, 2.0, 6.0],
[1.0, 2.0, 7.0],
[1.0, 1.0, 8.0],
[2.0, 2.0, 9.0],
[1.0, 1.0, 9.0]])
x[4][1] = machine(6)
x[6][0] = machine(6)
n = {}
permutation = []
sortData(x, n_keys, passive=True,
permutation=permutation, n=n)
writeMatrix("Unchanged x", x)
writeMatrix("Permutation", permutation, writeFormat="%10i")
nn = n["n"]
writeMatrix("n", nn, writeFormat="%10i")
Output¶
Unchanged x
1 2 3
1 1 1 1
2 2 1 2
3 1 1 3
4 1 1 4
5 2 ........... 5
6 1 2 6
7 ........... 2 7
8 1 1 8
9 2 2 9
10 1 1 9
Permutation
1 2 3 4 5 6
0 9 2 3 7 5
7 8 9 10
1 8 6 4
n
1 2 3 4
5 1 1 1
Example 3¶
The table of frequencies for a data matrix of size 30 × 2 is output in the
array table.
from __future__ import print_function
from numpy import *
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix
n_observations = 30
n_variables = 2
n_keys = 2
x = array([[0.5, 1.5],
[1.5, 3.5],
[0.5, 3.5],
[1.5, 2.5],
[1.5, 3.5],
[1.5, 4.5],
[0.5, 1.5],
[1.5, 3.5],
[3.5, 6.5],
[2.5, 3.5],
[2.5, 4.5],
[3.5, 6.5],
[1.5, 2.5],
[2.5, 4.5],
[0.5, 3.5],
[1.5, 2.5],
[1.5, 3.5],
[0.5, 3.5],
[0.5, 1.5],
[0.5, 2.5],
[2.5, 5.5],
[1.5, 2.5],
[1.5, 3.5],
[1.5, 4.5],
[4.5, 5.5],
[2.5, 4.5],
[0.5, 3.5],
[1.5, 2.5],
[0.5, 2.5],
[2.5, 5.5]])
table = {}
sortData(x, n_keys, passive=True, table=table)
writeMatrix("Unchanged x", x)
nValues = table["nValues"]
n_rows = nValues[0]
n_columns = nValues[1]
print("n_rows: ", n_rows)
print("n_columns: ", n_columns)
tableValues = table["values"]
rowValues = tableValues[0:n_rows]
colValues = tableValues[n_rows:]
writeMatrix("Row values", rowValues, writeFormat="%10.1f")
writeMatrix("Column values", colValues, writeFormat="%10.1f")
tabtab = table["table"]
writeMatrix("Table", tabtab, writeFormat="%8i")
Output¶
n_rows: 5
n_columns: 6
Unchanged x
1 2
1 0.5 1.5
2 1.5 3.5
3 0.5 3.5
4 1.5 2.5
5 1.5 3.5
6 1.5 4.5
7 0.5 1.5
8 1.5 3.5
9 3.5 6.5
10 2.5 3.5
11 2.5 4.5
12 3.5 6.5
13 1.5 2.5
14 2.5 4.5
15 0.5 3.5
16 1.5 2.5
17 1.5 3.5
18 0.5 3.5
19 0.5 1.5
20 0.5 2.5
21 2.5 5.5
22 1.5 2.5
23 1.5 3.5
24 1.5 4.5
25 4.5 5.5
26 2.5 4.5
27 0.5 3.5
28 1.5 2.5
29 0.5 2.5
30 2.5 5.5
Row values
1 2 3 4 5
0.5 1.5 2.5 3.5 4.5
Column values
1 2 3 4 5 6
1.5 2.5 3.5 4.5 5.5 6.5
Table
1 2 3 4 5 6
1 3 2 4 0 0 0
2 0 5 5 2 0 0
3 0 0 1 3 2 0
4 0 0 0 0 0 2
5 0 0 0 0 1 0