sortData¶
Sorts observations by specified keys, with option to tally cases into a multi-way frequency table.
Synopsis¶
sortData ( x, nKeys)
Required Arguments¶
- float
x[[]]
(Input/Output) - An
nObservations
×nVariables
matrix containing the observations to be sorted. The sorted matrix is returned inx
(exception: see optional argumentpassive
). - int
nKeys
(Input) - Number of columns of
x
on which to sort. The firstnKeys
columns ofx
are used as the sorting keys (exception: see optional argumentindicesKeys
).
Optional Arguments¶
indicesKeys
, int[]
(Input)Array of length
nKeys
giving the column numbers ofx
which are to be used in the sort.Default:
indicesKeys
[ ] = 0, 1, …,nKeys
− 1frequencies
, float[]
(Input)Array of length
nObservations
containing the frequency for each observation inx
.Default:
frequencies
[ ] = 1
ascending
, or
descending
- By default, or if
ascending
is specified, the sort is in ascending order. Ifdescending
is specified, the sort is in descending order.
active
, or
passive
- By default, or if
active
is specified, the sorted matrix is returned inx
. Ifpassive
is specified,x
is unchanged bysortData
(i.e.,x
becomes input only). permutation
(Output)- An array of length
nObservations
specifying the rearrangement (permutation) of the observations (rows). table
,nValues
,values
,table
(Output)Argument
nValues
is an array of lengthnKeys
containing in its i-th element (i = 0, 1, …,nKeys
− 1), the number of levels or categories of the i-th classification variable (column).Argument
values
is an array of lengthnValues
[0
] +nValues
[1
] + … +nValues
[nKeys
−1
] containing the values of the classification variables. The firstnValues
[0
] elements ofvalues
contain the values for the first classification variable. The nextnValues
[1
] contain the values for the second variable. The lastnValues
[nKeys
−1
] positions contain the values for the last classification variable.Argument
table
is an array of lengthnValues
[0
] ×nValues
[1
] × … ×nValues
[nKeys
−1
] containing the frequencies in the cells of the table to be fit.Empty cells are included in
table
, and each element oftable
is nonnegative. The cells oftable
are sequenced so that the first variable cycles through itsnValues
[0
] categories one time, the second variable cycles through itsnValues
[1
] categoriesnValues
[0
] times, the third variable cycles through itsnValues
[2
] categoriesnValues
[0
] ×nValues
[1
] times, etc., up to thenKeys
-th variable, which cycles through itsnValues
[nKeys
− 1] categoriesnValues
[0
] ×nValues
[1
] × … ×nValues
[nKeys
− 2] times.listCells
,nCells
,listCells
,tableUnbalanced
(Output)Number of nonempty cells is returned by
nCells
. ArgumentlistCells
is an array of sizenCells
×nKeys
containing, for each row, a list of the levels ofnKeys
corresponding classification variables that describe a cell.Argument
tableUnbalanced
is an array of lengthnCells
containing the frequency for each cell.n
,nCells
,n
(Output)The integer
nCells
returns the number of groups of different observations. A group contains observations (rows) inx
that are equal with respect to the method of comparison.Argument
n
is the an array of lengthnCells
containing the number of observations (rows) in each group.The first
n
[0] rows of the sortedx
are group number 1. The nextn
[1]rows of the sortedx
are group number 2, etc. The lastn
[nCells
− 1] rows of the sortedx
are group numbernCells
.
Description¶
Function sortData
can perform both a key sort and/or tabulation of
frequencies into a multi-way frequency table.
Sorting¶
Function sortData
sorts the rows of real matrix x
using a particular
row in x
as the keys. The sort is algebraic with the first key as the
most significant, the second key as the next most significant, etc. When
x
is sorted in ascending order, the resulting sorted array is such that
the following is true:
- For i = 0, 1, …,
nObservations
− 2,x
[i] [indicesKeys
[0]] ≤x
[i + 1] [indicesKeys
[0]] - For k = 1, …,
nKeys
− 1, ifx
[i] [indicesKeys
[j]] =x
[i + 1] [indicesKeys
[j]] for \(j=0,1,\ldots,k-1\), thenx
[i] [indicesKeys
[k]] =x
[i + 1] [indicesKeys
[k]]
The observations also can be sorted in descending order.
The rows of x
containing the missing value code NaN in at least one of
the specified columns are considered as an additional group. These rows are
moved to the end of the sorted x
.
The sorting algorithm is based on a quicksort method given by Singleton (1969) with modifications by Griffen and Redish (1970) and Petro (1970).
Frequency Tabulation¶
Function sortData
determines the distinct values in multivariate data
and computes frequencies for the data. This function accepts the data in the
matrix x
, but performs computations only for the variables (columns) in
the first nKeys
columns of x
(Exception: see optional argument
indicesKeys
). In general, the variables for which frequencies should be
computed are discrete; they should take on a relatively small number of
different values. Variables that are continuous can be grouped first. The
tableOneway function can be used to group variables and
determine the frequencies of groups.
When table
is specified, sortData
fills the vector values
with
the unique values of the variables and tallies the number of unique values
of each variable in the vector table
. Each combination of one value from
each variable forms a cell in a multi-way table. The frequencies of these
cells are entered in table
so that the first variable cycles through its
values exactly once, and the last variable cycles through its values most
rapidly. Some cells cannot correspond to any observations in the data; in
other words, “missing cells” are included in table
and have a value of
0.
When listCells
is specified, the frequency of each cell is entered in
tableUnbalanced
so that the first variable cycles through its values
exactly once and the last variable cycles through its values most rapidly.
All cells have a frequency of at least 1, i.e., there is no “missing cell.”
The listCells
array can be considered “parallel” to tableUnbalanced
because row i of listCells
is the set of nKeys
values that
describes the cell for which row i of tableUnbalanced
contains the
corresponding frequency.
Examples¶
Example 1¶
The rows of a 10 × 3 matrix x
are sorted in ascending order using
Columns 0 and 1 as the keys. There are two missing values (NaNs) in the
keys. The observations containing these values are moved to the end of the
sorted array.
from numpy import *
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix
n_keys = 2
x = array([[1.0, 1.0, 1.0],
[2.0, 1.0, 2.0],
[1.0, 1.0, 3.0],
[1.0, 1.0, 4.0],
[2.0, 2.0, 5.0],
[1.0, 2.0, 6.0],
[1.0, 2.0, 7.0],
[1.0, 1.0, 8.0],
[2.0, 2.0, 9.0],
[1.0, 1.0, 9.0]])
x[4][1] = machine(6)
x[6][0] = machine(6)
sortData(x, n_keys)
writeMatrix("Sorted x", x)
Output¶
Sorted x
1 2 3
1 1 1 1
2 1 1 9
3 1 1 3
4 1 1 4
5 1 1 8
6 1 2 6
7 2 1 2
8 2 2 9
9 ........... 2 7
10 2 ........... 5
Example 2¶
This example uses the same data as the previous example. The permutation of
the rows is output in the array permutation
.
from numpy import *
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix
n_keys = 2
x = array([[1.0, 1.0, 1.0],
[2.0, 1.0, 2.0],
[1.0, 1.0, 3.0],
[1.0, 1.0, 4.0],
[2.0, 2.0, 5.0],
[1.0, 2.0, 6.0],
[1.0, 2.0, 7.0],
[1.0, 1.0, 8.0],
[2.0, 2.0, 9.0],
[1.0, 1.0, 9.0]])
x[4][1] = machine(6)
x[6][0] = machine(6)
n = {}
permutation = []
sortData(x, n_keys, passive=True,
permutation=permutation, n=n)
writeMatrix("Unchanged x", x)
writeMatrix("Permutation", permutation, writeFormat="%10i")
nn = n["n"]
writeMatrix("n", nn, writeFormat="%10i")
Output¶
Unchanged x
1 2 3
1 1 1 1
2 2 1 2
3 1 1 3
4 1 1 4
5 2 ........... 5
6 1 2 6
7 ........... 2 7
8 1 1 8
9 2 2 9
10 1 1 9
Permutation
1 2 3 4 5 6
0 9 2 3 7 5
7 8 9 10
1 8 6 4
n
1 2 3 4
5 1 1 1
Example 3¶
The table of frequencies for a data matrix of size 30 × 2 is output in the
array table
.
from __future__ import print_function
from numpy import *
from pyimsl.stat.sortData import sortData
from pyimsl.stat.writeMatrix import writeMatrix
n_observations = 30
n_variables = 2
n_keys = 2
x = array([[0.5, 1.5],
[1.5, 3.5],
[0.5, 3.5],
[1.5, 2.5],
[1.5, 3.5],
[1.5, 4.5],
[0.5, 1.5],
[1.5, 3.5],
[3.5, 6.5],
[2.5, 3.5],
[2.5, 4.5],
[3.5, 6.5],
[1.5, 2.5],
[2.5, 4.5],
[0.5, 3.5],
[1.5, 2.5],
[1.5, 3.5],
[0.5, 3.5],
[0.5, 1.5],
[0.5, 2.5],
[2.5, 5.5],
[1.5, 2.5],
[1.5, 3.5],
[1.5, 4.5],
[4.5, 5.5],
[2.5, 4.5],
[0.5, 3.5],
[1.5, 2.5],
[0.5, 2.5],
[2.5, 5.5]])
table = {}
sortData(x, n_keys, passive=True, table=table)
writeMatrix("Unchanged x", x)
nValues = table["nValues"]
n_rows = nValues[0]
n_columns = nValues[1]
print("n_rows: ", n_rows)
print("n_columns: ", n_columns)
tableValues = table["values"]
rowValues = tableValues[0:n_rows]
colValues = tableValues[n_rows:]
writeMatrix("Row values", rowValues, writeFormat="%10.1f")
writeMatrix("Column values", colValues, writeFormat="%10.1f")
tabtab = table["table"]
writeMatrix("Table", tabtab, writeFormat="%8i")
Output¶
n_rows: 5
n_columns: 6
Unchanged x
1 2
1 0.5 1.5
2 1.5 3.5
3 0.5 3.5
4 1.5 2.5
5 1.5 3.5
6 1.5 4.5
7 0.5 1.5
8 1.5 3.5
9 3.5 6.5
10 2.5 3.5
11 2.5 4.5
12 3.5 6.5
13 1.5 2.5
14 2.5 4.5
15 0.5 3.5
16 1.5 2.5
17 1.5 3.5
18 0.5 3.5
19 0.5 1.5
20 0.5 2.5
21 2.5 5.5
22 1.5 2.5
23 1.5 3.5
24 1.5 4.5
25 4.5 5.5
26 2.5 4.5
27 0.5 3.5
28 1.5 2.5
29 0.5 2.5
30 2.5 5.5
Row values
1 2 3 4 5
0.5 1.5 2.5 3.5 4.5
Column values
1 2 3 4 5 6
1.5 2.5 3.5 4.5 5.5 6.5
Table
1 2 3 4 5 6
1 3 2 4 0 0 0
2 0 5 5 2 0 0
3 0 0 1 3 2 0
4 0 0 0 0 0 2
5 0 0 0 0 1 0