sort_data

Sorts observations by specified keys, with option to tally cases into a multi-way frequency table.

Synopsis

#include <imsls.h>

void imsls_f_sort_data (int n_observations, int n_variables, float x[], int n_keys, ..., 0)

The type double function is imsls_d_sort_data.

Required Arguments

int n_observations (Input)
Number of observations (rows) in x.

int n_variables (Input)
Number of variables (columns) in x.

float x[] (Input/Output)
An n_observations × n_variables matrix containing the observations to be sorted. The sorted matrix is returned in x (exception: see optional argument IMSLS_PASSIVE).

int n_keys (Input)
Number of columns of x on which to sort. The first n_keys columns of x are used as the sorting keys (exception: see optional argument IMSLS_INDICES_KEYS).

Synopsis with Optional Arguments

#include <imsls.h>

void imsls_f_sort_data (int n_observations, int n_variables, float x[], int n_keys,

IMSLS_X_COL_DIM, int x_col_dim,

IMSLS_INDICES_KEYS, int indices_keys[],

IMSLS_FREQUENCIES, float frequencies[],

IMSLS_ASCENDING, or

IMSLS_DESCENDING,

IMSLS_ACTIVE, or

IMSLS_PASSIVE,

IMSLS_PERMUTATION, int **permutation,

IMSLS_PERMUTATION_USER, int permutation[],

IMSLS_TABLE, int **n_values, float **values, float **table,

IMSLS_TABLE_USER, int n_values[], float values[], float table[],

IMSLS_LIST_CELLS, int *n_cells, float **list_cells, float **table_unbalanced,

IMSLS_LIST_CELLS_USER, int *n_cells, float list_cells[], float table_unbalanced[],

IMSLS_N, int *n_cells, int **n,

IMSLS_N_USER, int *n_cells, int n[],

0)

Optional Arguments

IMSLS_X_COL_DIM, int x_col_dim (Input)
Column dimension of x.

Default: x_col_dim = n_variables

IMSLS_INDICES_KEYS, int indices_keys[] (Input)
Array of length n_keys giving the column numbers of x which are to be used in the sort.

Default: indices_keys [ ] = 0, 1, n_keys  1

IMSLS_FREQUENCIES, float frequencies[] (Input)
Array of length n_observations containing the frequency for each observation in x.

Default: frequencies [ ] = 1

IMSLS_ASCENDING, or

IMSLS_DESCENDING
By default, or if IMSLS_ASCENDING is specified, the sort is in ascending order. If IMSLS_DESCENDING is specified, the sort is in descending order.

IMSLS_ACTIVE, or

IMSLS_PASSIVE
By default, or if IMSLS_ACTIVE is specified, the sorted matrix is returned in x. If IMSLS_PASSIVE is specified, x is unchanged by imsls_f_sort_data (i.e., x becomes input only).

IMSLS_PERMUTATION, int **permutation (Output)
Address of a pointer to an internally allocated array of length n_observations specifying the rearrangement (permutation) of the observations (rows).

IMSLS_PERMUTATION_USER, int permutation[] (Output)
Storage for array permutation is provided by the user. See IMSLS_PERMUTATION.

IMSLS_TABLE, int **n_values, float **values, float **table (Output)
Argument n_values is the address of a pointer to an internally allocated array of length n_keys containing in its i-th element (i = 0, 1, n_keys  1), the number of levels or categories of the i-th classification variable (column).

Argument values is the address of a pointer to an internally allocated array of length n_values [0] + n_values [1] +  + n_values [n_keys  1] containing the values of the classification variables. The first n_values [0] elements of values contain the values for the first classification variable. The next n_values [1] contain the values for the second variable. The last n_values [n_keys  1] positions contain the values for the last classification variable.

Argument table is the address of a pointer to an internally allocated array of length n_values [0× n_values [1×  × n_values [n_keys  1] containing the frequencies in the cells of the table to be fit.

Empty cells are included in table, and each element of table is nonnegative. The cells of table are sequenced so that the first variable cycles through its n_values [0] categories one time, the second variable cycles through its n_values [1] categories n_values [0] times, the third variable cycles through its n_values [2] categories n_values [0× n_values [1] times, etc., up to the n_keys-th variable, which cycles through its n_values [n_keys  1] categories n_values [0× n_values [1×  × n_values [n_keys  2] times.

IMSLS_TABLE_USER, int n_values[], float values[], float table[] (Output)
Storage for arrays n_values, values, and table is provided by the user. If the length of table is not known in advance, the upper bound for this length can be taken to be the product of the number of distinct values taken by all of the classification variables (since table includes the empty cells).

IMSLS_LIST_CELLS, int *n_cells, float **list_cells, float **table_unbalanced (Output)
Number of nonempty cells is returned by n_cells. Argument list_cells is an internally allocated array of size n_cells × n_keys containing, for each row, a list of the levels of n_keys corresponding classification variables that describe a cell.

Argument table_unbalanced is the address of a pointer to an array of length n_cells containing the frequency for each cell.

IMSLS_LIST_CELLS_USER, int *n_cells, float list_cells[], float table_unbalanced[] (Output)
Storage for arrays list_cells and table_unbalanced is provided by the user. See IMSLS_LIST_CELLS.

IMSLS_N, int *n_cells, int **n (Output)
The integer n_cells returns the number of groups of different observations. A group contains observations (rows) in x that are equal with respect to the method of comparison.

Argument n is the address of the pointer to an internally allocated array of length n_cells containing the number of observations (rows) in each group.

The first n [0] rows of the sorted x are group number 1. The next n [1]rows of the sorted x are group number 2, etc. The last n [n_cells  1] rows of the sorted x are group number n_cells.

IMSLS_N_USER, int *n_cells, int n[] (Output)
Storage for array n_cells is provided by the user. If the value of n_cells is not known, n_observations can be used as an upper bound for the length of n. See IMSLS_N.

Description

Function imsls_f_sort_data can perform both a key sort and/or tabulation of frequencies into a multi-way frequency table.

Sorting

Function imsls_f_sort_data sorts the rows of real matrix x using a particular row in x as the keys. The sort is algebraic with the first key as the most significant, the second key as the next most significant, etc. When x is sorted in ascending order, the resulting sorted array is such that the following is true:

For i = 0, 1, n_observations  2, x [i] [indices_keys [0]]  x [i + 1] [indices_keys [0]]
For k = 1, n_keys  1, if x [i] [indices_keys [j]] = x [i + 1] [indices_keys [j]] for j = 0, 1, k  1, then x [i] [indices_keys [k]] = x [i + 1] [indices_keys [k]]

The observations also can be sorted in descending order.

The rows of x containing the missing value code NaN in at least one of the specified columns are considered as an additional group. These rows are moved to the end of the sorted x.

The sorting algorithm is based on a quicksort method given by Singleton (1969) with modifications by Griffen and Redish (1970) and Petro (1970).

Frequency Tabulation

Function imsls_f_sort_data determines the distinct values in multivariate data and computes frequencies for the data. This function accepts the data in the matrix x, but performs computations only for the variables (columns) in the first n_keys columns of x (Exception: see optional argument IMSLS_INDICES_KEYS). In general, the variables for which frequencies should be computed are discrete; they should take on a relatively small number of different values. Variables that are continuous can be grouped first. The imsls_f_table_oneway function can be used to group variables and determine the frequencies of groups.

When IMSLS_TABLE is specified, imsls_f_sort_data fills the vector values with the unique values of the variables and tallies the number of unique values of each variable in the vector table. Each combination of one value from each variable forms a cell in a multi-way table. The frequencies of these cells are entered in table so that the first variable cycles through its values exactly once, and the last variable cycles through its values most rapidly. Some cells cannot correspond to any observations in the data; in other words, “missing cells” are included in table and have a value of 0.

When IMSLS_LIST_CELLS is specified, the frequency of each cell is entered in table_unbalanced so that the first variable cycles through its values exactly once and the last variable cycles through its values most rapidly. All cells have a frequency of at least 1, i.e., there is no “missing cell.” The list_cells array can be considered “parallel” to table_unbalanced because row i of list_cells is the set of n_keys values that describes the cell for which row i of table_unbalanced contains the corresponding frequency.

Examples

Example 1

The rows of a 10 × 3 matrix x are sorted in ascending order using Columns 0 and 1 as the keys. There are two missing values (NaNs) in the keys. The observations containing these values are moved to the end of the sorted array.

 

#include <imsls.h>

 

#define N_OBSERVATIONS 10

#define N_VARIABLES 3

 

int main()

{

int n_keys=2;

float x[N_OBSERVATIONS][N_VARIABLES] = {

1.0, 1.0, 1.0,

2.0, 1.0, 2.0,

1.0, 1.0, 3.0,

1.0, 1.0, 4.0,

2.0, 2.0, 5.0,

1.0, 2.0, 6.0,

1.0, 2.0, 7.0,

1.0, 1.0, 8.0,

2.0, 2.0, 9.0,

1.0, 1.0, 9.0

};

 

x[4][1]=imsls_f_machine(6);

x[6][0]=imsls_f_machine(6);

 

imsls_f_sort_data (N_OBSERVATIONS, N_VARIABLES,

&x[0][0], n_keys, 0);

 

imsls_f_write_matrix("sorted x", N_OBSERVATIONS, N_VARIABLES,

(float *)x, 0);

}

Output

 

sorted x

1 2 3

1 1 1 1

2 1 1 9

3 1 1 3

4 1 1 4

5 1 1 8

6 1 2 6

7 2 1 2

8 2 2 9

9 ......... 2 7

10 2 ......... 5

Example 2

This example uses the same data as the previous example. The permutation of the rows is output in the array permutation.

 

#include <imsls.h>

#define N_OBSERVATIONS 10

#define N_VARIABLES 3

int main()

{

int n_keys=2;

int n_cells;

int *n;

int *permutation;

float x[N_OBSERVATIONS][N_VARIABLES]={1.0, 1.0, 1.0,

2.0, 1.0, 2.0,

1.0, 1.0, 3.0,

1.0, 1.0, 4.0,

2.0, 2.0, 5.0,

1.0, 2.0, 6.0,

1.0, 2.0, 7.0,

1.0, 1.0, 8.0,

2.0, 2.0, 9.0,

1.0, 1.0, 9.0};

x[4][1]=imsls_f_machine(6);

x[6][0]=imsls_f_machine(6);

imsls_f_sort_data (N_OBSERVATIONS, N_VARIABLES,

(float *)x, n_keys,

IMSLS_PASSIVE,

IMSLS_PERMUTATION, &permutation,

IMSLS_N, &n_cells, &n, 0);

imsls_f_write_matrix("unchanged x ", N_OBSERVATIONS, N_VARIABLES,

(float *)x, 0);

imsls_i_write_matrix("permutation", 1, N_OBSERVATIONS, permutation,

0);

imsls_i_write_matrix("n", 1, n_cells, n, 0);

}

Output

 

unchanged x

1 2 3

1 1 1 1

2 2 1 2

3 1 1 3

4 1 1 4

5 2 .......... 5

6 1 2 6

7 .......... 2 7

8 1 1 8

9 2 2 9

10 1 1 9

 

permutation

1 2 3 4 5 6 7 8 9 10

0 9 2 3 7 5 1 8 6 4

 

n

1 2 3 4

5 1 1 1

Example 3

The table of frequencies for a data matrix of size 30 × 2 is output in the array table.

 

#include <imsls.h>

 

int main()

{

int n_observations=30;

int n_variables=2;

int n_keys=2;

int *n_values;

int n_rows, n_columns;

float *values;

float *table;

float x[] = {0.5, 1.5,

1.5, 3.5,

0.5, 3.5,

1.5, 2.5,

1.5, 3.5,

1.5, 4.5,

0.5, 1.5,

1.5, 3.5,

3.5, 6.5,

2.5, 3.5,

2.5, 4.5,

3.5, 6.5,

1.5, 2.5,

2.5, 4.5,

0.5, 3.5,

1.5, 2.5,

1.5, 3.5,

0.5, 3.5,

0.5, 1.5,

0.5, 2.5,

2.5, 5.5,

1.5, 2.5,

1.5, 3.5,

1.5, 4.5,

4.5, 5.5,

2.5, 4.5,

0.5, 3.5,

1.5, 2.5,

0.5, 2.5,

2.5, 5.5};

 

imsls_f_sort_data (n_observations, n_variables, x, n_keys,

IMSLS_PASSIVE,

IMSLS_TABLE, &n_values, &values, &table,

0);

imsls_f_write_matrix("unchanged x", n_observations, n_variables,

x, 0);

n_rows = n_values[0];

n_columns = n_values[1];s

imsls_f_write_matrix("row values", 1, n_rows, values, 0);

   imsls_f_write_matrix("column values", 1, n_columns, &values[n_rows],

                     0);

imsls_f_write_matrix("table", n_rows, n_columns, table, 0);

}

Output

 

unchanged x

1 2

1 0.5 1.5

2 1.5 3.5

3 0.5 3.5

4 1.5 2.5

5 1.5 3.5

6 1.5 4.5

7 0.5 1.5

8 1.5 3.5

9 3.5 6.5

10 2.5 3.5

11 2.5 4.5

12 3.5 6.5

13 1.5 2.5

14 2.5 4.5

15 0.5 3.5

16 1.5 2.5

17 1.5 3.5

18 0.5 3.5

19 0.5 1.5

20 0.5 2.5

21 2.5 5.5

22 1.5 2.5

23 1.5 3.5

24 1.5 4.5

25 4.5 5.5

26 2.5 4.5

27 0.5 3.5

28 1.5 2.5

29 0.5 2.5

30 2.5 5.5

row values

1 2 3 4 5

0.5 1.5 2.5 3.5 4.5

 

column values

1 2 3 4 5 6

1.5 2.5 3.5 4.5 5.5 6.5

 

Table

1 2 3 4 5 6

1 3 2 4 0 0 0

2 0 5 5 2 0 0

3 0 0 1 3 2 0

4 0 0 0 0 0 2

5 0 0 0 0 1 0