data_read

Reads column-oriented data from a delimited ASCII file and returns a structure with the number of rows and columns and a double matrix containing the data.

Synopsis

#include <imsls.h>

Imsls_data_matrix *imsls_data_read (char *data_filename, char *dictionary_filename, ..., 0)

Required Arguments

char *data_filename (Input)
The name of the file to read.

char *dictionary_filename (Input)
The name of a comma-delimited file containing the data dictionary. The data dictionary specifies for each column a name, a data type, and an additional format for date/time columns. If every column is numeric, use NULL for the dictionary file name.

Return Value

A pointer to a structure Imsls_data_matrix containing the number of rows and columns and a matrix containing the data transformed to double values. To release this space, use imsls_free_data_matrix. If no value can be computed, then NULL is returned.

Synopsis with Optional Arguments

#include <imsls.h>

Imsls_data_matrix *imsls_data_read (char *data_filename, char *dictionary_filename,

IMSLS_DATA_HEADER,

IMSLS_DELIM, char user_delim,

IMSLS_IGNORE, int n_ignore, char **strings_to_ignore,

IMSLS_FILTERS, int n_filter, char *chars_to_filter,

IMSLS_MAX_NUM_CATEGORIES, int max_num_categories,

IMSLS_REPLACEMENT_NUMBERS, int n_replacements, char **strings_to_replace, float *replacement_numbers,

IMSLS_REPLACEMENT_STRINGS, int n_replacements, char **strings_to_replace, char **replacement_strings,

IMSLS_RETURN_COLUMN_NAMES, char ***column_names,

IMSLS_RETURN_COLUMN_INFO, Imsls_column_info **column_info,

0)

Optional Arguments

IMSLS_DATA_HEADER, (Input)
Indicates that the first row of the data file contains column names or other information. This row will be skipped.
Default: The first row is assumed not to be a header.

IMSLS_DELIM, char user_delim (Input)
A character specifying the (single) file delimiter. user_delim should be one of comma, { ‘,’ }, semicolon, { ‘;’ }, or space {‘ ‘}.
Default: One of the delimiters {‘,’, ‘;’, ‘ ’} is determined by searching the first few rows of the data file. The file should have only one delimiter throughout.

IMSLS_IGNORE, int n_ignore, char **strings_to_ignore (Input)
The number of strings and the array of length n_ignore containing strings. When any string in strings_to_ignore is found, the entire row is ignored. This is one way to detect and remove invalid rows, such as rows with missing values.
Default: No rows are ignored.

IMSLS_FILTERS, int n_filter, char *chars_to_filter (Input)
The number of chars and the array of length n_filter containing characters. When any character in chars_to_filter is found, they are filtered out of the field.
Default: No characters are filtered.

IMSLS_MAX_NUM_CATEGORIES, int max_num_categories (Input)
The maximum number of categories or class levels to allow or expect in string variables.
Default: max_num_categories = 100.

IMSLS_REPLACEMENT_NUMBERS, int n_replacements, char **strings_to_replace, float *replacement_numbers (Input)
The number of replacement strings, an array of length n_replacements containing strings to be replaced, and an array of length n_replacements containing numeric values that replace each of the strings as they are found in the data. This is one way to replace missing values with some default numeric values.

Note: If IMSLS_REPLACEMENTS_STRINGS is also used, strings_to_replace and n_replacements should be the same in each instance.

Default: Blanks are replaced by 0.0.

IMSLS_REPLACEMENT_STRINGS, int n_replacements, char **strings_to_replace, char **replacement_strings (Input)
The number of replacement strings, an array of length n_replacements containing strings to be replaced, and an array of length n_replacements containing strings that replace each of the strings as they are found in the data. This is one way to replace missing values with some default strings.

Note: If IMSLS_REPLACEMENTS_NUMBERS is also used, strings_to_replace and n_replacements should be the same in each instance.

Default: Blanks are replaced by “BLANK”.

IMSLS_RETURN_COLUMN_NAMES, char ***column_names (Output)
The address of a pointer to an array containing the column names.

IMSLS_RETURN_COLUMN_INFO, Imsls_column_info **column_info (Output)
Returns the address of a pointer to a structure containing column information for each column, such as name, and for string variables, categorical class levels, and class counts. Use imsls_free_column_info to free the memory.

Description

Function imsls_data_read reads column-oriented data from a delimited ASCII file and returns a structure with the number of rows and columns and a double matrix containing the data.

Data Dictionary

If every column is numeric (float or double), no data dictionary is needed. Otherwise, if the data contains strings or dates, imsls_data_read requires a simple dictionary file to specify which columns are strings, dates, or numeric. The dictionary file is a comma delimited flat file with the following information for each column of the data file:

<Column_name>, <data type> [,<date time format>]

Note that all data types are returned as double in the return matrix. Column names should not contain any of the common delimiters, comma, semicolon, or space ( {“,”,”;”,” “}).

For example, a dictionary file with 6 column names and 6 different data types is shown below:

Column1, float
Column2, date, YYYY/MM/DD
Column3, time, HH:MnMn:SS
Column4, string
Column5, int
Column6, datetime, YYYY/MM/DD HHMnMn

The corresponding data file might look like

1700, 2012/05/03, 17:32:12, XYZ, 3, 2011/05/03 1035
1200, 2017/04/01, 11:20:10, EFG, 2, 2009/04/01 1152
                   .
                   .
                   .
2263, 2021/06/03, 20:17:05, BCG, 2, 2013/06/03 835
3200, 2022/03/16, 13:21:42, ANY, 3, 2014/03/16 713

Data Types

Table 1 contains the permitted data types.

Table 1 Data types permitted in the data dictionary

Data type

Behavior of
imsls_data_read

float

Reads in as float, converts to double in return.

double

Reads in as double, converts to double in return.

int

Reads in as int, converts to double in return.

string

Reads in as string, converts to double in return. The string values are encoded into 1, 2, 3,.. and returned as doubles.

date

Reads in as date in a specified format, converts to double (JULIAN date number) in return.

time

Reads in as time in a specified format, converts to double (JULIAN date number) in return.

datetime

Reads in as datetime in a specified format, converts to double (JULIAN date number) in return.

skip

Skips the column.

Date and Time Formats

Tables 2, 3, and 4 contain the permitted date, time, and datetime formats.

Table 2 Date formats

Data formats to be specified in the dictionary file

Example

YYYY/MM/DD

2019/04/28

MM/DD/YYYY

11/23/2020

DD/MM/YYYY

23/11/2020

DD-mmm-YYYY

23-Nov-2020

DD-mmmmmm-YYYY

23-November-2020

mmmmmm-DD-YYYY

November-23-2020

Table 3 Time formats

Time formats

Examples

HH:MnMn:SS

13:45:13

HHMnMn

1345

Datetime fields have both a date and time. The datetime formats are simply a concatenation (with a space in between) of the date and time formats.

Table 4 Datetime formats

Datetime formats

YYYY/MM/DD HH:MnMn:Ss

YYYY/MM/DD HHMnMn

MM/DD/YYYY HH:MnMn:SS

MM/DD/YYYY HHMnMn

DD/MM/YYYY HH:MnMn:SS

DD/MM/YYYY HHMnMn

DD-mmm-YYYY HH:MnMn:SS

DD-mmm-YYYY HHMnMn

DD-mmmmmm-YYYY HH:MnMn:SS

DD-mmmmmm-YYYY HHMnMn

Missing Values

By default, imsls_data_read expects missing values to be represented by blank or empty values between delimiters. In numeric columns, blanks will be treated as 0.0. In string columns, blanks are replaced with the string “BLANK” and treated as any other category when the column is encoded into integers. The user has an option to define other missing values and replacements for both numbers and strings via the optional arguments, IMSLS_REPLACEMENT_NUMBERS and IMSLS_REPLACEMENT_STRINGS. See examples 3 and 4.

Remarks

Function imsls_data_read is a user-friendly interface to imsls_ascii_read and exposes many, but not all, of its features. Refer to the IMSL function imsls_ascii_read for more information.

Examples

Example 1

The contents of a space delimited file are the integers 1 to 20.

1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20

This example reads the file and displays the contents of the file.

#include <imsls.h>

int main()
{
    Imsls_data_matrix *result = NULL;
 
    result = imsls_data_read("monotonic.dat", NULL,0);

    imsls_d_write_matrix("Monotonic.dat", result->n_rows,
        result->n_cols, result->double_data_matrix, 0);

    imsls_free_data_matrix(result);
}

Output

                            Example 1 data
             1            2            3            4            5
1            1            2            3            4            5
2            6            7            8            9           10
3           11           12           13           14           15
4           16           17           18           19           20

Example 2

The contents of the comma delimited data file for example 2 are:

1,2,3,4,1981/11/22 5:45:12
6,7,8,9,1981/11/23 5:45:12
11,12,13,15,1981/11/24 5:45:12
16,17,18,20,1981/11/25 5:45:12

The data includes a datetime field and so we need to use a dictionary file:

Col1, int
Col2, int
Col3, int
Col4, int
Col5, datetime, YYYY/MM/DD HH:MnMn:SS

The example reads the file and prints out the resulting data matrix, using the column names as column labels in the printout.

#include <imsls.h>

int main()

    char fname[] = "ex2data.dat";
    char dictionary[] = "ex2dictionary.dat";
    char **column_names = NULL;
    char *column_labels[6];
    int i;
    Imsls_data_matrix *result = NULL;

    result = imsls_data_read(fname, dictionary,
        IMSLS_RETURN_COLUMN_NAMES, &column_names, 0);

    column_labels[0] = "Row";
    
    for (i = 1; i <= 7; i++) {
        column_labels[i] = column_names[i - 1];
    }

    imsls_d_write_matrix("Example 2 data", result->n_rows, result->n_cols,
        result->double_data_matrix,
        IMSLS_COL_LABELS, column_labels,
        0);
    
    if (column_names) {
        for (i = 0; i < result->n_cols; i++) {
            if (column_names[i]) {
                imsls_free(column_names[i]);
                column_names[i] = NULL;
            }
        }
        imsls_free(column_names);
        column_names = NULL;
    }    

    imsls_free_data_matrix(result);

}

Output

                           Example 2 data
Row         Col1         Col2         Col3         Col4         Col5
  1            1            2            3            4        83710
  2            6            7            8            9        83711
  3           11           12           13           15        83712
  4           16           17           18           20        83713

Example 3

The contents of the file for example 3 are

name, phone, zipcode, age, income, education
Foghorn Leghorn, 555-777-8888, 759xx, 47, 47500, GED
Henny Penny, 555-555-9999, 692xx, 35, 62000, BS
Rumplestiltskin, 555-888-2222, 775xx, 102, 13000,  
Red Ridinghood, ,578xx, 12,, Elem
Pinnochio, 555-999-5555,821xx, 21, 10000000, PHD
Goldilocks, 555-888-7711, 879xx, 10, 200, Elem
Mama Bear, 555-111-9999, 879xx, 4, , 
Papa Bear, 555-111-9999, 879xx, 4.5,,
Baby Bear, 555-111-9999, 879xx, .333,,

and the dictionary file is

name, skip
phone, skip
zipcode, string
age, double
income, double
education, string

Function imsls_data_read by default replaces the empty (blank) fields in the data with “BLANK” for string data and 0.0 for numeric data. Alternatively, the user can specify replacement values. For any missing values other than blanks, the default call to imsls_data_read results in an error. That is, the user MUST specify missing values instructions for any values other than blank.

In this example, imsls_data_read is called both ways, using default replacements and user supplied replacements. The example also shows how string columns are encoded within imsls_data_read, using integer encoding, and follows that with one-hot encoding using imsls_unsupervised_filter.

#include <imsls.h>

int main()
{
    char fname[] = "names.dat";
    char dictionary[] = "namesdictionary.dat";
    char **column_names = NULL;
    char *print_column_labels[5];

    char *miss_vals[] = { "null", "#N/A", "", " " };
    char *repl_str[] = { "MISSING","MISSING","EMPTY","EMPTY" };
    float repl_nums[] = { -99999.0, -99999.0, 500.0, 500.0 };

    char delim = ‘,’;
    int i, j;

    Imsls_data_matrix *result = NULL;
    Imsls_column_info *column_info = NULL;

    result = imsls_data_read(fname, dictionary,
        IMSLS_DELIM, delim,
        IMSLS_DATA_HEADER,
        IMSLS_RETURN_COLUMN_NAMES, &column_names,
        0);

    print_column_labels[0] = "Row";
    for (i = 1; i <= 4; i++) {
        print_column_labels[i] = column_names[i - 1];
    }

    imsls_d_write_matrix("Resulting data matrix, default replacements:",
        result->n_rows,
        result->n_cols, result->double_data_matrix,
        IMSLS_COL_LABELS, print_column_labels,
        IMSLS_WRITE_FORMAT, "%12.2f", 0);

    imsls_free_data_matrix(result);

    result = imsls_data_read(fname, dictionary,
        IMSLS_DELIM, delim,
        IMSLS_DATA_HEADER,
        IMSLS_REPLACEMENT_NUMBERS, 4, miss_vals, repl_nums,
        IMSLS_REPLACEMENT_STRINGS, 4, miss_vals, repl_str,
        IMSLS_RETURN_COLUMN_INFO, &column_info,
        0);

    imsls_d_write_matrix("Resulting data matrix, user specified replacements:",
        result->n_rows,
        result->n_cols, result->double_data_matrix,
        IMSLS_COL_LABELS, print_column_labels,
        IMSLS_WRITE_FORMAT, "%12.2f", 0);

    for (i = 0; i < result->n_cols; i++) {
        if (column_info[i].n_levels > 0) {
            printf("\nInteger encoding for column %s:\n", column_info[i].name);
            printf("\nEncoded Value       String       Count\n");
            for (j = 0; j < column_info[i].n_levels; j++) {
                printf("%8d%17s%11d\n", j + 1, column_info[i].class_levels[j],
                column_info[i].class_counts[j]);
            }
        }
    }

    /* one-hot encoding for zipcode*/
    int temp_column[100], n_levels;

    /* zipcode is the first column.*/
    for (i = 0; i < result->n_rows; i++) {
        temp_column[i] = (int)(result->double_data_matrix[i*result->n_cols]);
    }

    /* imsls_unsupervised_nominal_filter accepts integer array
     * and returns integer matrix. 
    */
    int *dummy_cols_zipcode = 
         imsls_unsupervised_nominal_filter(result->n_rows, 
         &n_levels, temp_column, 0);

    imsls_i_write_matrix("One hot encoding for zipcode:", result->n_rows,
        n_levels, dummy_cols_zipcode,
        0);

    if (column_names) {
        for (i = 0; i < result->n_cols; i++) {
            if (column_names[i]) {
                imsls_free(column_names[i]);
                column_names[i] = NULL;
            }
        }
        imsls_free(column_names);
        column_names = NULL;
    }
    
    if (dummy_cols_zipcode) {
        imsls_free(dummy_cols_zipcode);
        dummy_cols_zipcode = NULL;
    }
    
    imsls_free_column_info(result->n_cols, column_info);
    imsls_free_data_matrix(result);
    
}

Output

       Resulting data matrix, default replacements:
Row       zipcode           age        income     education
  1          1.00         47.00      47500.00          1.00
  2          2.00         35.00      62000.00          2.00
  3          3.00        102.00      13000.00          3.00
  4          4.00         12.00          0.00          4.00
  5          5.00         21.00   10000000.00          5.00
  6          6.00         10.00        200.00          4.00
  7          6.00          4.00          0.00          3.00
  8          6.00          4.50          0.00          3.00
  9          6.00          0.33          0.00          3.00

    Resulting data matrix, user specified replacements:
Row       zipcode           age        income     education
  1          1.00         47.00      47500.00          1.00
  2          2.00         35.00      62000.00          2.00
  3          3.00        102.00      13000.00          3.00
  4          4.00         12.00        500.00          4.00
  5          5.00         21.00   10000000.00          5.00
  6          6.00         10.00        200.00          4.00
  7          6.00          4.00        500.00          3.00
  8          6.00          4.50        500.00          3.00
  9          6.00          0.33        500.00          3.00

Integer encoding for column zipcode:

Encoded Value       String       Count
       1            759xx          1
       2            692xx          1
       3            775xx          1
       4            578xx          1
       5            821xx          1
       6            879xx          4

Integer encoding for column education:

Encoded Value       String       Count
       1              GED          1
       2               BS          1
       3            EMPTY          4
       4             Elem          2
       5              PHD          1

One hot encoding for zipcode:
      1   2   3   4   5   6
  1   1   0   0   0   0   0
  2   0   1   0   0   0   0
  3   0   0   1   0   0   0
  4   0   0   0   1   0   0
  5   0   0   0   0   1   0
  6   0   0   0   0   0   1
  7   0   0   0   0   0   1
  8   0   0   0   0   0   1
  9   0   0   0   0   0   1

Example 4

A comma delimited file contains one year of bitcoin (BTC) stock prices (source: Yahoo Finance). The first few rows look like:

Date,Open,High,Low,Close,AdjClose,Volume
4/27/2021,54030.30469,55416.96484,53319.1875,55033.11719,55033.11719,49448222757
4/28/2021,55036.63672,56227.20703,53887.91797,54824.70313,54824.70313,48000572955
4/29/2021,54858.08984,55115.84375,52418.02734,53555.10938,53555.10938,46088929780   

An associated dictionary file has the following rows:

Date,date,MM/DD/YYYY
Open,double,
High,double,
Low,double,
Close,double,
AdjClose,double,
Volume,double

The example reads the data, ignores any rows with “null” prices, and runs sample statistics on each numeric column after converting the units to thousands (for prices), and billions (for volume).

#include <imsls.h>

int main()
{
    char fname[] = "BTC-USD.csv";
    char dictionary[] = "BTC-USDDictionary.csv";
    char *ignore_vals[] = { "null" };
    char **column_names = NULL;
    char *column_labels[8];
    int i, j;

    Imsls_data_matrix *result = NULL;
    double *simple_statistics = NULL;

    char *row_labels[] = {
       "means", "variances", "std. dev", "skewness", "kurtosis",
       "minima", "maxima", "ranges", "C.V.", "counts", "lower mean",
       "upper mean", "lower var", "upper var"};

    result = imsls_data_read(fname, dictionary,
        IMSLS_DATA_HEADER,
        IMSLS_IGNORE, 1, ignore_vals,
        IMSLS_RETURN_COLUMN_NAMES, &column_names,
        0);
   
    /* financial data expressing price in (thousands) and 
     * the volume in (billions) 
     */
    float units[] = { 1, 1000, 1000, 1000, 1000, 1000, 1000000000 };

    for (i = 0; i < result->n_rows; i++) {
        for (j = 1; j < result->n_cols; j++) {
            result->double_data_matrix[i*result->n_cols + j] /= units[j];
        }
    }
    

 /* test on the column statistics. */
    simple_statistics = imsls_d_simple_statistics(result->n_rows,
        result->n_cols, result->double_data_matrix, 0);

    column_labels[0] = "Statistic";

    for (i = 1; i <= 7; i++) {
        column_labels[i] = column_names[i - 1];
    }

    imsls_d_write_matrix("* * * Statistics * * *\n", 14, result->n_cols,
            simple_statistics,
            IMSLS_ROW_LABELS, row_labels,
            IMSLS_COL_LABELS, column_labels,
            IMSLS_WRITE_FORMAT, "%18.3f", 0);
    }
    if (column_names) {
        for (i = 0; i < result->n_cols; i++) {
            if (column_names[i]) {
                imsls_free(column_names[i]);
                column_names[i] = NULL;
            }
        }
        imsls_free(column_names);
        column_names = NULL;
    }    
    
    imsls_free_data_matrix(result);
    
    if (simple_statistics)
        imsls_free(simple_statistics);

    return;
}

Output

                       * * * Statistics * * *

Statistic                 Date                Open                High
means                98293.003              45.376              46.493
variances            11133.503              74.114              75.826
std. dev               105.515               8.609               8.708
skewness                 0.000               0.585               0.593
kurtosis                -1.200              -0.436              -0.440
minima               98111.000              29.796              31.006
maxima               98476.000              67.550              68.790
ranges                 365.000              37.753              37.783
C.V.                     0.001               0.190               0.187
counts                 365.000             365.000             365.000
lower mean           98282.142              44.490              45.596
upper mean           98303.864              46.262              47.389
lower var             9677.820              64.424              65.912
upper var            12946.086              86.180              88.171

Statistic                  Low               Close            AdjClose
means                   44.079              45.329              45.329
variances               72.613              73.738              73.738
std. dev                 8.521               8.587               8.587
skewness                 0.548               0.597               0.597
kurtosis                -0.455              -0.411              -0.411
minima                  28.894              29.807              29.807
maxima                  66.382              67.567              67.567
ranges                  37.488              37.759              37.759
C.V.                     0.193               0.189               0.189
counts                 365.000             365.000             365.000
lower mean              43.202              44.445              44.445
upper mean              44.956              46.213              46.213
lower var               63.119              64.097              64.097
upper var               84.435              85.743              85.743

Statistic               Volume
means                   34.516
variances              185.261
std. dev                13.611
skewness                 2.111
kurtosis                 7.837
minima                  13.737
maxima                 126.358
ranges                 112.621
C.V.                     0.394
counts                 365.000
lower mean              33.115
upper mean              35.917
lower var              161.038
upper var              215.422

Warning Errors

IMSLS_CANNOT_FIND_DELIMITER

A valid delimiter was not found in the first row of the data file. A single column of data is assumed.