data_read
Reads column-oriented data from a delimited ASCII file and returns a structure with the number of rows and columns and a double matrix containing the data.
Synopsis
#include <imsls.h>
Imsls_data_matrix *imsls_data_read (char *data_filename, char *dictionary_filename, ..., 0)
Required Arguments
char *data_filename (Input)
The name of the file to read.
char *dictionary_filename (Input)
The name of a comma-delimited file containing the data dictionary. The data dictionary specifies for each column a name, a data type, and an additional format for date/time columns. If every column is numeric, use NULL for the dictionary file name.
Return Value
A pointer to a structure Imsls_data_matrix containing the number of rows and columns and a matrix containing the data transformed to double values. To release this space, use imsls_free_data_matrix. If no value can be computed, then NULL is returned.
Synopsis with Optional Arguments
#include <imsls.h>
Imsls_data_matrix *imsls_data_read (char *data_filename, char *dictionary_filename,
IMSLS_DATA_HEADER,
IMSLS_DELIM, char user_delim,
IMSLS_IGNORE, int n_ignore, char **strings_to_ignore,
IMSLS_FILTERS, int n_filter, char *chars_to_filter,
IMSLS_MAX_NUM_CATEGORIES, int max_num_categories,
IMSLS_REPLACEMENT_NUMBERS, int n_replacements, char **strings_to_replace, float *replacement_numbers,
IMSLS_REPLACEMENT_STRINGS, int n_replacements, char **strings_to_replace, char **replacement_strings,
IMSLS_RETURN_COLUMN_NAMES, char ***column_names,
IMSLS_RETURN_COLUMN_INFO, Imsls_column_info **column_info,
0)
Optional Arguments
IMSLS_DATA_HEADER, (Input)
Indicates that the first row of the data file contains column names or other information. This row will be skipped.
Default: The first row is assumed not to be a header.
IMSLS_DELIM, char user_delim (Input)
A character specifying the (single) file delimiter. user_delim should be one of comma, { ‘,’ }, semicolon, { ‘;’ }, or space {‘ ‘}.
Default: One of the delimiters {‘,’, ‘;’, ‘ ’} is determined by searching the first few rows of the data file. The file should have only one delimiter throughout.
IMSLS_IGNORE, int n_ignore, char **strings_to_ignore (Input)
The number of strings and the array of length n_ignore containing strings. When any string in strings_to_ignore is found, the entire row is ignored. This is one way to detect and remove invalid rows, such as rows with missing values.
Default: No rows are ignored.
IMSLS_FILTERS, int n_filter, char *chars_to_filter (Input)
The number of chars and the array of length n_filter containing characters. When any character in chars_to_filter is found, they are filtered out of the field.
Default: No characters are filtered.
IMSLS_MAX_NUM_CATEGORIES, int max_num_categories (Input)
The maximum number of categories or class levels to allow or expect in string variables.
Default: max_num_categories = 100.
IMSLS_REPLACEMENT_NUMBERS, int n_replacements, char **strings_to_replace, float *replacement_numbers (Input)
The number of replacement strings, an array of length n_replacements containing strings to be replaced, and an array of length n_replacements containing numeric values that replace each of the strings as they are found in the data. This is one way to replace missing values with some default numeric values.
Note: If IMSLS_REPLACEMENTS_STRINGS is also used, strings_to_replace and n_replacements should be the same in each instance.
Default: Blanks are replaced by 0.0.
IMSLS_REPLACEMENT_STRINGS, int n_replacements, char **strings_to_replace, char **replacement_strings (Input)
The number of replacement strings, an array of length n_replacements containing strings to be replaced, and an array of length n_replacements containing strings that replace each of the strings as they are found in the data. This is one way to replace missing values with some default strings.
Note: If IMSLS_REPLACEMENTS_NUMBERS is also used, strings_to_replace and n_replacements should be the same in each instance.
Default: Blanks are replaced by “BLANK”.
IMSLS_RETURN_COLUMN_NAMES, char ***column_names (Output)
The address of a pointer to an array containing the column names.
IMSLS_RETURN_COLUMN_INFO, Imsls_column_info **column_info (Output)
Returns the address of a pointer to a structure containing column information for each column, such as name, and for string variables, categorical class levels, and class counts. Use imsls_free_column_info to free the memory.
Description
Function imsls_data_read reads column-oriented data from a delimited ASCII file and returns a structure with the number of rows and columns and a double matrix containing the data.
Data Dictionary
If every column is numeric (float or double), no data dictionary is needed. Otherwise, if the data contains strings or dates, imsls_data_read requires a simple dictionary file to specify which columns are strings, dates, or numeric. The dictionary file is a comma delimited flat file with the following information for each column of the data file:
<Column_name>, <data type> [,<date time format>]
Note that all data types are returned as double in the return matrix. Column names should not contain any of the common delimiters, comma, semicolon, or space ( {“,”,”;”,” “}).
For example, a dictionary file with 6 column names and 6 different data types is shown below:
Column1, float
Column2, date, YYYY/MM/DD
Column3, time, HH:MnMn:SS
Column4, string
Column5, int
Column6, datetime, YYYY/MM/DD HHMnMn
The corresponding data file might look like
1700, 2012/05/03, 17:32:12, XYZ, 3, 2011/05/03 1035
1200, 2017/04/01, 11:20:10, EFG, 2, 2009/04/01 1152
.
.
.
2263, 2021/06/03, 20:17:05, BCG, 2, 2013/06/03 835
3200, 2022/03/16, 13:21:42, ANY, 3, 2014/03/16 713
Data Types
Table 1 contains the permitted data types.
Table 1 Data types permitted in the data dictionary
Data type |
Behavior of |
float |
Reads in as float, converts to double in return. |
double |
Reads in as double, converts to double in return. |
int |
Reads in as int, converts to double in return. |
string |
Reads in as string, converts to double in return. The string values are encoded into 1, 2, 3,.. and returned as doubles. |
date |
Reads in as date in a specified format, converts to double (JULIAN date number) in return. |
time |
Reads in as time in a specified format, converts to double (JULIAN date number) in return. |
datetime |
Reads in as datetime in a specified format, converts to double (JULIAN date number) in return. |
skip |
Skips the column. |
Date and Time Formats
Tables 2, 3, and 4 contain the permitted date, time, and datetime formats.
Table 2 Date formats
Data formats to be specified in the dictionary file |
Example |
YYYY/MM/DD |
2019/04/28 |
MM/DD/YYYY |
11/23/2020 |
DD/MM/YYYY |
23/11/2020 |
DD-mmm-YYYY |
23-Nov-2020 |
DD-mmmmmm-YYYY |
23-November-2020 |
mmmmmm-DD-YYYY |
November-23-2020 |
Table 3 Time formats
Time formats |
Examples |
HH:MnMn:SS |
13:45:13 |
HHMnMn |
1345 |
Datetime fields have both a date and time. The datetime formats are simply a concatenation (with a space in between) of the date and time formats.
Table 4 Datetime formats
Datetime formats |
YYYY/MM/DD HH:MnMn:Ss YYYY/MM/DD HHMnMn |
MM/DD/YYYY HH:MnMn:SS MM/DD/YYYY HHMnMn |
DD/MM/YYYY HH:MnMn:SS DD/MM/YYYY HHMnMn |
DD-mmm-YYYY HH:MnMn:SS DD-mmm-YYYY HHMnMn |
DD-mmmmmm-YYYY HH:MnMn:SS DD-mmmmmm-YYYY HHMnMn |
Missing Values
By default, imsls_data_read expects missing values to be represented by blank or empty values between delimiters. In numeric columns, blanks will be treated as 0.0. In string columns, blanks are replaced with the string “BLANK” and treated as any other category when the column is encoded into integers. The user has an option to define other missing values and replacements for both numbers and strings via the optional arguments, IMSLS_REPLACEMENT_NUMBERS and IMSLS_REPLACEMENT_STRINGS. See examples 3 and 4.
Remarks
Function imsls_data_read is a user-friendly interface to imsls_ascii_read and exposes many, but not all, of its features. Refer to the IMSL function imsls_ascii_read for more information.
Examples
Example 1
The contents of a space delimited file are the integers 1 to 20.
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
This example reads the file and displays the contents of the file.
#include <imsls.h>
int main()
{
Imsls_data_matrix *result = NULL;
result = imsls_data_read("monotonic.dat", NULL,0);
imsls_d_write_matrix("Monotonic.dat", result->n_rows,
result->n_cols, result->double_data_matrix, 0);
imsls_free_data_matrix(result);
}
Output
Example 1 data
1 2 3 4 5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
4 16 17 18 19 20
Example 2
The contents of the comma delimited data file for example 2 are:
1,2,3,4,1981/11/22 5:45:12
6,7,8,9,1981/11/23 5:45:12
11,12,13,15,1981/11/24 5:45:12
16,17,18,20,1981/11/25 5:45:12
The data includes a datetime field and so we need to use a dictionary file:
Col1, int
Col2, int
Col3, int
Col4, int
Col5, datetime, YYYY/MM/DD HH:MnMn:SS
The example reads the file and prints out the resulting data matrix, using the column names as column labels in the printout.
#include <imsls.h>
int main()
{
char fname[] = "ex2data.dat";
char dictionary[] = "ex2dictionary.dat";
char **column_names = NULL;
char *column_labels[6];
int i;
Imsls_data_matrix *result = NULL;
result = imsls_data_read(fname, dictionary,
IMSLS_RETURN_COLUMN_NAMES, &column_names, 0);
column_labels[0] = "Row";
for (i = 1; i <= 7; i++) {
column_labels[i] = column_names[i - 1];
}
imsls_d_write_matrix("Example 2 data", result->n_rows, result->n_cols,
result->double_data_matrix,
IMSLS_COL_LABELS, column_labels,
0);
if (column_names) {
for (i = 0; i < result->n_cols; i++) {
if (column_names[i]) {
imsls_free(column_names[i]);
column_names[i] = NULL;
}
}
imsls_free(column_names);
column_names = NULL;
}
imsls_free_data_matrix(result);
}
Output
Example 2 data
Row Col1 Col2 Col3 Col4 Col5
1 1 2 3 4 83710
2 6 7 8 9 83711
3 11 12 13 15 83712
4 16 17 18 20 83713
Example 3
The contents of the file for example 3 are
name, phone, zipcode, age, income, education
Foghorn Leghorn, 555-777-8888, 759xx, 47, 47500, GED
Henny Penny, 555-555-9999, 692xx, 35, 62000, BS
Rumplestiltskin, 555-888-2222, 775xx, 102, 13000,
Red Ridinghood, ,578xx, 12,, Elem
Pinnochio, 555-999-5555,821xx, 21, 10000000, PHD
Goldilocks, 555-888-7711, 879xx, 10, 200, Elem
Mama Bear, 555-111-9999, 879xx, 4, ,
Papa Bear, 555-111-9999, 879xx, 4.5,,
Baby Bear, 555-111-9999, 879xx, .333,,
and the dictionary file is
name, skip
phone, skip
zipcode, string
age, double
income, double
education, string
Function imsls_data_read by default replaces the empty (blank) fields in the data with “BLANK” for string data and 0.0 for numeric data. Alternatively, the user can specify replacement values. For any missing values other than blanks, the default call to imsls_data_read results in an error. That is, the user MUST specify missing values instructions for any values other than blank.
In this example, imsls_data_read is called both ways, using default replacements and user supplied replacements. The example also shows how string columns are encoded within imsls_data_read, using integer encoding, and follows that with one-hot encoding using imsls_unsupervised_filter.
#include <imsls.h>
int main()
{
char fname[] = "names.dat";
char dictionary[] = "namesdictionary.dat";
char **column_names = NULL;
char *print_column_labels[5];
char *miss_vals[] = { "null", "#N/A", "", " " };
char *repl_str[] = { "MISSING","MISSING","EMPTY","EMPTY" };
float repl_nums[] = { -99999.0, -99999.0, 500.0, 500.0 };
char delim = ‘,’;
int i, j;
Imsls_data_matrix *result = NULL;
Imsls_column_info *column_info = NULL;
result = imsls_data_read(fname, dictionary,
IMSLS_DELIM, delim,
IMSLS_DATA_HEADER,
IMSLS_RETURN_COLUMN_NAMES, &column_names,
0);
print_column_labels[0] = "Row";
for (i = 1; i <= 4; i++) {
print_column_labels[i] = column_names[i - 1];
}
imsls_d_write_matrix("Resulting data matrix, default replacements:",
result->n_rows,
result->n_cols, result->double_data_matrix,
IMSLS_COL_LABELS, print_column_labels,
IMSLS_WRITE_FORMAT, "%12.2f", 0);
imsls_free_data_matrix(result);
result = imsls_data_read(fname, dictionary,
IMSLS_DELIM, delim,
IMSLS_DATA_HEADER,
IMSLS_REPLACEMENT_NUMBERS, 4, miss_vals, repl_nums,
IMSLS_REPLACEMENT_STRINGS, 4, miss_vals, repl_str,
IMSLS_RETURN_COLUMN_INFO, &column_info,
0);
imsls_d_write_matrix("Resulting data matrix, user specified replacements:",
result->n_rows,
result->n_cols, result->double_data_matrix,
IMSLS_COL_LABELS, print_column_labels,
IMSLS_WRITE_FORMAT, "%12.2f", 0);
for (i = 0; i < result->n_cols; i++) {
if (column_info[i].n_levels > 0) {
printf("\nInteger encoding for column %s:\n", column_info[i].name);
printf("\nEncoded Value String Count\n");
for (j = 0; j < column_info[i].n_levels; j++) {
printf("%8d%17s%11d\n", j + 1, column_info[i].class_levels[j],
column_info[i].class_counts[j]);
}
}
}
/* one-hot encoding for zipcode*/
int temp_column[100], n_levels;
/* zipcode is the first column.*/
for (i = 0; i < result->n_rows; i++) {
temp_column[i] = (int)(result->double_data_matrix[i*result->n_cols]);
}
/* imsls_unsupervised_nominal_filter accepts integer array
* and returns integer matrix.
*/
int *dummy_cols_zipcode =
imsls_unsupervised_nominal_filter(result->n_rows,
&n_levels, temp_column, 0);
imsls_i_write_matrix("One hot encoding for zipcode:", result->n_rows,
n_levels, dummy_cols_zipcode,
0);
if (column_names) {
for (i = 0; i < result->n_cols; i++) {
if (column_names[i]) {
imsls_free(column_names[i]);
column_names[i] = NULL;
}
}
imsls_free(column_names);
column_names = NULL;
}
if (dummy_cols_zipcode) {
imsls_free(dummy_cols_zipcode);
dummy_cols_zipcode = NULL;
}
imsls_free_column_info(result->n_cols, column_info);
imsls_free_data_matrix(result);
}
Output
Resulting data matrix, default replacements:
Row zipcode age income education
1 1.00 47.00 47500.00 1.00
2 2.00 35.00 62000.00 2.00
3 3.00 102.00 13000.00 3.00
4 4.00 12.00 0.00 4.00
5 5.00 21.00 10000000.00 5.00
6 6.00 10.00 200.00 4.00
7 6.00 4.00 0.00 3.00
8 6.00 4.50 0.00 3.00
9 6.00 0.33 0.00 3.00
Resulting data matrix, user specified replacements:
Row zipcode age income education
1 1.00 47.00 47500.00 1.00
2 2.00 35.00 62000.00 2.00
3 3.00 102.00 13000.00 3.00
4 4.00 12.00 500.00 4.00
5 5.00 21.00 10000000.00 5.00
6 6.00 10.00 200.00 4.00
7 6.00 4.00 500.00 3.00
8 6.00 4.50 500.00 3.00
9 6.00 0.33 500.00 3.00
Integer encoding for column zipcode:
Encoded Value String Count
1 759xx 1
2 692xx 1
3 775xx 1
4 578xx 1
5 821xx 1
6 879xx 4
Integer encoding for column education:
Encoded Value String Count
1 GED 1
2 BS 1
3 EMPTY 4
4 Elem 2
5 PHD 1
One hot encoding for zipcode:
1 2 3 4 5 6
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 0 1 0
6 0 0 0 0 0 1
7 0 0 0 0 0 1
8 0 0 0 0 0 1
9 0 0 0 0 0 1
Example 4
A comma delimited file contains one year of bitcoin (BTC) stock prices (source: Yahoo Finance). The first few rows look like:
Date,Open,High,Low,Close,AdjClose,Volume
4/27/2021,54030.30469,55416.96484,53319.1875,55033.11719,55033.11719,49448222757
4/28/2021,55036.63672,56227.20703,53887.91797,54824.70313,54824.70313,48000572955
4/29/2021,54858.08984,55115.84375,52418.02734,53555.10938,53555.10938,46088929780
An associated dictionary file has the following rows:
Date,date,MM/DD/YYYY
Open,double,
High,double,
Low,double,
Close,double,
AdjClose,double,
Volume,double
The example reads the data, ignores any rows with “null” prices, and runs sample statistics on each numeric column after converting the units to thousands (for prices), and billions (for volume).
#include <imsls.h>
int main()
{
char fname[] = "BTC-USD.csv";
char dictionary[] = "BTC-USDDictionary.csv";
char *ignore_vals[] = { "null" };
char **column_names = NULL;
char *column_labels[8];
int i, j;
Imsls_data_matrix *result = NULL;
double *simple_statistics = NULL;
char *row_labels[] = {
"means", "variances", "std. dev", "skewness", "kurtosis",
"minima", "maxima", "ranges", "C.V.", "counts", "lower mean",
"upper mean", "lower var", "upper var"};
result = imsls_data_read(fname, dictionary,
IMSLS_DATA_HEADER,
IMSLS_IGNORE, 1, ignore_vals,
IMSLS_RETURN_COLUMN_NAMES, &column_names,
0);
/* financial data expressing price in (thousands) and
* the volume in (billions)
*/
float units[] = { 1, 1000, 1000, 1000, 1000, 1000, 1000000000 };
for (i = 0; i < result->n_rows; i++) {
for (j = 1; j < result->n_cols; j++) {
result->double_data_matrix[i*result->n_cols + j] /= units[j];
}
}
/* test on the column statistics. */
simple_statistics = imsls_d_simple_statistics(result->n_rows,
result->n_cols, result->double_data_matrix, 0);
column_labels[0] = "Statistic";
for (i = 1; i <= 7; i++) {
column_labels[i] = column_names[i - 1];
}
imsls_d_write_matrix("* * * Statistics * * *\n", 14, result->n_cols,
simple_statistics,
IMSLS_ROW_LABELS, row_labels,
IMSLS_COL_LABELS, column_labels,
IMSLS_WRITE_FORMAT, "%18.3f", 0);
}
if (column_names) {
for (i = 0; i < result->n_cols; i++) {
if (column_names[i]) {
imsls_free(column_names[i]);
column_names[i] = NULL;
}
}
imsls_free(column_names);
column_names = NULL;
}
imsls_free_data_matrix(result);
if (simple_statistics)
imsls_free(simple_statistics);
return;
}
Output
* * * Statistics * * *
Statistic Date Open High
means 98293.003 45.376 46.493
variances 11133.503 74.114 75.826
std. dev 105.515 8.609 8.708
skewness 0.000 0.585 0.593
kurtosis -1.200 -0.436 -0.440
minima 98111.000 29.796 31.006
maxima 98476.000 67.550 68.790
ranges 365.000 37.753 37.783
C.V. 0.001 0.190 0.187
counts 365.000 365.000 365.000
lower mean 98282.142 44.490 45.596
upper mean 98303.864 46.262 47.389
lower var 9677.820 64.424 65.912
upper var 12946.086 86.180 88.171
Statistic Low Close AdjClose
means 44.079 45.329 45.329
variances 72.613 73.738 73.738
std. dev 8.521 8.587 8.587
skewness 0.548 0.597 0.597
kurtosis -0.455 -0.411 -0.411
minima 28.894 29.807 29.807
maxima 66.382 67.567 67.567
ranges 37.488 37.759 37.759
C.V. 0.193 0.189 0.189
counts 365.000 365.000 365.000
lower mean 43.202 44.445 44.445
upper mean 44.956 46.213 46.213
lower var 63.119 64.097 64.097
upper var 84.435 85.743 85.743
Statistic Volume
means 34.516
variances 185.261
std. dev 13.611
skewness 2.111
kurtosis 7.837
minima 13.737
maxima 126.358
ranges 112.621
C.V. 0.394
counts 365.000
lower mean 33.115
upper mean 35.917
lower var 161.038
upper var 215.422
Warning Errors
IMSLS_CANNOT_FIND_DELIMITER |
A valid delimiter was not found in the first row of the data file. A single column of data is assumed. |