Data Mining Usage Notes

Chapter 13: Data Mining > Data Mining Usage Notes

Data Mining Usage Notes

Data mining is a collection of statistical and analytical methods for extracting useful information from large databases. The problem of extracting information from large databases is common to government, industry, engineering and sciences.

Data Filtering

The first step in this process is to filter data from its raw form into formats required by sophisticated analytical algorithms.

Data fall into two major categories: continuous and categorical. Many algorithms, such as neural network forecasting, perform better if continuous data are mapped into a common scale. The function imsls_f_scale_filter implements several techniques for automatically scaling continuous data, including several variations of z-score scaling. If the continuous data represent a time series, imsls_f_time_series_filter and imsls_f_time_series_class_filter can be used to create a matrix of lagged values required as input to forecasting neural networks.

Categorical data must also be mapped into a corresponding numerical representation before they can be used in solving forecasting and classification problems. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color. The function imsls_f_unsupervised_ordinal_filter encodes and decodes ordinal data into the range [0, 1] using cumulative percentages. The function imsls_f_unsupervised_nominal_filter uses binary encoding to map nominal data into a matrix of zeros and ones.

Genetic Algorithms

The original genetic algorithm is generally attributed to John Holland and his students from the University of Michigan. During the 1970s they investigated the use of concepts in genetics and biology in optimizing a function. Since that original work, many variations of the original algorithm have been developed by pioneers working in the interface between genetics, computer science and statistics to solve complex problems. These include traditional optimization and search problems in engineering, decision making, game solutions, engineering and pattern recognition.

The genetic algorithm operates on a population of individuals designed to represent the problem being solved. Each individual is rated according to a fitness function designed to measure how close that individual is to solving the problem. For optimization problems, the fitness function is usually constructed from the function being optimized. For other problems, the fitness function can be more complex defined only by the algorithm being investigated. A chess program, for example, might use a fitness function that scores the quality of a board position represented by each individual.

The solution represented by each individual in a population is encoded into the individual chromosome. The fitness function calculates a fitness value for each individual from the information in the individual chromosome. An investor might search for the best set of trading rules for optimizing the returns from the individual investment.

In this case, chromosomes would contain encoded representations of different variations of candidate trading rules. One binary bit might indicate whether a particular technical indicator was being used. Another part of the chromosome might be encoded to indicate how that indicator would be used to buy and sell investments. The fitness function would calculate a rate of return for each individual based upon actual historical data.

Several functions are available for building, cloning and copying chromosomes and individuals:

C Numerical Library Function	Description
imsls_f_ga_chromosome	Creates the structure for a chromosome.
imsls_f_ga_clone_chromosome	Creates an exact duplicate of an existing chromosome.
imsls_f_ga_copy_chromosome	Copies the information contained in one chromosome into another.
imsls_f_ga_individual	Creates an individual using an existing chromosome.
imsls_f_ga_clone_individual	Creates an exact duplicate of an existing individual.
imsls_f_ga_copy_individual	Copies the information from one individual into another.

Solving any problem using a genetic algorithm always begins by creating a chromosome used for representing the problem. Four data types can be represented in a chromosome: binary, nominal, integer and real, or continuous attributes. Binary attributes are mapped directly into a chromosome as zeros and ones. A nominal attribute is represented as integers 0, 1, …, k-1, whesre k is the maximum number of classes for that attribute. Integer and real attributes are mapped into a binary representation by dividing the range of the attribute into a finite number of subintervals. The range and number of intervals is supplied by the user when the chromosome is constructed. Either base-2 or Gray encoding can be used to encode integer and real attributes.

By default, encoding and decoding of chromosomes is automatic. That is each individual not only carries the chromosome but it also carries the original phenotype values encoded in the chromosome. Before the fitness of an individual is evaluated by calling the user’s fitness function, the information in its chromosome is decoded into phenotype values. If this is too time consuming, automatic encoding and decoding can be disabled and done within the fitness functions. The functions imsls_f_ga_encode and imsls_f_ga_decode have been provided to encode and decode the chromosome of individuals, if needed. The routine imsls_f_ga_mutate has been provided to allow users to create their own genetic algorithm instead of using imsls_f_genetic_algorithm.

The memory allocated to a chromosome data structure can be released using the imsls_free function. However, the function imsls_f_ga_free_individual has been provided to release memory allocated to an individual.

The genetic algorithm implemented in imsls_f_genetic_algorithm evolves an initial population of individuals through several generations, searching for the optimum individuals. The initial population can be created using one of several functions. The easiest approach is to create a population of randomly selected individuals using imsls_f_ga_random_population. However, in some cases it might be better to initialize the population using an array of individuals selected based upon their fitness or diversity. The function imsls_f_ga_population can create a population data structure from an array of individuals.

In some cases it might be useful to restart a genetic algorithm search using a previous generation. The function imsls_f_ga_clone_population can be used to create an exact duplicate of a population. The function imsls_f_ga_copy_population replaces the individuals in one population with those from another. Two populations can be merged using imsls_f_ga_merge_population and individuals can be added to an existing population using imsls_f_ga_grow_population. Memory allocated to a population can be released using imsls_f_ga_free_population.

The actual search or optimization using an initial population is conducted using imsls_f_genetic_algorithm. This function returns the fittest individual found during the search. Also available are convergence statistics, including generation statistics, and the final population. This information can be used to evaluate the quality of the solution and if an additional search is warranted, the final population might be used as an initial population for that search.

For a detailed overview, see Genetic Algorithm - An Overview.

Naive Bayes

Naive Bayes is a classification algorithm. First a classifier is trained using imsls_f_naive_bayes_trainer. Once this is done imsls_f_naive_bayes_classification can be used to classify patterns with unknown classifications using the trained classifier represented in an Imsls_f_naive_bayes data structure.

In addition, imsls_f_nb_classifier_write can be used to store the data structure created by imsls_f_naive_bayes_trainer for future use. The function imsls_f_nb_classifier_read restores a Naive Bayes data structure from a file written using imsls_f_nb_classifier_write.

Classification problems can be solved using other algorithms such as discriminant analysis and neural networks. In general these alternatives have smaller classification error rates, but they are too slow for large classification problems. During training imsls_f_naive_bayes_trainer uses the non-missing training data to estimate two-way correlations among the attributes. Higher order correlations are assumed to be zero. This can increase the classification error rate, but it significantly reduces the time needed to train the classifier.

In addition, the Naive Bayes algorithm is the only classification algorithm that can handle data with missing values. Other algorithms such as discriminant analysis do not allow missing values in the data. This is a significant limitation for applying other techniques to a larger database.

Memory allocated to the Naive Bayes data structure created by imsls_f_naive_bayes_trainer and imsls_f_nb_classifier_read, can be released using imsls_f_nb_classifier_free.

For a detailed overview, see Naive Bayes - An Overview.

Neural Networks

Neural networks can be used for forecasting and classification. A neural network data structure is first created using imsls_f_mlff_network_init and imsls_f_mlff_network. Although forecasting and classification neural networks are initialized and represented using the same data structure, separate functions are provided for forecasting and classification in order to make them easier to use and to reduce network training times.

Once the network architecture is created, the network can be trained using imsls_f_mlff_network_trainer for forecasting problems and imsls_f_mlff_classification_trainer for classification problems. By default these algorithms initialize the network weights, but weight initialization can be controlled using imsls_f_mlff_initialize_weights.

Once a network is trained either imsls_mlff_network_forecast or imsls_f_mlff_pattern_classification is used to produce forecasts or to classify unknown patterns.

In many cases, network training will be completed on one computer and forecasting or classification might be done on another. The function imsls_f_mlff_network_write stores a trained network to a file which can be restored using imsls_f_mlff_network_read.

Memory allocated to a neural network data structure can be released using imsls_f_mlff_network_free.

For a detailed overview, see Neural Networks - An Overview.

Contact Support