CNL Stat : Data Mining : Data Mining Usage Notes
Data Mining Usage Notes
Data mining is a collection of statistical and analytical methods for extracting useful information from large databases. The problem of extracting information from large databases is common to government, industry, engineering and sciences.
Apriori
The Apriori algorithm is used for association rule discovery. Association rules are statements of the form, "if X, then Y", given with some measure of confidence. The main application for association rule discovery is market basket analysis, in which X and Y are products or groups of products, and the occurrences are individual transactions, or "market baskets." The results help sellers learn relationships between the products they sell, supporting better marketing decisions. Besides market basket analysis, association rule discovery has been used in the areas of text mining and bioinformatics. The function imsls_f_apriori implements the Apriori algorithm. The function imsls_f_aggr_apriori performs the Apriori algorithm on subsets of transactions and aggregates the results.
For more details, see the Description section of imsls_f_apriori.
Decision Trees
Decision trees are data mining methods for predicting a single response variable based on multiple predictor variables. If the response variable is categorical or discrete, the data mining problem is a classification problem; whereas if the response is continuous, the problem is a type of regression problem. Decision trees are generally applicable in both situations. The function imsls_f_decision_tree includes four of the most widely used algorithms for decision trees — the C4.5 method, ALACART, CHAID, and QUEST. The function imsls_f_decision_tree_predict applies a decision tree to a new set of data.
For a detailed overview, see Decision Trees – An Overview and the Description section of imsls_f_decision_tree.
Random Decision Trees
The ensemble method known as random forest (Breiman, 2001) fits a collection of decision trees on bootstrap samples. In addition, the set of predictor variables is randomized before each branching or splitting decision within the decision tree algorithms. This extra randomization reduces correlation among the different trees in the ensemble. The method is available in the decision tree function imsls_f_decision_tree.
Gradient Boosting
The function imsls_f_gradient_boosting implements the stochastic gradient tree boosting algorithm of Friedman (1999). The algorithm combines the outputs of relatively weak classifiers or predictive models to achieve iteratively better and better accuracy in either regression problems (the response variable is continuous) or classification problems (the response variable has two or more discrete values). Gradient boosting is an ensemble method, but instead of using independent trees, gradient boosting forms a sequence of trees, iteratively and judiciously re-weighted to minimize prediction errors. In particular, the decision tree at iteration m+1 is estimated on pseudo-residuals generated using the decision tree at step m. Hence, successive trees are dependent on previous trees. The algorithm in gradient boosting iterates for a fixed number of times and stops, rather than iterating until a convergence criteria is met. The number of iterations is therefore a parameter in the model. Using a randomly selected subset of the training data in each iteration has been shown to substantially improve efficiency and robustness. Thus, the method is called stochastic gradient boosting. For further discussion, see Hastie, et al. (2009).
Genetic Algorithms
The original genetic algorithm is generally attributed to John Holland and his students from the University of Michigan. During the 1970s they investigated the use of concepts in genetics and biology in optimizing a function. Since that original work, many variations of the original algorithm have been developed by pioneers working in the interface between genetics, computer science and statistics to solve complex problems. These include traditional optimization and search problems in engineering, decision making, game solutions, engineering and pattern recognition.
The genetic algorithm operates on a population of individuals designed to represent the problem being solved. Each individual is rated according to a fitness function designed to measure how close that individual is to solving the problem. For optimization problems, the fitness function is usually constructed from the function being optimized. For other problems, the fitness function can be more complex defined only by the algorithm being investigated. A chess program, for example, might use a fitness function that scores the quality of a board position represented by each individual.
The solution represented by each individual in a population is encoded into the individual chromosome. The fitness function calculates a fitness value for each individual from the information in the individual chromosome. An investor might search for the best set of trading rules for optimizing the returns from the individual investment.
In this case, chromosomes would contain encoded representations of different variations of candidate trading rules. One binary bit might indicate whether a particular technical indicator was being used. Another part of the chromosome might be encoded to indicate how that indicator would be used to buy and sell investments. The fitness function would calculate a rate of return for each individual based upon actual historical data.
Several functions are available for building, cloning and copying chromosomes and individuals:
C Stat Library Function
Description
Creates the structure for a chromosome.
Creates an exact duplicate of an existing chromosome.
Copies the information contained in one chromosome into another.
Creates an individual using an existing chromosome.
Creates an exact duplicate of an existing individual.
Copies the information from one individual into another.
Solving any problem using a genetic algorithm always begins by creating a chromosome used for representing the problem. Four data types can be represented in a chromosome: binary, nominal, integer and real, or continuous attributes. Binary attributes are mapped directly into a chromosome as zeros and ones. A nominal attribute is represented as integers 0, 1, …, k-1, where k is the maximum number of classes for that attribute. Integer and real attributes are mapped into a binary representation by dividing the range of the attribute into a finite number of subintervals. The range and number of intervals is supplied by the user when the chromosome is constructed. Either base-2 or Gray encoding can be used to encode integer and real attributes.
By default, encoding and decoding of chromosomes is automatic. That is each individual not only carries the chromosome but it also carries the original phenotype values encoded in the chromosome. Before the fitness of an individual is evaluated by calling the user’s fitness function, the information in its chromosome is decoded into phenotype values. If this is too time consuming, automatic encoding and decoding can be disabled and done within the fitness functions. The functions imsls_f_ga_encode and imsls_f_ga_decode have been provided to encode and decode the chromosome of individuals, if needed. The routine imsls_f_ga_mutate has been provided to allow users to create their own genetic algorithm instead of using imsls_f_genetic_algorithm.
The memory allocated to a chromosome data structure can be released using the imsls_free function. However, the function imsls_f_ga_free_individual has been provided to release memory allocated to an individual.
The genetic algorithm implemented in imsls_f_genetic_algorithm evolves an initial population of individuals through several generations, searching for the optimum individuals. The initial population can be created using one of several functions. The easiest approach is to create a population of randomly selected individuals using imsls_f_ga_random_population. However, in some cases it might be better to initialize the population using an array of individuals selected based upon their fitness or diversity. The function imsls_f_ga_population can create a population data structure from an array of individuals.
In some cases it might be useful to restart a genetic algorithm search using a previous generation. The function imsls_f_ga_clone_population can be used to create an exact duplicate of a population. The function imsls_f_ga_copy_population replaces the individuals in one population with those from another. Two populations can be merged using imsls_f_ga_merge_population and individuals can be added to an existing population using imsls_f_ga_grow_population. Memory allocated to a population can be released using imsls_f_ga_free_population.
The actual search or optimization using an initial population is conducted using imsls_f_genetic_algorithm. This function returns the fittest individual found during the search. Also available are convergence statistics, including generation statistics, and the final population. This information can be used to evaluate the quality of the solution and if an additional search is warranted, the final population might be used as an initial population for that search.
For a detailed overview, see Genetic Algorithms – An Overview.
Naive Bayes
Naive Bayes is a classification algorithm. First a classifier is trained using imsls_f_naive_bayes_trainer. Once this is done imsls_f_naive_bayes_classification can be used to classify patterns with unknown classifications using the trained classifier represented in an Imsls_f_naive_bayes data structure.
In addition, imsls_f_nb_classifier_write can be used to store the data structure created by imsls_f_naive_bayes_trainer for future use. The function imsls_f_nb_classifier_read restores a Naive Bayes data structure from a file written using imsls_f_nb_classifier_write.
Classification problems can be solved using other algorithms such as discriminant analysis and neural networks. In general these alternatives have smaller classification error rates, but they are too slow for large classification problems. During training imsls_f_naive_bayes_trainer uses the non-missing training data to estimate two-way correlations among the attributes. Higher order correlations are assumed to be zero. This can increase the classification error rate, but it significantly reduces the time needed to train the classifier.
In addition, the Naive Bayes algorithm is the only classification algorithm that can handle data with missing values. Other algorithms such as discriminant analysis do not allow missing values in the data. This is a significant limitation for applying other techniques to a larger database.
Memory allocated to the Naive Bayes data structure created by imsls_f_naive_bayes_trainer and imsls_f_nb_classifier_read, can be released using imsls_f_nb_classifier_free.
For a detailed overview, see Naive Bayes – An Overview.
Neural Networks
Neural networks can be used for forecasting and classification. A neural network data structure is first created using imsls_f_mlff_network_init and imsls_f_mlff_network. Although forecasting and classification neural networks are initialized and represented using the same data structure, separate functions are provided for forecasting and classification in order to make them easier to use and to reduce network training times.
Once the network architecture is created, the network can be trained using imsls_f_mlff_network_trainer for forecasting problems and imsls_f_mlff_classification_trainer for classification problems. By default these algorithms initialize the network weights, but weight initialization can be controlled using imsls_f_mlff_initialize_weights.
Once a network is trained either imsls_mlff_network_forecast or imsls_f_mlff_pattern_classification is used to produce forecasts or to classify unknown patterns.
In many cases, network training will be completed on one computer and forecasting or classification might be done on another. The function imsls_f_mlff_network_write stores a trained network to a file which can be restored using imsls_f_mlff_network_read.
Memory allocated to a neural network data structure can be released using imsls_f_mlff_network_free.
For a detailed overview, see Neural Networks – An Overview.
Data Filtering
The first step in this process is to filter data from its raw form into formats required by sophisticated analytical algorithms.
Data fall into two major categories: continuous and categorical. Many algorithms, such as neural network forecasting, perform better if continuous data are mapped into a common scale. The function imsls_f_scale_filter implements several techniques for automatically scaling continuous data, including several variations of z-score scaling. If the continuous data represent a time series, imsls_f_time_series_filter and imsls_f_time_series_class_filter can be used to create a matrix of lagged values required as input to forecasting neural networks.
Categorical data must also be mapped into a corresponding numerical representation before they can be used in solving forecasting and classification problems. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color. The function imsls_f_unsupervised_ordinal_filter encodes and decodes ordinal data into the range [0, 1] using cumulative percentages. The function imsls_f_unsupervised_nominal_filter uses binary encoding to map nominal data into a matrix of zeros and ones.
Kohonen Self-organizing Map
A self-organizing map (SOM), also known as a Kohonen map or Kohonen SOM, is a technique for gathering high-dimensional data into clusters that are constrained to lie in low dimensional space, usually two dimensions. It is a widely used technique for the purpose of feature extraction and visualization for very high dimensional data. The Kohonen SOM is equivalent to an artificial neural network having inputs linked to every node in the network. The creation of a Kohonen map involves two steps: training and forecasting. Training builds the map using input examples, and forecasting classifies new input. The functions imsls_f_kohonenSOM_trainer and imsls_f_kohonenSOM_forecast achieve these two steps.
For more details, see the Description sections of imsls_f_kohonenSOM_trainer and imsls_f_kohonenSOM_forecast.
Support Vector Machines
Support Vector Machines (SVM) is a class of learning algorithms that can be used for classification, regression, and distribution estimation. First, a classifier is trained using imsls_f_support_vector_trainer specifying either a classification, distribution, or regression type model. Then imsls_f_support_vector_classification can be used for classification, distribution estimation, or regression on patterns with unknown classifications using the trained classifier model represented in an Imsls_f_svm_classifier data structure.
For a detailed overview, see Support Vector Machines – An Overview