Data Mining Usage Notes

Data mining is a collection of statistical and analytical methods for extracting useful information from large databases. The problem of extracting information from large databases is common to government, industry, engineering and sciences.

Apriori

The Apriori algorithm is used for association rule discovery. Association rules are statements of the form, “if X, then Y”, given with some measure of confidence. The main application for association rule discovery is market basket analysis, in which X and Y are products or groups of products, and the occurrences are individual transactions, or “market baskets.” The results help sellers learn relationships between the products they sell, supporting better marketing decisions. Besides market basket analysis, association rule discovery has been used in the areas of text mining and bioinformatics. The function apriori implements the Apriori algorithm. The function aggrApriori performs the Apriori algorithm on subsets of transactions and aggregates the results.

For more details, see the Description section of apriori.

Decision Trees

Decision trees are data mining methods for predicting a single response variable based on multiple predictor variables. If the response variable is categorical or discrete, the data mining problem is a classification problem; whereas if the response is continuous, the problem is a type of regression problem. Decision trees are generally applicable in both situations. The function decisionTree includes four of the most widely used algorithms for decision trees — the C4.5 method, ALACART, CHAID, and QUEST. The function decisionTreePredict applies a decision tree to a new set of data.

For a detailed overview, see Decision Trees – An Overview and the Description section of decisionTree.

Genetic Algorithms

The original genetic algorithm is generally attributed to John Holland and his students from the University of Michigan. During the 1970s they investigated the use of concepts in genetics and biology in optimizing a function. Since that original work, many variations of the original algorithm have been developed by pioneers working in the interface between genetics, computer science and statistics to solve complex problems. These include traditional optimization and search problems in engineering, decision making, game solutions, engineering and pattern recognition.

The genetic algorithm operates on a population of individuals designed to represent the problem being solved. Each individual is rated according to a fitness function designed to measure how close that individual is to solving the problem. For optimization problems, the fitness function is usually constructed from the function being optimized. For other problems, the fitness function can be more complex defined only by the algorithm being investigated. A chess program, for example, might use a fitness function that scores the quality of a board position represented by each individual.

The solution represented by each individual in a population is encoded into the individual chromosome. The fitness function calculates a fitness value for each individual from the information in the individual chromosome. An investor might search for the best set of trading rules for optimizing the returns from the individual investment.

In this case, chromosomes would contain encoded representations of different variations of candidate trading rules. One binary bit might indicate whether a particular technical indicator was being used. Another part of the chromosome might be encoded to indicate how that indicator would be used to buy and sell investments. The fitness function would calculate a rate of return for each individual based upon actual historical data.

Several functions are available for building, cloning and copying chromosomes and individuals:

C Stat Library Function Description
gaChromosome Creates the structure for a chromosome.
gaCloneChromosome Creates an exact duplicate of an existing chromosome.
gaCopyChromosome Copies the information contained in one chromosome into another.
gaIndividual Creates an individual using an existing chromosome.
gaCloneIndividual Creates an exact duplicate of an existing individual.
gaCopyIndividual Copies the information from one individual into another.

Solving any problem using a genetic algorithm always begins by creating a chromosome used for representing the problem. Four data types can be represented in a chromosome: binary, nominal, integer and real, or continuous attributes. Binary attributes are mapped directly into a chromosome as zeros and ones. A nominal attribute is represented as integers 0, 1, …, k-1, where k is the maximum number of classes for that attribute. Integer and real attributes are mapped into a binary representation by dividing the range of the attribute into a finite number of subintervals. The range and number of intervals is supplied by the user when the chromosome is constructed. Either base-2 or Gray encoding can be used to encode integer and real attributes.

By default, encoding and decoding of chromosomes is automatic. That is each individual not only carries the chromosome but it also carries the original phenotype values encoded in the chromosome. Before the fitness of an individual is evaluated by calling the user’s fitness function, the information in its chromosome is decoded into phenotype values. If this is too time consuming, automatic encoding and decoding can be disabled and done within the fitness functions. The functions gaEncode and gaDecode have been provided to encode and decode the chromosome of individuals, if needed. The routine gaMutate has been provided to allow users to create their own genetic algorithm instead of using geneticAlgorithm.

The memory allocated to a chromosome data structure can be released using the free function. However, the function gaFreeIndividual has been provided to release memory allocated to an individual.

The genetic algorithm implemented in geneticAlgorithm evolves an initial population of individuals through several generations, searching for the optimum individuals. The initial population can be created using one of several functions. The easiest approach is to create a population of randomly selected individuals using gaRandomPopulation. However, in some cases it might be better to initialize the population using an array of individuals selected based upon their fitness or diversity. The function gaPopulation can create a population data structure from an array of individuals.

In some cases it might be useful to restart a genetic algorithm search using a previous generation. The function gaClonePopulation can be used to create an exact duplicate of a population. The function gaCopyPopulation replaces the individuals in one population with those from another. Two populations can be merged using gaMergePopulation and individuals can be added to an existing population using gaGrowPopulation. Memory allocated to a population can be released using gaFreePopulation.

The actual search or optimization using an initial population is conducted using geneticAlgorithm. This function returns the fittest individual found during the search. Also available are convergence statistics, including generation statistics, and the final population. This information can be used to evaluate the quality of the solution and if an additional search is warranted, the final population might be used as an initial population for that search.

For a detailed overview, see Genetic Algorithms – An Overview.

Naive Bayes

Naive Bayes is a classification algorithm. First a classifier is trained using naiveBayesTrainer. Once this is done naiveBayesClassification can be used to classify patterns with unknown classifications using the trained classifier represented in an Imsls_f_naive_bayes data structure.

In addition, nbClassifierWrite can be used to store the data structure created by naiveBayesTrainer for future use. The function nbClassifierRead restores a Naive Bayes data structure from a file written using nbClassifierWrite.

Classification problems can be solved using other algorithms such as discriminant analysis and neural networks. In general these alternatives have smaller classification error rates, but they are too slow for large classification problems. During training naiveBayesTrainer uses the non-missing training data to estimate two-way correlations among the attributes. Higher order correlations are assumed to be zero. This can increase the classification error rate, but it significantly reduces the time needed to train the classifier.

In addition, the Naive Bayes algorithm is the only classification algorithm that can handle data with missing values. Other algorithms such as discriminant analysis do not allow missing values in the data. This is a significant limitation for applying other techniques to a larger database.

Memory allocated to the Naive Bayes data structure created by naiveBayesTrainer and nbClassifierRead, can be released using nbClassifierFree.

For a detailed overview, see Naive Bayes – An Overview.

Neural Networks

Neural networks can be used for forecasting and classification. A neural network data structure is first created using mlffNetworkInit and mlffNetwork. Although forecasting and classification neural networks are initialized and represented using the same data structure, separate functions are provided for forecasting and classification in order to make them easier to use and to reduce network training times.

Once the network architecture is created, the network can be trained using mlffNetworkTrainer for forecasting problems and mlffClassificationTrainer for classification problems. By default these algorithms initialize the network weights, but weight initialization can be controlled using mlffInitializeWeights.

Once a network is trained either mlffNetworkForecast or mlffPatternClassification is used to produce forecasts or to classify unknown patterns.

In many cases, network training will be completed on one computer and forecasting or classification might be done on another. The function mlffNetworkWrite stores a trained network to a file which can be restored using mlffNetworkRead.

Memory allocated to a neural network data structure can be released using mlffNetworkFree.

For a detailed overview, see Neural Networks – An Overview.

Data Filtering

The first step in this process is to filter data from its raw form into formats required by sophisticated analytical algorithms.

Data fall into two major categories: continuous and categorical. Many algorithms, such as neural network forecasting, perform better if continuous data are mapped into a common scale. The function scaleFilter implements several techniques for automatically scaling continuous data, including several variations of z-score scaling. If the continuous data represent a time series, timeSeriesFilter and timeSeriesClassFilter can be used to create a matrix of lagged values required as input to forecasting neural networks.

Categorical data must also be mapped into a corresponding numerical representation before they can be used in solving forecasting and classification problems. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color. The function unsupervisedOrdinalFilter encodes and decodes ordinal data into the range [0, 1] using cumulative percentages. The function unsupervisedNominalFilter uses binary encoding to map nominal data into a matrix of zeros and ones.

Kohonen Self-organizing Map

A self-organizing map (SOM), also known as a Kohonen map or Kohonen SOM, is a technique for gathering high-dimensional data into clusters that are constrained to lie in low dimensional space, usually two dimensions. It is a widely used technique for the purpose of feature extraction and visualization for very high dimensional data. The Kohonen SOM is equivalent to an artificial neural network having inputs linked to every node in the network. The creation of a Kohonen map involves two steps: training and forecasting. Training builds the map using input examples, and forecasting classifies new input. The functions kohonenSOMTrainer and kohonenSOMForecast achieve these two steps.

For more details, see the Description sections of kohonenSOMTrainer and kohonenSOMForecast.

Support Vector Machines

Support Vector Machines (SVM) is a class of learning algorithms that can be used for classification, regression, and distribution estimation. First, a classifier is trained using supportVectorTrainer specifying either a classification, distribution, or regression type model. Then supportVectorClassification can be used for classification, distribution estimation, or regression on patterns with unknown classifications using the trained classifier model represented in an Imsls_f_svm_classifier data structure.

For a detailed overview, see Support Vector Machines – An Overview