Package com.imsl.datamining
Data mining refers to the process of using statistical and analytical methods to extract useful information from large databases. The problem of extracting information from data is prevalent in government, business, education, industry, engineering, medicine and the sciences; The methods and algorithms used in data mining have been invented and developed, with considerable overlap, in machine learning, statistical learning, and statistics. While there are theoretical and some philosophical differences between the fields of study, these nuances are not important from a practical standpoint. Whether a statistical method or a machine learning method, the goal is the same: learning from data.
Data Types
In general, data fall into two major categories: continuous and categorical. A continuous variable can assume any real number within a certain range. Examples of continuous variables include temperature, height, weight, circumference, body mass index, rate of return, etc. Count data are often treated as continuous variables in data mining algorithms. Even though they only assume discrete values, their set of possible values is infinite. Examples of count data include the number of accidents per year, the number of units sold, the number of insurance claims, and so on.
Categorical variables take on values from a finite list of categories. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color.
Sometimes continuous variables are binned into a finite set. Two examples are income level: {less than 25K, between 25K and 50K, over 50K}, and body weight: {underweight, normal, overweight, obese}. Binned continuous variables are often treated as ordinal-categorical type variables for modeling purposes.
Other types of data deserve a special mention: transaction or invoice data and text data. A transaction (think of a grocery store receipt) has attributes such as date and time, total amount, the set of products that were purchased, their quantities and prices, and possibly attributes of the individual customer making the purchases. Text data is discrete and not ordered, but the association of text or words in sentences, forming context, makes text data an important subtype for data mining applications.
Data Mining Problem Types
The primary types of data mining problems are pattern recognition and prediction. Prediction includes the subtypes classification, regression, and forecasting.
Pattern recognition algorithms are designed to detect patterns in large, high-dimensional, and complex data sets. Pattern recognition problems fall under two broad categories: supervised and unsupervised. In supervised problems, the number of groups or categories is known and each example (observation) in the training data has a known outcome (or response). The set of attributes or predictor variables measured on each example may relate to the response variable, and may then be used to predict the outcomes of future or new examples. Supervised learning algorithms try to detect the relationship between the set of attributes and the outcome of the response variable.
Prediction problems are supervised problems concerned with predicting an outcome of a variable using known attributes as inputs into a statistical model. In prediction problems, there is a single variable of interest called a dependent variable, or a response variable, or sometimes a label when the variable is categorical. The set of attributes consists of other variables that may have some relationship with the variable of interest. These variables are variously referred to as independent, explanatory, predictor variables, attributes, or features.
Classification is a prediction problem in which the response is categorical; regression is a prediction problem in which the response is continuous; and forecasting is a prediction problem in which the response variable and predictor variables are indexed by time. Most algorithms in this package have methods for either classification or regression. For time series, neural networks and support vector machines have both been used successfully.
In unsupervised problems, there is no known outcome or response. Each example
in the training data is a vector of measurements on a number (often a very
large number) of variables. The problem is to detect any patterns or
structure that might exist in the high-dimensional space spanned by the
variables. With a smaller set of natural groupings (clusters) or structures
in lower dimensions, stronger inferences can be made about the population or
the distribution of the variables. In this package, the Kohonen
self-organizing map (KohonenSOM) is an example of
an unsupervised learning algorithm. Many of the algorithms described in the
Multivariate Analysis package are unsupervised learning algorithms.
For any data mining model to be useful, it must be given data it can learn from. This data has examples of known values of predictor variables and known value of the dependent variable. This data is called training data. Once the model is trained or fitted to the training data, new examples (ones not used to train the model) are run through the model to obtain an estimated (predicted) value of the dependent variable. If the true value of the dependent variable is known, the predicted value is compared to the known value and a prediction error can be recorded. New examples with known (or realized) values of the dependent variable comprise what is often called a test data set, which is used to evaluate how accurately the model predicts the dependent variable. For the purpose of such model assessment, a complete data set can be randomly partitioned into a training set and a test set. See below the summary of cross-validation.
Data Filtering and Scaling
Regardless of the type or the source of the data, it must be filtered from its raw form into formats required by data mining algorithms. Categorical data must be mapped into a corresponding numerical representation. Some algorithms treat categorical data and continuous data differently while other algorithms interpret all data as continuous. In algorithms that interpret all data as continuous, categorical variables first must be transformed. Below is an example of transforming a categorical variable.
Dummy variables are used to indicate the distinct values of categorical variables without implying order. To illustrate, suppose in a survey of consumer preferences, the categorical variable x indicates the type of chocolate, {White, Milk, Dark}, and these values are encoded to \(\{1, 2, 3\}\) in the data. If x is used directly, the algorithm presumes a scale \((1 \lt 2 \lt 3)\) that is not really true or meaningful, and will lead to invalid inferences about the relationships between the variables. The reason for creating dummy variables is to represent a categorical variable with a set of indicators that puts each distinct value on the same numerical level. In the table, the binary encoding of x into \(3\) dummy variables is shown.
| Chocolate | x-value | dummy1 | dummy2 | dummy3 |
|---|---|---|---|---|
| White | \(1\) | \(1\) | \(0\) | \(0\) |
| Milk | \(2\) | \(0\) | \(1\) | \(0\) |
| Dark | \(3\) | \(0\) | \(0\) | \(1\) |
The class UnsupervisedNominalFilter uses
binary encoding to map nominal data into a matrix of zeros and ones. The
resulting columns are often referred to as dummy variables.
If the categorical variable is ordinal, so that order between the levels is
meaningful, the class
UnsupervisedOrdinalFilter encodes and
decodes ordinal data into the range [0, 1] using cumulative percentages.
Continuous data may need to be scaled. Many algorithms, such as neural
networks and support vector machines, perform better if continuous data are
mapped into a common scale. Class
ScaleFilter implements several techniques
for automatically scaling continuous data, including several variations of
z-score scaling. If the continuous data represent a time series,
TimeSeriesFilter and
TimeSeriesClassFilter can be used to
create a matrix of lagged values required as input to neural networks. More
details on these methods can be found in the
com.imsl.datamining.neural package.
Apriori Market Basket Analysis
Market basket analysis is an unsupervised data mining problem for detecting strong associations between products or items in transactional data. The problem arose especially with the advent of digital scanners and scanner data in supermarkets, hence the name "market basket". The classApriori performs the Apriori algorithm for
finding strong association rules. Association rules are statements of the
form, "if X, then Y", given with some measure of confidence. For example, in
a supermarket X and Y are different products such as bread and butter.
Learning which products are strongly associated helps managers make more
profitable marketing decisions, such as product placement or sales
promotions. There are other applications for association rule discovery, such
as in text mining and bioinformatics.
PrefixSpan Sequential Pattern Mining
Sequential pattern mining (SPM) is an unsupervised data mining problem that involves finding sequential patterns within a set of sequences. Sequential Patterns are frequently occurring subsequences of items. Such items could be products, genes, behaviors, symptoms of disease, virtually any observable, discrete event. Applications for SPM include problems in medicine, bio-informatics, economics, psychiatry, retail and e-commerce, and many others. The PrefixSpan algorithm for SPM is a depth-first search that "discovers" or grows sequential patterns by way of projection and recursion.
Kohonen Self-organizing Map
A self-organizing map (SOM), also known as a Kohonen map or Kohonen SOM, is a
technique for gathering high-dimensional data into clusters that are
constrained to lie in low dimensional space, usually two dimensions. It is a
widely used technique for the purpose of feature extraction and visualization
for very high dimensional data. The Kohonen SOM is equivalent to a neural
network having inputs linked to every node in the network. The classes
KohonenSOM and
KohonenSOMTrainer provide the methods for
creating, training, and forecasting with a Kohonen map. Training builds the
map using input examples, and forecasting classifies new input.
Naive Bayes
Naive Bayes is an algorithm for supervised learning that is built upon Bayes'
rule for conditional probability. The class
NaiveBayesClassifier can be used to train the
classifier using continuous or categorical predictors, or a mixture of both,
to classify a categorical target variable.
Classification problems can be solved using other algorithms such as
discriminant analysis and neural networks. In general, these alternatives
have smaller classification error rates, but they are too slow for large
classification problems. During training,
NaiveBayesClassifier uses the non-missing
training data to estimate two-way correlations among the attributes. Higher
order correlations are assumed to be zero. This can increase the
classification error rate, but it significantly reduces the time needed to
train the classifier.
Neural Networks
An artificial neural network, or neural network, is a flexible modeling
framework that can be used in many of the applications and problem areas in
data mining. Using terms inspired by the biological brain, the elements of a
neural network are nodes, layers, and activation functions. There are many
options for setting up a network. Once the architecture of the network is
specified, it can be trained on the training data and evaluated on test data,
similar to other supervised learning algorithms. For more details, see the
com.imsl.datamining.neural package.
Predictive Models
ClassPredictiveModel is the abstract base class
for predictive models like decision trees. It contains methods and class members
common to different predictive models in regression or classification
problems. Users can leverage the abstract class and its methods to create
customized predictive models. JMSL includes two major packages
com.imsl.datamining.decisionTree and
com.imsl.datamining.supportvectormachine that extend
PredictiveModel.
Decision trees are predictive models for classification or regression. The
com.imsl.datamining.decisionTree package includes 4 specific
algorithms, ALACART,
C45,
CHAID, and
QUEST, for generating a decision
tree. For more details and examples, see the
com.imsl.datamining.decisionTree package.
Support Vector Machines (SVMs) are a widely used machine learning method for
regression, classification and other learning tasks. More information on the
available SVMs and kernels can be found in the
com.imsl.datamining.supportvectormachine package.
Cross-Validation
Cross-validation is an important resampling method that can be used for model
assessment or model selection. Class
CrossValidation performs
k-fold cross-validation on predictive models, like decision trees or
support vector machines. In k-fold cross-validation, the set of
observations is randomly split into k disjoint folds of approximately
equal size. The first fold is then treated as a test set, and the model
trained on the remaining k-1 folds. This procedure is repeated k
times, with each of the folds serving once as a test set. Applying the fitted
models to their test sets results in k estimates for the test error.
The cross-validated error is computed by averaging these values. For
classification problems, stratified cross-validation can be performed by
setting a flag via the method
CrossValidation.setStratifiedCrossValidation(boolean)
method.
Ensemble Methods
An ensemble method involves fitting a collection of predictive models and combining their collective outputs. The approach helps reduce variability and overfitting and improves predictive accuracy. In particular, decision trees have been shown to dramatically improve when used in ensembles.Bootstrap Aggregation
Bootstrap aggregation (bagging) is a statistical technique designed to
improve the accuracy of predictive models by reducing variability. Given a
specific predictive model and a single training data set of size N,
class BootstrapAggregation performs bagging by
taking repeated bootstrap samples of size N from the training data
set. The predictive model is then trained on each bootstrap sample
separately, and predictions are generated. The predictions are finally
combined into a single value by averaging (for regression problems) or
majority vote (for classification problems).
Gradient Boosting
Like bagging, gradient boosting is another approach to iteratively improve
the predictions from a predictive model, specifically a decision tree. Class
GradientBoosting implements a special form of
gradient boosting, the stochastic gradient tree boosting algorithm of
Friedman (1999). Class GradientBoosting can be
applied to regression and classification problems. For classification
problems, the binomial or multinomial deviance loss function must be used.
Random Forest
Another ensemble method, RandomTrees
implements the ensemble method called random forest (Breiman, 2001). A
random forest is a collection of decision trees on bootstrap samples. In
addition, the set of predictor variables is randomized before each branching
or splitting decision within the decision tree algorithm. This extra
randomization reduces correlation among the different trees in the ensemble.
The class RandomTrees is in the
com.imsl.datamining.decisionTree package.
-
ClassDescriptionPerforms the Apriori algorithm for association rule discovery.Contains association rules discovered by the Apriori algorithm.Performs bootstrap aggregation to generate predictions using predictive models.Performs V-Fold cross-validation for predictive models.Performs stochastic gradient boosting for a single response variable and multiple predictor variables.The loss function type as specified by the error measure.Predicts a data set using a trained gradient boosting model.Object containing a set of frequent items and the number of transactions examined to obtain the frequent item set.A Kohonen self organizing map.Trains a Kohonen network.Performs binomial or multinomial logistic regression.Predicts a data set using a previously trained logistic regression model object.Trains a naive Bayes classifier.Specifies a predictive model.Wraps the
java.lang.CloneNotSupportedExceptionto indicate that theclonemethod in classObjecthas been called to clone an object, but that the object's class does not implement theCloneableinterface.An exception class intended to be the parent of all nested Exception classes where the enclosing class extendsPredictiveModel.Exception thrown when an input parameter has changed that might affect the model estimates or predictions.Exception thrown when the sum of probabilities is not approximately one.Enumerates different variable types.Performs the PrefixSpan algorithm for sequential pattern mining.Defines a sequence database for use with thePrefixSpanalgorithm.