Class | Description |
---|---|
Apriori |
Performs the Apriori algorithm for association rule discovery.
|
AssociationRule |
Contains association rules discovered by the Apriori algorithm.
|
BootstrapAggregation |
Performs bootstrap aggregation to generate predictions using predictive
models.
|
CrossValidation |
Performs V-Fold cross-validation for predictive models.
|
GradientBoosting |
Performs stochastic gradient boosting for a single response variable and
multiple predictor variables.
|
Itemsets |
Object containing a set of frequent items and the number of transactions
examined to obtain the frequent item set.
|
KohonenSOM |
A Kohonen self organizing map.
|
KohonenSOMTrainer |
Trains a Kohonen network.
|
NaiveBayesClassifier |
Trains a naive Bayes classifier.
|
PredictiveModel |
Specifies a predictive model.
|
Enum | Description |
---|---|
GradientBoosting.LossFunctionType |
The loss function type as specified by the error measure.
|
PredictiveModel.VariableType |
Enumerates different variable types.
|
Exception | Description |
---|---|
PredictiveModel.CloneNotSupportedException |
Wraps the
java.lang.CloneNotSupportedException to indicate
that the clone method in class Object has been
called to clone an object, but that the object's class does not implement
the Cloneable interface. |
PredictiveModel.PredictiveModelException |
An exception class intended to be the parent of all nested Exception
classes where the enclosing class extends
PredictiveModel . |
PredictiveModel.StateChangeException |
Exception thrown when an input parameter has changed that might affect
the model estimates or predictions.
|
PredictiveModel.SumOfProbabilitiesNotOneException |
Exception thrown when the sum of probabilities is not approximately one.
|
Data mining refers to the process of using statistical and analytical methods to extract useful information from large databases. The problem of extracting information from data is prevalent in government, business, education, industry, engineering, medicine and the sciences; The methods and algorithms used in data mining have been invented and developed, with considerable overlap, in machine learning, statistical learning, and statistics. While there are theoretical and some philosophical differences between the fields of study, these nuances are not important from a practical standpoint. Whether a statistical method or a machine learning method, the goal is the same: learning from data.
In general, data fall into two major categories: continuous and categorical. A continuous variable can assume any real number within a certain range. Examples of continuous variables include temperature, height, weight, circumference, body mass index, rate of return, etc. Count data are often treated as continuous variables in data mining algorithms. Even though they only assume discrete values, their set of possible values is infinite. Examples of count data include the number of accidents per year, the number of units sold, the number of insurance claims, and so on.
Categorical variables take on values from a finite list of categories. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color.
Sometimes continuous variables are binned into a finite set. Two examples are income level: {less than 25K, between 25K and 50K, over 50K}, and body weight: {underweight, normal, overweight, obese}. Binned continuous variables are often treated as ordinal-categorical type variables for modeling purposes.
Other types of data deserve a special mention: transaction or invoice data and text data. A transaction (think of a grocery store receipt) has attributes such as date and time, total amount, the set of products that were purchased, their quantities and prices, and possibly attributes of the individual customer making the purchases. Text data is discrete and not ordered, but the association of text or words in sentences, forming context, makes text data an important subtype for data mining applications.
The primary types of data mining problems are pattern recognition and prediction. Prediction includes the subtypes classification, regression, and forecasting.
Pattern recognition algorithms are designed to detect patterns in large, high-dimensional, and complex data sets. Pattern recognition problems fall under two broad categories: supervised and unsupervised. In supervised problems, the number of groups or categories is known and each example (observation) in the training data has a known outcome (or response). The set of attributes or predictor variables measured on each example may relate to the response variable, and may then be used to predict the outcomes of future or new examples. Supervised learning algorithms try to detect the relationship between the set of attributes and the outcome of the response variable.
Prediction problems are supervised problems concerned with predicting an outcome of a variable using known attributes as inputs into a statistical model. In prediction problems, there is a single variable of interest called a dependent variable, or a response variable, or sometimes a label when the variable is categorical. The set of attributes consists of other variables that may have some relationship with the variable of interest. These variables are variously referred to as independent, explanatory, predictor variables, attributes, or features.
Classification is a prediction problem in which the response is categorical; regression is a prediction problem in which the response is continuous; and forecasting is a prediction problem in which the response variable and predictor variables are indexed by time. Most algorithms in this package have methods for either classification or regression. For time series, neural networks and support vector machines have both been used successfully.
In unsupervised problems, there is no known outcome or response. Each example
in the training data is a vector of measurements on a number (often a very
large number) of variables. The problem is to detect any patterns or
structure that might exist in the high-dimensional space spanned by the
variables. With a smaller set of natural groupings (clusters) or structures
in lower dimensions, stronger inferences can be made about the population or
the distribution of the variables. In this package, the Kohonen
self-organizing map (KohonenSOM
) is an example of
an unsupervised learning algorithm. Many of the algorithms described in the
Multivariate Analysis package are unsupervised learning algorithms.
For any data mining model to be useful, it must be given data it can learn from. This data has examples of known values of predictor variables and known value of the dependent variable. This data is called training data. Once the model is trained or fitted to the training data, new examples (ones not used to train the model) are run through the model to obtain an estimated (predicted) value of the dependent variable. If the true value of the dependent variable is known, the predicted value is compared to the known value and a prediction error can be recorded. New examples with known (or realized) values of the dependent variable comprise what is often called a test data set, which is used to evaluate how accurately the model predicts the dependent variable. For the purpose of such model assessment, a complete data set can be randomly partitioned into a training set and a test set. See below the summary of cross-validation.
Regardless of the type or the source of the data, it must be filtered from its raw form into formats required by data mining algorithms. Categorical data must be mapped into a corresponding numerical representation. Some algorithms treat categorical data and continuous data differently while other algorithms interpret all data as continuous. In algorithms that interpret all data as continuous, categorical variables first must be transformed. Below is an example of transforming a categorical variable.
Dummy variables are used to indicate the distinct values of categorical variables without implying order. To illustrate, suppose in a survey of consumer preferences, the categorical variable x indicates the type of chocolate, {White, Milk, Dark}, and these values are encoded to \(\{1, 2, 3\}\) in the data. If x is used directly, the algorithm presumes a scale \((1 < 2 < 3)\) that is not really true or meaningful, and will lead to invalid inferences about the relationships between the variables. The reason for creating dummy variables is to represent a categorical variable with a set of indicators that puts each distinct value on the same numerical level. In the table, the binary encoding of x into \(3\) dummy variables is shown.
Chocolate | x-value | dummy1 | dummy2 | dummy3 |
---|---|---|---|---|
White | \(1\) | \(1\) | \(0\) | \(0\) |
Milk | \(2\) | \(0\) | \(1\) | \(0\) |
Dark | \(3\) | \(0\) | \(0\) | \(1\) |
The class UnsupervisedNominalFilter
uses
binary encoding to map nominal data into a matrix of zeros and ones. The
resulting columns are often referred to as dummy variables.
If the categorical variable is ordinal, so that order between the levels is
meaningful, the class
UnsupervisedOrdinalFilter
encodes and
decodes ordinal data into the range [0, 1] using cumulative percentages.
Continuous data may need to be scaled. Many algorithms, such as neural
networks and support vector machines, perform better if continuous data are
mapped into a common scale. Class
ScaleFilter
implements several techniques
for automatically scaling continuous data, including several variations of
z-score scaling. If the continuous data represent a time series,
TimeSeriesFilter
and
TimeSeriesClassFilter
can be used to
create a matrix of lagged values required as input to neural networks. More
details on these methods can be found in the
com.imsl.datamining.neural
package.
Apriori
performs the Apriori algorithm for
finding strong association rules. Association rules are statements of the
form, "if X, then Y", given with some measure of confidence. For example, in
a supermarket X and Y are different products such as bread and butter.
Learning which products are strongly associated helps managers make more
profitable marketing decisions, such as product placement or sales
promotions. There are other applications for association rule discovery, such
as in text mining and bioinformatics.
A self-organizing map (SOM), also known as a Kohonen map or Kohonen SOM, is a
technique for gathering high-dimensional data into clusters that are
constrained to lie in low dimensional space, usually two dimensions. It is a
widely used technique for the purpose of feature extraction and visualization
for very high dimensional data. The Kohonen SOM is equivalent to a neural
network having inputs linked to every node in the network. The classes
KohonenSOM
and
KohonenSOMTrainer
provide the methods for
creating, training, and forecasting with a Kohonen map. Training builds the
map using input examples, and forecasting classifies new input.
Naive Bayes is an algorithm for supervised learning that is built upon Bayes'
rule for conditional probability. The class
NaiveBayesClassifier
can be used to train the
classifier using continuous or categorical predictors, or a mixture of both,
to classify a categorical target variable.
Classification problems can be solved using other algorithms such as
discriminant analysis and neural networks. In general, these alternatives
have smaller classification error rates, but they are too slow for large
classification problems. During training,
NaiveBayesClassifier
uses the non-missing
training data to estimate two-way correlations among the attributes. Higher
order correlations are assumed to be zero. This can increase the
classification error rate, but it significantly reduces the time needed to
train the classifier.
An artificial neural network, or neural network, is a flexible modeling
framework that can be used in many of the applications and problem areas in
data mining. Using terms inspired by the biological brain, the elements of a
neural network are nodes, layers, and activation functions. There are many
options for setting up a network. Once the architecture of the network is
specified, it can be trained on the training data and evaluated on test data,
similar to other supervised learning algorithms. For more details, see the
com.imsl.datamining.neural
package.
PredictiveModel
is the abstract base class
for predictive models like decision trees. It contains methods and class members
common to different predictive models in regression or classification
problems. Users can leverage the abstract class and its methods to create
customized predictive models. JMSL includes two major packages
com.imsl.datamining.decisionTree
and
com.imsl.datamining.supportvectormachine
that extend
PredictiveModel
.
Decision trees are predictive models for classification or regression. The
com.imsl.datamining.decisionTree
package includes 4 specific
algorithms, ALACART
,
C45
,
CHAID
, and
QUEST
, for generating a decision
tree. For more details and examples, see the
com.imsl.datamining.decisionTree
package.
Support Vector Machines (SVMs) are a widely used machine learning method for
regression, classification and other learning tasks. More information on the
available SVMs and kernels can be found in the
com.imsl.datamining.supportvectormachine
package.
Cross-validation is an important resampling method that can be used for model
assessment or model selection. Class
CrossValidation
performs
k-fold cross-validation on predictive models, like decision trees or
support vector machines. In k-fold cross-validation, the set of
observations is randomly split into k disjoint folds of approximately
equal size. The first fold is then treated as a test set, and the model
trained on the remaining k-1 folds. This procedure is repeated k
times, with each of the folds serving once as a test set. Applying the fitted
models to their test sets results in k estimates for the test error.
The cross-validated error is computed by averaging these values. For
classification problems, stratified cross-validation can be performed by
setting a flag via the method
CrossValidation.setStratifiedCrossValidation(boolean)
method.
Bootstrap aggregation (bagging) is a statistical technique designed to
improve the accuracy of predictive models by reducing variability. Given a
specific predictive model and a single training data set of size N,
class BootstrapAggregation
performs bagging by
taking repeated bootstrap samples of size N from the training data
set. The predictive model is then trained on each bootstrap sample
separately, and predictions are generated. The predictions are finally
combined into a single value by averaging (for regression problems) or
majority vote (for classification problems).
Like bagging, gradient boosting is another approach to iteratively improve
the predictions from a predictive model, specifically a decision tree. Class
GradientBoosting
implements a special form of
gradient boosting, the stochastic gradient tree boosting algorithm of
Friedman (1999). Class GradientBoosting
can be
applied to regression and classification problems. For classification
problems, the binomial or multinomial deviance loss function must be used.
Another ensemble method, RandomTrees
implements the ensemble method called random forest (Breiman, 2001). A
random forest is a collection of decision trees on bootstrap samples. In
addition, the set of predictor variables is randomized before each branching
or splitting decision within the decision tree algorithm. This extra
randomization reduces correlation among the different trees in the ensemble.
The class RandomTrees
is in the
com.imsl.datamining.decisionTree
package.
Copyright © 2020 Rogue Wave Software. All rights reserved.