Skip navigation links

Package com.imsl.datamining

Data mining and machine learning.

See: Description

Package com.imsl.datamining Description

Data mining and machine learning.

Data mining refers to the process of using statistical and analytical methods to extract useful information from large databases. The problem of extracting information from data is prevalent in government, business, education, industry, engineering, medicine and the sciences; The methods and algorithms used in data mining have been invented and developed, with considerable overlap, in machine learning, statistical learning, and statistics. While there are theoretical and some philosophical differences between the fields of study, these nuances are not important from a practical standpoint. Whether a statistical method or a machine learning method, the goal is the same: learning from data.

Data Types

In general, data fall into two major categories: continuous and categorical. A continuous variable can assume any real number within a certain range. Examples of continuous variables include temperature, height, weight, circumference, body mass index, rate of return, etc. Count data are often treated as continuous variables in data mining algorithms. Even though they only assume discrete values, their set of possible values is infinite. Examples of count data include the number of accidents per year, the number of units sold, the number of insurance claims, and so on.

Categorical variables take on values from a finite list of categories. There are two types of categorical data: ordinal and nominal. Ordinal data have a natural ordering among the categories, such as a school grade. Nominal data are categories without a natural ordering, such as eye color.

Sometimes continuous variables are binned into a finite set. Two examples are income level: {less than 25K, between 25K and 50K, over 50K}, and body weight: {underweight, normal, overweight, obese}. Binned continuous variables are often treated as ordinal-categorical type variables for modeling purposes.

Other types of data deserve a special mention: transaction or invoice data and text data. A transaction (think of a grocery store receipt) has attributes such as date and time, total amount, the set of products that were purchased, their quantities and prices, and possibly attributes of the individual customer making the purchases. Text data is discrete and not ordered, but the association of text or words in sentences, forming context, makes text data an important subtype for data mining applications.

Data Mining Problem Types

The primary types of data mining problems are pattern recognition and prediction. Prediction includes the subtypes classification, regression, and forecasting.

Pattern recognition algorithms are designed to detect patterns in large, high-dimensional, and complex data sets. Pattern recognition problems fall under two broad categories: supervised and unsupervised. In supervised problems, the number of groups or categories is known and each example (observation) in the training data has a known outcome (or response). The set of attributes or predictor variables measured on each example may relate to the response variable, and may then be used to predict the outcomes of future or new examples. Supervised learning algorithms try to detect the relationship between the set of attributes and the outcome of the response variable.

Prediction problems are supervised problems concerned with predicting an outcome of a variable using known attributes as inputs into a statistical model. In prediction problems, there is a single variable of interest called a dependent variable, or a response variable, or sometimes a label when the variable is categorical. The set of attributes consists of other variables that may have some relationship with the variable of interest. These variables are variously referred to as independent, explanatory, predictor variables, attributes, or features.

Classification is a prediction problem in which the response is categorical; regression is a prediction problem in which the response is continuous; and forecasting is a prediction problem in which the response variable and predictor variables are indexed by time. Most algorithms in this package have methods for either classification or regression. For time series, neural networks and support vector machines have both been used successfully.

In unsupervised problems, there is no known outcome or response. Each example in the training data is a vector of measurements on a number (often a very large number) of variables. The problem is to detect any patterns or structure that might exist in the high-dimensional space spanned by the variables. With a smaller set of natural groupings (clusters) or structures in lower dimensions, stronger inferences can be made about the population or the distribution of the variables. In this package, the Kohonen self-organizing map (KohonenSOM) is an example of an unsupervised learning algorithm. Many of the algorithms described in the Multivariate Analysis package are unsupervised learning algorithms.

For any data mining model to be useful, it must be given data it can learn from. This data has examples of known values of predictor variables and known value of the dependent variable. This data is called training data. Once the model is trained or fitted to the training data, new examples (ones not used to train the model) are run through the model to obtain an estimated (predicted) value of the dependent variable. If the true value of the dependent variable is known, the predicted value is compared to the known value and a prediction error can be recorded. New examples with known (or realized) values of the dependent variable comprise what is often called a test data set, which is used to evaluate how accurately the model predicts the dependent variable. For the purpose of such model assessment, a complete data set can be randomly partitioned into a training set and a test set. See below the summary of cross-validation.

Data Filtering and Scaling

Regardless of the type or the source of the data, it must be filtered from its raw form into formats required by data mining algorithms. Categorical data must be mapped into a corresponding numerical representation. Some algorithms treat categorical data and continuous data differently while other algorithms interpret all data as continuous. In algorithms that interpret all data as continuous, categorical variables first must be transformed. Below is an example of transforming a categorical variable.

Dummy variables are used to indicate the distinct values of categorical variables without implying order. To illustrate, suppose in a survey of consumer preferences, the categorical variable x indicates the type of chocolate, {White, Milk, Dark}, and these values are encoded to \(\{1, 2, 3\}\) in the data. If x is used directly, the algorithm presumes a scale \((1 < 2 < 3)\) that is not really true or meaningful, and will lead to invalid inferences about the relationships between the variables. The reason for creating dummy variables is to represent a categorical variable with a set of indicators that puts each distinct value on the same numerical level. In the table, the binary encoding of x into \(3\) dummy variables is shown.

Chocolate x-value dummy1 dummy2 dummy3
White \(1\) \(1\) \(0\) \(0\)
Milk \(2\) \(0\) \(1\) \(0\)
Dark \(3\) \(0\) \(0\) \(1\)

The class UnsupervisedNominalFilter uses binary encoding to map nominal data into a matrix of zeros and ones. The resulting columns are often referred to as dummy variables.

If the categorical variable is ordinal, so that order between the levels is meaningful, the class UnsupervisedOrdinalFilter encodes and decodes ordinal data into the range [0, 1] using cumulative percentages.

Continuous data may need to be scaled. Many algorithms, such as neural networks and support vector machines, perform better if continuous data are mapped into a common scale. Class ScaleFilter implements several techniques for automatically scaling continuous data, including several variations of z-score scaling. If the continuous data represent a time series, TimeSeriesFilter and TimeSeriesClassFilter can be used to create a matrix of lagged values required as input to neural networks. More details on these methods can be found in the com.imsl.datamining.neural package.

Apriori Market Basket Analysis

Market basket analysis is an unsupervised data mining problem for detecting strong associations between products or items in transactional data. The problem arose especially with the advent of digital scanners and scanner data in supermarkets, hence the name "market basket". The class Apriori performs the Apriori algorithm for finding strong association rules. Association rules are statements of the form, "if X, then Y", given with some measure of confidence. For example, in a supermarket X and Y are different products such as bread and butter. Learning which products are strongly associated helps managers make more profitable marketing decisions, such as product placement or sales promotions. There are other applications for association rule discovery, such as in text mining and bioinformatics.

Kohonen Self-organizing Map

A self-organizing map (SOM), also known as a Kohonen map or Kohonen SOM, is a technique for gathering high-dimensional data into clusters that are constrained to lie in low dimensional space, usually two dimensions. It is a widely used technique for the purpose of feature extraction and visualization for very high dimensional data. The Kohonen SOM is equivalent to a neural network having inputs linked to every node in the network. The classes KohonenSOM and KohonenSOMTrainer provide the methods for creating, training, and forecasting with a Kohonen map. Training builds the map using input examples, and forecasting classifies new input.

Naive Bayes

Naive Bayes is an algorithm for supervised learning that is built upon Bayes' rule for conditional probability. The class NaiveBayesClassifier can be used to train the classifier using continuous or categorical predictors, or a mixture of both, to classify a categorical target variable.

Classification problems can be solved using other algorithms such as discriminant analysis and neural networks. In general, these alternatives have smaller classification error rates, but they are too slow for large classification problems. During training, NaiveBayesClassifier uses the non-missing training data to estimate two-way correlations among the attributes. Higher order correlations are assumed to be zero. This can increase the classification error rate, but it significantly reduces the time needed to train the classifier.

Neural Networks

An artificial neural network, or neural network, is a flexible modeling framework that can be used in many of the applications and problem areas in data mining. Using terms inspired by the biological brain, the elements of a neural network are nodes, layers, and activation functions. There are many options for setting up a network. Once the architecture of the network is specified, it can be trained on the training data and evaluated on test data, similar to other supervised learning algorithms. For more details, see the com.imsl.datamining.neural package.

Predictive Models

Class PredictiveModel is the abstract base class for predictive models like decision trees. It contains methods and class members common to different predictive models in regression or classification problems. Users can leverage the abstract class and its methods to create customized predictive models. JMSL includes two major packages com.imsl.datamining.decisionTree and com.imsl.datamining.supportvectormachine that extend PredictiveModel.

Decision trees are predictive models for classification or regression. The com.imsl.datamining.decisionTree package includes 4 specific algorithms, ALACART, C45, CHAID, and QUEST, for generating a decision tree. For more details and examples, see the com.imsl.datamining.decisionTree package.

Support Vector Machines (SVMs) are a widely used machine learning method for regression, classification and other learning tasks. More information on the available SVMs and kernels can be found in the com.imsl.datamining.supportvectormachine package.

Cross-Validation

Cross-validation is an important resampling method that can be used for model assessment or model selection. Class CrossValidation performs k-fold cross-validation on predictive models, like decision trees or support vector machines. In k-fold cross-validation, the set of observations is randomly split into k disjoint folds of approximately equal size. The first fold is then treated as a test set, and the model trained on the remaining k-1 folds. This procedure is repeated k times, with each of the folds serving once as a test set. Applying the fitted models to their test sets results in k estimates for the test error. The cross-validated error is computed by averaging these values. For classification problems, stratified cross-validation can be performed by setting a flag via the method CrossValidation.setStratifiedCrossValidation(boolean) method.

Ensemble Methods

An ensemble method involves fitting a collection of predictive models and combining their collective outputs. The approach helps reduce variability and overfitting and improves predictive accuracy. In particular, decision trees have been shown to dramatically improve when used in ensembles.

Bootstrap Aggregation

Bootstrap aggregation (bagging) is a statistical technique designed to improve the accuracy of predictive models by reducing variability. Given a specific predictive model and a single training data set of size N, class BootstrapAggregation performs bagging by taking repeated bootstrap samples of size N from the training data set. The predictive model is then trained on each bootstrap sample separately, and predictions are generated. The predictions are finally combined into a single value by averaging (for regression problems) or majority vote (for classification problems).

Gradient Boosting

Like bagging, gradient boosting is another approach to iteratively improve the predictions from a predictive model, specifically a decision tree. Class GradientBoosting implements a special form of gradient boosting, the stochastic gradient tree boosting algorithm of Friedman (1999). Class GradientBoosting can be applied to regression and classification problems. For classification problems, the binomial or multinomial deviance loss function must be used.

Random Forest

Another ensemble method, RandomTrees implements the ensemble method called random forest (Breiman, 2001). A random forest is a collection of decision trees on bootstrap samples. In addition, the set of predictor variables is randomized before each branching or splitting decision within the decision tree algorithm. This extra randomization reduces correlation among the different trees in the ensemble. The class RandomTrees is in the com.imsl.datamining.decisionTree package.

Skip navigation links

Copyright © 2020 Rogue Wave Software. All rights reserved.