Class NaiveBayesClassifier
- All Implemented Interfaces:
Serializable
NaiveBayesClassifier trains a Naive Bayes classifier for
classifying data into one of nClasses target classes. Input
attributes can be a combination of both nominal and continuous data. Ordinal
data can be treated as either nominal attributes or continuous. If the
distribution of the ordinal data is known or can be approximated using one of
the continuous distributions, then associating them with continuous
attributes allows a user to specify that distribution. Missing values are
allowed.
Before training the classifier the input attributes must be specified. For
each nominal attribute, use method createNominalAttribute to
specify the number of categories in each nNominal attribute.
Specify the input attributes in the same column order that they will be
supplied to the train method. For example, if the input
attribute in the first two columns of the nominal input data,
nominalData, represent the first two nominal attributes and have
two and three categories respectively, then the first call to the
createNominalAttribute method would specify two categories and
the second call to createNominalAttribute would specify three
categories.
Likewise, for each continuous attribute, the method
createContinuousAttribute can be used to specify a
ProbabilityDistribution other than the default
NormalDistribution. A second
createContinuousAttribute is provided to allow specification of
a different distribution for each target class (see Example 3). Create each
continuous attribute in the same column order they will be supplied to the
train method. If createContinuousAttribute is not
invoked for all nContinuous attributes, the
NormalDistribution ProbabilityDistribution will be
used. For example, if five continuous attributes have been specified in the
constructor, but only three calls to createContinuousAttribute
have been invoked, the last two attributes, or columns of
continuousData in the train method, will use the
NormalDistribution ProbabilityDistribution.
Nominal only, continuous only, and a combination of both nominal and
continuous input attributes are allowed. The three train methods
allow for a combination of input attribute types.
Let C be the classification attribute with target categories
\(0, 1, \ldots, \mbox{nClasses}-1\), and let
\(X = \{x_1, x_2, \ldots, x_k\}\) be a vector valued array
of k=nNominal+nContinuous input attributes,
where nNominal is the number of nominal attributes and
nContinuous is the number of continuous attributes. See methods
createNominalAttribute(int) to specify the number of categories for
each nominal attribute and createContinuousAttribute(com.imsl.stat.ProbabilityDistribution) to specify
the distribution for each continuous attribute. The classification problem
simplifies to estimate the conditional probability
P(C|X) from a set of training patterns. The Bayes rule states that
this probability can be expressed as the ratio:
$$P(C = c|X = \{x_1, x_2, \ldots, x_k\}) =
\frac{P(C=c)P(X=\{x_1, x_2, \ldots, x_k\}|C=c)}{P(X=\{x_1, x_2,\ldots,x_k
\})}
$$
where c is equal to one of the target classes
\(0, 1, \ldots, \mbox{nClasses}-1\). In practice, the
denominator of this expression is constant across all target classes since it
is only a function of the given values of X. As a result, the Naive
Bayes algorithm does not expend computational time estimating
\(P(X=\{x_1, x_2,\ldots,x_k \}\) for every pattern.
Instead, a Naive Bayes classifier calculates the numerator
\(P(C=c)P(X=\{x_1, x_2, \ldots, x_k\})|C=c)\) for each
target class and then classifies X to the target class with the
largest value, i.e.,
$$X\xleftarrow[{\max (c = 0,1,\ldots, \mbox{nClasses}
- 1)}]{}P(C = c)P(X|C = c)
$$
The classifier simplifies this calculation by assuming conditional independence. That is it assumes that: $$P(X = \{x_1, x_2, \ldots, x_k\}|C=c) = \prod_{j=1}^{k} P(x_j|C=c) $$ This is equivalent to assuming that the values of the input attributes, given C, are independent of one another, i.e., $$P(x_i|x_j,C=c)=P(x_i|C=c),\,\,\, \mbox{for all}\,\,\,i \neq j $$ In real world data this assumption rarely holds, yet in many cases this approach results in surprisingly low classification error rates. Since, the estimate of \(P(C=c|X=\{x_1,x_2,\ldots,x_k\})\) from a Naive Bayes classifier is generally an approximation, classifying patterns based upon the Naive Bayes algorithm can have acceptably low classification error rates.
For nominal attributes, this implementation of the Naive Bayes classifier estimates conditional probabilities using a smoothed estimate: $$P(x_j|C=c)= \frac{ \# N \{ x_j \, \cap\, C=c \} + \lambda }{ \# N \{ C=c \} + \lambda j} \mbox{,}$$ where #N{Z} is the number of training patterns with attribute Z and j is equal to the number of categories associated with the j-th attribute.
The probability P(C=c) is also estimated using a smoothed estimate: $$P(C=c)= \frac{\# N\{C=c\} + \lambda }{\mbox{nPatterns} + \lambda (\mbox{nClasses})} \,\,\, \mbox{.} $$
These estimates correspond to the maximum a priori (MAP) estimates for a
Dirichelet prior assuming equal priors. The smoothing parameter can be any
non-negative value. Setting \(\lambda=0\) corresponds to no
smoothing. The default smoothing used in this algorithm,
\(\lambda=1\), is commonly referred to as Laplace smoothing.
This can be specified using the optional
setDiscreteSmoothingValue.
For continuous attributes, the same conditional probability
\(P(x_j|C=c)\) in the Naive Bayes formula is replaced with
the conditional probability density function \(f(x_j|C=c)\).
By default, the density function for continuous attributes is the normal
(Gaussian) probability density function (see
NormalDistribution):
$$f(x_j|C=c) = \frac{1}{\sigma
\sqrt{2\pi}}e^{-\frac{{\left(x_j - \mu\right)}^2}{2{\sigma}^2}}
$$ where \(\mu\) and \(\sigma\)
are the conditional mean and standard deviation, i.e. the mean and standard
deviation of
\(x_j\) when C = c. For convenience, methods
getMeans and getStandardDeviations are provided to
calculate the conditional mean and standard deviations of the training
patterns.
In addition to the default normal pdf, users can select any continuous
distribution to model the continuous attribute by providing an implementation
of the com.imsl.stat.ProbabilityDistribution interface. See
NormalDistribution, LogNormalDistribution,
GammaDistribution, and PoissonDistribution for
classes that implement the ProbabilityDistribution interface.
Smoothing conditional probability calculations for continuous attributes is
controlled by the methods setContinuousSmoothingValue and
setZeroCorrection. By default, conditional probability
calculations for continuous attributes are unadjusted for calculations near
zero. The value specified in the setContinuousSmoothingValue
method will be added to each continuous probability calculation. This is
similar to the effect of using setDiscreteSmoothingValue for the
corresponding discrete calculations.
The value specified in the setZeroCorrection method is used when
\((f(x|C=c) + \lambda)=0\), where
\(\lambda\) is the smoothing parameter setting. If this
condition occurs, the conditional probability is replaced with the value set
in setZeroCorrection.
Methods getClassificationErrors, getPredictedClass,
getProbabilities, and getTrainingErrors provide
information on how well the trained NaiveBayesClassifier
predicts the known target classifications of the training patterns.
Methods probabilities and predictClass estimate
classification probabilities and predict classification of the input pattern
using the trained Naive Bayes Classifier. The predicted classification
returned by predictClass is the class with the largest estimated
classification probability. Method classError predicts the
classification from the trained Naive Bayes classifier and compares the
predicted classifications with the known target classification provided. This
allows verification of the classifier with a set of patterns other than the
training patterns.
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionNaiveBayesClassifier(int nContinuous, int nNominal, int nClasses) Constructs a NaiveBayesClassifier -
Method Summary
Modifier and TypeMethodDescriptiondoubleclassError(double[] continuous, int[] nominal, int classification) Returns the classification probability error for the input pattern and known target classification.voidCreate a continuous variable and the associated distribution function.voidCreate a continuous variable and the associated distribution functions for each target classification.voidcreateNominalAttribute(int nCategories) Create a nominal attribute and the number of categoriesint[]getClassCounts(int[] classificationData) Returns the number of patterns for each target classification.double[]Returns the classification probability errors for each pattern in the training data.double[][]getMeans(double[][] continuousData, int[] classificationData) Returns a table of means for each continuous attribute incontinuousDatasegmented by the target classes inclassificationData.int[]Returns the predicted classification for each training pattern.double[][]Returns the predicted classification probabilities for each target class.double[][]getStandardDeviations(double[][] continuousData, int[] classificationData) Returns a table of standard deviations for each continuous attribute incontinuousDatasegmented by the target classes inclassificationData.int[][]Returns a table of classification errors of non-missing classifications for each target classification plus the overall total of classification errors.voidignoreMissingValues(boolean ignoreMissing) Specifies whether or not missing values will be ignored during the training process.intpredictClass(double[] continuous, int[] nominal) Predicts the classification for the input pattern using the trained Naive Bayes classifier.double[]probabilities(double[] continuous, int[] nominal) Predicts the classification probabilities for the input pattern using the trained Naive Bayes classifier.voidsetContinuousSmoothingValue(double clambda) Parameter for calculating smoothed estimates of conditional probabilities for continuous attributes.voidsetDiscreteSmoothingValue(double dlambda) Parameter for calculating smoothed estimates of conditional probabilities for discrete (nominal) attributes.voidsetZeroCorrection(double zeroCorrection) Specifies the replacement value to be used for conditional probabilities equal to zero.voidtrain(double[][] continuousData, int[] classificationData) Trains a Naive Bayes classifier for classifying data into one ofnClassestarget classifications.voidtrain(double[][] continuousData, int[][] nominalData, int[] classificationData) Trains a Naive Bayes classifier for classifying data into one ofnClassestarget classifications.voidtrain(int[][] nominalData, int[] classificationData) Trains a Naive Bayes classifier for classifying data into one ofnClassestarget classifications.
-
Constructor Details
-
NaiveBayesClassifier
public NaiveBayesClassifier(int nContinuous, int nNominal, int nClasses) Constructs a NaiveBayesClassifier- Parameters:
nContinuous- anintcontaining the number of continuous attributesnNominal- anintcontaining the number of nominal attributesnClasses- anintcontaining the number of target classifications
-
-
Method Details
-
createContinuousAttribute
Create a continuous variable and the associated distribution function.- Parameters:
pdf- aProbabiltyDistributionto be applied to the continuous attribute. The distribution function will be applied to all classes. By default,NormalDistributionis used.
-
createContinuousAttribute
Create a continuous variable and the associated distribution functions for each target classification.- Parameters:
pdf- an array ofProbabilityDistributions containingnClassesdistribution functions for a continuous attribute. This allows a different distribution function to be applied to each classification. By default,NormalDistributionis used.
-
createNominalAttribute
public void createNominalAttribute(int nCategories) Create a nominal attribute and the number of categories- Parameters:
nCategories- anintcontaining the number of categories in the nominal attribute. The category values are expected to be encoded with integers ranging from 0 tonCategories-1. No default is used fornCategories. IfnNominalis not zero, andcreateNominalAttributeis not invoked for eachnNominalattribute, anIllegalStateExceptionwill be thrown when thetrainmethod is invoked.
-
train
public void train(double[][] continuousData, int[] classificationData) Trains a Naive Bayes classifier for classifying data into one ofnClassestarget classifications.- Parameters:
continuousData- adoublematrix containing the training values for thenContinuouscontinuous attributes. The i-th row contains the input attributes for the i-th training pattern. The j-th column contains the values for the j-th continuous attribute. Missing values should be set toDouble.NaN. Patterns with both non-missing and missing values are used to train the classifier unless theignoreMissingValuesmethod has been set totrue.classificationData- anintarray containing the target classifications for the training patterns. These must be encoded from zero tonClasses-1. Any value outside this range is considered a missing value. In this case, the data in that pattern are not used to train the Naive Bayes classifier. However, any pattern with missing values is still classified after the classifier is trained.
-
train
public void train(int[][] nominalData, int[] classificationData) Trains a Naive Bayes classifier for classifying data into one ofnClassestarget classifications.- Parameters:
nominalData- anintmatrix containing the training values for thenNominalnominal attributes. The i-th row contains the input attributes for the i-th training pattern. The j-th column contains the classifications for the j-th nominal attribute. The values for the j-th nominal attribute are expected to be encoded with integers starting from 0 to nCategories - 1, where nCategories is specified in thecreateNominalAttributemethod. Any value outside this range is treated as a missing value. Patterns with both non-missing and missing values are used to train the classifier unless theignoreMissingValuesmethod has been set totrue.classificationData- anintarray containing the target classifications for the training patterns. These must be encoded from zero tonClasses-1. Any value outside this range is considered a missing value. In this case, the data in that pattern are not used to train the Naive Bayes classifier. However, any pattern with missing values is still classified after the classifier is trained.
-
train
public void train(double[][] continuousData, int[][] nominalData, int[] classificationData) Trains a Naive Bayes classifier for classifying data into one ofnClassestarget classifications.- Parameters:
continuousData- adoublematrix containing the training values for thenContinuouscontinuous attributes. The i-th row contains the input attributes for the i-th training pattern. The j-th column contains the values for the j-th continuous attribute. Missing values should be set toDouble.NaN. Patterns with both non-missing and missing values are used to train the classifier unless theignoreMissingValuesmethod has been set totrue.nominalData- anintmatrix containing the training values for thenNominalnominal attributes. The i-th row contains the input attributes for the i-th training pattern. The j-th column contains the classifications for the j-th nominal attribute. The values for the j-th nominal attribute are expected to be encoded with integers starting from 0 to nCategories - 1, where nCategories is specified in thecreateNominalAttributemethod. Any value outside this range is treated as a missing value. Patterns with both non-missing and missing values are used to train the classifier unless theignoreMissingValuesmethod has been set totrue.classificationData- anintarray containing the target classifications for the training patterns. These must be encoded from zero tonClasses-1. Any value outside this range is considered a missing value. In this case, the data in that pattern are not used to train the Naive Bayes classifier. However, any pattern with missing values is still classified after the classifier is trained.
-
getClassificationErrors
public double[] getClassificationErrors()Returns the classification probability errors for each pattern in the training data.- Returns:
- a
doublearray containing the classification probability errors for each pattern in the training data. The classification error for the i-th training pattern is equal to 1-predictedClassProbability[i][k], where predictedClassProbability is returned fromgetProbabilitiesand k is equal toclassificationData[i].
-
getMeans
public double[][] getMeans(double[][] continuousData, int[] classificationData) Returns a table of means for each continuous attribute incontinuousDatasegmented by the target classes inclassificationData.This method is provided as a utility, prior training is not necessary.
- Parameters:
continuousData- adoublematrix containing training values for the continuous attributes.classificationData- anintarray containing the target classifications for the training patterns.- Returns:
- a
continuousData[0].lengthbynClassesdoublematrix, means, containing the means segmented by the target classes. The i-th row contains the means of the i-th continuous attribute for each value of the target classification. That is, means[i][j] is the mean for the i-th continuous attribute when the target classification equals j, unless there are no training patterns for this condition.
-
getStandardDeviations
public double[][] getStandardDeviations(double[][] continuousData, int[] classificationData) Returns a table of standard deviations for each continuous attribute incontinuousDatasegmented by the target classes inclassificationData.This method is provided as a utility, prior training is not necessary.
- Parameters:
continuousData- adoublematrix containing training values for the continuous attributes.classificationData- anintarray containing the target classifications for the training patterns.- Returns:
- a
continuousData[0].lengthbynClassesdoublematrix, stdev, containing the standard deviations segmented by the target classes. The i-th row contains the standard deviation of the i-th continuous attribute for each value of the target classification. That is, stdev[i][j] is the standard deviations for the i</>-th continuous attribute when the target classification equals j, unless there are no training patterns for this condition.
-
setDiscreteSmoothingValue
public void setDiscreteSmoothingValue(double dlambda) Parameter for calculating smoothed estimates of conditional probabilities for discrete (nominal) attributes.- Parameters:
dlambda- adoublecontaining the smoothing parameter to be used for calculating smoothed estimates of conditional probabilities for discrete attributes.dlambdamust be non-negative. By default,dlambda= 1.0, i.e. Laplace smoothing of conditional probabilities.
-
setContinuousSmoothingValue
public void setContinuousSmoothingValue(double clambda) Parameter for calculating smoothed estimates of conditional probabilities for continuous attributes.- Parameters:
clambda- adoublecontaining the smoothing parameter to be used for calculating smoothed estimates of conditional probabilities for continuous attributes.clambdamust be non-negative. By default,clambda=0, i.e. no smoothing is done.
-
setZeroCorrection
public void setZeroCorrection(double zeroCorrection) Specifies the replacement value to be used for conditional probabilities equal to zero.- Parameters:
zeroCorrection- adoublecontaining the value to replace conditional probabilities equal to zero.zeroCorrectionmust be non-negative. By default, no correction will be performed.
-
ignoreMissingValues
public void ignoreMissingValues(boolean ignoreMissing) Specifies whether or not missing values will be ignored during the training process.- Parameters:
ignoreMissing- abooleanspecifying whether or not to ignore patterns during training when one or more input attributes are missing. By default, both missing and non-missing values are used to train the classifier. Classification predictions are still returned for all patterns even when set to true. By default,ignoreMissing=false.
-
getTrainingErrors
public int[][] getTrainingErrors()Returns a table of classification errors of non-missing classifications for each target classification plus the overall total of classification errors.- Returns:
- an
intmatrix containingnClasses+ 1 rows and two columns. The first column contains the number of misclassifications and the second column contains the total number of classifications for the i-th row target class. The last row of the matrix contains the total number of misclassifications in column one and the total non-missing classifications in column two.
-
getPredictedClass
public int[] getPredictedClass()Returns the predicted classification for each training pattern.- Returns:
- an
intarray containing the predicted classification for each training pattern.
-
predictClass
public int predictClass(double[] continuous, int[] nominal) Predicts the classification for the input pattern using the trained Naive Bayes classifier.- Parameters:
continuous- adoublearray containing an input pattern ofnContinuouscontinuous attributes. IfnContinuous= 0, anullis allowed.nominal- anintarray of lengthnNominalcontaining an input pattern of nominal attributes. IfnNominal= 0, anullis allowed.- Returns:
- an
intcontaining the predicted classification for the input pattern using the trained Naive Bayes Classifier. The predicted classification returned is the class with the largest estimated classification probability. The classification probabilities can be predicted using theprobabilitiesmethod.
-
probabilities
public double[] probabilities(double[] continuous, int[] nominal) Predicts the classification probabilities for the input pattern using the trained Naive Bayes classifier.- Parameters:
continuous- adoublearray containing an input pattern ofnContinuouscontinuous attributes. IfnContinuous= 0, anullis allowed.nominal- anintarray of lengthnNominalcontaining an input pattern of nominal attributes. IfnNominal= 0, anullis allowed.- Returns:
- a
doublearray of lengthnClassescontaining the predicted classification probabilities for each target class.
-
getProbabilities
public double[][] getProbabilities()Returns the predicted classification probabilities for each target class.- Returns:
- a
doublematrix, prob, of size nPatterns bynClassescontaining the predicted classification probabilities for each target class, where nPatterns is the number of patterns trained. prob[i][j] is the estimated probability that the i-th pattern belongs to the j-th target class.
-
classError
public double classError(double[] continuous, int[] nominal, int classification) Returns the classification probability error for the input pattern and known target classification.- Parameters:
continuous- adoublearray of lengthnContinuouscontaining an input pattern of continuous attributes. IfnContinuous= 0, anullis allowed.nominal- anintarray of lengthnNominalcontaining an input pattern of nominal attributes. IfnNominal= 0, anullis allowed.classification- anintcontaining the target classification.- Returns:
- a
doublecontaining the classification probability error for the input pattern. The classification error for the input pattern is equal to 1-p, where p is the predicted class probability of inputclassification. The predicted class probability of inputclassificationcan be obtained by the methodprobabilities. If p =probabilitiesand k is equal toclassification, then the classification error is 1 - p[k].
-
getClassCounts
public int[] getClassCounts(int[] classificationData) Returns the number of patterns for each target classification.- Parameters:
classificationData- anintarray containing the target classifications for the training patterns. These must be encoded from zero tonClasses-1. Any value outside this range is considered a missing value. In this case, the data in that pattern are not used to train the Naive Bayes classifier. However, any pattern with missing values is still classified after the classifier is trained.- Returns:
- an
intarray containing the class counts.
-