naiveBayesTrainer¶
Trains a Naive Bayes classifier.
Synopsis¶
naiveBayesTrainer (nClasses, classification)
Required Arguments¶
- int
nClasses
(Input) - Number of target classifications.
- int
classification[]
(Input) - Array of size
nPatterns
containing the target classifications for the training patterns. These must be encoded from zero tonClasses-1
. Any value outside this range is considered a missing value. In this case, the data in that pattern are not used to train the Naive Bayes classifier. However, any pattern with missing values is still classified after the classifier is trained.
Return Value¶
An array of size (nClasses
+1) by 2 containing the number of
classification errors and the number of non-missing classifications for each
target classification plus the overall totals for these errors. For i <
nClasses
, the i-th
row contains the number of classification
errors for the i-th
class and the number of patterns with
non-missing classifications for that class. The last row contains the number
of classification errors totaled over all target classifications and the
total number of patterns with non-missing target classifications. The memory
allocated for this array can be released using free.
If training is unsuccessful, None
is returned.
Optional Arguments¶
continuous
, floatcontinuous[[]]
(Input)nContinuous
is the number of continuous attributes andcontinuous
is an array of sizenPatterns
bynContinuous
containing the training values for the continuous attributes. The i-th row contains the input attributes for the i‑th training pattern. The j‑th column ofcontinuous
contains the values for the j‑th continuous attribute. Missing values should be set equal tomachine(6)
=NaN. Patterns with both non-missing and missing values are used to train the classifier unless theignoreMissingValuePatterns
argument is supplied. If thecontinuous
argument is not supplied,nContinuous
is assumed equal to zero.nominal
, intnCategories[]
, intnominal[]
(Input)nNominal
is the number of nominal attributes.nCategories
is an array of lengthnNominal
containing the number of categories associated with each nominal attribute. These must all be greater than zero. nominal is an array of sizenPatterns
bynNominal
containing the training values for the nominal attributes. The i‑th row contains the nominal input attributes for the i‑th pattern. The j‑th column of this matrix contains the classifications for the j‑th nominal attribute. The values for the j‑th nominal attribute are expected to be encoded with integers starting from 0 tonCategories[i]
-1. Any value outside this range is treated as a missing value. Patterns with both non-missing and missing values are used to train the classifier unless theignoreMissingValuePatterns
option is supplied. If thenominal
argument is not supplied,nNominal
is assumed equal to zero.printLevel,
int (Input)Print levels for printing data warnings and final results.
printLevel
should be set to one of the following values:printLevel
Description NONE
Printing of data warnings and final results is suppressed. FINAL
Prints final summary of Naive Bayes classifier training. DATA_WARNINGS
Prints information about missing values and PDF calculations equal to zero. TRACE_ALL
Prints final summary plus all data warnings associated with missing values and PDF calculations equal to zero. Default:
NONE
.ignoreMissingValuePatterns
, (Input)- By default, patterns with both missing and non-missing values are used to train the classifier. This option causes the algorithm to ignore patterns with one or more missing input attributes during training. However, classification predictions are still returned for all patterns.
discreteSmoothingParm
, float (Input)Parameter for calculating smoothed estimates of conditional probabilities for discrete attributes. This parameter must be non-negative.
Default: Laplace smoothing of conditional probabilities, i.e.,
discreteSmoothingParm=1
.continuousSmoothingParm
, float (Input)Parameter for calculating smoothed estimates of conditional probabilities for continuous attributes. This parameter must be non-negative.
Default: No smoothing of conditional probabilities for continuous attributes, i.e.,
continuousSmoothingParm=0
.zeroCorrection
, float (Input)Parameter used to replace conditional probabilities equal to zero numerically. This parameter must be non-negative.
Default: No correction, i.e.,
zeroCorrection = 0
.selectedPdf
, int[]
(Input)An array of length
nContinuous
specifying the distribution for each continuous input attribute. If this argument is not supplied, conditional probabilities for all continuous attributes are calculated using the Gaussian probability density function with its parameters estimated from the training patterns, i.e.,selectedPdf[i] = GAUSSIAN
. This argument allows users to select other distributions using the following encoding:selectedPdf[i]
Probability Density Function GAUSSIAN
Gaussian (See gaussianPdf). LOG_NORMAL
Log-normal (See logNormalPdf). GAMMA
Gamma (See gammaPdf). POISSON
Poisson (See poissonPdf). USER
User Defined (See userPdf). selectedPdf[i]
, specifies the probability density function for the i-th continuous input attribute.
gaussianPdf
, floatmeans[[]]
, floatstdev[[]]
(Input)The
means
andstdev
are two arrays each of sizenGauss by nClasses
wherenGauss
represents the number of Gaussian attributes as specified by optional argumentselectedPdf
(i.e., the number of elements inselectedPdf
equal toGAUSSIAN
). The i-th row ofmeans
andstdev
contains the means and standard deviations respectively for the i-th Gaussian attribute incontinuous
for each value of the target classification.means[i\*nClasses+j]
is used as the mean for the i-th Gaussian attribute when the target classification equalsj
, andstdev[i\*nClasses+j]
is used as the standard deviation for the i-th Gaussian attribute when the target classification equalsj
. This argument is ignored ifnContinuous = 0
.Default: The means and standard deviations for all Gaussian attributes are estimated from the means and standard deviations of the training patterns. These estimates are the traditional BLUE (Best Linear Unbiased Estimates) for the parameters of a Gaussian distribution.
logNormalPdf
, floatlogMean[[]]
, floatlogStdev[[]]
(Input)Two arrays each of size
nLogNormal by nClasses
wherenLogNormal
represents the number of log-normal attributes as specified by optional argumentselectedPdf
(i.e., the number of elements inselectedPdf
equal toLOG_NORMAL
). The i-th row oflogMean
andlogStdev
contains the means and standard deviations respectively for the i-th log-normal attribute for each value of the target classification.logMean[i\*nClasses+j]
is used as the mean for the i-th log-normal attribute when the target classification equalsj
, andlogStdev[i\*nClasses+j]
is used as the standard deviation for the i-th log-normal attribute when the target classification equalsj
. This argument is ignored ifnContinuous = 0
.Default: The means and standard deviations for all log-normal attributes are estimated from the means and standard deviations of the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a log-normal distribution.
gammaPdf
, floata[[]]
, floatb[[]]
(Input)Two arrays each of size
nGamma
bynClasses
containing the means and standard deviations for the Gamma continuous attributes, wherenGamma
represents the number of gamma distributed continuous variables as specified by the optional argumentselectedPdf
(i.e., the number of elements inselectedPdf
equal toGAMMA
). The i-th
row ofa
andb
contains the shape and scale parameters for the i-th Gamma attribute for each value of the target classification.a[i\*nClasses+j]
is used as the shape parameter for the i-th Gamma attribute when the target classification equalsj
, andb[i\*nClasses+j]
is used as the scale parameter for the i-th Gamma attribute when the target classification equalsj
. This argument is ignored ifnContinuous = 0
.Default: The shape and scale parameters for all Gamma attributes are estimated from the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a Gamma distribution.
poissonPdf
, float[[]]
(Input)An array of size
nPoisson by nClasses
containing the means for the Poisson attributes, wherenPoisson
represents the number of Poisson distributed continuous variables as specified by the optional argumentselectedPdf
(i.e., the number of elements inselectedPdf
equal toPOISSON
).The i-th row oftheta
contains the means for the i-th Poisson attribute for each value of the target classification.theta[i\*nClasses+j]
is used as the mean for the i-th Poisson attribute when the target classification equalsj
. This argument is ignored ifnContinuous= 0
.Default: The means (
theta
) for all Poisson attributes are estimated from the means of the training patterns. These estimates are the traditional MLE (Maximum Likelihood Estimates) for the parameters of a Poisson distribution.
userPdf
, floatpdf
(intindex[]
, floatx
) (Input)The user-supplied probability density function and parameters used to calculate the conditional probability density for continuous input attributes is required when
selectedPdf[i]
=USER
.When
pdf
is called,x
will equalcontinuous[i\*nContinuous+j]
, andindex
is an array of length 3 which will contain the following values fori, j,
andk
:index Value index[0]
i
= pattern indexindex[1]
j
= attribute indexindex[2]
k
= target classificationThe pattern index ranges from 0 to
nPatterns-1
and identifies the pattern index forx
. The attributes index ranges from 0 tonCategories[i]-1,
andk=classification[i].
This argument is ignored if
nContinuous = 0
. By default the Gaussian PDF is used for calculating the conditional probability densities using either the means and variances calculated from the training patterns or those supplied ingaussianPdf
.On some platforms,
naiveBayesTrainer
can evaluate the user-supplied functionpdf
in parallel. This is done only if the function ompOptions is called to flag user-defined functions as thread-safe. A function is thread-safe if there are no dependencies between calls. Such dependencies are usually the result of writing to global or static variables.statistics
, floatmeans
, floatstdev
(Output)- Two arrays of size
nContinuous
bynClasses
containing the means and standard deviations for the continuous attributes segmented by the target classes. The structure of these matrices is identical to the structure described for thegaussianPdf
argument. The i-th row ofmeans
andstdev
contains the means and standard deviations respectively of the i-th continuous attribute for each value of the target classification. That is,means[i\*nClasses+j]
is the mean for the i-th continuous attribute when the target classification equals j, andstdev[i\*nClasses+j]
is the standard deviation for the i-th continuous attribute when the target classification equals j, unless there are no training patterns for this condition. If there are no training patterns in the i, j-th cell then the mean and standard deviation for that cell is computed using the mean and standard deviation for the i-th continuous attribute calculated using all of its non-missing values. Standard deviations are estimated using the minimum variance unbiased estimator. predictedClass
(Output)- An array of size
nPatterns
containing the predicted classification for each training pattern. predictedClassProb
(Output)- An array of size
nPatterns by nClasses
. The values in the i-th row are the predicted classification probabilities associated with the target classes.predictedClassProb[i\*nClasses+j]
is the estimated probability that the i-th pattern belongs to the j-th target class. classError
(Output)- An array with
nPatterns
containing the classification probability errors for each pattern in the training data. The classification error for the i‑th training pattern is equal to1-
predictedClassProb[i\*nClasses+k]
wherek=classification[i].
countTable
(Output)An array of size
\[(m+1)\left(\mathrm{nClasses} + \mathrm{nClasses} \sum_{\mathrm{i}=0}^{\mathrm{m}} \mathrm{nCategories}[i]\right)\]where m =
nNominal
-1.countTable
[i\*nNominal\*nClasses
+j\*nClasses
+k
] is equal to the number of training patterns for the i-th nominal attribute, when theclassification
[i
]=j
andnominal[i\*nClasses+j]=k
.nbClassifier
(Output)- An Imsls_d_nb_classifier structure. Upon return, the structure is
populated with the trained Naive Bayes classifier. This is required input
to
naiveBayesClassification
. Memory allocated to this structure is released usingnbClassifierFree
.
Description¶
Function naiveBayesTrainer
trains a Naive Bayes classifier for
classifying data into one of nClasses
target classes. Input attributes
can be a combination of both nominal and continuous data. Ordinal data can
be treated as either nominal attributes or continuous. If the distribution
of the ordinal data is known or can be approximated using one of the
continuous distributions, then associating them with continuous attributes
allows a user to specify that distribution. Missing values are allowed.
Let C be the classification attribute with target categories 0
, 1
,
…, nClasses-1
, and let \(X^T=\{ x_1,x_2,\ldots,x_k\}\) be a
vector valued array of k = nNominal+nContinuous
input attributes. The
classification problem simplifies to estimate the conditional probability
\(P(C|X)\) from a set of training patterns. The Bayes rule states that
this probability can be expressed as the ratio:
where c is equal to one of the target classes 0, 1, …, nClasses
-1
.
In practice, the denominator of this expression is constant across all target
classes since it is only a function of the given values of X. As a result,
the Naive Bayes algorithm does not expend computational time estimating
\(P \left( X=\left\{ x_1,x_2,\ldots x_k \right\} \right)\) for every
pattern. Instead, a Naive Bayes classifier calculates the numerator \(P
(C=c) P\left( X=\left\{ x_1,x_2,\ldots x_k \right\} | C=c \right)\) for each
target class and then classifies X to the target class with the largest
value, i.e.,
The classifier simplifies this calculation by assuming conditional independence. That is it assumes that:
This is equivalent to assuming that the values of the input attributes, given C, are independent of one another, i.e.,
In real world data this assumption rarely holds, yet in many cases this approach results in surprisingly low classification error rates. Thus, the estimate of \(P \left(C=c | X=\left\{ x_1,x_2,\ldots x_k \right\} \right)\) from a Naive Bayes classifier is generally an approximation. Classifying patterns based upon the Naive Bayes algorithm can have acceptably low classification error rates.
For nominal attributes, this implementation of the Naive Bayes classifier estimates conditional probabilities using a smoothed estimate:
where \(\#N\{Z\}\) is the number of training patterns with attribute Z and j is equal to the number of categories associated with the j-th nominal attribute.
The probability \(P (C=c)\) is also estimated using a smoothed estimate:
These estimates correspond to the maximum a priori (MAP) estimates for a
Dirichelet prior assuming equal priors. The smoothing parameter can be any
non-negative value. Setting \(\lambda=0\) corresponds to no smoothing.
The default smoothing used in this algorithm, \(\lambda=1\), is commonly
referred to as Laplace smoothing. This can be changed using the optional
argument discreteSmoothingParm
.
For continuous attributes, the same conditional probability \(P (x_j | C=c)\) in the Naive Bayes formula is replaced with the conditional probability density function \(f(x_j | C=c)\). By default, the density function for continuous attributes is the Gaussian density function:
where μ and σ are the conditional mean and variance, i.e., the mean and
variance of \(x_j\) when \(C=c\). By default the conditional mean and
standard deviations are estimated using the sample mean and standard
deviation of the training patterns. These are returned in the optional
argument statistics
.
In addition to the default GAUSSIAN
, users can select three other
continuous distributions to model the continuous attributes using the
argument selectedPdf.
These are the Log Normal, Gamma, and Poisson
distributions selected by setting the entries in selectedPdf
to
LOG_NORMAL
, GAMMA
or POISSON
. Their probability density
functions are equal to:
and
By default parameters for these distributions are estimated from the
training patterns using the maximum likelihood method. However, they can
also be supplied using the optional input arguments gaussianPdf
,
logNormalPdf
, gammaPdf
and poissonPdf.
The default Gaussian PDF can be changed and each continuous attribute can be
assigned a different density function using the argument selectedPdf
. If
any entry in selectedPdf
is equal to USER
, the user must supply
their own PDF calculation using the userPdf
argument. Each continuous
attribute can be modeled using a different distribution if appropriate.
Smoothing conditional probability calculations for continuous attributes is
controlled by the continuousSmoothingParm
and zeroCorrection
optional arguments. By default conditional probability calculations for
continuous attributes are unadjusted for calculations near zero. If the
value of continuousSmoothingParm
is set using the
continuousSmoothingParm
argument, the algorithm adds
continuousSmoothingParm
to each continuous probability calculation. This
is similar to the effect of discreteSmoothingParm
for the corresponding
discrete calculations. By default continuousSmoothingParm=0.
The value of zeroCorrection
from the zeroCorrection
argument is used
when \(\left( f(x | C=c)+\mathrm{continuousSmoothingParm} \right)=0\). If
this condition occurs, the conditional probability is replaced with the value
of zeroCorrection
. By default zeroCorrection
=
0.
Examples¶
Example 1¶
Fisher’s (1936) Iris data is often used for benchmarking classification algorithms. It is one of the IMSL data sets and consists of the following continuous input attributes and classification target:
Continuous Attributes: X0(sepal length), X1(sepal width), X2(petal length), and X3(petal width)
Classification (Iris Type): Setosa, Versicolour, or Virginica.
This example trains a Naive Bayes classifier using 150 training patterns with these data.
from __future__ import print_function
from numpy import empty, double
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.ompOptions import ompOptions
from pyimsl.stat.naiveBayesTrainer import naiveBayesTrainer
n_patterns = 150 # 150 training patterns
n_continuous = 4 # four continuous input attributes
n_classes = 3 # three classification categories
classification = empty([150], dtype=int)
continuous = empty([150, 4], dtype=double)
classLabel = ["Setosa ", "Versicolour", "Virginica "]
ompOptions(setFunctionsThreadSafe=True)
# irisData[]: The raw data matrix. This is a 2-D matrix
# with 150 rows and 5 columns. The last 4 columns are the
# continuous input attributes and the 1st column is the
# classification category (1-3). These data contain no
# nominal input attributes.
irisData = dataSets(3)
# Data corrections described in the KDD data mining archive
irisData[34][4] = 0.1
irisData[37][2] = 3.1
irisData[37][3] = 1.5
# setup the required input arrays from the data matrix
for i in range(0, n_patterns):
classification[i] = int(irisData[i][0] - 1)
for j in range(1, n_continuous + 1):
continuous[i][j - 1] = irisData[i][j]
nb_classifier = []
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=continuous,
nbClassifier=nb_classifier)
print(" Iris Classification Error Rates")
print("----------------------------------------------")
print(" Setosa Versicolour Virginica | TOTAL")
print(" %d/%d %d/%d %d/%d | %d/%d\n"
% (classErrors[0][0], classErrors[0][1],
classErrors[1][0], classErrors[1][1],
classErrors[2][0], classErrors[2][1],
classErrors[3][0], classErrors[3][1]))
print("----------------------------------------------\n")
Output¶
For Fisher’s data, the Naive Bayes classifier incorrectly classified 6 of the 150 training patterns.
Iris Classification Error Rates
----------------------------------------------
Setosa Versicolour Virginica | TOTAL
0/50 3/50 3/50 | 6/150
----------------------------------------------
Example 2¶
This example trains a Naive Bayes classifier using 24 training patterns with
four nominal input attributes. It illustrates the output available from the
optional argument printLevel
.
The first nominal attribute has three classifications and the others have three. The target classifications are contact lenses prescription: hard, soft or neither recommended. These data are benchmark data from the Knowledge Discovery Databases archive maintained at the University of California, Irvine: http://archive.ics.uci.edu/ml/datasets/Lenses.
from numpy import empty, double, int
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.ompOptions import ompOptions
from pyimsl.stat.naiveBayesTrainer import naiveBayesTrainer, FINAL
# Data matrix
inputData = [[1, 1, 1, 1, 3], [1, 1, 1, 2, 2], [1, 1, 2, 1, 3], [1, 1, 2, 2, 1],
[1, 2, 1, 1, 3], [1, 2, 1, 2, 2], [
1, 2, 2, 1, 3], [1, 2, 2, 2, 1],
[2, 1, 1, 1, 3], [2, 1, 1, 2, 2], [
2, 1, 2, 1, 3], [2, 1, 2, 2, 1],
[2, 2, 1, 1, 3], [2, 2, 1, 2, 2], [
2, 2, 2, 1, 3], [2, 2, 2, 2, 3],
[3, 1, 1, 1, 3], [3, 1, 1, 2, 3], [
3, 1, 2, 1, 3], [3, 1, 2, 2, 1],
[3, 2, 1, 1, 3], [3, 2, 1, 2, 2], [3, 2, 2, 1, 3], [3, 2, 2, 2, 3]]
n_patterns = 24 # 24 training patterns
n_nominal = 4 # 2 nominal input attributes
n_classes = 3 # three classification categories
n_categories = [3, 2, 2, 2]
nominal_arr = empty([24, 4], dtype=int)
classification = empty([24], dtype=int)
classLabel = ["Hard ", "Soft ", "Neither"]
ompOptions(setFunctionsThreadSafe=True)
# setup the required input arrays from the data matrix
# subtract 1 from the data to ensure classes start at zero
for i in range(0, n_patterns):
classification[i] = inputData[i][4] - 1
for j in range(0, n_nominal):
nominal_arr[i][j] = inputData[i][j] - 1
nominal = {'nCategories': n_categories,
'nominal': nominal_arr}
classErrors = naiveBayesTrainer(n_classes, classification,
nominal=nominal,
printLevel=FINAL)
Output¶
For these data, only one of the 24 training patterns is misclassified,
pattern 17. The target classification for that pattern is 2 = “Neither”.
However, since P(
class = 2)
= 0.3491
< P(
class =
1) = 0.5085,
pattern 17 is classified as class = 1
, “Soft
Contacts” recommended. The classification error for this probability is
calculated as 1.0 - 0.3491 = 0.6509
.
--------UNCONDITIONAL TARGET CLASS PROBABILITIES---------
P(Class=0) = 0.1852 P(Class=1) = 0.2222 P(Class=2) = 0.5926
---------------------------------------------------------
----------------CONDITIONAL PROBABILITIES----------------
----------NOMINAL ATTRIBUTE 0 WITH 3 CATEGORIES----------
P(X(0)=0|Class=0) = 0.4286 P(X(0)=1|Class=0) = 0.2857 P(X(0)=2|Class=0) = 0.2857
P(X(0)=0|Class=1) = 0.3750 P(X(0)=1|Class=1) = 0.3750 P(X(0)=2|Class=1) = 0.2500
P(X(0)=0|Class=2) = 0.2778 P(X(0)=1|Class=2) = 0.3333 P(X(0)=2|Class=2) = 0.3889
---------------------------------------------------------
----------NOMINAL ATTRIBUTE 1 WITH 2 CATEGORIES----------
P(X(1)=0|Class=0) = 0.6667 P(X(1)=1|Class=0) = 0.3333
P(X(1)=0|Class=1) = 0.4286 P(X(1)=1|Class=1) = 0.5714
P(X(1)=0|Class=2) = 0.4706 P(X(1)=1|Class=2) = 0.5294
---------------------------------------------------------
----------NOMINAL ATTRIBUTE 2 WITH 2 CATEGORIES----------
P(X(2)=0|Class=0) = 0.1667 P(X(2)=1|Class=0) = 0.8333
P(X(2)=0|Class=1) = 0.8571 P(X(2)=1|Class=1) = 0.1429
P(X(2)=0|Class=2) = 0.4706 P(X(2)=1|Class=2) = 0.5294
---------------------------------------------------------
----------NOMINAL ATTRIBUTE 3 WITH 2 CATEGORIES----------
P(X(3)=0|Class=0) = 0.1667 P(X(3)=1|Class=0) = 0.8333
P(X(3)=0|Class=1) = 0.1429 P(X(3)=1|Class=1) = 0.8571
P(X(3)=0|Class=2) = 0.7647 P(X(3)=1|Class=2) = 0.2353
---------------------------------------------------------
TRAINING PREDICTED CLASS
PATTERN P(class= 0) P(class= 1) P(class= 2) CLASS CLASS ERROR
-----------------------------------------------------------------------
0 0.0436 0.1297 0.8267 2 2 0.1733
1 0.1743 0.6223 0.2034 1 1 0.3777
2 0.1863 0.0185 0.7952 2 2 0.2048
3 0.7238 0.0861 0.1901 0 0 0.2762
4 0.0194 0.1537 0.8269 2 2 0.1731
5 0.0761 0.7242 0.1997 1 1 0.2758
6 0.0920 0.0243 0.8836 2 2 0.1164
7 0.5240 0.1663 0.3096 0 0 0.4760
8 0.0253 0.1127 0.8621 2 2 0.1379
9 0.1182 0.6333 0.2484 1 1 0.3667
10 0.1132 0.0168 0.8699 2 2 0.1301
11 0.6056 0.1081 0.2863 0 0 0.3944
12 0.0111 0.1327 0.8562 2 2 0.1438
13 0.0500 0.7138 0.2362 1 1 0.2862
14 0.0535 0.0212 0.9252 2 2 0.0748
15 0.3937 0.1875 0.4188 2 2 0.5812
16 0.0228 0.0679 0.9092 2 2 0.0908
17 0.1424 0.5085 0.3491 2 1 0.6509
18 0.0994 0.0099 0.8907 2 2 0.1093
19 0.5986 0.0712 0.3301 0 0 0.4014
20 0.0101 0.0805 0.9093 2 2 0.0907
21 0.0624 0.5937 0.3439 1 1 0.4063
22 0.0467 0.0123 0.9410 2 2 0.0590
23 0.3909 0.1241 0.4850 2 2 0.5150
-----------------------------------------------------------------------
CLASSIFICATION ERRORS
Classification 0: 0/4
Classification 1: 0/5
Classification 2: 1/15
Total Errors: 1/24
Example 3¶
This example illustrates the power of Naive Bayes classification for text mining applications. This example uses the spam benchmark data available from the Knowledge Discovery Databases archive maintained at the University of California, Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase and is one of the IMSL data sets.
These data consist of 4601 patterns consisting of 57 continuous attributes and one classification binary classification attribute. 41% of these patterns are classified as spam and the remaining as non-spam. The first 54 continuous attributes are word or symbol percentages. That is, they are percents scaled from 0 to 100% representing the percentage of words or characters in the email that contain a particular word or character. The last three continuous attributes are word lengths. For a detailed description of these data visit the KDD archive at the above link.
In this example, the program was written to evaluate alternatives for modeling the continuous attributes. Since some are percentages and others are lengths with widely different ranges, the classification error rate can be influenced by scaling. Percentages are transformed using the arcsin/square root transformation \(y=\sin^{-1} \left( \sqrt{p} \right)\). This transformation often produces a continuous attribute that is more closely approximated by a Gaussian distribution. There are a variety of possible transformations for the word length attributes. In this example, the square root transformation is compared to a classifier with no transformation.
In addition, since this Naive Bayes algorithm allows users to select individual statistical distributions for modeling continuous attributes, the Gaussian and Log Normal distributions are investigated for modeling the continuous attributes.
from __future__ import print_function
from numpy import empty, double, int
from math import asin, sin, sqrt
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.ompOptions import ompOptions
from pyimsl.stat.naiveBayesTrainer import naiveBayesTrainer, \
FINAL, GAUSSIAN, LOG_NORMAL
def printErrorRates(classErrors):
p0 = 100.0 * classErrors[0][0] / classErrors[0][1]
p1 = 100.0 * classErrors[1][0] / classErrors[1][1]
p2 = 100.0 * classErrors[2][0] / classErrors[2][1]
print("----------------------------------------------------")
print(" Not Spam Spam | TOTAL")
print(" %d/%d=%4.1f%% %d/%d=%4.1f%% | %d/%d=%4.1f%%"
% (classErrors[0][0], classErrors[0][1], p0,
classErrors[1][0], classErrors[1][1], p1,
classErrors[2][0], classErrors[2][1], p2))
print("----------------------------------------------------\n")
# Inputs assuming all attributes, except family history,
# are continuous
# n_patterns = 4601
# n_variables = 57 + 1 classification
n_classes = 2 # (spam or no spam)
n_continuous = 57
selected_pdf = empty(57)
# additional variables
n_spam = 0
fmt = "%10.2f"
ompOptions(setFunctionsThreadSafe=True)
n_patterns = []
n_variables = []
spamData = dataSets(11,
nObservations=n_patterns,
nVariables=n_variables)
continuous = empty([n_patterns[0], n_variables[0] - 1], dtype=double)
unscaledContinuous = empty([n_patterns[0], n_variables[0] - 1], dtype=double)
classification = empty(n_patterns[0], dtype=int)
for i in range(0, n_patterns[0]):
for j in range(0, n_variables[0] - 1):
if (j < 54):
continuous[i][j] = asin(sqrt(spamData[i][j] / 100))
else:
continuous[i][j] = spamData[i][j]
unscaledContinuous[i][j] = spamData[i][j]
classification[i] = int(spamData[i][n_variables[0] - 1])
if (classification[i] == 1):
n_spam += 1
print("Number of Patterns = %d" % (n_patterns[0]))
print(" Number Classified as Spam = %d\n" % (n_spam))
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=unscaledContinuous)
print(" Unscaled Gaussian Classification Error Rates ")
print(" No Attribute Transformations ")
print(" All Attributes Modeled as Gaussian Variates.")
printErrorRates(classErrors)
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=continuous)
print(" Scaled Gaussian Classification Error Rates ")
print(" Arsin(sqrt) transformation of first 54 Vars. ")
print(" All Attributes Modeled as Gaussian Variates. ")
printErrorRates(classErrors)
for i in range(0, 54):
selected_pdf[i] = GAUSSIAN
for i in range(54, 57):
selected_pdf[i] = LOG_NORMAL
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=continuous,
selectedPdf=selected_pdf)
print(" Gaussian/Log Normal Classification Error Rates ")
print(" Arsin(sqrt) transformation of 1st 54 Attributes. ")
print(" Gaussian - 1st 54 & Log Normal - last 3 Attributes")
printErrorRates(classErrors)
# scale continuous attributes using z-score scaling
for i in range(0, n_patterns[0]):
for j in range(54, 57):
continuous[i][j] = sqrt(unscaledContinuous[i][j])
for i in range(0, 57):
selected_pdf[i] = GAUSSIAN
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=continuous,
selectedPdf=selected_pdf)
print(" Scaled Classification Error Rates ")
print(" Arsin(sqrt) transformation of 1st 54 Attributes")
print(" sqrt() transformation for last 3 Attributes ")
print(" All Attributes Modeled as Gaussian Variates. ")
printErrorRates(classErrors)
for i in range(54, 57):
selected_pdf[i] = LOG_NORMAL
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=continuous,
selectedPdf=selected_pdf)
print(" Scaled Classification Error Rates")
print(" Arsin(sqrt) transformation of 1st 54 Attributes ")
print(" and sqrt() transformation for last 3 Attributes ")
print(" Gaussian - 1st 54 & Log Normal - last 3 Attributes")
printErrorRates(classErrors)
Output¶
If the continuous attributes are left untransformed and modeled using the Gaussian distribution, the overall classification error rate is 18.4% with most of these occurring when spam is classified as “not spam.” The error rate for correctly classifying non-spam is 26.6%.
The lowest overall classification error rate occurs when the percentages are transformed using the arc-sin/square root transformation and the length attributes are untransformed using logs. Representing the transformed percentages as Gaussian attributes and the transformed lengths as log-normal attributes reduces the overall error rate to 14.2%. However, although the error rate for correctly classifying non-spam email is low for this case, the error rate for correctly classifying spam is high, about 28%.
In the end, the best model to identify spam may depend upon which type of error is more important, incorrectly classifying non-spam email or incorrectly classifying spam.
Number of Patterns = 4601
Number Classified as Spam = 1813
Unscaled Gaussian Classification Error Rates
No Attribute Transformations
All Attributes Modeled as Gaussian Variates.
----------------------------------------------------
Not Spam Spam | TOTAL
753/2788=27.0% 76/1813= 4.2% | 829/4601=18.0%
----------------------------------------------------
Scaled Gaussian Classification Error Rates
Arsin(sqrt) transformation of first 54 Vars.
All Attributes Modeled as Gaussian Variates.
----------------------------------------------------
Not Spam Spam | TOTAL
628/2788=22.5% 103/1813= 5.7% | 731/4601=15.9%
----------------------------------------------------
Gaussian/Log Normal Classification Error Rates
Arsin(sqrt) transformation of 1st 54 Attributes.
Gaussian - 1st 54 & Log Normal - last 3 Attributes
----------------------------------------------------
Not Spam Spam | TOTAL
652/2788=23.4% 92/1813= 5.1% | 744/4601=16.2%
----------------------------------------------------
Scaled Classification Error Rates
Arsin(sqrt) transformation of 1st 54 Attributes
sqrt() transformation for last 3 Attributes
All Attributes Modeled as Gaussian Variates.
----------------------------------------------------
Not Spam Spam | TOTAL
650/2788=23.3% 96/1813= 5.3% | 746/4601=16.2%
----------------------------------------------------
Scaled Classification Error Rates
Arsin(sqrt) transformation of 1st 54 Attributes
and sqrt() transformation for last 3 Attributes
Gaussian - 1st 54 & Log Normal - last 3 Attributes
----------------------------------------------------
Not Spam Spam | TOTAL
652/2788=23.4% 92/1813= 5.1% | 744/4601=16.2%
----------------------------------------------------
Fatal Errors¶
IMSLS_STOP_USER_FCN |
Request from user supplied function to stop algorithm. User flag = “#”. |
IMSLS_N_OBS_PER_CLASS |
Class # has # observation(s). All classes must have at least 2 observations. |