naiveBayesClassification¶
Classifies unknown patterns using a previously trained Naive Bayes classifier. The classifier is contained in an Imsls_d_nb_classifier data structure, which is optional output from naiveBayesTrainer.
Synopsis¶
naiveBayesClassification (nbClassifier, nPatterns)
Required Arguments¶
- Imsls_d_nb_classifier
nbClassifier
(Input) - A structure of the type Imsls_d_nb_classifier from
naiveBayesTrainer
. - int
nPatterns
(Input) - Number of patterns to classify.
Return Value¶
An array of size nPatterns
containing the predicted classification
associated with each input pattern.
Optional Arguments¶
nominal
, int[[]]
(Input)nominal
is an array of sizenPatterns
bynbClassifier.nNominal
containing values for the nominal input attributes. The i-th row contains the nominal input attributes for the i-th pattern. The j-th column of this matrix contains the classifications for the j-th
nominal attribute. They must be encoded with integers starting from0
tonbClassifier.nCategories[i]-1
. Any value outside this range is treated as a missing value. IfnbClassifier.nNominal=0
, this array is ignored.continuous
, float[[]]
(Input)continuous
is an array of sizenPatterns
bynbClassifier.nContinuous
containing values for the continuous input attributes. The i-th row contains the input attributes for the i-th training pattern. The j-th column of this matrix contains the values for the j-th continuous attribute. Missing values should be set equal tomachine(6)
=NaN. Patterns with missing values are still used to train the classifier unless theignoreMissingValues
option is supplied. IfnbClassifier.nContinuous=0
, this matrix is ignored.printLevel
, int (Input)Print levels for printing data warnings and final results.
printLevel
should be set to one of the following values:printLevel
Description NONE
Printing of data warnings and final results is suppressed. FINAL
Prints final summary of Naive Bayes classifier training. DATA_WARNINGS
Prints information about missing values and PDF calculations equal to zero. TRACE_ALL
Prints final summary plus all data warnings associated with missing values and PDF calculations equal to zero. Default:
NONE
.userPdf
, floatpdf(
intindex[]
, floatx
) (Input)The user-supplied probability density function and parameters used to calculate the conditional probability density for continuous input attributes is required when the classifier was trained with
selectedPdf[i]
=USER
.When
pdf
is called,x
will equalcontinuous[i*nContinuous+j]
, andindex
will contain the following values fori
,j
, andk
:Index Value index[0]
i
= pattern indexindex[1]
j
= attribute indexindex[2]
k
= target classificationThe pattern index ranges from 0 to
nPatterns-1
and identifies the pattern index for x. The attributes index ranges from 0 tonCategories[i]-1,
andk=classification[i].
This argument is ignored if
nContinuous = 0
. By default the Gaussian PDF is used for calculating the conditional probability densities using either the means and variances calculated from the training patterns or those supplied ingaussianPdf
.predictedClassProb
(Output)- An array of size
nPatterns
bynClasses
, wherenClasses
is the number of target classifications. The values in the i-th row are the predicted classification probabilities associated with the target classes.predictedClassProb[i*nClasses+j
] is the estimated probability that the i-th pattern belongs to the j-th target classes.
Description¶
Function naiveBayesClassification
estimates classification probabilities
from a previously trained Naive Bayes classifier. Two arrays are used to
describe the values of the nominal and continuous attributes used for
calculating these probabilities. The predicted classification returned by
this function is the class with the largest estimated classification
probability. The classification probability estimates for each pattern can
be obtained using the optional argument predictedClassProb
.
Examples¶
Example 1¶
Fisher’s (1936) Iris data is often used for benchmarking classification algorithms. It is one of the IMSL data sets and consists of the following continuous input attributes and classification target:
Continuous Attributes: X0(sepal length), X1(sepal width), X2(petal length), and X3(petal width)
Classification (Iris Type): Setosa, Versicolour or Virginica.
This example trains a Naive Bayes classifier using 150 training patterns from Fisher’s data then classifies ten unknown plants using their sepal and petal measurements.
from __future__ import print_function
from numpy import empty, double
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.ompOptions import ompOptions
from pyimsl.stat.naiveBayesClassification import naiveBayesClassification
from pyimsl.stat.naiveBayesTrainer import naiveBayesTrainer
from pyimsl.stat.nbClassifierFree import nbClassifierFree
n_patterns = 150 # 150 training patterns
n_continuous = 4 # four continuous input attributes
n_classes = 3 # three classification categories
dashes = "------------------------------------------------------"
classification = empty([150], dtype=int)
continuous = empty([150, 4], dtype=double)
classLabel = ["Setosa ", "Versicolour", "Virginica "]
ompOptions(setFunctionsThreadSafe=True)
# irisData[]: The raw data matrix. This is a 2-D matrix
# with 150 rows and 5 columns. The last 4 columns are the
# continuous input attributes and the 1st column is the
# classification category (1-3). These data contain no
# nominal input attributes.
irisData = dataSets(3)
# Data corrections described in the KDD data mining archive
irisData[34][4] = 0.1
irisData[37][2] = 3.1
irisData[37][3] = 1.5
# setup the required input arrays from the data matrix
for i in range(0, n_patterns):
classification[i] = int(irisData[i][0] - 1)
for j in range(1, n_continuous + 1):
continuous[i][j - 1] = irisData[i][j]
nb_classifier = []
classErrors = naiveBayesTrainer(n_classes, classification,
continuous=continuous,
nbClassifier=nb_classifier)
print(" Iris Classification Error Rates")
print("----------------------------------------------")
print(" Setosa Versicolour Virginica | TOTAL")
print(" %d/%d %d/%d %d/%d | %d/%d\n"
% (classErrors[0][0], classErrors[0][1],
classErrors[1][0], classErrors[1][1],
classErrors[2][0], classErrors[2][1],
classErrors[3][0], classErrors[3][1]))
print("----------------------------------------------\n")
# CALL NAIVE_BAYES_CLASSIFICATION ***************************
pred_class_prob = []
predictedClass = naiveBayesClassification(nb_classifier, n_patterns,
continuous=continuous,
predictedClassProb=pred_class_prob)
print(" PROBABILITIES FOR INCORRECT CLASSIFICATIONS")
print(dashes)
print("\nTRAINING PATTERNS| PREDICTED\t|")
print(" X1 X2 X3 X4 | CLASS\t| CLASS\tP(0) P(1) P(2)|")
print(dashes)
for i in range(0, n_patterns):
if (classification[i] == predictedClass[i]):
continue
print(" %4.1f%4.1f%4.1f%4.1f| %s\t| %s\t%4.2f %4.2f %4.2f|"
% (continuous[i][0], continuous[i][1],
continuous[i][2], continuous[i][3],
classLabel[classification[i]], classLabel[predictedClass[i]],
pred_class_prob[i][0], pred_class_prob[i][1],
pred_class_prob[i][2]))
print(dashes)
nbClassifierFree(nb_classifier)
Output¶
For Fisher’s data, the Naive Bayes classifier incorrectly classified 6 of the 150 training patterns.
Iris Classification Error Rates
----------------------------------------------
Setosa Versicolour Virginica | TOTAL
0/50 3/50 3/50 | 6/150
----------------------------------------------
PROBABILITIES FOR INCORRECT CLASSIFICATIONS
------------------------------------------------------
TRAINING PATTERNS| PREDICTED |
X1 X2 X3 X4 | CLASS | CLASS P(0) P(1) P(2)|
------------------------------------------------------
6.9 3.1 4.9 1.5| Versicolour | Virginica 0.00 0.46 0.54|
------------------------------------------------------
5.9 3.2 4.8 1.8| Versicolour | Virginica 0.00 0.16 0.84|
------------------------------------------------------
6.7 3.0 5.0 1.7| Versicolour | Virginica 0.00 0.08 0.92|
------------------------------------------------------
4.9 2.5 4.5 1.7| Virginica | Versicolour 0.00 0.97 0.03|
------------------------------------------------------
6.0 2.2 5.0 1.5| Virginica | Versicolour 0.00 0.96 0.04|
------------------------------------------------------
6.3 2.8 5.1 1.5| Virginica | Versicolour 0.00 0.71 0.29|
------------------------------------------------------
Example 2¶
This example uses the spam benchmark data available from the Knowledge Discovery Databases archive maintained at the University of California, Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase.
These data contain of 4601 patterns consisting of 57 continuous attributes and one classification. 41% of these patterns are classified as spam and the remaining as non-spam. The first 54 continuous attributes are word or symbol percentages. That is, they are percents scaled from 0 to 100% representing the percentage of words or characters in the email that contain a particular word or character. The last three continuous attributes are word lengths. For a detailed description of these data visit the KDD archive at the above link.
In this example, percentages are transformed using the arcsin/square root transformation \(y=\sin^{-1} \left( \sqrt{p} \right)\). The last three attributes, word lengths, are transformed using square roots. Transformed percentages and the first word length attribute are modeled using the Gaussian distribution. The last two word lengths are modeled using the log normal distribution.
from __future__ import print_function
from numpy import empty, double, int, zeros
from math import asin, sin, sqrt
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.ompOptions import ompOptions
from pyimsl.stat.randomSampleIndices import randomSampleIndices
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.naiveBayesClassification import naiveBayesClassification
from pyimsl.stat.naiveBayesTrainer import naiveBayesTrainer, \
FINAL, GAUSSIAN, LOG_NORMAL
def printErrorRates(classificationErrors, n, label):
p0 = 100.0 * classificationErrors[0][0] / classificationErrors[0][1]
p1 = 100.0 * classificationErrors[1][0] / classificationErrors[1][1]
p2 = 100.0 * classificationErrors[2][0] / classificationErrors[2][1]
print(" Classification Error Rates Reported by")
print(label % (n))
print("----------------------------------------------------")
print(" Not Spam Spam | TOTAL")
print(" %d/%d=%4.1f%% %d/%d=%4.1f%% | %d/%d=%4.1f%%"
% (classificationErrors[0][0], classificationErrors[0][1], p0,
classificationErrors[1][0], classificationErrors[1][1], p1,
classificationErrors[2][0], classificationErrors[2][1], p2))
print("----------------------------------------------------\n")
condPdfTableLength = 0
n_sample = 2000
n_classes = 2 # (spam or no spam)
n_continuous = 57
classSample = empty([2000], dtype=int)
label1 = " Trainer from Training Dataset of %d Observations "
label2 = " Classifier for Entire Dataset of %d Observations "
n_spam = 0
n_patterns = []
n_variables = []
spamData = dataSets(11,
nObservations=n_patterns,
nVariables=n_variables)
continuous = empty([n_patterns[0], n_continuous], dtype=double)
continuousSample = empty([n_sample, n_continuous], dtype=double)
classification = empty(n_patterns[0], dtype=int)
# map continuous attributes into transformed representation
for i in range(0, n_patterns[0]):
for j in range(0, n_continuous):
if (j < 54):
continuous[i][j] = asin(sqrt(spamData[i][j] / 100))
else:
continuous[i][j] = spamData[i][j]
classification[i] = int(spamData[i][n_variables[0] - 1])
if (classification[i] == 1):
n_spam += 1
print("Number of Patterns = %d Number Classified as Spam = %d \n"
% (n_patterns[0], n_spam))
# select random sample for training Naive Bayes Classifier
randomSeedSet(1234567)
rndSampleIndex = randomSampleIndices(n_sample, n_patterns[0])
for k in range(0, n_sample):
i = rndSampleIndex[k] - 1
classSample[k] = classification[i]
for j in range(0, n_continuous):
continuousSample[k, j] = continuous[i, j]
# Train Naive Bayes Classifier
nb_classifier = []
classErrors = naiveBayesTrainer(n_classes, classSample,
continuous=continuousSample,
nbClassifier=nb_classifier)
# print error rates for training sample
printErrorRates(classErrors, n_sample, label1)
# CALL NAIVE_BAYES_CLASSIFICATION TO CLASSIFIY ENTIRE DATASET
predictedClass = naiveBayesClassification(
nb_classifier, n_patterns[0], continuous=continuous)
# calculate classification error rates for entire dataset */
classification_errors = zeros([3, 2], dtype=int)
for i in range(0, n_patterns[0]):
if (classification[i] == 0):
classification_errors[0][1] += 1
if(classification[i] != predictedClass[i]):
classification_errors[0][0] += 1
elif (classification[i] == 1):
classification_errors[1][1] += 1
if(classification[i] != predictedClass[i]):
classification_errors[1][0] += 1
classification_errors[2][1] = \
classification_errors[0][1] + classification_errors[1][1]
classification_errors[2][0] = \
classification_errors[0][0] + classification_errors[1][0]
# print error rates for entire dataset
printErrorRates(classification_errors, n_patterns[0], label2)
Output¶
It is interesting to note that the classification error rates obtained by
training a classifier from a random sample is slightly lower than those
obtained from training a classifier with all 4601 patterns. When the
classifier is trained using all 4601 patterns, the overall classification
error rate was 12.9% (see Example 3 for
naiveBayesTrainer
). It is 12.4% for a random sample of 2000 patterns.
Number of Patterns = 4601 Number Classified as Spam = 1813
Classification Error Rates Reported by
Trainer from Training Dataset of 2000 Observations
----------------------------------------------------
Not Spam Spam | TOTAL
236/1202=19.6% 41/798= 5.1% | 277/2000=13.8%
----------------------------------------------------
Classification Error Rates Reported by
Classifier for Entire Dataset of 4601 Observations
----------------------------------------------------
Not Spam Spam | TOTAL
589/2788=21.1% 99/1813= 5.5% | 688/4601=15.0%
----------------------------------------------------
Fatal Errors¶
IMSLS_STOP_USER_FCN |
Request from user supplied function to stop algorithm. User flag = “#”. |