Naive Bayes – An Overview

Classification problems are characterized by a need to classify unknown patterns or data into one of m categories based upon the values of k attributes \(x_1,x_2,\ldots,x_k\). There are many algorithms for solving classification problems including discriminant analysis, neural networks and Naive Bayes. Each algorithm has its strengths and weaknesses. Discriminant analysis is robust but it requires \(x_1,x_2,\ldots,x_k\). to be continuous, and since it uses a simple linear equation for the discriminant function, its error rate can be higher than the other algorithms. See discriminantAnalysis.

Neural Networks provides a linear or non-linear classification algorithm that accepts both nominal and continuous input attributes. However, network training can be unacceptably slow for problems with a larger number of attributes, typically when k >1000. Naive Bayes, on the other hand, is a simple algorithm that is very fast. A Naive Bayes classifier can be trained to classify patterns involving thousands of attributes and applied to thousands of patterns. As a result, Naive Bayes is a preferred algorithm for text mining and other large classification problems. However, its computational efficiency comes at a price. The error rate for a Naive Bayes classifier is typically higher than the equivalent Neural Network classifier, although it is usually low enough for many applications such as text mining.

If C is the classification attribute and \(X^T=\{x_1,x_2,\ldots, x_k\}\) is the vector valued array of input attributes, the classification problem simplifies to estimating the conditional probability \(P(c|X)\) from a set of training patterns. The Bayes rule states that this probability can be expressed as the ratio:

\[P\left(C = c | X = \left\{x_1,x_2,\ldots,x_k\right\}\right) = \frac {P(C = c)P\left(X = \left\{x_1,x_2,\ldots,x_k\right\}|C = c\right)} {P\left(X = \left\{x_1,x_2,\ldots,x_k\right\}\right)}\]

where c is equal to one of the target classes 0, 1, …, nClasses-1. In practice, the denominator of this expression is constant across all target classes since it is only a function of the given values of X. As a result, the Naive Bayes algorithm does not expend computational time estimating \(P \left( X=\left\{ x_1,x_2,\ldots x_k \right\} \right)\) for every pattern. Instead, a Naive Bayes classifier calculates the numerator \(P (C=c) P\left( X=\left\{ x_1,x_2,\ldots x_k \right\} | C=c \right)\) for each target class and then classifies X to the target class with the largest value, i.e.,

\[X \xleftarrow[\max (c=0,1,\mathtt{n\_classes} - 1)]{} P(C=c) P(X|C=c)\]

The classifier simplifies this calculation by assuming conditional independence:

\[P\left(X = \left\{x_1,x_2,\ldots x_k\right\}|C=c\right) = \prod_{j=1}^{k} P\left(x_j|C=c\right)\]

This is equivalent to assuming that the values of the input attributes, given C, are independent of one another, i.e.,

\[P\left(x_i|x_j,C=c\right) = P\left(x_i|C=c\right) \phantom{...} \text{for all } i \neq j.\]

In real world data this assumption rarely holds, yet in many cases this approach results in surprisingly low classification error rates. Thus, the estimate of \(P \left(C=c | X=\left\{ x_1,x_2,\ldots x_k \right\} \right)\) from a Naive Bayes classifier is generally an approximation, classifying patterns based upon the Naive Bayes algorithm can have acceptably low classification error rates.

Function naiveBayesTrainer is used to train a classifier from a set of training patterns that contains patterns with values for both the input and target attributes. This routine stores the trained classifier into an Imsls_d_nb_classifier data structure. The trained classifier can in turn be stored to a file using nbClassifierWrite, and later retrieved using nbClassifierRead.

Classifications of new patterns with unknown classifications can be predicted by passing the trained classifier data structure, Imsls_d_nb_classifier, to naiveBayesClassification. If necessary, memory allocated to the trained classifier can be released using nbClassifierFree.