## Introduction

A Naive Bayes classifier is a simple probabilistic classifier based on the Bayes’ theorem along with some strong (naive) assumptions regarding the independence of features. Others have suggested the name “independent feature model” as more fit. For example, a pet may be considered a dog, in a pet classifier context, if it has 4 legs, a tail, and barks. These features (presence of 4 legs, a tail, and barking) may depend on each other. However, the naive Bayes classifier assumes they contribute independently to the probability that a pet is a dog. Naive Bayes classifier is used heavily in text classification, e.g., assigning topics on text, detecting spam, identifying age/gender from text, performing sentiment analysis. Given that there are many well-written introductory articles on this topic, we won’t spend much time in theory.

## The mathematical formulation

Given a set of features $$x_1, x_2, \ldots, x_n$$ and a set of classes $$C$$, we want to build a model that yields the value of $$P(C_k\mid x_1, x_2, \ldots, x_n)$$. Then, by taking the maximum probability over this range of probabilities, we come up with our best estimate for the correct class:

\begin{align*} C_\text{predicted} &= \underset{c_k \in \mathcal{C}}{\text{arg max}} \,P(C_k | x_1, x_2, \ldots, x_n)\\ &= \underset{c_k \in \mathcal{C}}{\text{arg max}} \,\frac{P(x_1, x_2, \ldots, x_n|C_k) P(C_k)}{P(x_1, x_2, \ldots, x_n)}\\ &= \underset{c_k \in \mathcal{C}}{\text{arg max}} \, P(x_1, x_2, \ldots, x_n|C_k) P(C_k) \end{align*}

In the second line we applied the Bayes’ theorem $$P(A\mid B) = P(B\mid A) P(A) / P(B)$$. In the last line, we omitted the denominator since it is the same across all classes, i.e., acts merely as a scaling factor. The estimation of $$P(C_k)$$ is straightforward; we just compute each class’s relative frequency in the training set. However, the calculation of $$P(x_1, x_2, \ldots, x_n\mid C_k)$$ is more demanding. Here comes the “naive” part of the Naive Bayes classifier. We make the assumption that $$x_1, x_2, \ldots, x_n$$ features are independent. Then, it holds that $$P(x_1, x_2, \ldots, x_n\mid C_k) = P(x_1\mid C_k)P(x_2\mid C_k)\ldots P(x_n\mid C_k)$$ or just $$\prod_{i=1}^n P(x_i\mid C_k)$$. This greatly reduces the number of the model’s parameters and simplifies their estimation. So, to sum up, the naive Bayes classifier is the solution to the following optimization problem:

$C_\text{predicted} = \underset{c_k \in \mathcal{C}}{\text{arg max}} \, P(C_k) \prod_{i=1}^n P(x_i|C_k)$

In the pet example, assuming we had two classes, $$C_\text{dog}$$ and $$C_\text{monkey}$$, we would write:

\begin{align*} P\left(\text{dog} \mid \text{4-legs}, \text{tail}, \text{barks}\right) &\sim P(\text{4-legs|dog}) P(\text{tail|dog}) P(\text{barks|dog}) \,\, P(C_\text{dog})\\ P\left(\text{monkey} \mid \text{4-legs}, \text{tail}, \text{barks}\right) &\sim \underbrace{P(\text{4-legs|monkey}) P(\text{tail|monkey}) P(\text{barks|monkey})}_{\text{feature distributions}} \,\, \underbrace{P(C_\text{monkey})}_{\text{prior}} \end{align*}

Finally, we would compare the two calculated probabilities to infer whether the pet was a dog or a monkey. Believe it or not, monkeys may bark as well!

All the model’s parameters (the priors for each class and the feature probability distributions) need to be estimated from the training set. The priors can be calculated by the relative frequency of each class in the training set, e.g. $$P(C_k) = \frac{\text{# of samples in class }C_k}{\text{total # of samples}}$$. The feature probability distributions (or class conditionals) can be approximated with maximum likelihood estimation.

In this post, we will create some trainable Gaussian distributions for the iris data set’s features. We will then have Tensorflow estimate their parameters ($$\mu, \sigma$$) by minimizing the negative log-likelihood, which is equivalent to maximizing of log-likelihood. We have already done this in a previous post. Note though, that the feature distributions need not be Gaussian. For instance, in Mathematica’s current implementation, the feature distributions are modeled using a piecewise-constant function:

Another example is when the features are assumed to be a binary-valued variable (True/False). In this case, the samples would be represented as binary-valued feature vectors, and we would use a Bernoulli distribution instead of Gaussian. The bottom line is that no matter what distribution we use to model our features, maximum likelihood estimation can be applied to estimate the distribution’s parameters (let it be $$p$$ in Bernoulli, $$\lambda$$ in Poisson, $$\mu, \sigma$$ in Gaussian, etc.).

## Tensorflow example with the iris dataset

### Load and preprocess the data

The iris data set consists of 50 samples from Iris’ three species (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. For further information, the reader may refer to R. A. Fisher. “The use of multiple measurements in taxonomic problems”. Annals of Eugenics. 7 (2): 179–188, 1936. Our goal is to construct a Naive Bayes classifier model that predicts the correct class from the sepal length and sepal width features (so, just 2 out of 4 features). This blog post is inspired by a weekly assignment of the course “Probabilistic Deep Learning with TensorFlow 2” from Imperial College London.

First, we load the iris dataset, select the features that we will be using, create a training and a test set, and plot the data to get a sense of it.

### Construct the custom training loop

This code block is probably the single most important of the classifier. Here, we create trainable Gaussian distributions for the data set’s features with tfd.MultivariateNormalDiag(). The distributions’ parameters will be estimated by minimizing the negative log-likelihood, which is equivalent to maximizing the log-likelihood.

### Train the model

Next, we assign some initial values to the model’s parameters, instantiate an Adam optimizer, and call the learn_parameters() function to actually perform the training.

After the training has been completed, we plot the loss vs. epoch to ensure that our optimizer converged. We also plot for the same reasons the values of our model’s parameters.

Indeed, our training loop achieved convergence, and the model’s parameters settled on their best estimates, even if they all started with the same initial conditions. The following code plots the three bivariate Gaussian distributions for the three classes.

We may also print the model’s parameters:

### Measure model’s accuracy

To measure the model’s accuracy, we need to be able to make some predictions, which in turn means that we must be able to calculate:

$C_\text{predicted} = \underset{c_k \in \mathcal{C}}{\text{arg max}} \, P(C_k) \prod_{i=1}^n P(x_i|C_k)$

Up until now, we have calculated the feature distributions $$P(x_i\mid C_k)$$, which is the hardest part. We will now estimate the priors as the relative frequencies of each class in the data set.

Since we have 3 classes in the data set, we calculated 3 priors: $$P(C_1) = 0.342, P(C_2) = 0.333$$, and $$P(C_3) = 0.325$$. Next, we will setup a predict_class(), that will act as the $$\text{arg max}$$ opetator.

So, our model achieved an accuracy of 0.8667 on the test set, which is pretty good, considering that we used only 2 out of the 4 features of the dataset!

### Plot the decision regions

In a classification problem with $$N$$ classes, a decision boundary is a hypersurface partitioning the feature space into $$N$$ sets, one for each class. All the points on the same side of the decision boundary are taken to belong to the same category. In our case, we had 2 features and 3 classes; therefore, the classifier partitions the 2D plane into 3 regions. The decision boundary is the area where the predicted label becomes ambiguous.

## A note regarding Gaussian distributions

When the feature distributions are Gaussian, there exist analytic solutions for the distributions’ parameters:

\begin{align} \hat{\mu}_{ik} &= \frac{\sum_n x_i^{(n)} \delta(y^{(n)}=y_k)}{\sum_n \delta(y^{(n)}=y_k)} \\ \hat{\sigma}_{ik} &= \sqrt{\frac{\sum_n (x_i^{(n)} - \hat{\mu}_{ik})^2 \delta(y^{(n)}=y_k)}{\sum_n \delta(y^{(n)}=y_k)}} \end{align}

Where the superscript $$(n)$$ denotes the $$n$$-th example of the data set, $$\delta(y^{(n)}=y_k) = 1$$ if $$y^{(n)}=y_k$$ and 0 otherwise, and $$N$$ is the total number of examples. Note that the above are just the means and standard deviations of the sample data points for each class.

Indeed, the values we got via maximum likelihood estimation are pretty close to the values derived from the analytic solutions!