## Introduction

You probably have heard the saying, “If all you have is a hammer, everything looks like a nail”. This proverb applies to many cases, deterministic classification neural networks not being an exception. Consider, for instance, a typical neural network that classifies images from the CIFAR-10 dataset. This dataset consists of 60.000 color images, all of which belong to 10 classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Naturally, no matter what image we feed this network, say a pencil, it will always assign it to one of the 10 known classes.

However, it would be handy if the model conveyed its uncertainty for the predictions it made. So, given a “pencil” image, it would probably label it as a bird or ship or whatever. At the same time, we’d like it to assign a large uncertainty to this prediction. To reach such an inference level, we need to rethink the traditional deterministic neural network paradigm and take a leap of faith towards probabilistic modeling. Instead of having a model parameterized by its point weights, each weight will now be sampled from a posterior distribution whose parameters will be tuned during the training process.

Left: Deterministic neural network with point estimates for weights. Right: Probabilistic neural network with weights sampled from probability distributions. Image taken from Blundell, et al. Weight Uncertainty in Neural Networks. arXiv (2015)

## Aleatoric and epistemic uncertainty

Probabilistic modeling is intimately related to the concept of uncertainty. The latter is sometimes divided into two categories, aleatoric (also known as statistical) and epistemic (also known as systematic). Aleatoric is derived from the Latin word “alea” which means die. You might be familiar with the phrase “alea iact est”, meaning “the die has been cast”. Hence, aleatoric uncertainty relates to the data itself and captures the inherent randomness when running the same experiment or performing the same task. For instance, if a person draws the number “4” repeatedly, its shape will be slightly different every time. Aleatoric uncertainty is irreducible in the sense that no matter how much data we collect, there will always be some noise in them. Finally, we model aleatoric uncertainty by having our model’s output be a distribution, where we sample from, rather than having point estimates as output values.

Epistemic uncertainty, on the other hand, refers to a model’s uncertainty. I.e., the uncertainty regarding which parameters (out of the set of all possible parameters) accurately model the experimental data. Epistemic uncertainty is decreased as we collect more training examples. Its modeling is realized by enabling a neural network’s weights to be probabilistic rather than deterministic.

Image taken from Kendall, A. & Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv [cs.CV] (2017)

## Tensorflow example

### Summary objective

In the following example, we will generate some non-linear noisy training data, and then we will develop a probabilistic regression neural network to fit the data. To do so, we will provide appropriate prior and posterior trainable probability distributions. This blog post is inspired by a weekly assignment of the course “Probabilistic Deep Learning with TensorFlow 2” from Imperial College London. We start by importing the Python modules that we will need.

### Data generation

We generate some training data $$\mathcal{D}=\{(x_i, y_i)\}$$ using the quintic equation $$y_i = x_i^5 + 0.4 \, x_i \,\epsilon_i$$ where $$\epsilon_i \sim \mathcal{N}(0,1)$$ means that the noise $$\epsilon_i$$ is sampled from a normal distribution with zero mean and standard deviation equal to one.

Notice how the data points are squeezed near $$x=0$$, and how they diverge as $$x$$ deviates from zero. When aleatoric uncertainty (read: measurement noise) differs across the input points $$x$$, we call it heteroscedastic, from the Greek words “hetero” (different) and “scedastic” (dispersive). When it is the same for all input, we call it “homoscedastic”.

### Setup prior and posterior distributions

#### Bayes’ rule

At the core of probabilistic predictive modeling lies the Bayes’ rule. To estimate a full posterior distribution of the parameters $$\mathbf{w}$$, i.e., to estimate $$p(\mathbf{w}\mid \mathcal{D})$$, given some training data $$\mathcal{D} = \{(x_i, y_i)\}$$, the Bayes’ rule assumes the following form:

$P(\mathbf{\mathbf{w}|\mathcal{D}}) = \frac{P(\mathcal{D}|\mathbf{w})P(\mathbf{w})}{P(\mathcal{D})}$

In the following image, you see a sketch of the various probability distributions that the Bayes’ rule entangles. In plain terms, it holds that $$\text{Prior beliefs} \oplus \text{Evidence} = \text{Posterior beliefs}$$, i.e., we start with some assumptions inscribed in the prior distribution, then we observe the “Evidence”, and we update our prior beliefs accordingly, to yield the posterior distribution. Subsequently, the posterior distribution acts as the next iteration’s prior distribution, and the whole cycle is repeated.

To let all these sink, let us elaborate on the essence of the posterior distribution by marginalizing the model’s parameters. The probability of predicting $$y$$ given an input $$\mathbf{x}$$ and the training data $$\mathcal{D}$$ is:

$P(y\mid \mathbf{x},\mathcal{D})= \int P(y\mid \mathbf{x},\mathbf{w}) \, P(\mathbf{w}\mid\mathcal{D}) \mathop{\mathrm{d}\mathbf{w}}$

This is equivalent to having an ensemble of models with different parameters $$\mathbf{w}$$, and taking their average weighted by the posterior probabilities of their parameters, $$P(\mathbf{w}\mid \mathcal{D})$$. Neat?

There are two problems with this approach, however. First, it is computationally intractable to calculate an exact solution for the posterior distribution. Second, the averaging implies that our equation is not differentiable, so we can’t use good old backpropagation to update the model’s parameters! The answer to these hindrances is variational inference, a method of formulating inference as an optimization problem. We won’t dive deep into the theoretical background, but the inquiring reader may google for the Kullback-Leibler divergence. The basic idea is to approximate the true posterior with another function by using the KL divergence as a “metric” of how much the two distributions differ. I promise to blog about all the juicy mathematical details of the KL divergence concept in a future post.

#### Prior distribution

We start by defining a prior distribution for our model’s weights. I haven’t researched the matter a lot, but in the absence of any evidence, adopting a normal distribution as a prior is a fair way to initialize a probabilistic neural network. After all, the central limit theorem asserts that a properly normalized sum of samples will approximate a normal distribution no matter the actual underlying distribution. We use the DistributionLambda() function to inject a distribution into our model, which you can think of as the “lambda function” analog for distributions. The distribution we use is a multivariate normal with a diagonal covariance matrix:

$\Sigma = \left( \begin{matrix} \sigma_1^2 & 0 & 0 & \ldots \\ 0 & \sigma_2^2 & 0 & \ldots \\ 0 & 0 & \sigma_3^2 & \ldots\\ \vdots & \vdots & \vdots & \ddots \end{matrix} \right)$

The mean values are initialized to zero and the $$\sigma_i^2$$ to one.

#### Posterior distribution

The case of the posterior distribution is a bit more complex. We again use a multivariate Gaussian distribution, but we will now allow off-diagonal elements in the covariance matrix to be non-zero. There are three ways to parameterize such a distribution. First, in terms of a positive definite covariance matrix $$\mathbf{\Sigma}$$, second via a positive definite precision matrix $$\mathbf{\Sigma}^{-1}$$, and last with a lower-triangular matrix $$\mathbf{L}\mathbf{L}^⊤$$ with positive-valued diagonal entries, such that $$\mathbf{\Sigma} = \mathbf{L}\mathbf{L}^⊤$$. This triangular matrix can be obtained via, e.g., Cholesky decomposition of the covariance matrix. In our case, we are going for the last method with MultivariateNormalTriL(), where “TriL” stands for “triangular lower”.

$\mathbf{L}= \left( \begin{matrix} L_{11} & 0 & 0 & \ldots\\ L_{21} & L_{22} & 0 & \ldots\\ L_{31} & L_{32} & L_{33} & \ldots\\ \vdots & \vdots & \vdots & \ddots \end{matrix} \right)$

The following parenthetical code shows how one can sample from a multinormal distribution, by setting $$\mathbf{z} = \mathbf{\mu} + \mathbf{L} \mathbf{x}$$, where $$\mathbf{\mu}$$ is the mean vector, and $$\mathbf{L}$$ is the lower triangular matrix derived via $$\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^⊤$$ decomposition. The rationale is that sampling from a univariate normal distribution (e.g., zero mean and unit variance) is easy and fast since efficient vectorized algorithms exist for doing this. Whereas sampling directly from a multinormal distribution is not as efficient. Hence, we turn the problem of sampling a multinormal distribution into sampling a univariate normal.

So, instead of parameterizing the neural network with point estimates for weights $$\mathbf{w}$$, we will instead parameterize it with $$\mathbf{\mu}$$’s and $$\sigma$$’s. Notice that for a lower triangular matrix there are $$(n^2 - n)/2 + n = n(n+1)/2$$ non-zero elements. Adding the $$n$$ $$\mu$$’s we end up with a distribution with $$n(n+1)/2 + n = n(n+3)/2$$ parameters.

By the way, let us create some prior and posterior distributions, print the number of their trainable variables, and sample from them. Note that every time we run this cell block, we get different results for the samples.

### Define the model, loss function, and optimizer

To define probabilistic layers in a neural network, we use the DenseVariational() function, specifying the input and output shape, along with the prior and posterior distributions that we have previously defined. We use a sigmoid activation function to enable the network model non-linear data, along with an IndependentNormal() output layer, to capture aleatoric uncertainty, with an event shape equal to 1 (since our $$y$$ is just a scalar). Regarding the kl_weight parameter, you may refer to the original paper “Weight Uncertainty in Neural Networks” for further information. For now, just take for granted that it is a scaling factor.

Let’s practice and calculate by hand the model’s parameters. The first dense variational layer has 1 input, 8 outputs and 8 biases. Therefore, there are $$1\cdot 8 + 8 = 16$$ weights. Since each weight is going to be modelled by a normal distribution, we need 16 $$\mu$$’s, and $$(16^2 - 16)/2 + 16 = 136$$ $$\sigma$$’s. The latter is just the number of elements of a lower triangular matrix $$16\times 16$$. Therefore, in total we need $$16 + 136 = 152$$ parameters.

What about the second variational layer? This one has 8 inputs (since the previous had 8 outputs), 2 outputs (the $$\mu, \sigma$$ of the independent normal distribution), and 2 biases. Therefore, it has $$8\times 2 + 2 = 18$$ weights. For 18 weights, we need 18 $$\mu$$’s and $$(18^2 - 18)/2 + 18 = 171$$ $$\sigma$$’s. Therefore, in total we need $$18 + 171 = 189$$ parameters. The tfpl.MultivariateNormalTriL.params_size(n) static function calculates the number of parameters needed to parameterize a multivariate normal distribution, so that we don’t have to bother with it.

### Train the model and make predictions

We train the model for 1000 epochs and plot the loss function vs. to confirm that the algorithm has converged.

Indeed RMSprop converged, and now we proceed by making some predictions:

Notice how the models’ outputs converge around $$x=0$$ where there is very little variation in the data. The following plot was generated by taking the average of 100 models:

In the following two images, you can see how going from known inputs to extrapolating to unknown inputs “forces” our model to spread out its predictions.

Known input Unknown input