# Trainable probability distributions with Tensorflow

machine learning mathematics optimization statistics Python TensorflowIn the previous post, we fit a Gaussian curve to data with maximum likelihood estimation (MLE). For that, we subclassed `tf.keras.layers.Layer`

and wrapped up the model’s parameters in our custom layer. Then, we used negative log-likelihood minimization to have Tensorflow figure out the optimal values for the distribution’s parameters. In today’s short post, we will again fit a Gaussian curve to normally distributed data with MLE. However, we will use Tensorflow’s trainable probability distributions rather than a custom layer. The TensorFlow Probability is a separate library for probabilistic reasoning and statistical analysis.

The same as before, we generate some Gaussian data with \(\mu = 2, \sigma = 1\):

We now use a `tensorflow_probability.Normal`

distribution, with trainable parameters for loc and scale. We do assign some random values to them, which will be updated during the training loop. The initial values we give are purposely off to test whether the gradient descent optimizer will converge. Also, notice how the two distributions (ground truth *vs.* predicted with random parameters) are misaligned.

Compare the following figure with the previous one, and see how well-aligned the predicted distribution is with the ground truth distribution.

We print the final estimates for the distribution’s parameters, and we see that they are pretty close to the ones we used when we generated our training data.

Of course, for the normal distribution there exist analytic solutions yielding the optimal parameters. You just assume the log-likelihood:

\[\begin{align*} \log \mathcal{L}(\mu,\sigma^2 \mid x_1,\ldots,x_N) &= \log \prod_{i=1}^N f(x_i) \\ &=\log\left[\left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^{N} \exp\left( -\frac{ \sum_{i=1}^N (x_i-\mu)^2}{2\sigma^2}\right)\right]\\ &=-\frac{N}{2} \log \left( 2\pi \sigma^2 \right) - \sum_{i=1}^{N} \left( \frac{(x_i - \mu)^2}{2\sigma^2}\right) \end{align*}\]And then solve the following set of equations that maximize log-likelihood (and, therefore, the likelihood):

\[\left\{\frac{\partial \log\mathcal{L}}{\partial \mu}=0, \frac{\partial \log\mathcal{L}}{\partial \sigma}=0\right\}\]I.e., solve for \(\mu, \sigma\) the:

\[\begin{align*} \frac{\partial \log\mathcal{L}}{\partial\mu} &= \sum _{i=1}^n \frac{2 x_i - 2\mu}{2 \sigma^2}=0\\ \frac{\partial \log\mathcal{L}}{\partial\sigma} &=-\frac{N}{\sigma} + \sum_{i=1}^{N} \frac{(x_i-\mu)^2}{\sigma^3}=0 \end{align*}\]The solutions are the mean value and standard deviation of the sample:

\[\begin{align*} \mu_\text{MLE} &= \frac{1}{N} \sum_{i=1}^{N} x_i\\ \sigma_\text{MLE} &= \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \end{align*}\]Indeed:

However, in most cases, this optimization problem cannot be solved analytically, and therefore we need to attack it numerically.