Custom training loops and subclassing with Tensorflow
machine learning mathematics optimization statistics TensorflowContents
- Introduction
- Fit linear regression model to data by minimizing MSE
- Fit Gaussian curve to data with maximum likelihood estimation
Introduction
The most straightforward way to train a model in Tensorflow is to use the model.fit()
and model.fit_generator()
Keras functions. These functions may seem opaque at first, but they accept callbacks that make them versatile. Such callbacks enable early stopping, saving the model to the disk periodically, writing logs for TensorBoard after every batch, accumulating statistics, and so on. However, it may be the case that one needs even finer control of the training loop. In this post, we will see a couple of examples on how to construct a custom training loop, define a custom loss function, have Tensorflow automatically compute the gradients of the loss function with respect to the trainable parameters, and then update the model.
Fit linear regression model to data by minimizing MSE
Generate training data
The “Hello World” of data science is arguably fitting a linear regression model. Indeed, in the first example, we will first generate some noisy data, and then we will fit a linear regression model of the form \(y = m x + b\). The model’s parameters are the scalars \(m, b\), and their optimal values will be figured out by Tensorflow.
Create a custom Keras layer
We then subclass the tf.keras.layers.Layer
class to create a new layer. The new layer accepts as input a one dimensional tensor of \(x\)’s and outputs a one dimensional tensor of \(y\)’s, after mapping the input to \(m x + b\). This layer’s trainable parameters are \(m, b\), which are initialized to random values drawn from the normal distribution and to zeros, respectively.
Define a custom loss function
Here, we define our custom loss function. For this particular case, the mean squared error (MSE) is appropriate, but conceivably we could use whatever loss function we’d like.
We calculate our loss function for the newly instantiated linear regression layer. At the moment, the parameters \(m\) have random values and \(b=0\), so we expect MSE to be high.
Train the model with a custom training loop
Here comes the custom training loop. What is essential in the following code is the tf.GradientTape[]
context. Every operation that is performed on the input inside this context is recorded by Tensorflow. We will then use this record for automatic differentiation.
We confirm that the algorithm converged by looking at the MSE vs. epoch:
We print the optimal values for the models’ parameters, \(m, b\), after the training has completed. Indeed, \(m_\text{opt} = 1.05, b_\text{opt} = 1.91\) are very close to the ground truth values (minus the noise we added).
Finally, we superimpose the training data with the best linear regression model Tensorflow converged to:
Fit Gaussian curve to data with maximum likelihood estimation
What is likelihood?
Likelihood measures how well a statistical model fits a sample of data given a set of values for the unknown parameters. It is considered a function of the parameters only, treating the random variables as fixed at their observed values.
For instance, suppose that we are given a sample \(x_1, x_2, \ldots, x_N\) and we are told that the underlying distribution is normal, but the model’s parameters, \(\mu, \sigma^2\), are unknown to us. How could we estimate those parameters? We start by considering the probability density function (PDF) of the normal distribution:
\[f(x\mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } \exp\left(-\frac {(x-\mu)^2}{2\sigma^2} \right)\]Then, we define as likelihood \(\mathcal{L}\) the corresponding PDF for the whole sample (assuming independent identically distributed) of normal random variables:
\[\mathcal{L}(\mu,\sigma^2 \mid x_1,\ldots,x_N) = \prod_{i=1}^N f(x_i) = \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^{N} \exp\left( -\frac{ \sum_{i=1}^N (x_i-\mu)^2}{2\sigma^2}\right)\]Notice how we treat \(\mathcal{L}(\mu,\sigma^2 \mid x_1,\ldots,x_N)\) as a function of the model’s parameters \(\mu, \sigma^2\), and the \(x_i\) as fixed. Therefore, given a set of observations and a candidate model parameterized by some parameters (here \(\mu,\sigma^2\)), likelihood measures how well the model accounts for the observation of these data.
A more concrete example is this. Suppose we observe just three values: 0.5, 2, and 1. Assuming the underlying distributions is Gaussian, what would then be \(\mathcal{L}\) equal to?
\[\mathcal{L}(\mu,\sigma ^2\mid x=0.5, 2, 1) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(0.5 -\mu)^2}{2 \sigma^2}}\times \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(2 -\mu)^2}{2 \sigma^2}}\times \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(1 -\mu)^2}{2 \sigma^2}}\]Now that we have a formula for \(\mathcal{L}\), we can plug in different values of \(\mu, \sigma^2\) and calculate the likelihood. The combination of \(\mu,\sigma^2\) that yields the largest likelihood will constitute our best estimate. By the way, it’s easier to work with logarithms every time we deal with a product, and this is how the log-likelihood concept emerges:
\[\begin{align*} \log \mathcal{L}(\mu,\sigma^2 \mid x_1,\ldots,x_N) &= \log \prod_{i=1}^N f(x_i) \\ &=\log\left[\left( \frac{1}{\sqrt{2\pi\sigma^2}} \right)^{N} \exp\left( -\frac{ \sum_{i=1}^N (x_i-\mu)^2}{2\sigma^2}\right)\right]\\ &=-\frac{N}{2} \log \left( 2\pi \sigma^2 \right) - \sum_{i=1}^{N} \left( \frac{(x_i - \mu)^2}{2\sigma^2}\right) \end{align*}\]Ok, let’s do some experimentation with Python now:
At this point we have considered every observation on its own. Now we will view the sample as a whole and calculate the joint PDF of all \(x_i\):
Since it’s easier to work with logarithms every time we deal with a product, as we are in this case, we take the log of the likelihood:
Our ultimate goal is to find the optimal values for the model’s parameters, \(\mu, \sigma^2\), that maximize the log-likelihood. If they maximize the log-likelihood, they will also maximize the likelihood. However, it’s typical for an optimizer to minimize a cost function by default rather than maximize a utility function. Therefore, we usually define our cost function to be the negative log-likelihood, and have our optimizer minimize it. By minimizing the negative log-likelihood, we maximize the log-likelihood (and therefore the likelihood).
A concrete example of maximum likelihood estimation
In the second example, we will generate some data sampled from a normal distribution with known parameters. We will then use Tensorflow to figure out the optimal model’s parameters that maximize our data’s negative log-likelihood.
The same as before, we subclass the tf.keras.layers.Layer
and add our model’s parameters as weights. We initialize their values to whatever value the user supplies. We then instantiate a new layer with some estimated values for the parameters that are purposely off.
We then define a function that calculates the negative log-likelihood, given some observed values and a set of parameters.
Here comes the custom training loop. The code is almost identical to the previous case. It’s just the loss function that differs.
We confirm that the algorithm converged, and also we print the best estimates for the model’s parameters.
In a future blog post, we will discuss how to structure our custom training loops so that we can use the tf.function
decorator to speed things up!
How is mean squared error related to log-likelihood?
There is a fundamental mind-blowing connection between MSE and log-likelihood on a linear Gaussian model. Let us assume that our data are modelled by the linear model \(\mathbf{Y} = \mathbf{X} \mathbf{\Theta} + \epsilon\), where \(\epsilon_i \sim N(0,\sigma_e^2)\). Therefore:
\[\mathbf{Y} - \mathbf{X}\mathbf{\Theta} = \epsilon\]And the log-likelihood is:
\[\begin{align*} \log \mathcal{L}(\mathbf{\Theta} \mid \mathbf{Y}, \mathbf{X}) &= \log\prod_{i=1}^N \text{PDF}(\epsilon_i)= \sum_{i=1}^N \log \text{PDF}(\epsilon_i)\\ &= \sum_{i=1}^N \log \left[ \frac{1}{\sqrt{2\pi \sigma_e^2}} \exp\left(-\frac{(\mathbf{Y}_i - \mathbf{X}_i \mathbf{\Theta})^2}{2\sigma_e^2}\right)\right]\\ &=-\frac{N}{2} \log \left( 2\pi \sigma_e^2 \right) - \sum_{i=1}^{N} \left( \frac{(\mathbf{Y}_i - \mathbf{X}_i \mathbf{\Theta})^2}{2\sigma_e^2}\right) \end{align*}\]In order to maximize \(\log \mathcal{L}(\mathbf{\Theta} \mid \mathbf{Y}, \mathbf{X})\), we need to minimize \(\sum_{i=1}^{N} \left( -\frac{(\mathbf{Y}_i - \mathbf{X}_i \mathbf{\Theta})^2}{2\sigma_e^2}\right)\), but this is equivalent to minimizing \(N \cdot \text{MSE}\) or to minimizing MSE. In the following plot we see the values of log-likelihood and MSE for various values of the parameter \(\mu\) in a linear regression model. Notice how the value of \(\mu\) that minimizes MSE is the same that maximizes log-likelihood!