Let’s talk about science!

Random thoughts on ChatGPT

2023-01-16T00:00:00+00:00

Shout out to the kind person somewhere on the globe that donated 20 coffees in “Buy me a coffee”. Whoever you are, I thank you! I promise that I will try to deliver high-value content in the following months.

In Superintelligence, Nick Bostrom talks about an “Oracle AI,” i.e., an AI system that, by design, does not act but merely answers questions, akin to having a genie in a bottle. Arguably, this is the safest advanced AI we can build and have it confined. However, even in this case, we could still be vulnerable to Oracle’s social engineering dexterity should it find the right arguments to persuade us for a matter. So Bostrom makes the following suggestions.

He proposes limiting the number of interactions between humans and the Oracle; contrast this with how many of us treat ChatGPT as an infinite capacity system interrogating it repeatedly.
He makes a case for reducing its output to “yes/no/undetermined” instead of free text responses so that a social engineering attack would take much longer. Again, ChatGPT works differently since it produces a great deal of narrative text.
Another precaution is resetting the Oracle’s state after each answer so the system does not contemplate long-term goals (ChatGPT remembers previous prompts given to it in the same conversation).
Last, it should be motivated by something other than human rewards via reinforcement learning, or social engineering becomes inevitable. This could be done via the fascinating idea of injecting “calculated indifference” inside Oracle’s utility function, making it apathetic to whether its replies are read. However, modern AI systems in social media perform in the opposite direction: they get rewards by maximizing user engagement.

To be clear, I’m not implying that ChatGPT is an Oracle or that it somehow possesses agency, but still, it makes you think about the safety of forthcoming AI systems.

The above are relevant for when fully autonomous AI arrives, if ever. Until then, people misusing advanced AI in politics pose significant dangers to society already. One major concern is the potential for manipulation and disinformation. ChatGPT can generate compelling and sophisticated text, making it easy for bad actors to spread false information and propaganda. This can be particularly dangerous in politics since misinformation there can have serious real-world consequences (e.g., climate change, pandemics, nuclear energy, etc.)

Another concern is the potential for AI to be used to influence public opinion and sway elections. With its ability to generate vast amounts of content and target specific individuals, ChatGPT could be used to spread disinformation in a highly targeted and effective manner. This could significantly impact the outcome of elections and undermine the democratic process.

Moreover, the use of AI in politics could also perpetuate and amplify societal biases. Machine learning algorithms are only as unbiased as the data they are trained on. This could severely affect marginalized groups and further entrench existing power imbalances.

The future is as dangerous as fascinating.

Custom training loops with Pytorch

2022-09-25T00:00:00+00:00

Introduction
Fit quadratic regression model to data by minimizing MSE

Introduction

In a previous post, we saw a couple of examples on how to construct a linear regression model, define a custom loss function, have Tensorflow automatically compute the gradients of the loss function with respect to the trainable parameters, and then update the model’s parameters. We will do the same in this post, but we will use PyTorch this time. It’s been a while since I wanted to switch from Tensorflow to Pytorch, and what better way than start from the basics?

Fit quadratic regression model to data by minimizing MSE

Generate training data

First, we will generate some data coming from a quadratic model, i.e., \(y = a x^2 + b x + c\), and we will add some noise to make the setup look a bit more realistic, as in the real world.

import torch
import matplotlib.pyplot as plt

def generate_dataset(npts=100):
    x = torch.linspace(0, 1, npts)
    y = 20*x**2 + 5*x - 3
    y += torch.randn(npts)  # Add some noise
    return x, y

x, y_true = generate_dataset()

plt.scatter(x, y_true)
plt.xlabel('$x$')
plt.ylabel('$y_{true}$')
plt.grid()
plt.title('Dataset')

Define a model with trainable parameters

In this step, we are defining a model, specifically the \(y = f(x) = a x^2 + b x + c\). Given the model’s parameters, \(a, b, c\), and an input \(x\), \(x\) being a tensor, we will calculate the output tensor \(y_\text{pred}\):

def f(x, params):
    """Calculate the model's output given a set of parameters and input x"""
    a, b, c = params
    return a * (x**2) + b * x + c

Define a custom loss function

Here we define a custom loss function that calculates the mean squared error between the model’s predictions and the actual target values in the dataset.

def mse(y_pred, y_true):
    """Returns the mean squared error between y_pred and y_true tensors"""
    return ((y_pred - y_true)**2).mean()

We then assign some initial random values to the parameters \(a, b, c\), and also tell PyTorch that we want it to compute the gradients for this tensor (the parameters tensor).

params = torch.randn(3).requires_grad_()
y_pred = f(x, params)

Here is a helper function that draws the predictions and actual targets in the same plot. Before training the model, we expect a considerable discordance between these two.

def plot_pred_vs_true(title):
    plt.scatter(x, y_true, label='y_true', marker='o', s=50, alpha=0.75)
    plt.plot(x, y_pred.detach().numpy(), label='y_pred', c='r', linewidth=4)
    plt.legend()
    plt.title(title)
    plt.xlabel('x')

plot_pred_vs_true('Before training')

Define a custom training loop

This is the heart of our setup. Given the old values for the model’s parameters, we construct a function that calculates its predictions, how much they deviate from the actual targets, and modifies the parameters via gradient descent.

def apply_step():
    lr = 1e-3                                   # Set learning rate to 0.001
    y_pred = f(x, params)                       # Calculate the y given x and a set of parameters' values
    loss = mse(y_pred=y_pred, y_true=y_true)    # Calculate the loss between y_pred and y_true
    loss.backward()                             # Calculate the gradient of loss tensor w.r.t. graph leaves
    params.data -= lr * params.grad.data        # Update parameters' values using gradient descent
    params.grad = None                          # Zero grad since backward() accumulates by default gradient in leaves
    return y_pred, loss.item()                  # Return the y_pred, along with the loss as a standard Python number

Run the custom training loop

We repeatedly apply the previous step until the training process converges to a particular combination of \(a, b, c\).

epochs = 15000
history = []
for i in range(epochs):
    y_pred, loss = apply_step()
    history.append(loss)

plt.plot(history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('MSE vs. Epoch')
plt.grid()

Final results

Finally, we superimpose the dataset with the best quadratic regression model PyTorch converged to:

plot_pred_vs_true('After training')

Applications of autoencoders

2022-09-17T00:00:00+00:00

Introduction
Applications of autoencoders

Introduction

Hello, world! It’s been nine months since my last post! I was so engaged working at Chronicles Health that I couldn’t find time to reserve for blogging. However, the previous week was my last one there. Now I’ll wear my medical hat again and work as a radiation oncology consultant, hopefully enjoying a more predictable work schedule. I will probably write a blog post about my venture working on a startup. But for now, all I wanted was to make a soft comeback by writing a short post on the applications of autoencoders, one of my favorite machine learning topics.

In the future, I expect to find time to expand on these topics via separate posts, with in-depth analysis and coding examples.

Applications of autoencoders

Dimensionality reduction

We have already used autoencoders as a dimensionality reduction technique before, and judging from Google Analytics, this post has been quite a success! So, the idea here is to compress the input by learning some efficient low-dimensional data representation encoded onto the latent layer. To the extent that we accomplish that, we can then replace the original input \(x\) with the new \(x_\text{latent}\), just like we can replace \(x\) with the first couple of principal components when doing PCA.

As it turns out, though, there are quite a few more applications that we will present here briefly.

Feature extraction

To the uninitiated, feature extraction is the process of transforming some data so that the new variables are more informative and less redundant than the original ones. Also, the new derived values (features), hopefully, can differentiate different classes of things in a classification task or predict some target value in a regression task. This application is tightly related to dimensionality reduction. Here’s how we do it. We take raw (unlabelled) data and train an autoencoder with it to force the model learn efficient data representations (the so-called latent space). Once we have trained the autoencoder network, we then ignore the decoder part of the model. Instead, we use only the encoder to convert new raw input data into the latent space representation. This new representation can then be used for supervised learning tasks. So, instead of training a supervised model to learn how to map \(x\) to \(y\), we ask it to map \(x_\text{latent}\) to \(y\).

Object matching

Again, this application is connected to the previous one. Say we’d like to build a search engine for images or songs. We could save all the items in a database and then go through each one, comparing it with our target. But that would be very time-consuming if we did the comparison pixel-by-pixel (or beat-by-beat). Instead, we could run the entire thing in the latent space. Concretely, we would first pass all the known images (or songs) from a trained autoencoder and save their latent space representation (which, by definition, is low dimensional and cheap!) in a database. The position of the input on the latent space is akin to a “signature”. Assuming we would use a 2D latent space, every song in the database would be characterized just by two numbers! Then, given an image (or song) to search for, we would convert it into a latent space representation (again, two numbers), and then we would search the database for it. The comparison could be made via, for instance, the Euclidean distance between the target and the \(i\)-th element in the database. The rationale is that operating on low-dimensional latent space is much more economical, computation-wise, than high-dimensional original space. What if this method doesn’t work? Well, we could try increasing the latent space dimensionality from 2D to 3D and try again until we find the minimum number of dimensions in the latent space that are enough to separate the images (or songs) in our database.

To be a bit more concrete, this is the hypothetical database of known songs along with their latent encoding:

Song name	Coordinate of latent dim 1	Coordinate of latent dim 2
Enter Sandman	0.65	0.12
Fear of the Dark	0.44	0.99
…	…	…
Land of the free	0.81	0.03

And suppose we are given an unknown song with \(\text{coord latent dim}_1 = 0.45, \text{coord latent dim}_2 = 0.97\). We would then calculate its distance from the first, 2nd, 3rd song in the database, and we would pick the one with the minimum distance. Neat?

Denoising

Autoencoders can be trained in such a way that they learn how to perform efficient denoising of the source. Contrary to conventional denoising techniques, they do not actively look for noise in the data. Instead, they extract the source from the noisy input by learning a representation of it. The representation is subsequently used to decompress the input into noise-free data. A concrete example is training an autoencoder to remove noise from images. The key to accomplishing this is to take the training images, add some noise to them, and use them as the \(x\). Then use the original images (without the noise) as the \(y\). So, to put it a bit more formally, we are asking the network to learn the mapping \((x+\text{noise}) \to x\). The following figure is taken from Keras’s documentation on autoencoders. The upper row consists of the original untainted images (the \(y\)), and the lower row contains images with some noise added by us (the \(x\)).

Anomaly detection

Since autoencoders are trained to reconstruct their input as well as they can, naturally, if they are given an out of distribution example, the reconstruction will not be as good as if this example was from the training distribution. So, by using some proper threshold for the reconstruction loss, one can build an anomaly detector: any outlier \(x\) will be reconstructed as \(x'\), where \(\left|x' - x\right| \gt \text{thresh}\).

Synthetic data generation

Variational autoencoders can generate new synthetic data, primarily images but also time series. The way to do this is by first training an autoencoder with some data and then randomly sampling the latent dimension of the autoencoder. These random samples are then handed over to the decoder part of the network, leading to new data generation. The following image shows the results of sampling an autoencoder trained on the MNIST dataset. These digits do not exist in the training dataset; they are generated by the network.

Variational autoencoders differ from vanilla autoencoders because the network learns a (typically) normal distribution for the latent vectors. This acts as some sort of regularization since autoencoders tend to memorize their input.

Data imputation

This is similar to the previous application. The idea here is to take a dataset without any missing entries and randomly delete some of the rows for some of the columns, pretending they are missing. However, we know the ground truth values and train the autoencoder to output those. Once trained, we can present a really missing entry to the network, and assuming that it has been trained robustly, it should perform efficient imputation. Again, to be a bit more concrete, given a dataset with \(x\) values without any missing data, we artificially remove some values and then train an autoencoder to learn the mapping \(x_{\text{missing}} \to x\).

Image colorization

Image colorization is the process of assigning colors to a grayscale image.

This task can be achieved by taking a dataset with colored images and creating a new dataset with pairs of grayscale and colored images. We then train an autoencoder to learn the mapping \(x_\text{grayscale} \to x_\text{colored}\).

The joy of not google’ing: Short to long stick ratio in broken rods

2021-12-20T00:00:00+00:00

Introduction

Hola! Long time no see! In the past months, I’ve been swamped working as a machine learning engineer at Chronicles Health, a digital health company, on a course to revolutionize the management of inflammatory bowel disease.

But I’m back! However, today’s post won’t cover some fancy machine learning algorithm or data science topic. Instead, let me tell you about a neat little problem I found on the Internet (credits to Gianni Sarcone). It turns out that, like many people, I’ve become extremely good at googling stuff but less so at thinking for myself. So, I decided to solve this cute little puzzle in the traditional “analog” way, with pen and paper, without any online help :) And as a matter of fact, I encourage you to do the same. Once every now and then, try to solve a relatively simple science problem without referencing online resources. If you need some formula or a theorem, look it up in a paper book, seriously. You will be amazed at how beneficial this approach will be to your problem-solving skills.

Problem statement

Suppose that we throw 10.000 rods against a rock, and they break at random places. What is the average ratio of the length of the short piece to the length of the long piece?

Solution

We start by modeling the problem, which probably is the most critical part of a problem-solving process. The way we set it up will largely define the next steps. So, we need to assign symbols to the various involved components. There’s a rod, and two pieces, a short and a long one. Let’s say that the rod has length \(L\). Then, if we agree that the short piece is of size \(x\), the remainder will be the long one with length \(L-x\). Mind that \(x\) is not fixed; it’s a random variable since we have 10.000 rods, and so is \(L-x\).

Every time we translate the statement of a problem into mathematical symbols and expressions, we need to constrain the values that our variables assume so that our setup always “makes sense”. Since \(x\) is the short part, it really can’t be larger than a half rod because it would be the long one! So, \(x\in[0,L/2]\). Also, \(L>0\) or there would be any rod, to begin with. So, we are interested in the average ratio of the short to long pieces, i.e.:

\[\text{avg. ratio} = \left<x/(L-x)\right>\]

At this point, we need to invoke the concept of expected value. The expected value of a random variable \(X\), often denoted \(\mathbb{E}[X]\), can be thought of as a generalized version of the weighted average, where the weights are given by the probabilities. Consider for example a fair die, then the probability of each outcome is \(p=1/6\) and the expected value after many throws is given by \(1 \times 1/6 + 2 \times 1/6 + \ldots + 6 \times 1/6 = 7/2\). This is easily demonstrated by simulating, say, 10.000 throws and taking the mean of the outcomes:

Mean@RandomInteger[{1, 6}, 1000] // N
(* 3.58 *)

Alright, back to our problem! Here we don’t throw dice. Instead, we crack rods and look at the number \(x/(L-x)\). To calculate the expected value of this ratio, we write:

\[\begin{align*} \mathbb{E}(x/(L-x)) = \int_{0}^{L/2} \frac{x}{L-x} p(x) \mathrm{d}x \end{align*}\]

Where \(x/(L-x)\) is the value of the ratio when the rod breaks at short length \(x\), and \(p(x)\) is the probability of this particular break happening. We assume that a rod is equally probable to break at a point \(x\) since the problem doesn’t state any specific probability distribution. In another blog post I talk about how uniform distribution is maximally noncommittal with respect to missing information. Check it out! The information-theoretic arguments are so mind-opening.

Therefore, \(p(x) = 1/(L/2)=2/L\). Does this make sense? Yes, because the longer the rod, the less probable it is for a particular break of short length \(x\) to happen. Imagine if we had a die with 1.000.000 faces; what would be the probability of getting the number “3” after a throw? 1/1.000.000. What if it was a regular one with 6 faces? The probability would be 1/6.

\[\begin{align*} \mathbb{E}(x/(L-x)) = \int_{0}^{L/2} \left( \frac{x}{L-x} \cdot\frac{2}{L} \right) \mathrm{d}x = 2\int_{0}^{L/2} \frac{x}{L(L-x)} \mathrm{d}x \end{align*}\]

From this point onwards, it’s just about computing the integral. Such integrals are usually calculated by breaking up the fraction into a sum of simple fractions, e.g.,

\[\frac{x}{L(L-x)}=\frac{A}{L} + \frac{B}{L-x}\]

and solving for \(A, B\). Since this is a simple one, we could just see that:

\[\frac{x}{L(L-x)}=-\frac{1}{L} + \frac{1}{L-x}\]

Therefore:

\[\begin{align*} \mathbb{E}(x/(L-x)) &= 2\int_{0}^{L/2} \left( -\frac{1}{L} + \frac{1}{L-x} \right) \mathrm{d}x\\ &= -\frac{2}{L} \left(\frac{L}{2}-0\right) - 2\left[\ln{(L-x)}\right]_{0}^{L/2}\\ &=-1 - 2\left[\ln\left({L}-\frac{L}{2}\right) - \ln{L}\right]\\ &=-1-2(\ln{1/2}) = -1+\ln{4} \end{align*}\]

Simulation

Here is a simple simulation in Mathematica for a rod of length \(L=1\). Notice how the average ratio converges on \(-1 + \ln{4} \simeq 0.386\).

L = 1;
f[x_] := x/(L - x)
sim[n_] :=
 Mean[
  f /@ RandomReal[{0, L/2},   n]
  ]
ListPlot[
 Table[{n, sim[n]}, {n, 1, 20000, 1000}], Joined -> True, 
 InterpolationOrder -> 2, PlotRange -> All, 
 Frame -> {True, True, False, False}, 
 FrameLabel -> {"# of throws", "Value of ratio"}, 
 GridLines -> Automatic, PlotRange -> All]

Stuff to think about

Why is the result independent of the length \(L\)? Is there any intuitive answer to this?
Why was it enough to integrate from \(x=0\) to \(x=L/2\) and not do something like:

\[\int_0^{L/2} \left(\frac{x}{L-x} \cdot \frac{1}{L} \right) \mathrm{d}x + \int_{L/2}^{L} \left(\frac{L-x}{x} \cdot \frac{1}{L} \right) \mathrm{d}x\]

Is there any symmetry in the problem that allows us to shortcut it? (Always look for symmetries!)

What whould happen if the probability of the rod breaking at some point wasn’t the same along the rod? Say because the rod was weaker as we moved to its left end. How would this affect the symmetry of the initial problem?

The expectation-maximization algorithm - Part 1

2021-07-03T00:00:00+00:00

Introduction
- What is EM about?
A 1-dimensional example
References

Introduction

What is EM about?

Maximum likelihood estimation (MLE)

The expectation-maximization (EM) algorithm is an iterative method to find the local maximum likelihood of parameters in statistical models. So what is the maximum likelihood? It’s the maximum value of the likelihood function! And what is a likelihood function? It’s a function of the model’s parameters treating the observed data as fixed points, i.e., we write \(\mathcal{L}(\theta\mid x)\) meaning that we vary the parameters \(\theta\) while taking the \(x\)’s as given. If \(\mathcal{L}(\theta_1\mid x) > \mathcal{L}(\theta_2 \mid x)\) then the sample we observed is more likely to have occurred if \(\theta = \theta_1\) rather than if \(\theta = \theta_2\). So, given the data that we have observed, the likelihood function points to a model’s most plausible parameterization that might have generated the observed data.

Here is an elementary example. Suppose that we have some data and want to fit a model of the form \(y = a x\). In this case, \(\theta\) is essentially the coefficient \(a\), but usually, there will be many unknown parameters. In the left image, there’s the likelihood function for several values of the parameter \(a\) (actually, it’s the logarithm of the likelihood function, but we will talk about this later). In the right image, we plot \(y = a x, \, a = -3, \ldots 7\) with a step size of 0.5, superimposed with the observed data. As you can see, \(a = 2\) maximizes the log-likelihood and fits the data better than any other line. So, fitting data to models can be done via maximum likelihood estimation.

By the way, in a previous blog post we have proven that by maximizing the likelihood in the linear regression case, this is equivalent to minimizing the mean squared error.

… in the presence of hidden variables

The EM algorithm is particularly useful when there are missing data in the data set or when the model depends on hidden or so-called latent variables. These are variables that affect our observed data but in ways that we can’t know directly. So what’s so special about latent parameters? Typically, if we know all the parameters, we can take the derivatives of the likelihood function with respect to them, solve the system of equations and find the values that maximize the likelihood. Like:

\[\left\{\frac{\partial \mathcal{L}}{\partial \theta_1}=0, \frac{\partial \mathcal{L}}{\partial \theta_2}=0, \ldots \right\}\]

This is precisely what we did when we wanted to fit some data to a normal distribution. However, in statistical models with latent variables, this typically results in a set of equations where the solutions to the parameters mandate the values of the latent variables and vice versa. By substituting one set of equations into the other, an unsolvable equation is produced. That’s why we need the expectation-maximization algorithm. Concretely, EM can be used in any of the following scenarios:

Estimating parameters of (usually Gaussian) mixture models
Estimating parameters of Hidden Markov Models
Unsupervised learning of clusters
Filling missing data in samples

What are the basic steps of EM?

EM takes its name from the alternation between two algorithmic steps. The first step is the expectation step, where we form a function for the expectation of the log-likelihood, using the current best estimates of the model’s parameters. Whereas, in the maximization step, we calculate the new parameters’ values by maximizing the expected log-likelihood. These new estimates of the parameters are then used to determine the distribution of the latent variables in the next expectation step. Don’t worry if it doesn’t make sense now; we will show an example in a minute, and we will also delve into it in subsequent blog posts.

A 1-dimensional example

Setting up the problem

Let us consider some observed 1-dimensional data points, \(x_i\). We assume they are generated by two normal distributions \(N(\mu_1, \sigma_1^2)\) and \(N(\mu_2, \sigma_2^2)\), with probabilities \(\pi\) and \(1-\pi\), respectively. In this setup, we have 5 unknown parameters: the mixing probability \(\pi\), the mean and standard deviation of the first distribution, and the mean and standard deviation of the second distribution. Let us gather all these under a vector called \(\theta = [\pi, \mu_1, \sigma_1, \mu_2, \sigma_2]\).

Writing down the likelihood function

Suppose that we observed a datapoint with value \(x_i\). What is the probability of \(x_i\) occuring? Assuming \(\varphi_1(x)\) is the probability density function of the 1st distribution, and \(\varphi_2(x)\) of the second, the probability of observing \(x_i\) is:

\[p(x_i) = \pi \varphi_1(x_i) + (1-\pi)\varphi_2(x_i)\]

To be more pedantic we would write:

\[p(x_i\mid \theta) = \pi \varphi_1(x_i \mid \mu_1,\sigma_1^2) + (1-\pi)\varphi_2(x_i \mid \mu_2,\sigma_2^2)\]

Which means that the PDF’s are paremeterized by \(\mu_1,\sigma_1^2\) and \(\mu_2, \sigma_2^2\), respectively. Ok, but this is just for a single observation \(x_i\). What if we have a bunch of \(x_i\)’s, say for \(i=1,\ldots,N\)? To find the joint probability of \(N\) independent events (which by the way is the likelihood function!) we just multiply the individual probabilities:

\[\mathcal{L}(\theta \mid x) = \prod_{i=1}^N p(x_i \mid \theta)\]

But since it’s easier to work with sums rather than products, we take the logarirthm of the likelihood, \(\ell(\theta\mid x)\):

\[\begin{align*}\ell(\theta \mid x) &= \log \prod_{i=1}^N p(x_i \mid \theta) =\sum_{i=1}^N \log p(x_i \mid \theta)\\&=\sum_{i=1}^N \log \left[\pi \varphi_1(x_i\mid \mu_1,\sigma_1^2) + (1-\pi)\varphi_2(x_i|\mu_2,\sigma_2^2)\right]\end{align*}\]

So, our objective is to maximize likelihood \(\mathcal{L}(\theta\mid x)\), which is equivalent to maximizing the log-likelihood \(\ell(\theta\mid x)\), with respect to the model’s parameters \(\theta = [\pi, \mu_1, \sigma_1, \mu_2, \sigma_2]\), given the data points \(\{x_i\}\).

Brute forcing one parameter at a time

In the following examples, we will generate some synthetic observed data from a mixture distribution with known parameters \(\mu_1, \sigma_1, \mu_2, \sigma_2\) and mixing probability \(\pi\). We will then calculate \(\ell(\theta\mid x)\) for various parameter values while keeping the rest of the parameters fixed. Every time we will do that, we will see how \(\ell(\theta\mid x)\) is maximized when a model’s parameter becomes equal to its ground-truth value.

Let’s create a mixture distribution of two Gaussian distributions with known parameters \(\mu_1, \sigma_1, \mu_2, \sigma_2\) and known mixing probability \(\pi=0.3\). Normally, we won’t know the values of these parameters, and as a matter of fact, finding them will be the very objective of the EM algorithm. But for now, let’s pretend we don’t know them.

ClearAll["Global`*"];
{m1, s1} = {1, 2};
{m2, s2} = {9, 3};

npts = 5000;
dist[m_, s_] := NormalDistribution[m, s];
mixdist[p_] :=
 MixtureDistribution[{p, 1 - p}, {dist[m1, s1], dist[m2, s2]}]
data = RandomVariate[mixdist[0.3], npts];
Histogram[data]

Let’s plot the probability density functions of the mixture distribution for various mixing probabilities \(\pi\). We notice how for \(\pi\to 0\) the mixture distribution approaches the 1st distribution, and for \(\pi\to 1\), the 2nd distribution. For in-between values, it’s a mixture! ;)

Style[Grid[{
   Table[
    Plot[PDF[mixdist[p], x], {x, -10, 20}, 
     PlotLabel -> "p=" <> ToString@p,
     FrameLabel -> {"x", "PDF(x)"}, 
     Frame -> {True, True, False, False},
     AxesOrigin -> {-10, 0}, Filling -> Axis],
    {p, 0, 1, 0.3}]
   }],
 ImageSizeMultipliers -> 0.7]

Let us now define the log-likelihood function:

logLikelihood[data_, p_, m1_, s1_, m2_, s2_] :=
 Module[{},
  Sum[
   Log[
    p PDF[dist[m1, s1], x] + (1 - p) PDF[dist[m2, s2], x] /. 
     x -> data[[i]]
    ],
   {i, 1, Length@data}]
  ]

Ok, we are ready to go. We will first vary the mixing probability \(\pi\), keeping the rest of the model’s parameters fixed. In some sense, we are brute-forcing \(\pi\), to find \(\pi\):

llvalues = 
  Table[{p, logLikelihood[data, p, m1, s1, m2, s2]}, {p, 0, 1, 0.1}];
{pmax, llmax} = 
 llvalues[[Ordering[llvalues[[All, 2]], -1][[1]]]]
(* {0.3, -14437.1} *)

plot1 =
 Show[
  ListPlot[llvalues, Joined -> True, 
   FrameLabel -> {"Probability p", "Log-Likelihood"}, 
   Frame -> {True, True, False, False}, 
   GridLines -> {{pmax}, {llmax}}, GridLinesStyle -> Dashed],
  ListPlot[llvalues, PlotStyle -> {Red, AbsolutePointSize[5]}]
  ]

Do you see how \(\ell(\theta\mid x)\) is maximized at \(\pi = 0.3\)? By the same token, we can try other model parameters, but we will always come to the same conclusion: the log-likelihood, therefore the likelihood, is maximized when our guesses become equal to the ground-truth values for the model’s parameters.

Reformulating the problem as a latent variable problem

Previously, we varied one parameter at a time, keeping the rest at their ground-truth values. We will now get serious and seek to estimate the values of all parameters simultaneously. If we attempt to directly maximize \(\ell(\theta|x)\), it will be tough due to the sum of terms inside the logarithm. For those of you who doubt it, just calculate the partial derivatives of \(\ell(\theta|x)\) with respect to \(\pi, \mu_1, \sigma_1, \mu_2, \sigma_2\) and contemplate solving the system where all these derivatives are required to become zero. Good luck with that! :P

There’s another way to go, though. We will reformulate our problem as a problem of maximum likelihood estimation with latent variables. For this, we will introduce a set of latent variables called \(\Delta_i \in \{0,1\}\). If \(\Delta_i = 0\) then \(x_i\) was sampled from the 1st distribution, and if \(\Delta_i = 1\), then it came from the 2nd distribution. In this case, the log-likelihood \(\ell(\theta\mid x,\Delta)\) is given by:

\[\begin{align*} \ell(\theta\mid x,\Delta) = &\sum_{i=1}^N \left[ (1-\Delta_i) \log \varphi_1(x_i) + \Delta_i \log\varphi_2(x_i)\right] +\\ &\sum_{i=1}^N \left[ (1-\Delta_i)\log\pi + \Delta_i\log(1-\pi)\right] \end{align*}\]

When we write \(\varphi_1(x_i)\) in reality we mean \(\varphi_1(x_i\mid \mu_1, \sigma_1^2)\), and similarly for \(\varphi_2(x_i)\) we mean \(\varphi_2(x_i\mid \mu_2, \sigma_2^2)\). The reason we omited it, is for keeping the log-likelihood expression easily readable. Feel free to check that the above formula is equal to the previous expression of \(\ell(\theta\mid x)\), by first letting \(\Delta_i = 0\) and then \(\Delta_i = 1\).

But, we don’t actually know the values \(\Delta_i\). After all, these are the latent variables that we introduced into the problem! If you feel that we ain’t making any progress, hold on. Here’s where the EM algorithm kicks in. Even though we don’t know the exact values \(\Delta_i\), we will use their expected values given our current best estimates for the model’s parameters! This is the expectation step of the EM algorithm. So, instead of \(\Delta_i\), we will use \(\gamma_i\) defined as:

\[\gamma_i(\theta) = \mathbb{E}(\Delta_i\mid \theta,x) = \text{Pr}(\Delta_i = 1\mid \theta,x)\]

Once we have \(\gamma_i\) calculated, we know which distribution \(x_i\) belongs to. Therefore, we can update the model’s parameters using the weighted maximum-likelihood fits. For Gaussian distributions, this is just the mean and standard deviation of the \(x_i\). This is the maximization step! Actually, \(\gamma_i\) doesn’t take discrete values like the \(\Delta_i\). Instead, it lies in the interval \([0,1]\) and, therefore, the EM algorithm does a soft membership assignment. I.e., for every \(x_i\), it assigns a probability that it comes from the 1st or the 2nd distribution. That’s why, when we calculate the Gaussians’ parameters, we use a \(\gamma_i\)-weighted average.

EM algorithm

So, here’s the EM algorithm for our particular problem:

Initialize unknown parameters (e.g., \(\hat{\pi} = 0.5, \hat{\mu_1} = \text{random }x_i, \sigma_i = \sum_{i=1}^N(x_i-\bar{x})^2/N, \ldots\)
Expectation step:

\[\hat{\gamma_i} = \frac{(1-\pi) \varphi_2(x_i)}{\pi \varphi_1(x_i) + (1-\pi)\varphi_2(x_i)}\]

Maximization step:

\[\begin{align*} \hat{\mu_1} &= \frac{\sum_{i=1}^N (1-\hat{\gamma_i})x_i}{\sum_{i=1}^N (1-\hat{\gamma_i})}\hspace{3cm}\hat{\mu_2} = \frac{\sum_{i=1}^N \hat{\gamma_i} x_i}{\sum_{i=1}^N \hat{\gamma_i}}\\ \hat{\sigma_1} = &\sqrt{\frac{\sum_{i=1}^N (1-\hat{\gamma_i})(x_i-\hat{\mu_1})^2}{\sum_{i=1}^N (1-\hat{\gamma_i})}}\hspace{1cm} \hat{\sigma_2} = \sqrt{\frac{\sum_{i=1}^N \hat{\gamma_i}(x_i-\hat{\mu_2})^2}{\sum_{i=1}^N \hat{\gamma_i}}}\\ \hat{\pi} &= \sum_{i=1}^N(1-\hat{\gamma_i})/N \end{align*}\]

Repeat until convergence or maximum number of iterations reached.

Here is a sample code that implements the EM algorithm for our particular problem. The code doesn’t look pretty without Mathematica’s syntax color highlighting and the Notebook’s format, but anyway.

em[data_, p_, m1_, s1_, m2_, s2_] :=
 Module[{newp, newm1, news1, newm2, news2, g, npts},
  npts = Length@data;
  g = Table[((1 - p) PDF[dist[m2, s2], data[[i]]]) / (p PDF[dist[m1, s1], data[[i]]] + (1 - p) PDF[dist[m2, s2], data[[i]]]), {i, 1, npts}];
  newm1 = Sum[(1 - g[[i]])*data[[i]], {i, 1, npts}] / Sum[1 - g[[i]], {i, 1, npts}];
  newm2 = Sum[g[[i]]*data[[i]], {i, 1, npts}] / Sum[g[[i]], {i, 1, npts}];
  news1 = Sqrt[Sum[(1 - g[[i]])*(data[[i]] - m1)^2, {i, 1, npts}] / Sum[1 - g[[i]], {i, 1, npts}]];
  news2 = Sqrt[Sum[g[[i]]*(data[[i]] - m2)^2, {i, 1, npts}] / Sum[g[[i]], {i, 1, npts}]];
  newp = Sum[(1 - g[[i]])/npts, {i, 1, npts}];
  {newp, newm1, news1, newm2, news2, g}
  ]

doEM[data_] :=
 Module[{p, m1, s1, m2, s2, g},
  {p, m1, s1, m2, s2} = {0.5, RandomChoice[data], StandardDeviation[data], RandomChoice[data], StandardDeviation[data]};
  Print[{p, m1, s1, m2, s2}];
  For [i = 1, i < 40, i++,
   {p, m1, s1, m2, s2, g} = em[data, p, m1, s1, m2, s2];
   If[Mod[i, 4] == 0, Print[{p, m1, s1, m2, s2}]]
   ];
  {p, m1, s1, m2, s2, g}
  ]

This is a short test run, where we confirm that the algorithm converges to the ground-truth values (the red lines are the ground-truth values). As we mentioned in the introduction, EM is a local algorithm, meaning it can get stuck at a local maximum. Therefore, sometimes we may need to repeat the algorithm a few times to ensure a near-global optimal solution.

In the following plot, we see how the \(\gamma_i\)’s vary as the observed data transition from the 1st to the 2nd distribution. E.g., when we look at observed data around x=1 (or less), the \(\gamma_i\)’s are equal to zero. This means that the EM algorithm doesn’t cast any doubt on the source of these values. They were sampled from the 1st distribution. When we look at observed data around x=9 (or more), EM is confident that these values originate from the second distribution (\(\gamma_i=1\)). However, when we are in between, \(\gamma_i\)’s assume intermediate values around 0.5, conveying the uncertainty regarding which distribution each \(x_i\) belongs to. So, by applying the EM algorithm, we discovered the membership of each observed value (with some uncertainty), and we estimated the model’s unknown parameters! Neat?

References

The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Acquisition functions in Bayesian Optimization

2021-06-11T00:00:00+00:00

Introduction
A schematic Bayesian Optimization algorithm
Acquisition Functions

Introduction

In a previous blog post, we talked about Bayesian Optimization (BO) as a generic method for optimizing a black-box function, \(f(x)\), that is a function whose formula we don’t know. The only thing we can do in this setup is to ask \(f\) evaluate at some \(x\) and observe the output.

A schematic Bayesian Optimization algorithm

The essential ingredients of a BO algorithm are the surrogate model (SM) and the acquisition function (AF). The surrogate model is often a Gaussian Process that can fit the observed data points and quantify the uncertainty of unobserved areas. So, SM is our effort to approximate the unknown black-box function \(f(x)\).

Next, the acquisition function “looks” at the SM and determines what areas in the domain of \(f(x)\) are worth exploiting and what areas are worth exploring. Accordingly, in areas where \(f(x)\) is optimal or areas that we haven’t yet looked at, AF assumes a high value. On the contrary, in areas where \(f(x)\) is suboptimal or areas that we have already sampled from, AF’s value is small. By finding the \(x\) that maximizes the acquisition function, we identify the next best guess for \(f\) to try. That’s right: instead of maximizing directly \(f(x)\), whose analytic form we don’t even know, we instead maximize another function, AF, that is much easier to do and much less expensive. So, the steps that a BO algorithm follows are the following.

In the following video, we demonstrate the exploitation (trying slightly different things that have already been proven to be good solutions) vs. exploration (trying totally different things from areas that have not yet been probed) tradeoff. Although here \(f(x)\) is known, in the general case, it is not.

Acquisition Functions

Upper Confidence Bound (UCB)

Probably as simple as an acquisition function can get, upper confidence bound contains explicit exploitation (\(\mu(x)\)) and exploration (\(\sigma(x)\)) terms:

\[a(x;\lambda) = \mu(x) + \lambda \sigma (x)\]

With UCB, the exploitation vs. exploration tradeoff is straightforward and easy to tune via the parameter \(\lambda\). Concretely, UCB is a weighted sum of the expected performance captured by \(\mu(x)\) of the Gaussian Process, and of the uncertainty \(\sigma(x)\), captured by the standard deviation of the GP. When \(\lambda\) is small, BO will favor solutions that are expected to be high-performing, i.e., have high \(\mu(x)\). On the contrary, when \(\lambda\) is large, BO rewards the exploration of currently uncharted areas in the search space.

Here is an example with a large value for \(\lambda\). UCB favors areas where we don’t have any samples from.

This is an example with a value for \(\lambda\) around 1 (I made \(\lambda=1.2\) so that AF and upper confidence interval curves don’t coincide). UCB balances between known good values and unexplored areas.

Finally, here is an example with a small value for \(\lambda\). UCB is very conservative in this case and will cause aggressive sampling around the current best solution.

Probability of Improvement (PI)

Suppose that we’d like to maximize \(f(x)\), and the best solution we have so far is \(x^\star\). Then, we can define “improvement”, \(I(x)\), as:

\[I(x) = \max(f(x) - f(x^\star), 0)\]

Therefore, if the new \(x\) we are looking at has an associated value \(f(x)\) that is less than \(f(x^\star)\), then \(f(x) - f(x^\star)\) is negative. So we aren’t improving at all, and the above formula returns 0, since the maximum number between any negative number and 0 is 0. On the contrary, if the new value \(f(x)\) is larger than our current best estimate, then \(f(x) - f(x^\star)\) is positive. In this case \(I(x)\) returns the difference which is how much we will improve over our current best solution if we evaluate \(f\) at the new point \(x\).

In probability of improvement acquisition function, for each candidate \(x\) we assign the probability of \(I(x)>0\), i.e., \(f(x)\) being larger than our current best \(f(x^\star)\). Let us recall that in a Gaussian Process, at each point there’s a Gaussian distribution attached. Therefore, at point \(x\) the value of the function \(f(x)\) is sampled from a normal distribution with mean \(\mu(x)\) and variance \(\sigma^2(x)\):

\[f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))\]

Now, let us use a reparameterization trick. If \(z \sim \mathcal{N}(0, 1)\), then \(f(x) = \mu(x) + \sigma(x) z\) is a normal distribution with mean \(\mu(x)\) and variance \(\sigma^2(x)\). Therefore, we can rewrite the improvement function, \(I(x)\), as:

\[I(x) = \text{max}(f(x) - f(x^\star), 0) = \text{max}(\mu(x) + \sigma(x) z - f(x^\star), 0) \,\, z \sim \mathcal{N}(0,1)\]

Let us take a pause here and make sure that we really understand what’s going on. Here \(x\) is some point that we want to check whether it worths evaluating \(f\) at. So, we assign a value \(I(x)\) to it. However, \(I(x)\)’s value is sampled from a normal distribution \(\mathcal{N}(\mu(x), \sigma^2(x))\). So, here’s how we calculate:

\[\text{PI}(x) = \text{Pr}(I(x) > 0) \Leftrightarrow \text{Pr}(f(x) > f(x^\star))\]

If you look at the image above, it’s clear that the probability of improvement is the shaded area under the Gaussian curve for \(z>z_0\). Therefore:

\[\text{PI}(x) = 1 - \Phi(z_0) = \Phi(-z_0) = \Phi\left(\frac{\mu(x)-f(x^\star)}{\sigma(x)}\right)\]

Where \(\Phi(z) \equiv \text{CDF}(z)\) and \(z_0 = \frac{f(x^\star) - \mu(x)}{\sigma(x)}\).

Expected Improvement (EI)

PI considers only the probability of improving our current best estimate, but it does not factor in the magnitude of the improvement. This is where the expected improvement acquisition function is different. Instead of looking at the improvement \(I(x)\), which is a random variable, we will instead calculate the “Expected Improvement”, which is the expected value of \(I(x)\):

\[\text{EI}(x)\equiv\mathbb{E}\left[I(x)\right] = \int_{-\infty}^{\infty} I(x)\varphi(z) \mathop{\mathrm{d}z}\]

Where \(\varphi(z)\) is the probability density function of the normal distribution \(\mathcal{N}(0,1)\), i.e., \(\varphi(z) = \frac{1}{\sqrt{2\pi}}\exp\left(-z^2/2\right)\). In case you aren’t familiar with the expected value of a random variable, it’s kind of a weighted average of “value” times “probability of getting that value”.

Ok, so:

\[\text{EI}(x) = \int_{-\infty}^{\infty} I(x)\varphi(z) \mathop{\mathrm{d}z}=\int_{-\infty}^{\infty}\underbrace{\max(f(x) - f(x^\star), 0)}_{I(x)}\varphi(z)\mathop{\mathrm{d}z}\]

How do we calculate this integral? We need to get rid of the \(max\) operator. In order to do that, we are going to break up the integral into two components, one where \(f(x) - f(x^\star)\) is positive and one where it is negative. The point where the switch happens is given by:

\[f(x) = f(x^\star) \Rightarrow \mu + \sigma z = f(x^\star) \Rightarrow z = \frac{f(x^\star) - \mu}{\sigma}\]

Let’s call this point \(z_0 = \frac{f(x^\star) - \mu}{\sigma}\), and break up the integral as:

\[\text{EI}(x) = \underbrace{\int_{-\infty}^{z_0} I(x)\varphi(z) \mathop{\mathrm{d}z}}_{\text{Zero since }I(x)=0} + \int_{z_0}^{\infty} I(x)\varphi(z) \mathop{\mathrm{d}z}\]

Ok, so we are good to go now:

\[\begin{aligned} \text{EI}(x) &=\int_{z_0}^{\infty} \max(f(x)-f(x^\star),0) \varphi(z)\mathop{\mathrm{d}z} = \int_{z_0}^{\infty} \left(\mu+\sigma z - f(x^\star)\right)\varphi(z) \mathop{\mathrm{d}z}\\ &= \int_{z_0}^{\infty} \left(\mu - f(x^\star) \right)\varphi(z)\mathop{\mathrm{d}z} + \int_{z_0}^{\infty} \sigma z \frac{1}{\sqrt{2\pi}}e^{-z^2/2}\mathop{\mathrm{d}z} \\\\ &=\left(\mu- f(x^\star)\right) \underbrace{\int_{z_0}^{\infty}\varphi(z)\mathop{\mathrm{d}z}}_{1-\Phi(z_0)\equiv 1-\text{CDF}(z_0)} + \frac{\sigma}{\sqrt{2\pi}}\int_{z_0}^{\infty} z e^{-z^2/2}\mathop{\mathrm{d}z}\\ &=\left(\mu- f(x^\star)\right) (1-\Phi(z_0)) - \frac{\sigma}{\sqrt{2\pi}}\int_{z_0}^{\infty} \left(e^{-z^2/2}\right)' \mathop{\mathrm{d}z}\\ &=\left(\mu- f(x^\star)\right) (1-\Phi(z_0)) - \frac{\sigma}{\sqrt{2\pi}} \left[e^{-z^2/2}\right]_{z_0}^{\infty}\\ &=\left(\mu- f(x^\star)\right) \underbrace{(1-\Phi(z_0))}_{\Phi(-z_0)} + \sigma \varphi(z_0) \\ &=\left(\mu- f(x^\star)\right) \Phi\left(\frac{\mu-f(x^\star)}{\sigma}\right) + \sigma \varphi\left(\frac{\mu - f(x^\star)}{\sigma}\right) \end{aligned}\]

At the last point, we used the fact that the PDF of normal distribution is symmetric, therefore \(\phi(z_0) = \phi(-z_0)\). Alright, so this equation might seem intimidating, but it’s really not. So, when does \(\text{EI}(x)\) take high values? When \(\mu > f(x^\star)\). I.e., then mean value of the Gaussian Process is high at \(x\). Expected improvement is also increased when there’s lots of uncertainty, therefore when \(\sigma > 1\). By the way, the formula above works for \(\sigma(x)>0\), otherwise, if \(\sigma(x) = 0\) (as it happens at the observed data points), it holds that \(\text{EI}(x)=0\).

There’s one last before we conclude. By injecting a (hyper)parameter \(\xi\) into the formula for \(\text{EI}(x)\), we can fine tune how much exploitation vs. how much exploration the BO algorithm will do. So, the full formula is:

\[\text{EI}(x;\xi) = \left(\mu- f(x^\star) - \xi\right) \Phi\left(\frac{\mu-f(x^\star)-\xi}{\sigma}\right) + \sigma \varphi\left(\frac{\mu - f(x^\star)-\xi}{\sigma}\right)\]

For \(\xi=0\), we just end up with the previous formula. However, for large values of \(\xi\), you can think of it as if we pretend to have a larger current best value than we actually do! Therefore, this steers the BO algorithm towards more exploration.

Bayesian optimization for hyperparameter tuning

2021-05-08T00:00:00+00:00

Introduction
The ingredients of Bayesian Optimization
- Surrogate model
- Acquisition function
Hyperparameter tuning of an SVM

Introduction

Plot: We died and ended up in Dante’s inferno – the optimization version. So, what does it mean to be in an optimization hell?

We are asked to optimize a function we don’t have an analytic expression for. It follows that we don’t have access to the first or second derivatives, hence using gradient descent or Newton’s method is a no-go. Also, we don’t have any convexity guarantees about \(f(x)\). Therefore, methods from the convex optimization field are also not available to us. The only thing we can do is to evaluate \(f(x)\) at some \(x\)’s. However, as if the situation was not bad enough, the function we want to optimize is very costly. So, we can’t just go ahead and massively evaluate \(f(x)\) in, say, 100 billion random points and keep the one \(x\) that optimizes \(f(x)\)’s value.

To summarize, we want to optimize an expensive, black-box, derivative-free, possibly non-convex function. And for this kind of problem, Bayesian Optimization (BO) is a universal and robust method.

Mind that the evaluation of the objective function is not necessarily computational! Let me give you a couple of examples, where \(f(x)\) is not something you can calculate with a computer:

You are a researcher investigating mixtures of chemotherapeutic drugs for their ability to kill cancer cells. You have narrowed it down to three candidate molecules, and you need to find the best combination of concentrations \(c_1, c_2, c_3\) of the three drugs. Evaluating the objective function \(f(c_1,c_2,c_3)\) in this context entails conducting actual experiments in the lab requiring personnel, consumables, and waiting for hours or days for the cell cultures to grow. Therefore, considering all possible concentration combinations is not a realistic approach. Instead, you need to begin with a few random drug concentrations, test them, and then use the experimental outcomes to predict the most promising drug combination to use next. Makes sense?
You work as a consultant for an oil company, and you want to maximize a probability density function \(f({\tiny\text{LAT}, \tiny\text{LONG}})\) of finding oil if you drill on \(({\tiny\text{LAT}, \tiny\text{LONG}})\) coordinates. Here, the evaluation of the function at a point requires the conduction of actual drilling. And this costs lots of money; therefore, you need to make good educated guesses, and you need to do so with only a few trials.

In other cases, however, \(f(x)\) is indeed be computational. For instance, we may define it as the k-fold cross-validation error of a machine-learning model whose hyperparameters we want to tune. As a matter of fact, we will do precisely this later on.

The ingredients of Bayesian Optimization

Surrogate model

Since we lack an expression for the objective function, the first step is to use a surrogate model to approximate \(f(x)\). It is typical in this context to use Gaussian Processes (GPs), as we have already discussed in a previous blog post. It’s vital that you grasp the concept of GPs, and then BO will require almost no mental effort to sink. There are other choices for surrogate models, but let’s stick to GPs for now. Once we have built a proxy model for \(f(x)\), we want to decide which point \(x\) to sample next. This is the responsibility of the acquisition function (AF), which kind of “peeks” at the GP and generates the best guess \(x\). So, in BO, there are two main components: the surrogate model, which most often is a Gaussian Process modeling \(f(x)\), and the acquisition function that yields the next \(x\) to evaluate. Having said that, a BO algorithm would look like this in pseudocode:

Evaluate \(f(x)\) at \(n\) initial points
While \(n \le N\) repeat:
- Update the surrogate model (e.g., the GP posterior) using all available data \(\mathcal{D}_{1:n}\)
- Compute the acquisition function,\(u(x\mid\mathcal{D}_{1:n})\), using the current surrogate model
- Let \(x_{n+1}\) be the maximizer of the acquisition function, i.e. \(x_{n+1} = \text{argmax}_x u(x\mid\mathcal{D}_{1:n})\)
- Evaluate \(y_{n+1} = f(x_{n+1})\)
- Augment the data \(\mathcal{D}_{1:n+1} = \{\mathcal{D}_{1:n}, (x_{n+1}, y_{n+1})\}\) and increment \(n\)
Return either the \(x\) evaluated with the largest \(f(x)\), or the point with the largest posterior mean.

Acquisition function

As we have already noted, the purpose of the acquisition function is to guide the next best point to sample \(f(x)\). Acquisition functions are constructed so that a high value corresponds to potentially high values of the objective function. Either because the prediction is high or because the uncertainty is high. Which is why they favor regions that already correspond to optimal values or areas that haven’t been explored yet. This is known as the so-called exploration-exploitation trade-off.

If you have played strategy games, like Age of Empires or Command & Conquer, you are already familiar with the concept. Initially, we are placed at some part of the map, and only the immediate area is visible to us. We may choose to sit there and mine any resources we already have access to or send a scouter to explore the invisible part of the map. By exploring the map, we risk meeting the enemy and getting killed, but also, we may find some high-value resources.

To find the next point to evaluate, we optimize the acquisition function. This an optimization problem itself, but luckily it does not require the evaluation of the objective function. In some cases, we may even derive an exact equation for the AF and find a solution with, say, gradient-based optimization. There are three often cited acquisition functions: expected improvement (EI), maximum probability of improvement (MPI), and upper confidence bound (UCB). Although often mentioned last, I think it’s best to talk about UCB because it contains explicit exploitation and exploration terms:

\[a_{\text{UCB}}(x;\lambda) = \mu(x) + \lambda \sigma(x)\]

With UCB, the exploitation vs. exploration trade-off is explicit and easy to tune via the parameter \(\lambda\). Concretely, we construct a weighted sum of the expected performance captured by \(\mu(x)\) of the Gaussian Process, and of the uncertainty \(\sigma(x)\), captured by the standard deviation of the GP. Assuming a small \(\lambda\), BO will favor solutions that are expected to be high-performing, i.e., have high \(\mu(x)\). Conversely, high values of \(\lambda\) will make BO favor the exploration of currently uncharted areas in the search space.

Here is an example of a Gaussian Process along with a corresponding acquisition function. This is a 1-dimensional optimization problem, but the idea is the same for more variables. The black dots are our measurements, i.e. the \(x\)’s where we have already sampled \(f(x)\). The black dotted line is the objective function, and the black solid line is our surrogate model of it, i.e., our posterior Gaussian Process. The blue shaded area represents the uncertainty of our surrogate model, \(\sigma(x)\), corresponding to regions in the domain of the objective function that we don’t have any observations. The green line is the acquisition function, which informs us what point \(x\) to sample next. Notice that it takes high values in regions where our GP’s \(\mu(x)\) is high and \(\sigma(x)\) is high.

Image taken from here.

This was a lightweight introduction to how a Bayesian Optimization algorithm works under the hood. Next, we will use a third-party library to tune an SVM’s hyperparameters and compare the results with some ground-truth data acquired via brute force. In the future, we will talk more about BO, perhaps by implementing our own algorithm with GPs, acquisition functions, and all.

Hyperparameter tuning of an SVM

Let’s import some of the stuff we will be using:

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

import matplotlib.pyplot as plt
import matplotlib.tri as tri
import numpy as np
from hyperopt import fmin, tpe, Trials, hp, STATUS_OK

Create a dataset

Then, we construct an artificial training dataset with many classes, where some of the features are informative, and some are not:

# Create a random n-class classification problem.

# n_features is the total number of features
# n_informative is the number of informative features 
# n_redundant features are generated as random linear combinations of the informative features

X_train, y_train = make_classification(n_samples=2500, n_features=20, n_informative=7, n_redundant=3)

Objective function definition

In this example, we will be using the hyperopt package to perform the hyperparameter tuning. First, we define our objective/cost/loss function. This is the \(f(\mathbf{x})\) that we want talked about in the introduction, and \(\mathbf{x} = [C, \gamma]\) is the parameter space. Therefore, we want to find the best combination of \(C, \gamma\) values that minimizes \(f(\mathbf{x})\). The machine learning model that we will be using is a Support Vector Machine (SVM), and the loss will be derived from the average 3-fold cross-validation score.

def objective(args):
    '''Define the loss function / objective of our model.

    We will be using an SVM parameterized by the regularization parameter C
    and the parameter gamma.
    
    The C parameter trades off correct classification of training examples
    against maximization of the decision function's margin. For larger values
    of C, a smaller margin will be accepted.

    The gamma parameter defines how far the influence of a single training
    example reaches, with larger values meaning 'close'. 
    '''
    C, gamma = args
    model = SVC(C=10 ** C, gamma=10 ** gamma, random_state=12345)
    loss = 1 - cross_val_score(estimator=model, X=X_train, y=y_train, scoring='roc_auc', cv=3).mean()
    return {'params': {'C': C, 'gamma': gamma}, 'loss': loss, 'status': STATUS_OK }

Optimization

Now, we will use the fmin() function from the hyperopt package. In this step, we need to specify the search space for our parameters, the database in which we will be storing the evaluation points of the search, and finally, the search algorithm to use. The careful reader might notice that we are doing 1000 evaluations, although we said that evaluation \(f(x)\) is expensive. That’s correct; the only reason we do so is because we want to exaggerate the effect of exploitation vs. exploration, as you shall see in the plots.

trials = Trials()
best = fmin(objective,
    space=[hp.uniform('C', -4., 1.), hp.uniform('gamma', -4., 1.)],
    algo=tpe.suggest,
    max_evals=1000,
    trials=trials)

Let’s print the results:

print(best)
100%|██████████| 1000/1000 [13:01<00:00,  1.28trial/s, best loss: 0.046323449153816476]
{'C': 0.7280999882033379, 'gamma': -1.6752085795502363}

Let us now extract the value of our objective function for every \(C, \gamma\) pair:

# Extract the loss for every combination of C, gamma
results = trials.results
ar = np.zeros(shape=(1000,3))
for i, r in enumerate(results):
    C = r['params']['C']
    gamma = r['params']['gamma']
    loss = r['loss']
    ar[i] = C, gamma, loss

And then use it to plot the loss surface:

C, gamma, loss = ar[:, 0], ar[:, 1], ar[:, 2]

fig, ax = plt.subplots(nrows=1)
ax.tricontour(C, gamma, loss, levels=14, linewidths=0.5, colors='k')
cntr = ax.tricontourf(C, gamma, loss, levels=14, cmap="RdBu_r")

fig.colorbar(cntr, ax=ax)
ax.plot(C, gamma, 'ko', ms=1)
ax.set(xlim=(-4, 1), ylim=(-4, 1))
plt.title('Loss as a function of $10^C$, $10^\gamma$')
plt.xlabel('C')
plt.ylabel('gamma')

plt.show()

Brute-force evaluation of objective function

Since the parameter space is just 2-dimensional, the dataset relatively small, and the SVM training fast, we can brute-force compute the value of the objective function for all possible values of \(C\) and \(\gamma\). These will be our ground-truth data against which we will compare the results from the BO run.

def sample_loss(args):
    C, gamma = args
    model = SVC(C=10 ** C, gamma=10 ** gamma, random_state=12345)
    loss = 1 - cross_val_score(estimator=model, X=X_train, y=y_train, scoring='roc_auc', cv=3).mean()
    return loss

lambdas = np.linspace(1, -4, 25)
gammas = np.linspace(1, -4, 20)
param_grid = np.array([[C, gamma] for gamma in gammas for C in lambdas])

real_loss = [sample_loss(params) for params in param_grid]

And here is the respective contour plot:

C, G = np.meshgrid(lambdas, gammas)
plt.figure()
cp = plt.contourf(C, G, np.array(real_loss).reshape(C.shape), cmap="RdBu_r")
plt.colorbar(cp)
plt.title('Loss as a function of $10^C$, $10^\gamma$')
plt.xlabel('$C$')
plt.ylabel('$\gamma$')
plt.show()

Let’s place the two plots side-by-side and talk about the results. In the left image, we see the ground-truth values of the loss function that we acquired by computing the value \(\ell(C, \gamma)\) for every possible pair of \((C, \gamma)\) via a grid-search. You see the blue shaded region corresponding to low values for the loss function (good!) and the red stripe at the top corresponding to high values for the loss function (bad!). In the right image, we see the black points corresponding to our tried values. Do you notice how there is a high density of points near the blue shaded area where \(\ell(C,\gamma)\) is minimized? That’s exploitation! The BO algorithm picked up some good solutions into that area and sampled aggressively around that region. On the contrary, it tried some values near the top red stripe region, and since those trials yielded bad results, it didn’t bother sampling any further there.

Ground-truth values	Bayesian Optimization

References

https://thuijskens.github.io/2016/12/29/bayesian-optimisation/

Longest substring with non-repeating characters

2021-04-14T00:00:00+00:00

I have been doing some interviews for job positions like data scientist, machine learning engineer, and software developer during the past months. To prepare for the coding part of these interviews and brush up on my algorithmic thinking and programming skills, I decided to do some ad-hoc practicing. There are lots of websites with coding challenges of varying difficulty. Some examples include Leetcode, HackerRank, Topcoder, and others. Although I kind of dislike the contrived nature of these quizzes, I joined Leetcode nonetheless. Anyway, I picked a problem under the “medium” difficulty category that I’ll blog about today. The problem is about finding the longest substring with non-repeating characters in a string.

Problem formulation

Given a string s, find the length of the longest substring without repeating characters.

Example 1: Input: s = “abcabcbb” Output: 3 Explanation: The answer is “abc”, with the length of 3.

Example 2: Input: s = “bbbbb” Output: 1 Explanation: The answer is “b”, with the length of 1.

Example 3: Input: s = “pwwkew” Output: 3 Explanation: The answer is “wke”, with the length of 3. Notice that the answer must be a substring, “pwke” is a subsequence and not a substring.

Example 4: Input: s = “” Output: 0

Constraints: \(0 \le \text{s.length} \le 5 \times 10^4\) s consists of English letters, digits, symbols and spaces.

Solutions

We import some libraries that we will need later on.

import matplotlib.pyplot as plt
import numpy as np
import string
import random
import time

For starters, we will write a function that generates random strings consisting of lowercase letters, digits, and whitespace characters of varying lengths. We will use it to see how our different solutions scale with increasing input size. When coding such problems, it’s essential to have abundant examples that cover all edge cases. By the way, I’ve found it easier to write and run my code in a Jupyter Notebook inside Visual Studio Code and then paste it to Leetcode.

def str_generator(size=6, chars=string.ascii_lowercase + string.digits + string.whitespace):
    return ''.join(random.choice(chars) for _ in range(size))

# Print 10 random strings of random length [0,20) 
input_str = [str_generator(size=random.randrange(0, 20)) for _ in range(10)]
print(input_str)

#    ['75ypzflfi85wgbe', 'k4dogu\x0c14ckj', 'zcj8aoquhzfsh1g7uyh', '\x0cce\r\tt48nq1gio', 'c58',
#     'ol\tnfq7', 'i', 'jsjn\t8', '2tj\x0bb413', '']

The horrible solution

My first attempt resulted in the following readable yet absolutely horrible, complexity-wise, solution. The rep() function is good, actually, and we will be using it in the other solutions as well. It uses a dictionary to track whether the next character has already been seen inside a substring. It has the advantage that it iterates only once the substring, so it’s \(\mathcal{O}(N)\) time complexity. Had we used a nested loop to search for repeating characters, that would lead us to \(\mathcal{O}(N^2)\) complexity from the get-go!

So, the following algorithm starts with the entire string and checks whether it has any repeating characters. If it doesn’t, then this is the longest substring of length N! Return its length, and we are done. If it has repeating characters, though, we slice it into two N-1 substrings. If the repeating characters are located in only one out of the two, we know that the other is the longest substring with a length N-1. Return it immediately, and we are done. Last, if there are repeating characters in both the substrings of length N-1, we need to dig deeper, and therefore we return the maximum length of the substring contained in these two substrings.

def rep(s:str) -> bool:
    '''Returns True if str has repeating characters in it and False otherwise'''
    freq = {}
    for c in s:
        if freq.get(c) != None:
            return True
        freq[c] = 1
    return False

def helper(s:str, n:int) -> int:
    '''The most horrible solution in terms of time and space complexity.
    It uses recursion to generate the substrings, starting from the full
    string and generating substrings.'''
    if n < 2: return n
    if not rep(s):
        return n
    a, b = s[:-1], s[1:]
    rep_a, rep_b = rep(a), rep(b)
    if not (rep_a and rep_b):
        return n-1
    else:
        return max(helper(a, n-1), helper(b, n-1))

def verySlowLLS(s: str) -> int:
    return helper(s, len(s))

So, why does this algorithm perform so poorly? As I understand, there are two reasons: 1. Recursion is expensive because each time we call the helper() function, a new stack frame needs to be allocated, and 2. When we are calling max(helper(a, n-1), helper(b, n-1)), we don’t really divide the input, let alone conquer it! We merely go from N to N-1. It’s not as if we reduced the search space from N to N/2 or something. So, remember: if you recurse, you better be dividing the search space at each step, otherwise don’t recurse!

A decent solution of \(\mathcal{O}(N^2)\) complexity

The next two solutions use sliding windows, either forward or backward, to find all possible substrings in a string. The forward method keeps track of the length that is maximum at the current point in time. We need to do this because we must look up to the original string of length N.

def slowLLS_forward(s: str) -> int:
    '''It uses sliding windows of length 1, 2, ..., N-1, N.
    That's why we need to keep track of the currently maximum
    length.'''
    L = len(s)
    if L < 2: return L
    max_len = 0
    for w in range(1, L+1):
        for i in range(L - w + 1):
            sub = s[i:(i+w)]
            if not rep(sub):
                current_len = len(sub)
                if current_len > max_len:
                    max_len = current_len
    return max_len

On the other hand, if we are moving backward, i.e., if we are examining substrings of decreasing length, we know that the first substring without any repeating characters is the one with the maximum length.

def slowLLS_backward(s: str) -> int:
    '''It uses sliding windows of length N, N-1, N-2, ..., 1.
    That's why we don't need to keep track of the currently
    maximum length. The first non-repeating substring we encounter
    is the one with the maximum length.'''
    L = len(s)
    if L < 2: return L
    for w in range(L, 0, -1):
        for i in range(L - w + 1):
            sub = s[i:(i+w)]
            if not rep(sub):
                return len(sub)

The best solution of \(\mathcal{O}(N)\) complexity

This is actually the best solution I could come up with. We are using two variables to keep track of the start and the end of the currently maximum substring. Every time we see a non-repeating character, we advance the end of the presently maximal length substring. Contrastly, every time we encounter a repeating character, we advance the start of the currently maximal length substring. In this case, however, we need to remove from our dictionary all characters that we had discarded when we pushed the origin of the substring.

def fastLLS(s: str) -> int:
    '''Calculate the longest non-repeating substring
    on one go, by keeping track of the start (variable a) and
    end (variable b) of the currently maximum such substring.'''
    max_len = 0
    a = 0
    b = 0
    track = {}

    for i, c in enumerate(s):
        if track.get(c, -1) == -1:
            b = i
            track[c] = i
        else:
            start = a
            end = track[c] + 1
            for j in s[start:end]:
                del track[j]
            a = end
            b = i
            track[c] = i
        m = b - a + 1
        if m > max_len: max_len = m
    return max_len

Indeed, after submitting this solution to Leetcode I got:

step = 2
def profile_function(f, n):
    '''Profile `f' by applying it on input strings of
    progressively increasing length up to `n'.'''
    runtimes = []
    for i in range(0, n, step):
        input_str = str_generator(size=i)
        start = time.perf_counter()
        f(input_str)
        runtimes.append([i, time.perf_counter() - start])
    return runtimes

def plot_runtimes(r, fitDegree, title):
    '''Plot runtimes along with a polynomial fit of `fitDegree' degree.
    By default don't create figure / show the plot, so that we can call
    this function inside a subplot() context.'''
    #plt.figure()
    plt.scatter(*zip(*r), s=5)
    plt.xlabel('Input string length')
    plt.ylabel('Execution time in sec')
    plt.title(title)

    # Add a polynomial fit
    if fitDegree >= 0:
        model = np.poly1d(np.polyfit(*zip(*r), fitDegree))
        polyline = np.linspace(1, len(r) * step, 50)
        plt.plot(polyline, model(polyline), 'r')
    #plt.show()

runtimes_very_slow = profile_function(verySlowLLS, 34)
plot_runtimes(runtimes_very_slow, -1, verySlowLLS.__name__)

runtimes_slow_forward = profile_function(slowLLS_forward, 1000)
plot_runtimes(runtimes_slow_forward, 2, slowLLS_forward.__name__)

runtimes_slow_backward = profile_function(slowLLS_backward, 1000)
plot_runtimes(runtimes_slow_backward, 2, slowLLS_backward.__name__)

runtimes_fast = profile_function(fastLLS, 10000)
plot_runtimes(runtimes_fast, 1, fastLLS.__name__)

As a sanity check, we verify that all algorithms return the same result for strings of various lengths. We can’t really go past a length of 30 characters because the recursive algorithm takes ages to run.

# Sanity check -- all algorithms should agree
for i in range(0, 30, 3):
    input_str = str_generator(size=i)
    y1 = fastLLS(input_str)
    y2 = slowLLS_forward(input_str)
    y3 = slowLLS_forward(input_str)
    y4 = verySlowLLS(input_str)
    if y1 != y2 or y2 != y3 or y3 != y4:
        print(input_str)
        print(y1, y2, y3, y4)
        break

In this plot we combine the running times of all algorithms side by side.

# Plot the runtimes of all algorithms side by side
plt.figure(figsize=(15,4))
plt.subplot(1,4,1)
plot_runtimes(runtimes_fast, 1, fastLLS.__name__)
plt.subplot(1,4,2)
plot_runtimes(runtimes_slow_backward, 2, slowLLS_backward.__name__)
plt.subplot(1,4,3)
plot_runtimes(runtimes_slow_forward, 2, slowLLS_forward.__name__)
plt.subplot(1,4,4)
plot_runtimes(runtimes_very_slow, 3, verySlowLLS.__name__)

plt.figure()
#plt.xscale('log')
plt.yscale('log')
plt.scatter(*zip(*runtimes_fast), s=5)
plt.scatter(*zip(*runtimes_slow_backward), s=10)
plt.scatter(*zip(*runtimes_slow_forward), s=10)
plt.scatter(*zip(*runtimes_very_slow), s=10)
plt.xlabel('Input string length')
plt.ylabel('Execution time in sec');

Decision Trees: Gini index vs entropy

2021-04-13T00:00:00+00:00

Introduction

Decision trees are tree-based methods that are used for both regression and classification. They work by segmenting the feature space into several simple subregions. To make predictions, trees assume either the mean or the most frequent class of the training points inside the region our observation falls, depending on whether we do regression or classification, respectively. Decision trees are straightforward to interpret, and as a matter of fact, they can be even easier to interpret than linear or logistic regression models. Perhaps because they relate to how the human decision-making process works. On the downside, trees usually lack the level of predictive accuracy of other methods. Also, they can be susceptible to changes in the training dataset, where a slight change in it may cause a dramatic change in the final tree. That’s why bagging, random forests and boosting are used to construct more robust tree-based prediction models. But that’s for another day. Today we are going to talk about how the split happens.

Gini impurity and information entropy

Trees are constructed via recursive binary splitting of the feature space. In classification scenarios that we will be discussing today, the criteria typically used to decide which feature to split on are the Gini index and information entropy. Both of these measures are pretty similar numerically. They take small values if most observations fall into the same class in a node. Contrastly, they are maximized if there’s an equal number of observations across all classes in a node. A node with mixed classes is called impure, and the Gini index is also known as Gini impurity.

Concretely, for a set of items with \(K\) classes, and \(p_k\) being the fraction of items labeled with class \(k\in {1,2,\ldots,K}\), the Gini impurity is defined as:

\[G = \sum_{k=1}^K p_k (1 - p_k) = 1 - \sum_{k=1}^N p_k^2\]

And information entropy as:

\[H = -\sum_{k=1}^K p_k \log p_k\]

In the following plot, both metrics are plotted against each other, considering a set of K=2 classes with probability \(p\) and \(1-p\), respectively. Notice how for small values of \(p\), Gini is consistently lower than entropy. Therefore, it penalizes less small impurities. This is a crucial observation that will prove helpful in the context of imbalanced datasets.

The Gini index is used by the CART (classification and regression tree) algorithm, whereas information gain via entropy reduction is used by algorithms like C4.5. In the following image, we see a part of a decision tree for predicting whether a person receiving a loan will be able to pay it back. The left node is an example of a low impurity node since most of the observations fall into the same class. Contrast this with the node on the right where observations of different classes are mixed in.

Image taken from “Provost, Foster; Fawcett, Tom. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking”.

Let’s calculate the Gini impurity of the left node:

\[\begin{align} G\left(\text{Balance < 50K}\right) &= 1-\sum_{k=1}^{2} p_k^2 = 1-p_1^2 - p_2^2\\ &=1-\left(\frac{12}{13}\right)^2 -\left(\frac{1}{13}\right)^2 \simeq 0.14 \end{align}\]

And the Gini impurity of the right node:

\[\begin{align} G\left(\text{Balance} \ge \text{50K}\right) &= 1-\sum_{k=1}^{2} p_k^2 = 1-p_1^2 - p_2^2\\ &=1-\left(\frac{4}{17}\right)^2 -\left(\frac{13}{17}\right)^2 \simeq 0.36 \end{align}\]

We notice that the left node has a lower Gini impurity index, which we’d expect since \(G\) measures impurity, and the left node is purer relative to the right one. Let’s calculate now the entropy of the left node:

\[\begin{align} H\left(\text{Balance < 50K}\right) &= -\sum_{k=1}^{2} p_k \log{p}_k = -p_1 \log{p}_1 -p_2 \log{p}_2\\ &=-\frac{12}{13}\log\left(\frac{12}{13}\right) -\frac{1}{13}\log\left(\frac{1}{13}\right) \simeq 0.27 nats \end{align}\]

Depending on whether we are using \(log_2\) or \(log_e\) in the entropy formula we get the result in bits or nats, respectively. For instance, here it’s \(H \simeq 0.39 bits\). Let’s calculate the entropy of the right node as well:

\[\begin{align} H\left(\text{Balance}\ge\text{50K}\right) &= -\sum_{k=1}^{2} p_k \log{p}_k = -p_1 \log{p}_1 -p_2 \log{p}_2\\ &=-\frac{4}{17}\log\left(\frac{4}{17}\right) -\frac{13}{17}\log\left(\frac{13}{17}\right) \simeq 0.55 nats \end{align}\]

Again, if we’d use base 2 in the entropy’s logarithm, we’d get \(H \simeq 0.79 bits\). Units aside, we see that the left node has lower entropy than the right one, which is expected since the left one is in a more ordered state and entropy measures disorder. So, it’s \(H_\text{left} \simeq 0.27 nats\) and \(H_\text{right} \simeq 0.55 nats\). The various algorithms for assembling decision trees pick the next feature to split, so maximum impurity reduction is achieved.

Let’s calculate how much entropy is reduced by splitting on the “Balance” feature:

\[\begin{align*} H(\text{Parent}) &= -\frac{16}{30} \log\left(\frac{16}{30}\right) -\frac{14}{30}\log\left(\frac{14}{30}\right)\simeq 0.69nats\\ H(\text{Balance}) &= \frac{13}{30} \times 0.27 + \frac{17}{30} \times 0.55 \simeq 0.43nats \end{align*}\]

Therefore, the information gain by splitting on the “Balance” feature is:

\[\text{IG} = H(\text{Parent}) - H(\text{Balance}) = 0.69 - 0.43 = 0.26nats\]

If we were to choose among “Balance” and some other feature, say “Education”, we would make up our mind based on the IG of both. If IG of “Balance” was 0.26 nats and IG of “Education” was 0.14 nats, we would pick the former to split.

So when do we use Gini impurity versus information gain via entropy reduction? Both metrics work more or less the same, and in only a few cases do the results differ considerably. Having said that, there’s a scenario where entropy might be more prudent: imbalanced datasets.

An example of an imbalanced dataset

The package ROSE comes with a built-in imbalanced dataset named hacide, consisting of hacide.train and hacide.test. The dataset has three variables in it for a total of \(N=10^3\) observations. The cls, short for “class”, is the response categorical variable, and \(x_1\) and \(x_2\) are the predictor variables. For building our classification trees, we will use the rpart package.

# Load the necessary libraries and the dataset 
library(ROSE)
library(rpart)
library(rpart.plot)
data(hacide)

# Check imbalance on training set
table(hacide.train$cls)
#
#   0   1 
# 980  20

As you may see from the output above, this is a very imbalanced dataset. The vast majority, 980, of the 1000 observations belong to the “0” class, and only 20 belong to the “1” class. We will now fit a decision tree by using Gini as the split criterion.

# Use gini as the split criterion
tree.imb <- rpart(cls ~ ., data = hacide.train, parms = list(split = "gini"))
pred.tree.imb <- predict(tree.imb, newdata = hacide.test)
accuracy.meas(hacide.test$cls, pred.tree.imb[,2])
#
# Call: 
# accuracy.meas(response = hacide.test$cls, predicted = pred.tree.imb[, 2])
#
# Examples are labelled as positive when predicted is greater than 0.5 
#
# precision: 1.000
# recall: 0.200
# F: 0.167

Things don’t look all that great. Although we have a perfect precision (reminder: Precision=TP/(TP+FP)), meaning that we don’t have any false positives, our recall is very low (reminder: Recall=TP/(TP+FN), meaning that we have many false negatives. So basically, our classifier outputs pretty much always the majority class “0”. F-metric also is very low. And this is the ROC curve which shows how horrible our performance is.

roc.curve(hacide.test$cls, pred.tree.imb[,2], plotit = T, main = "Gini index")
# Area under the curve (AUC): 0.600

So what did go wrong here? Let’s take a look at the decision tree itself. Notice that the left node has 10 observations of the minority class and 979 of the dominant class. From the perspective of Gini impurity index that’s a very pure node, because \(G_L = 1 - (10/989)^2 - (979/989)^2 \simeq 0.02\). The same applies, albeit to a lesser degree, for the right node: \(G_R = 1 - (1/11)^2 - (10/11)^2\simeq 0.17\). Therefore, \(G\) doesn’t appear to be working so great with our imbalanced dataset.

rpart.plot(tree.imb, main = "Gini Index", type = 5, extra = 3)

Let’s repeat the fitting, but now we will use entropy as the split criterion for growing our tree.

# Use information gain as the split criterion
tree.imb <- rpart(cls ~ ., data = hacide.train, parms = list(split = "information"))
pred.tree.imb <- predict(tree.imb, newdata = hacide.test)
accuracy.meas(hacide.test$cls, pred.tree.imb[,2])
#
# Call: 
# accuracy.meas(response = hacide.test$cls, predicted = pred.tree.imb[, 2])
#
#  Examples are labelled as positive when predicted is greater than 0.5 
#
# precision: 1.000
# recall: 0.400
# F: 0.286

The precision is still perfect, i.e. we aren’t predicting any false positives, and we doubled the recall. This improvement also reflects to the F metric. Also, the ROC curve of the new decision tree is way better than the previous run.

roc.curve(hacide.test$cls, pred.tree.imb[,2], plotit = T)
# Area under the curve (AUC): 0.883

Here is the decision tree itself. Admittedly, it’s a bit more complex that when we used Gini, but overall the classifier is more performant and useful.

rpart.plot(tree.imb, main = "Information Gain", type = 5, extra = 3)

An introduction to Gaussian Processes

2021-03-30T00:00:00+00:00

Introduction
The ingredients
A simple 1D GP prediction example
Limitations of Gaussian Processes
References

Introduction

One of the recurring topics in statistics is establishing a relationship between a response variable \(y\) and some predictor variable(s) \(x\), given a set of data points. This procedure is known as regression analysis and is typically done by assuming a polynomial function whose coefficients are determined via ordinary least squares. But what if we don’t want to commit ourselves upfront on the number of parameters or even the functional form to use? Suppose we’d like to consider every possible function as a candidate model for our data. That’s bold, but could we pull it through? The answer is yes, with the help of Gaussian Processes (GP).

The ingredients

Gaussian process priors

Let’s start with a distribution of all possible functions that, conceivably, could have produced our data (without actually looking at the data!). This is portrayed in the following plot, where we have drawn 10 such candidate random functions. In principle, the number is infinite, but for brevity, we only drew 10 here. These functions are known as GP priors in the Bayesian vernacular. They capture our ignorance regarding the true generating function \(y = f(x)\) we are after.

From GP priors to GP posteriors

Then, as we look at the data, we narrow down the functions that could have generated them. In the following example, after considering 5 observations, we build-up some pretty strong confidence regarding how the data-generating function should look like. It’s as if we put a vertical loop around the prior functions at the training data points, and then we pull the rope. The shaded area represents our model’s uncertainty, being high where we lack data and low where we have many data points. The image was taken from the book Machine Learning A Probabilistic Perspective by Kevin P. Murphy, which is very nice, by the way.

Constraining the priors

The truth is that we don’t really want to consider every mathematically valid function. Instead, we need to impose some constraints on the prior distribution over all possible functions. For starters, we want our functions to be smooth because this matches our empirical knowledge about how the world generally works. Points that are close to each other in the input space (whether in the time domain, i.e. \(t_1, t_2, \ldots\) or in the spatial domain, i.e. \(x_1, x_2, \ldots\)) are associated with \(y_1, y_2, \ldots\) values that are also close to each other. Therefore, we don’t really want our algorithm to favor solutions that look like the left edgy function.

Which brings us to the following question: how do we introduce smoothness? Although we talk about the distribution over functions, in reality, we define the distribution over the function’s values at a finite yet arbitrary set of points, say \(x_1, x_2, \ldots, x_N\). I.e., we model functions as really long column vectors. You also need to realize that every point \(y_1, y_2, \ldots, y_N\) is treated as a random variable, and the joint probability distribution of \(p(y_1, y_2, \ldots, y_N)\) is a multivariate normal distribution (MVN). Let that sink in for a moment because this is the heart of GP. To generate the following function, we set up a 120-variate normal distribution and take a single 120-variate sample from it. This 120 long \(y\) vector corresponds to our function.

Ok, but if we sample from a 120-variate Gaussian, how can we guarantee the function’s smoothness? After all, the \(y_i\)’s are random! First, to set up a 120-variate Gaussian, we need a 120x120 covariance matrix. Each element of the matrix determines how much the \((y_i, y_j)\) variables are related. The trick now is to use a covariance matrix such that the values that are close in the input space, the \(x\)’s, will produce values that are close in the output space, the \(y\)’s. In the following plot, \(x_1\) and \(x_2\) are close together, so we’d expect \(y_1\) and \(y_2\) to also be close (this makes the function smooth and not too wiggly). On the contrary, \(x_1\) and \(x_N\) are very apart, so the covariance matrix element \(\Sigma_{1N}\) should be some tiny number. And \(y_1, y_N\) would be allowed to be as far away as they’d feel like.

In the following plot, we visualize such a legitimate covariance matrix. The variables near the diagonal, i.e., variables close in the input space, are assigned a high value (\(\Sigma_{ij}=1\)). Therefore, when we sample from the multivariate normal distribution, these points will come out as neighbors. On the contrary, the rest of the pairs are given a low value (\(\Sigma_{ij}=0\)). Hence, when we sample from the MVN, the \(y\)’s will be uncorrelated. Alright, but how do we actually calculate the values of the covariance matrix? We use a specialized function called kernel (also known as covariance function).

A kernel function is just a fancy name for a function that accepts as input two points in the input space, i.e., \(x_i\) and \(x_j\), and outputs how “similar” they are based on some notion of “distance”. For example, the following kernel is the exponentiated quadratic that uses the exponentiated squared Euclidean distance between two points. If \(x=x'\), then \(k(x, x') = \exp(0)=1\), whereas if \(\|x-x'\| \to \infty\), then \(k(x, x') \to 0\).

\[k(x,x') = \sigma^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right)\]

The \(\ell\) parameter determines the length of the “wiggles”. Generally speaking, we won’t be able to extrapolate more than \(\ell\) units away from our data. Similarly, the variance \(\sigma^2\) determines the average distance of our function from its mean value. In short, \(\ell, \sigma\) circumscribe the horizontal and vertical “range” of the function. As you may have guessed, they are indeed hyperparameters (i.e., their values need to be set by us; they can’t be inferred automatically by the algorithm). The following image shows various different kernels that can be used in GP and the derived GP priors. By the way, the same kernels are also used in Support Vector Machines (SVM).

Image taken from here.

Given a kernel \(k(x,x')\) we construct the covariance matrix with:

\[\Sigma(x,x')= \begin{bmatrix} k(x_1,x_1) & k(x_1,x_2) & k(x_1,x_3) & \dots & k(x_1,x_N) \\ k(x_2,x_1) & k(x_2,x_2) & k(x_2,x_3) & \dots & k(x_2,x_N) \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ k(x_N,x_1) & k(x_N,x_2) & k(x_N,x_3) & \dots & k(x_N,x_N) \\ \end{bmatrix}\]

The covariance matrix \(\Sigma(x,x')\) must be positive definite, meaning that the following condition must be met:

\[x^⊤ \Sigma x > 0, \forall x \ne 0\]

This is the multivariate analog of the univariate requirement for the variance \(\sigma^2\) to be positive. Although we haven’t made any reference to it, we also need a mean function \(m(x)\) to fully characterize the MVN that we will be sampling our \(y\)’s from. Having all the ingredients in place, we write:

\[Y(x) \sim \mathcal{GP}\left(m(x),k(x,x')\right)\]

Making predictions from Gaussian Processes posteriors

We have gone through all the fuzz to make some predictions. Right? So, suppose that we have \(n_1\) new testing samples, and we are going to base these predictions on \(n_2\) previously observed data points. Keep in mind that both the training and testing \(y\)’s (that we want to calculate) are jointly Gaussian, since they both come from the same MVN. Having said that and given samples’s finite size we can write:

\[Y = \left( \begin{array}{c} Y_1 \\ Y_2 \end{array} \right) \quad \mbox{ with sizes } \quad \left( \begin{array}{c} n_1 \times 1 \\ n_2 \times 1 \end{array} \right)\] \[\mu = \left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right) \quad \mbox{ with sizes } \quad \left( \begin{array}{c} n_1 \times 1 \\ n_2 \times 1 \end{array} \right)\] \[\Sigma = \left(\begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right) \ \mbox{ with sizes } \ \left(\begin{array}{cc} n_1 \times n_1 & n_1 \times n_2 \\ n_2\times n_1 & n_2\times n_2 \end{array} \right)\]

Just to make sure that we are on the same page here: \(Y_1\) contains the \(y\) values we want to calculate, \(Y_2\) contains the \(y\) values from the training set, \(\Sigma_{11}\) is the covariance matrix for the testing set, \(\Sigma_{22}\) for the training set and \(\Sigma_{12} = \Sigma_{21}\) for the mixed testing-training set. The distribution of \(Y_1\) conditional on \(Y_2=y_2\) is \(Y_1 \mid y_2 \sim \mathcal{N} (\bar{\mu}, \bar{\Sigma})\), where:

\[\begin{align} \bar{\mu} &= \mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(y_2 - \mu_2) \\ \mbox{and } \quad \bar{\Sigma} &= \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \end{align}\]

If you feel like reading more regarding conditional distributions, you can start here.

A simple 1D GP prediction example

Let us consider a somewhat contrived one-dimensional problem. The response variable \(y\) is a sinusoid measured at eight equally spaced \(x\) locations in the span of a single period \([0, 2\pi]\). This example is taken from here, and we reimplement it with Mathematica (the original implementation is in R).

ClearAll["Global`*"];

(* Define a squared exponential kernel *)
 kernel[a_, b_] := Exp[-Norm[(a - b), 2]^2]

(* These are our training data *)
nTrainingPoints = 8; 
X = Array[# &, nTrainingPoints, {0, 2 Pi}]; 
Y = Sin[X];

eps = 10^-6;
Sigma = Outer[kernel, X, X] + eps*IdentityMatrix[nTrainingPoints];

nTestPoints = 100;
XX = Array[# &, nTestPoints, {-0.5, 2 Pi + 0.5}];

SigmaXX = Outer[kernel, XX, XX] + eps*IdentityMatrix[nTestPoints];
SigmaX = Outer[kernel, XX, X];
SigmaInverse = Inverse[Sigma];
Sigmap = SigmaXX - SigmaX . SigmaInverse . Transpose[SigmaX];

(* Although it is positive definite, it isn't symmetric due to small round off errors *)
{PositiveDefiniteMatrixQ[Sigmap], SymmetricMatrixQ[Sigmap]}
(* {True, False} *)

(* Make it symmetric *)
Sigmap = (Sigmap + Transpose@Sigmap)/2; 
{PositiveDefiniteMatrixQ[Sigmap], SymmetricMatrixQ[Sigmap]}
(* {True, True} *)

mup = SigmaX . SigmaInverse . Y;

nsamples = 100;
YY = RandomVariate[MultinormalDistribution[mup, Sigmap], nsamples];
Dimensions@YY
(* {100, 100} *)

(* Calculate 5% and 95% quantiles for uncertainty modeling *)
quantiles = Transpose@Quantile[
    RandomVariate[
        MultinormalDistribution[mup, Sigmap], nsamples],
    {0.05, 0.95}]; 
Dimensions@quantiles
(* {2, 100} *)

Show[
    Table[
        ListPlot[Transpose@{XX, YY[[i]]}, AxesLabel -> {"x", "y"}, 
        Joined -> True, PlotStyle -> Opacity[0.1], 
        PlotRange -> {Automatic, {-2, 2}}], {i, 1, nsamples}],
    ListPlot[Transpose@{XX, Mean[YY]}, AxesLabel -> {"x", "y"}, 
        Joined -> True, PlotRange -> {Automatic, {-2, 2}}, 
        PlotStyle -> Magenta],
    Plot[Sin[x], {x, -0.5, 2 \[Pi] + 0.5}, PlotStyle -> Black],
    ListPlot[Transpose@{XX, quantiles[[1]]}, PlotStyle -> {Red, Dashed}, Joined -> True],
    ListPlot[Transpose@{XX, quantiles[[2]]}, PlotStyle -> {Red, Dashed}, Joined -> True],
    ListPlot[Transpose@{X, Y}, PlotStyle -> {Blue, AbsolutePointSize[7]}]]

The blue points are the training data points, the black line is the ground truth function, the magenta line is our approximation after averaging 100 function samples, the red dotted lines are the 5% and 95% quantiles and the faded out blue lines are the 100 function realizations that we sampled from the MVN.

Limitations of Gaussian Processes

Slow inference. Computing the covariance matrix’s inverse has a \(\mathcal{O}(N^3)\) time complexity, rendering exact inference too slow for more than a few thousand data points.
Choosing a covariance kernel. There’s some arbitrariness when choosing a kernel. However, the kernel’s hyperparameters can be inferred by maximizing the marginal likelihood, and the whole process can be automated.
Gaussian processes are in some sense idealizations. For the understanding of extreme phenomena exhibited by real physical systems, non-Gaussian processes might turn out more suitable. In this context, GPs serve as starting points to be perturbed.

References

Surrogates: Gaussian process modeling, design and optimization for the applied sciences by Robert B. Gramacy, 2021-02-06. https://bookdown.org/rbg/surrogates/
Machine Learning A Probabilistic Perspective by Kevin P. Murphy.

Let’s talk about science!

Random thoughts on ChatGPT

Custom training loops with Pytorch

Contents

Introduction

Fit quadratic regression model to data by minimizing MSE

Generate training data

Define a model with trainable parameters

Define a custom loss function

Define a custom training loop

Run the custom training loop

Final results

Applications of autoencoders

Contents

Introduction

Applications of autoencoders

Dimensionality reduction

Feature extraction

Object matching

Denoising

Anomaly detection

Synthetic data generation

Data imputation

Image colorization

The joy of not google’ing: Short to long stick ratio in broken rods

Introduction

Problem statement

Solution

Simulation

Stuff to think about

The expectation-maximization algorithm - Part 1

Contents

Introduction

What is EM about?

Maximum likelihood estimation (MLE)

… in the presence of hidden variables

What are the basic steps of EM?

A 1-dimensional example

Setting up the problem

Writing down the likelihood function

Brute forcing one parameter at a time

Reformulating the problem as a latent variable problem

EM algorithm

References

Acquisition functions in Bayesian Optimization

Contents

Introduction

A schematic Bayesian Optimization algorithm

Acquisition Functions

Upper Confidence Bound (UCB)

Probability of Improvement (PI)

Expected Improvement (EI)

Bayesian optimization for hyperparameter tuning

Contents

Introduction

The ingredients of Bayesian Optimization

Surrogate model

Acquisition function

Hyperparameter tuning of an SVM

Create a dataset

Objective function definition

Optimization

Brute-force evaluation of objective function

References

Longest substring with non-repeating characters

Problem formulation

Solutions

The horrible solution

A decent solution of \(\mathcal{O}(N^2)\) complexity

The best solution of \(\mathcal{O}(N)\) complexity

Decision Trees: Gini index vs entropy

Introduction

Gini impurity and information entropy

An example of an imbalanced dataset

An introduction to Gaussian Processes

Contents

Introduction

The ingredients

Gaussian process priors

From GP priors to GP posteriors