<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ekamperi.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ekamperi.github.io/" rel="alternate" type="text/html" /><updated>2026-01-04T19:18:02+00:00</updated><id>https://ekamperi.github.io/feed.xml</id><title type="html">Let’s talk about science!</title><subtitle>A blog on things I’m interested in such as mathematics, physics, programming, machine learning, data science, and radiation oncology.
</subtitle><author><name>Stathis Kamperis</name></author><entry><title type="html">How to install Jupyter Lab in FreeBSD 15.0</title><link href="https://ekamperi.github.io/machine%20learning/freebsd/2026/01/04/installing-jupyter-lab-in-freebsd.html" rel="alternate" type="text/html" title="How to install Jupyter Lab in FreeBSD 15.0" /><published>2026-01-04T00:00:00+00:00</published><updated>2026-01-04T00:00:00+00:00</updated><id>https://ekamperi.github.io/machine%20learning/freebsd/2026/01/04/installing-jupyter-lab-in-freebsd</id><content type="html" xml:base="https://ekamperi.github.io/machine%20learning/freebsd/2026/01/04/installing-jupyter-lab-in-freebsd.html"><![CDATA[<p>We used to have an implicit rule in this blog: we write only about things that hold during the passage of time, such as concepts, algorithms, math theorems, and so on. However, we now break this rule for two reasons. First, the Internet is getting spammed with auto-generated content with ~zero value. Let’s try to increase the SNR and at the same time keep the knowledge decentralized, as it was meant to be. Second, LLMs are particularly bad when it comes to solving problems with less known operating systems, such as <a href="https://www.freebsd.org/">FreeBSD</a>. So, hopefully during the next scraping, the spider bots will parse this little post and then we get to affect, even if in a minuscule way, the training of the next generation LLMs.</p>

<p>The key idea on how to succeed installing Jupyter Lab in FreeBSD 15.0 is to use as many FreeBSD packages as possible, and use pip only for what’s left. Here is the exact recipe that worked for me. Mind the option <code class="language-plaintext highlighter-rouge">--system-site-packages</code>:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">sudo </span>pkg <span class="nb">install </span>pkgconf python311 py311-pip py311-setuptools py311-wheel py311-cython <span class="se">\</span>
    py311-maturin py311-pyzmq py311-scikit-build-core cmake ninja rust
<span class="nv">$ </span><span class="nb">mkdir</span> <span class="nt">-p</span> venvs
<span class="nv">$ </span>python3.11 <span class="nt">-m</span> venv <span class="nt">--system-site-packages</span> ~/venvs/jupyter
<span class="nv">$ </span><span class="nb">source</span> ~/venvs/jupyter/bin/activate
<span class="nv">$ </span>pip <span class="nb">install </span>jupyterlab notebook
<span class="nv">$ </span>jupyter lab</code></pre></figure>

<p>Here is the proof:</p>

<p align="center">
    <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/jupyter_lab.png" />
</p>]]></content><author><name>Stathis Kamperis</name></author><category term="machine learning" /><category term="freebsd" /><category term="machine learning" /><category term="FreeBSD" /><summary type="html"><![CDATA[How to install Jupyter Lab in FreeBSD 15.0 using mostly system packages]]></summary></entry><entry><title type="html">Random thoughts on ChatGPT</title><link href="https://ekamperi.github.io/machine%20learning/2023/01/16/random-thoughts-on-chatgpt.html" rel="alternate" type="text/html" title="Random thoughts on ChatGPT" /><published>2023-01-16T00:00:00+00:00</published><updated>2023-01-16T00:00:00+00:00</updated><id>https://ekamperi.github.io/machine%20learning/2023/01/16/random-thoughts-on-chatgpt</id><content type="html" xml:base="https://ekamperi.github.io/machine%20learning/2023/01/16/random-thoughts-on-chatgpt.html"><![CDATA[<p><em>Shout out to the kind person somewhere on the globe that donated 20 coffees in “Buy me a coffee”. Whoever you are, I thank you! I promise that I will try to deliver high-value content in the following months.</em></p>

<p>In <a href="https://en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies">Superintelligence</a>, Nick Bostrom talks about an “Oracle AI,” i.e., an AI system that, by design, does not act but merely answers questions, akin to having a genie in a bottle. Arguably, this is the safest advanced AI we can build and have it confined. However, even in this case, we could still be vulnerable to Oracle’s social engineering dexterity should it find the right arguments to persuade us for a matter. So Bostrom makes the following suggestions.</p>

<ol>
  <li>He proposes limiting the number of interactions between humans and the Oracle; contrast this with how many of us treat ChatGPT as an infinite capacity system interrogating it repeatedly.</li>
  <li>He makes a case for reducing its output to “yes/no/undetermined” instead of free text responses so that a social engineering attack would take much longer. Again, ChatGPT works differently since it produces a great deal of narrative text.</li>
  <li>Another precaution is resetting the Oracle’s state after each answer so the system does not contemplate long-term goals (ChatGPT remembers previous prompts given to it in the same conversation).</li>
  <li>Last, it should be motivated by something other than human rewards via reinforcement learning, or social engineering becomes inevitable. This could be done via the fascinating idea of injecting “calculated indifference” inside Oracle’s utility function, making it apathetic to whether its replies are read. However, modern AI systems in social media perform in the opposite direction: they get rewards by maximizing user engagement.</li>
</ol>

<p>To be clear, <strong>I’m not implying that ChatGPT is an Oracle or that it somehow possesses agency</strong>, but still, it makes you think about the safety of forthcoming AI systems.</p>

<p>The above are relevant for when fully autonomous AI arrives, if ever. Until then, people misusing advanced AI in politics pose significant dangers to society <em>already</em>. One major concern is the potential for manipulation and disinformation. ChatGPT can generate compelling and sophisticated text, making it easy for bad actors to spread false information and propaganda. This can be particularly dangerous in politics since misinformation there can have serious real-world consequences (e.g., climate change, pandemics, nuclear energy, etc.)</p>

<p>Another concern is the potential for AI to be used to influence public opinion and sway elections. With its ability to generate vast amounts of content and target specific individuals, ChatGPT could be used to spread disinformation in a highly targeted and effective manner. This could significantly impact the outcome of elections and undermine the democratic process.</p>

<p>Moreover, the use of AI in politics could also perpetuate and amplify societal biases. Machine learning algorithms are only as unbiased as the data they are trained on. This could severely affect marginalized groups and further entrench existing power imbalances.</p>

<p>The future is as dangerous as fascinating.</p>]]></content><author><name>Stathis Kamperis</name></author><category term="machine learning" /><category term="machine learning" /><category term="neural networks" /><category term="philosophy" /><summary type="html"><![CDATA[Random thoughts on ChatGPT]]></summary></entry><entry><title type="html">Custom training loops with Pytorch</title><link href="https://ekamperi.github.io/mathematics/2022/09/25/pytorch-custom-training-loops.html" rel="alternate" type="text/html" title="Custom training loops with Pytorch" /><published>2022-09-25T00:00:00+00:00</published><updated>2022-09-25T00:00:00+00:00</updated><id>https://ekamperi.github.io/mathematics/2022/09/25/pytorch-custom-training-loops</id><content type="html" xml:base="https://ekamperi.github.io/mathematics/2022/09/25/pytorch-custom-training-loops.html"><![CDATA[<h3 class="no_toc" id="contents">Contents</h3>

<ul id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#fit-quadratic-regression-model-to-data-by-minimizing-mse" id="markdown-toc-fit-quadratic-regression-model-to-data-by-minimizing-mse">Fit quadratic regression model to data by minimizing MSE</a>    <ul>
      <li><a href="#generate-training-data" id="markdown-toc-generate-training-data">Generate training data</a></li>
      <li><a href="#define-a-model-with-trainable-parameters" id="markdown-toc-define-a-model-with-trainable-parameters">Define a model with trainable parameters</a></li>
      <li><a href="#define-a-custom-loss-function" id="markdown-toc-define-a-custom-loss-function">Define a custom loss function</a></li>
      <li><a href="#define-a-custom-training-loop" id="markdown-toc-define-a-custom-training-loop">Define a custom training loop</a></li>
      <li><a href="#run-the-custom-training-loop" id="markdown-toc-run-the-custom-training-loop">Run the custom training loop</a></li>
      <li><a href="#final-results" id="markdown-toc-final-results">Final results</a></li>
    </ul>
  </li>
</ul>

<h2 id="introduction">Introduction</h2>
<p><a href="https://ekamperi.github.io/mathematics/2020/12/20/tensorflow-custom-training-loops.html">In a previous post</a>, we saw a couple of examples on how to construct a linear regression model, define a custom loss function, have Tensorflow automatically compute the gradients of the loss function with respect to the trainable parameters, and then update the model’s parameters. We will do the same in this post, but we will use PyTorch this time. It’s been a while since I wanted to switch from Tensorflow to Pytorch, and what better way than start from the basics?</p>

<h2 id="fit-quadratic-regression-model-to-data-by-minimizing-mse">Fit quadratic regression model to data by minimizing MSE</h2>
<h3 id="generate-training-data">Generate training data</h3>
<p>First, we will generate some data coming from a quadratic model, i.e., \(y = a x^2 + b x + c\), and we will add some noise to make the setup look a bit more realistic, as in the real world.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="k">def</span> <span class="nf">generate_dataset</span><span class="p">(</span><span class="n">npts</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">npts</span><span class="p">)</span>
    <span class="n">y</span> <span class="o">=</span> <span class="mi">20</span><span class="o">*</span><span class="n">x</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">5</span><span class="o">*</span><span class="n">x</span> <span class="o">-</span> <span class="mi">3</span>
    <span class="n">y</span> <span class="o">+=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">npts</span><span class="p">)</span>  <span class="c1"># Add some noise
</span>    <span class="k">return</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span>

<span class="n">x</span><span class="p">,</span> <span class="n">y_true</span> <span class="o">=</span> <span class="n">generate_dataset</span><span class="p">()</span>

<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y_true</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'$x$'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'$y_{true}$'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Dataset'</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/pytorch_custom_loop/dataset.png" alt="Dataset for regression" />
</p>

<h3 id="define-a-model-with-trainable-parameters">Define a model with trainable parameters</h3>
<p>In this step, we are defining a model, specifically the \(y = f(x) = a x^2 + b x + c\). Given the model’s parameters, \(a, b, c\), and an input \(x\), \(x\) being a tensor, we will calculate the output tensor \(y_\text{pred}\):</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">):</span>
    <span class="s">"""Calculate the model's output given a set of parameters and input x"""</span>
    <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span> <span class="o">=</span> <span class="n">params</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">c</span></code></pre></figure>

<h3 id="define-a-custom-loss-function">Define a custom loss function</h3>
<p>Here we define a custom loss function that calculates the mean squared error between the model’s predictions and the actual target values in the dataset.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">mse</span><span class="p">(</span><span class="n">y_pred</span><span class="p">,</span> <span class="n">y_true</span><span class="p">):</span>
    <span class="s">"""Returns the mean squared error between y_pred and y_true tensors"""</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">y_pred</span> <span class="o">-</span> <span class="n">y_true</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span></code></pre></figure>

<p>We then assign some initial random values to the parameters \(a, b, c\), and also tell PyTorch that we want it to compute the gradients for this tensor (the parameters tensor).</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">params</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">).</span><span class="n">requires_grad_</span><span class="p">()</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span></code></pre></figure>

<p>Here is a helper function that draws the predictions and actual targets in the same plot. Before training the model, we expect a considerable discordance between these two.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">plot_pred_vs_true</span><span class="p">(</span><span class="n">title</span><span class="p">):</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y_true</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'y_true'</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.75</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">label</span><span class="o">=</span><span class="s">'y_pred'</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'x'</span><span class="p">)</span>

<span class="n">plot_pred_vs_true</span><span class="p">(</span><span class="s">'Before training'</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/pytorch_custom_loop/before_training.png" alt="Regression with Pytorch" />
</p>

<h3 id="define-a-custom-training-loop">Define a custom training loop</h3>
<p>This is the heart of our setup. Given the old values for the model’s parameters, we construct a function that calculates its predictions, how much they deviate from the actual targets, and modifies the parameters via <a href="https://ekamperi.github.io/machine%20learning/2019/07/28/gradient-descent.html">gradient descent</a>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">apply_step</span><span class="p">():</span>
    <span class="n">lr</span> <span class="o">=</span> <span class="mf">1e-3</span>                                   <span class="c1"># Set learning rate to 0.001
</span>    <span class="n">y_pred</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>                       <span class="c1"># Calculate the y given x and a set of parameters' values
</span>    <span class="n">loss</span> <span class="o">=</span> <span class="n">mse</span><span class="p">(</span><span class="n">y_pred</span><span class="o">=</span><span class="n">y_pred</span><span class="p">,</span> <span class="n">y_true</span><span class="o">=</span><span class="n">y_true</span><span class="p">)</span>    <span class="c1"># Calculate the loss between y_pred and y_true
</span>    <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>                             <span class="c1"># Calculate the gradient of loss tensor w.r.t. graph leaves
</span>    <span class="n">params</span><span class="p">.</span><span class="n">data</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">params</span><span class="p">.</span><span class="n">grad</span><span class="p">.</span><span class="n">data</span>        <span class="c1"># Update parameters' values using gradient descent
</span>    <span class="n">params</span><span class="p">.</span><span class="n">grad</span> <span class="o">=</span> <span class="bp">None</span>                          <span class="c1"># Zero grad since backward() accumulates by default gradient in leaves
</span>    <span class="k">return</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>                  <span class="c1"># Return the y_pred, along with the loss as a standard Python number</span></code></pre></figure>

<h3 id="run-the-custom-training-loop">Run the custom training loop</h3>
<p>We repeatedly apply the previous step until the training process converges to a particular combination of \(a, b, c\).</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">epochs</span> <span class="o">=</span> <span class="mi">15000</span>
<span class="n">history</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
    <span class="n">y_pred</span><span class="p">,</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">apply_step</span><span class="p">()</span>
    <span class="n">history</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">history</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Epoch'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Loss'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'MSE vs. Epoch'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">()</span></code></pre></figure>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/pytorch_custom_loop/history.png" alt="History of MSE loss" />
</p>

<h3 id="final-results">Final results</h3>
<p>Finally, we superimpose the dataset with the best quadratic regression model PyTorch converged to:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">plot_pred_vs_true</span><span class="p">(</span><span class="s">'After training'</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/pytorch_custom_loop/after_training.png" alt="Regression with Pytorch" />
</p>]]></content><author><name>Stathis Kamperis</name></author><category term="mathematics" /><category term="machine learning" /><category term="mathematics" /><category term="neural networks" /><category term="pytorch" /><category term="statistics" /><summary type="html"><![CDATA[How to create custom training loops with Pytorch]]></summary></entry><entry><title type="html">Applications of autoencoders</title><link href="https://ekamperi.github.io/mathematics/2022/09/17/applications-of-autoencoders.html" rel="alternate" type="text/html" title="Applications of autoencoders" /><published>2022-09-17T00:00:00+00:00</published><updated>2022-09-17T00:00:00+00:00</updated><id>https://ekamperi.github.io/mathematics/2022/09/17/applications-of-autoencoders</id><content type="html" xml:base="https://ekamperi.github.io/mathematics/2022/09/17/applications-of-autoencoders.html"><![CDATA[<h3 class="no_toc" id="contents">Contents</h3>

<ul id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#applications-of-autoencoders" id="markdown-toc-applications-of-autoencoders">Applications of autoencoders</a>    <ul>
      <li><a href="#dimensionality-reduction" id="markdown-toc-dimensionality-reduction">Dimensionality reduction</a></li>
      <li><a href="#feature-extraction" id="markdown-toc-feature-extraction">Feature extraction</a></li>
      <li><a href="#object-matching" id="markdown-toc-object-matching">Object matching</a></li>
      <li><a href="#denoising" id="markdown-toc-denoising">Denoising</a></li>
      <li><a href="#anomaly-detection" id="markdown-toc-anomaly-detection">Anomaly detection</a></li>
      <li><a href="#synthetic-data-generation" id="markdown-toc-synthetic-data-generation">Synthetic data generation</a></li>
      <li><a href="#data-imputation" id="markdown-toc-data-imputation">Data imputation</a></li>
      <li><a href="#image-colorization" id="markdown-toc-image-colorization">Image colorization</a></li>
    </ul>
  </li>
</ul>

<h2 id="introduction">Introduction</h2>
<p>Hello, world! It’s been nine months since my last post! I was so engaged working at Chronicles Health that I couldn’t find time to reserve for blogging. However, the previous week was my last one there. Now I’ll wear my medical hat again and work as a <a href="https://en.wikipedia.org/wiki/Radiation_therapy">radiation oncology consultant</a>, hopefully enjoying a more predictable work schedule. I will probably write a blog post about my venture working on a startup. But for now, all I wanted was to make a soft comeback by writing a short post on <strong>the applications of autoencoders</strong>, one of my favorite machine learning topics.</p>

<p>In the future, I expect to find time to expand on these topics via separate posts, with in-depth analysis and coding examples.</p>

<h2 id="applications-of-autoencoders">Applications of autoencoders</h2>
<h3 id="dimensionality-reduction">Dimensionality reduction</h3>
<p><a href="https://ekamperi.github.io/machine%20learning/2021/01/21/encoder-decoder-model.html">We have already used autoencoders as a dimensionality reduction technique before</a>, and judging from Google Analytics, this post has been quite a success! So, the idea here is to compress the input
by learning some efficient low-dimensional data representation encoded onto the latent layer. To the extent that we accomplish that, we can then replace the original input \(x\) with the new \(x_\text{latent}\),
just like we can replace \(x\) with the first couple of principal components when doing PCA.</p>

<p align="center">
 <img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/autoencoder/autoencoder_schematic.png" alt="Schematic representation of an autoencoder" />
</p>

<p>As it turns out, though, there are quite a few more applications that we will present here briefly.</p>

<h3 id="feature-extraction">Feature extraction</h3>
<p>To the uninitiated, feature extraction is the process of transforming some data so that the new variables are more informative and less redundant than the original ones. Also, the new derived values (features), hopefully, can differentiate different classes of things in a classification task or predict some target value in a regression task. This application is tightly related to dimensionality reduction. Here’s how we do it. We take raw (unlabelled) data and train an autoencoder with it to force the model learn efficient data representations (the so-called latent space). Once we have trained the autoencoder network, we then <strong>ignore the decoder part of the model</strong>. Instead, we use only the encoder to convert new raw input data into the latent space representation. This new representation can then be used for supervised learning tasks. So, instead of training a supervised model to learn how to map \(x\) to \(y\), we ask it to map \(x_\text{latent}\) to \(y\).</p>

<h3 id="object-matching">Object matching</h3>
<p>Again, this application is connected to the previous one. Say we’d like to build a search engine for images or songs. We could save all the items in a database and then go through each one, comparing it with our target. But that would be very time-consuming if we did the comparison pixel-by-pixel (or beat-by-beat). Instead, we could run the entire thing in the latent space. Concretely, we would first pass all the known images (or songs) from a trained autoencoder and save their latent space representation (which, by definition, is low dimensional and cheap!) in a database. The position of the input on the latent space is akin to a “signature”.  Assuming we would use a 2D latent space, every song in the database would be characterized just by two numbers! Then, given an image (or song) to search for, we would convert it into a latent space representation (again, two numbers), and <em>then</em> we would search the database for it. The comparison could be made via, for instance, the <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> between the target and the \(i\)-th element in the database. The rationale is that operating on low-dimensional latent space is much more economical, computation-wise, than high-dimensional original space. What if this method doesn’t work? Well, we could try increasing the latent space dimensionality from 2D to 3D and try again until we find the minimum number of dimensions in the latent space that are enough to separate the images (or songs) in our database.</p>

<p>To be a bit more concrete, this is the hypothetical database of known songs along with their latent encoding:</p>

<table>
  <thead>
    <tr>
      <th>Song name</th>
      <th>Coordinate of latent dim 1</th>
      <th>Coordinate of latent dim 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Enter Sandman</td>
      <td>0.65</td>
      <td>0.12</td>
    </tr>
    <tr>
      <td>Fear of the Dark</td>
      <td>0.44</td>
      <td>0.99</td>
    </tr>
    <tr>
      <td>…</td>
      <td>…</td>
      <td>…</td>
    </tr>
    <tr>
      <td>Land of the free</td>
      <td>0.81</td>
      <td>0.03</td>
    </tr>
  </tbody>
</table>

<p>And suppose we are given an unknown song with \(\text{coord latent dim}_1 = 0.45, \text{coord latent dim}_2 = 0.97\). We would then calculate its distance from the first, 2nd, 3rd song in the database, and we would pick the one with the minimum distance. Neat?</p>

<h3 id="denoising">Denoising</h3>
<p>Autoencoders can be trained in such a way that they learn how to perform efficient denoising of the source. Contrary to conventional denoising techniques, they do not actively look for noise in the data. Instead, they extract the source from the noisy input by learning a representation of it. The representation is subsequently used to decompress the input into noise-free data. A concrete example is training an autoencoder to remove noise from images. The key to
accomplishing this is to take the training images, <em>add some noise</em> to them, and use them as the \(x\). Then use the original images (without the noise) as the \(y\). So, to put it a bit more formally, we are asking the network to learn the mapping \((x+\text{noise}) \to x\). The following figure is taken from Keras’s documentation on autoencoders. The upper row consists of the original untainted images (the \(y\)), and the lower row contains images with some noise added by us (the \(x\)).</p>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/autoencoder/noisy_digits.png" alt="Noisy digits for training a denoising autoencoder" />
</p>

<h3 id="anomaly-detection">Anomaly detection</h3>
<p>Since autoencoders are trained to reconstruct their input as well as they can, naturally, if they are given an <em>out of distribution</em> example, the reconstruction will not be as good as if this example was <em>from the training distribution</em>. So, by using some proper threshold for the reconstruction loss, one can build an anomaly detector: any outlier \(x\) will be reconstructed as \(x'\), where \(\left|x' - x\right| \gt \text{thresh}\).</p>

<h3 id="synthetic-data-generation">Synthetic data generation</h3>
<p>Variational autoencoders can generate new synthetic data, primarily images but also time series. The way to do this is by first training an autoencoder with some data and then <em>randomly sampling the latent dimension</em> of the autoencoder. These random samples are then handed over to the decoder part of the network, leading to new data generation. The following image shows the results of sampling an autoencoder trained on the MNIST dataset. These digits do not exist in the training dataset; they are <em>generated</em> by the network.</p>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/autoencoder/latent_sample.png" alt="Image generation with variation autoencoders" />
</p>

<p>Variational autoencoders differ from vanilla autoencoders because the network learns a (typically) normal <strong>distribution for the latent vectors</strong>. This acts as some sort of <em>regularization</em> since autoencoders tend to memorize their input.</p>

<h3 id="data-imputation">Data imputation</h3>
<p>This is similar to the previous application. The idea here is to take a dataset <em>without</em> any missing entries and randomly <em>delete</em> some of the rows for some of the columns, pretending they are missing. However, we know the ground truth values and train the autoencoder to output those. Once trained, we can present a <em>really missing</em> entry to the network, and assuming that it has been trained robustly, it should perform efficient imputation. Again, to be a bit more concrete, given a dataset with \(x\) values <em>without</em> any missing data, we artificially remove some values and then train an autoencoder to learn the mapping \(x_{\text{missing}} \to x\).</p>

<h3 id="image-colorization">Image colorization</h3>
<p>Image colorization is the process of assigning colors to a grayscale image.</p>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/autoencoder/colorized_einstein.jpeg" alt="Colorized Albert Einstein" />
</p>

<p>This task can be achieved by taking a dataset with colored images and creating a new dataset with pairs of grayscale and colored images. We then train an autoencoder to learn the mapping \(x_\text{grayscale} \to x_\text{colored}\).</p>]]></content><author><name>Stathis Kamperis</name></author><category term="mathematics" /><category term="machine learning" /><category term="mathematics" /><category term="neural networks" /><category term="statistics" /><summary type="html"><![CDATA[A high-level summary of autoencoders' applications]]></summary></entry><entry><title type="html">The joy of not google’ing: Short to long stick ratio in broken rods</title><link href="https://ekamperi.github.io/mathematics/2021/12/20/short-to-long-stick-ratio.html" rel="alternate" type="text/html" title="The joy of not google’ing: Short to long stick ratio in broken rods" /><published>2021-12-20T00:00:00+00:00</published><updated>2021-12-20T00:00:00+00:00</updated><id>https://ekamperi.github.io/mathematics/2021/12/20/short-to-long-stick-ratio</id><content type="html" xml:base="https://ekamperi.github.io/mathematics/2021/12/20/short-to-long-stick-ratio.html"><![CDATA[<h3 id="introduction">Introduction</h3>
<p>Hola! Long time no see! In the past months, I’ve been swamped working as a machine learning engineer at <a href="https://www.chronicles.health/">Chronicles Health</a>, a digital health company, on a course to revolutionize the management of inflammatory bowel disease.</p>

<p>But I’m back! However, today’s post won’t cover some fancy machine learning algorithm or data science topic. Instead, let me tell you about a neat little problem I found on the Internet (credits to <a href="https://www.wikiwand.com/en/Gianni_A._Sarcone">Gianni Sarcone</a>). It turns out that, like many people, I’ve become extremely good at googling stuff but less so at thinking for myself. So, I decided to solve this cute little puzzle in the traditional “analog” way, with pen and paper, without any online help :) And as a matter of fact, I encourage you to do the same. Once every now and then, try to solve a relatively simple science problem without referencing online resources. If you need some formula or a theorem, look it up in a paper book, seriously. You will be amazed at how beneficial this approach will be to your problem-solving skills.</p>

<h3 id="problem-statement">Problem statement</h3>
<p style="border:2px; border-style:solid; border-color:#1C6EA4; border-radius: 5px; padding: 20px;">
Suppose that we throw 10.000 rods against a rock, and they break at random places. What is the average ratio of the length of the short piece to the length of the long piece?
</p>

<h3 id="solution">Solution</h3>
<p>We start by modeling the problem, which probably is the most critical part of a problem-solving process. The way we set it up will largely define the next steps. So, we need to assign symbols to the various involved components. There’s a rod, and two pieces, a <em>short</em> and a <em>long</em> one. Let’s say that the rod has length \(L\). Then, if we agree that the short piece is of size \(x\), the remainder will be the long one with length \(L-x\). Mind that \(x\) is not fixed; it’s a random variable since we have 10.000 rods, and so is \(L-x\).</p>

<p align="center">
 <img style="width: 60%; height: 60%" src="https://ekamperi.github.io/images/short-to-long-stick-sketch.png" alt="Average short to long stick ratio" />
</p>

<p>Every time we translate the statement of a problem into mathematical symbols and expressions, we need to constrain the values that our variables assume so that our setup always “makes sense”. Since \(x\) is the short part, it really can’t be larger than a half rod because it would be the long one! So, \(x\in[0,L/2]\). Also, \(L&gt;0\) or there would be any rod, to begin with. So, we are interested in the average ratio of the short to long pieces, i.e.:</p>

\[\text{avg. ratio} = \left&lt;x/(L-x)\right&gt;\]

<p>At this point, we need to invoke the concept of <a href="https://www.wikiwand.com/en/Expected_value"><strong>expected value</strong></a>. The expected value of a random variable \(X\), often denoted \(\mathbb{E}[X]\), can be thought of as a generalized version of the weighted average, where the weights are given by the probabilities. Consider for example a fair die, then the probability of each outcome is \(p=1/6\) and the expected value after many throws is given by \(1 \times 1/6 + 2 \times 1/6 + \ldots + 6 \times 1/6 = 7/2\). This is easily demonstrated by simulating, say, 10.000 throws and taking the mean of the outcomes:</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nb">Mean</span><span class="o">@</span><span class="nb">RandomInteger</span><span class="p">[{</span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="m">6</span><span class="p">}</span><span class="o">,</span><span class="w"> </span><span class="m">1000</span><span class="p">]</span><span class="w"> </span><span class="o">//</span><span class="w"> </span><span class="nb">N</span><span class="w">
</span><span class="c">(* 3.58 *)</span></code></pre></figure>

<p>Alright, back to our problem! Here we don’t throw dice. Instead, we crack rods and look at the number \(x/(L-x)\). To calculate the <em>expected value</em> of this ratio, we write:</p>

\[\begin{align*}
\mathbb{E}(x/(L-x)) = \int_{0}^{L/2} \frac{x}{L-x} p(x) \mathrm{d}x
\end{align*}\]

<p>Where \(x/(L-x)\) is the <em>value of the ratio</em> when the rod breaks at short length \(x\), and \(p(x)\) is the <em>probability</em> of this particular break happening. We assume that a rod is equally probable to break at a point \(x\) since the problem doesn’t state any specific probability distribution. In <a href="https://ekamperi.github.io/mathematics/2021/01/29/why-is-normal-distribution-so-ubiquitous.html#information-theoretic-arguments">another blog post</a> I talk about how uniform distribution is maximally noncommittal with respect to missing information. Check it out! The information-theoretic arguments are so mind-opening.</p>

<p>Therefore, \(p(x) = 1/(L/2)=2/L\). Does this make sense? Yes, because the longer the rod, the less probable it is for a <em>particular</em> break of short length \(x\) to happen. Imagine if we had a die with 1.000.000 faces; what would be the probability of getting the number “3” after a throw? 1/1.000.000. What if it was a regular one with 6 faces? The probability would be 1/6.</p>

\[\begin{align*}
\mathbb{E}(x/(L-x)) = \int_{0}^{L/2} \left( \frac{x}{L-x} \cdot\frac{2}{L} \right) \mathrm{d}x = 
2\int_{0}^{L/2} \frac{x}{L(L-x)} \mathrm{d}x 
\end{align*}\]

<p>From this point onwards, it’s just about computing the integral. Such integrals are usually calculated by breaking up the fraction into a sum of simple fractions, e.g.,</p>

\[\frac{x}{L(L-x)}=\frac{A}{L} + \frac{B}{L-x}\]

<p>and solving for \(A, B\). Since this is a simple one, we could just see that:</p>

\[\frac{x}{L(L-x)}=-\frac{1}{L} + \frac{1}{L-x}\]

<p>Therefore:</p>

\[\begin{align*}
\mathbb{E}(x/(L-x))
&amp;= 2\int_{0}^{L/2} \left( -\frac{1}{L} + \frac{1}{L-x} \right) \mathrm{d}x\\
&amp;= -\frac{2}{L} \left(\frac{L}{2}-0\right) - 2\left[\ln{(L-x)}\right]_{0}^{L/2}\\
&amp;=-1 - 2\left[\ln\left({L}-\frac{L}{2}\right) - \ln{L}\right]\\
&amp;=-1-2(\ln{1/2}) = -1+\ln{4}
\end{align*}\]

<h3 id="simulation">Simulation</h3>
<p>Here is a simple simulation in <em>Mathematica</em> for a rod of length \(L=1\). Notice how the average ratio converges on \(-1 + \ln{4} \simeq 0.386\).</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nv">L</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">;</span><span class="w">
</span><span class="nv">f</span><span class="p">[</span><span class="nv">x</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nv">x</span><span class="o">/</span><span class="p">(</span><span class="nv">L</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">x</span><span class="p">)</span><span class="w">
</span><span class="nv">sim</span><span class="p">[</span><span class="nv">n</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w">
 </span><span class="nb">Mean</span><span class="p">[</span><span class="w">
  </span><span class="nv">f</span><span class="w"> </span><span class="o">/@</span><span class="w"> </span><span class="nb">RandomReal</span><span class="p">[{</span><span class="m">0</span><span class="o">,</span><span class="w"> </span><span class="nv">L</span><span class="o">/</span><span class="m">2</span><span class="p">}</span><span class="o">,</span><span class="w">   </span><span class="nv">n</span><span class="p">]</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="nb">ListPlot</span><span class="p">[</span><span class="w">
 </span><span class="nb">Table</span><span class="p">[{</span><span class="nv">n</span><span class="o">,</span><span class="w"> </span><span class="nv">sim</span><span class="p">[</span><span class="nv">n</span><span class="p">]}</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">n</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="m">20000</span><span class="o">,</span><span class="w"> </span><span class="m">1000</span><span class="p">}]</span><span class="o">,</span><span class="w"> </span><span class="nb">Joined</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">True</span><span class="o">,</span><span class="w"> 
 </span><span class="nb">InterpolationOrder</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="m">2</span><span class="o">,</span><span class="w"> </span><span class="nb">PlotRange</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">All</span><span class="o">,</span><span class="w"> 
 </span><span class="nb">Frame</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="nb">True</span><span class="o">,</span><span class="w"> </span><span class="nb">True</span><span class="o">,</span><span class="w"> </span><span class="nb">False</span><span class="o">,</span><span class="w"> </span><span class="nb">False</span><span class="p">}</span><span class="o">,</span><span class="w"> 
 </span><span class="nb">FrameLabel</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="s">"# of throws"</span><span class="o">,</span><span class="w"> </span><span class="s">"Value of ratio"</span><span class="p">}</span><span class="o">,</span><span class="w"> 
 </span><span class="nb">GridLines</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">Automatic</span><span class="o">,</span><span class="w"> </span><span class="nb">PlotRange</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">All</span><span class="p">]</span></code></pre></figure>

<p align="center">
 <img style="width: 60%; height: 60%" src="https://ekamperi.github.io/images/short-stick-ratio.png" alt="Average short to long stick ratio" />
</p>

<h3 id="stuff-to-think-about">Stuff to think about</h3>
<ul>
  <li>Why is the result <em>independent</em> of the length \(L\)? Is there any intuitive answer to this?</li>
  <li>Why was it enough to integrate from \(x=0\) to \(x=L/2\) and not do something like:</li>
</ul>

\[\int_0^{L/2} \left(\frac{x}{L-x} \cdot \frac{1}{L} \right) \mathrm{d}x + \int_{L/2}^{L} \left(\frac{L-x}{x} \cdot \frac{1}{L} \right) \mathrm{d}x\]

<p>Is there any <em>symmetry</em> in the problem that allows us to shortcut it? (Always look for symmetries!)</p>
<ul>
  <li>What whould happen if the probability of the rod breaking at some point wasn’t the same along the rod? Say because the rod was weaker as we moved to its left end. How would this affect the symmetry of the initial problem?</li>
</ul>]]></content><author><name>Stathis Kamperis</name></author><category term="mathematics" /><category term="mathematics" /><summary type="html"><![CDATA[How to calculate the average short-to-long stick ratio when breaking rods at random points.]]></summary></entry><entry><title type="html">The expectation-maximization algorithm - Part 1</title><link href="https://ekamperi.github.io/mathematics/2021/07/03/expectation-maximization-part1.html" rel="alternate" type="text/html" title="The expectation-maximization algorithm - Part 1" /><published>2021-07-03T00:00:00+00:00</published><updated>2021-07-03T00:00:00+00:00</updated><id>https://ekamperi.github.io/mathematics/2021/07/03/expectation-maximization-part1</id><content type="html" xml:base="https://ekamperi.github.io/mathematics/2021/07/03/expectation-maximization-part1.html"><![CDATA[<h3 class="no_toc" id="contents">Contents</h3>

<ul id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a>    <ul>
      <li><a href="#what-is-em-about" id="markdown-toc-what-is-em-about">What is EM about?</a>        <ul>
          <li><a href="#maximum-likelihood-estimation-mle" id="markdown-toc-maximum-likelihood-estimation-mle">Maximum likelihood estimation (MLE)</a></li>
          <li><a href="#-in-the-presence-of-hidden-variables" id="markdown-toc--in-the-presence-of-hidden-variables">… in the presence of hidden variables</a></li>
          <li><a href="#what-are-the-basic-steps-of-em" id="markdown-toc-what-are-the-basic-steps-of-em">What are the basic steps of EM?</a></li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="#a-1-dimensional-example" id="markdown-toc-a-1-dimensional-example">A 1-dimensional example</a>    <ul>
      <li><a href="#setting-up-the-problem" id="markdown-toc-setting-up-the-problem">Setting up the problem</a></li>
      <li><a href="#writing-down-the-likelihood-function" id="markdown-toc-writing-down-the-likelihood-function">Writing down the likelihood function</a></li>
      <li><a href="#brute-forcing-one-parameter-at-a-time" id="markdown-toc-brute-forcing-one-parameter-at-a-time">Brute forcing one parameter at a time</a></li>
      <li><a href="#reformulating-the-problem-as-a-latent-variable-problem" id="markdown-toc-reformulating-the-problem-as-a-latent-variable-problem">Reformulating the problem as a latent variable problem</a></li>
      <li><a href="#em-algorithm" id="markdown-toc-em-algorithm">EM algorithm</a></li>
    </ul>
  </li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<h1 id="introduction">Introduction</h1>
<h2 id="what-is-em-about">What is EM about?</h2>
<h3 id="maximum-likelihood-estimation-mle">Maximum likelihood estimation (MLE)</h3>
<p>The expectation-maximization (EM) algorithm is an iterative method to find the local <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood</a> of parameters in statistical models. So what is the maximum likelihood? It’s the maximum value of the likelihood function! And <strong>what is a likelihood function?</strong> It’s a function of the model’s parameters treating the observed data as fixed points, i.e., we write \(\mathcal{L}(\theta\mid x)\) meaning that we vary the parameters \(\theta\) while taking the \(x\)’s as given. If \(\mathcal{L}(\theta_1\mid x) &gt; \mathcal{L}(\theta_2 \mid x)\) then the sample we observed is more likely to have occurred if \(\theta = \theta_1\) rather than if \(\theta = \theta_2\). So, given the data that we have observed, the likelihood function points to a model’s most plausible parameterization that might have generated the observed data.</p>

<p>Here is an elementary example. Suppose that we have some data and want to fit a model of the form \(y = a x\). In this case, \(\theta\) is essentially the coefficient \(a\), but usually, there will be many unknown parameters. In the left image, there’s the likelihood function for several values of the parameter \(a\) (actually, it’s the logarithm of the likelihood function, but we will talk about this later). In the right image, we plot \(y = a x, \, a = -3, \ldots 7\) with a step size of 0.5, superimposed with the observed data. As you can see, \(a = 2\) maximizes the log-likelihood <em>and</em> fits the data better than any other line. So, <strong>fitting data to models can be done via maximum likelihood estimation</strong>.</p>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/em_algorithm/linear_regression_mle.png" alt="Log likelihood of linear regression model" />
</p>

<p>By the way, in a <a href="https://ekamperi.github.io/mathematics/2020/12/20/tensorflow-custom-training-loops.html#how-is-mean-squared-error-related-to-log-likelihood">previous blog post</a> we have proven that by <strong>maximizing the likelihood in the linear regression case, this is equivalent to minimizing the mean squared error</strong>.</p>

<h3 id="-in-the-presence-of-hidden-variables">… in the presence of hidden variables</h3>
<p>The EM algorithm is particularly useful when there are missing data in the data set or when the model depends on <strong>hidden</strong> or so-called <a href="https://en.wikipedia.org/wiki/Latent_variable"><strong>latent variables</strong></a>. These are variables that affect our observed data but in ways that we can’t know directly. So what’s so special about latent parameters? Typically, if we know all the parameters, we can take the derivatives of the likelihood function with respect to them, solve the system of equations and find the values that maximize the likelihood. Like:</p>

\[\left\{\frac{\partial \mathcal{L}}{\partial \theta_1}=0, \frac{\partial \mathcal{L}}{\partial \theta_2}=0, \ldots \right\}\]

<p>This is precisely what we did when we wanted to <a href="https://ekamperi.github.io/mathematics/2020/12/26/tensorflow-trainable-probability-distributions.html">fit some data to a normal distribution</a>. However, in statistical models with latent variables, this typically results in a set of equations where the solutions to the parameters mandate the values of the latent variables and vice versa. By substituting one set of equations into the other, an unsolvable equation is produced. That’s why we need the expectation-maximization algorithm. Concretely, EM can be used in any of the following scenarios:</p>

<ul>
  <li>Estimating parameters of (usually Gaussian) mixture models</li>
  <li>Estimating parameters of Hidden Markov Models</li>
  <li>Unsupervised learning of clusters</li>
  <li>Filling missing data in samples</li>
</ul>

<h3 id="what-are-the-basic-steps-of-em">What are the basic steps of EM?</h3>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/em_algorithm/EM_schematic.png" alt="Expectation-Maximization algorithm schematic" />
</p>

<p>EM takes its name from the alternation between two algorithmic steps. The first step is the <strong>expectation step</strong>, where we form a function for the expectation of the log-likelihood, using the current best estimates of the model’s parameters. Whereas, in the <strong>maximization step</strong>, we calculate the new parameters’ values by maximizing the expected log-likelihood. These new estimates of the parameters are then used to determine the distribution of the latent variables in the next expectation step. Don’t worry if it doesn’t make sense now; we will show an example in a minute, and we will also delve into it in subsequent blog posts.</p>

<h1 id="a-1-dimensional-example">A 1-dimensional example</h1>
<h2 id="setting-up-the-problem">Setting up the problem</h2>
<p>Let us consider some observed 1-dimensional data points, \(x_i\). We assume they are generated by <em>two</em> normal distributions \(N(\mu_1, \sigma_1^2)\) and \(N(\mu_2, \sigma_2^2)\), with probabilities \(\pi\) and \(1-\pi\), respectively. In this setup, we have 5 unknown parameters: the mixing probability \(\pi\), the mean and standard deviation of the first distribution, and the mean and standard deviation of the second distribution. Let us gather all these under a vector called \(\theta = [\pi, \mu_1, \sigma_1, \mu_2, \sigma_2]\).</p>

<p align="center">
 <img style="width: 90%; height: 90%" src="https://ekamperi.github.io/images/em_algorithm/histogram_broken_by_dist.png" alt="Histogram of mixed gaussian distribution" />
</p>

<h2 id="writing-down-the-likelihood-function">Writing down the likelihood function</h2>
<p>Suppose that we observed a datapoint with value \(x_i\). What is the probability of \(x_i\) occuring? Assuming \(\varphi_1(x)\) is the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">probability density function</a> of the 1st distribution, and \(\varphi_2(x)\) of the second, the probability of observing \(x_i\) is:</p>

\[p(x_i) = \pi \varphi_1(x_i) + (1-\pi)\varphi_2(x_i)\]

<p>To be more pedantic we would write:</p>

\[p(x_i\mid \theta) = \pi \varphi_1(x_i \mid \mu_1,\sigma_1^2) + (1-\pi)\varphi_2(x_i \mid \mu_2,\sigma_2^2)\]

<p>Which means that the PDF’s are paremeterized by \(\mu_1,\sigma_1^2\) and \(\mu_2, \sigma_2^2\), respectively. Ok, but this is just for a single observation \(x_i\). What if we have a bunch of \(x_i\)’s, say for \(i=1,\ldots,N\)? To find the joint probability of \(N\) independent events (which by the way is the likelihood function!) we just multiply the individual probabilities:</p>

\[\mathcal{L}(\theta \mid x) = \prod_{i=1}^N p(x_i \mid \theta)\]

<p>But since it’s easier to work with sums rather than products, we take the logarirthm of the likelihood, \(\ell(\theta\mid x)\):</p>

\[\begin{align*}\ell(\theta \mid x) &amp;= \log \prod_{i=1}^N p(x_i \mid \theta) =\sum_{i=1}^N \log p(x_i \mid \theta)\\&amp;=\sum_{i=1}^N \log \left[\pi \varphi_1(x_i\mid \mu_1,\sigma_1^2) + (1-\pi)\varphi_2(x_i|\mu_2,\sigma_2^2)\right]\end{align*}\]

<p>So, our objective is to maximize likelihood \(\mathcal{L}(\theta\mid x)\), which is equivalent to maximizing the log-likelihood \(\ell(\theta\mid x)\), with respect to the model’s parameters \(\theta = [\pi, \mu_1, \sigma_1, \mu_2, \sigma_2]\), <em>given</em> the data points \(\{x_i\}\).</p>

<h2 id="brute-forcing-one-parameter-at-a-time">Brute forcing one parameter at a time</h2>
<p>In the following examples, we will generate some synthetic observed data from a mixture distribution with known parameters \(\mu_1, \sigma_1, \mu_2, \sigma_2\) and mixing probability \(\pi\). We will then calculate \(\ell(\theta\mid x)\) for various parameter values while keeping the rest of the parameters fixed. Every time we will do that, we will see how \(\ell(\theta\mid x)\) is maximized when a model’s parameter becomes equal to its ground-truth value.</p>

<p>Let’s create a mixture distribution of two Gaussian distributions with known parameters \(\mu_1, \sigma_1, \mu_2, \sigma_2\) and known mixing probability \(\pi=0.3\). Normally, we won’t know the values of these parameters, and as a matter of fact, <strong>finding them will be the very objective of the EM algorithm</strong>. But for now, let’s <em>pretend</em> we don’t know them.</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nb">ClearAll</span><span class="p">[</span><span class="s">"Global`*"</span><span class="p">]</span><span class="o">;</span><span class="w">
</span><span class="p">{</span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="m">2</span><span class="p">}</span><span class="o">;</span><span class="w">
</span><span class="p">{</span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="m">9</span><span class="o">,</span><span class="w"> </span><span class="m">3</span><span class="p">}</span><span class="o">;</span><span class="w">

</span><span class="nv">npts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5000</span><span class="o">;</span><span class="w">
</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m</span><span class="o">_,</span><span class="w"> </span><span class="nv">s</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nb">NormalDistribution</span><span class="p">[</span><span class="nv">m</span><span class="o">,</span><span class="w"> </span><span class="nv">s</span><span class="p">]</span><span class="o">;</span><span class="w">
</span><span class="nv">mixdist</span><span class="p">[</span><span class="nv">p</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w">
 </span><span class="nb">MixtureDistribution</span><span class="p">[{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">p</span><span class="p">}</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">dist</span><span class="p">[</span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">]}]</span><span class="w">
</span><span class="nv">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">RandomVariate</span><span class="p">[</span><span class="nv">mixdist</span><span class="p">[</span><span class="m">0.3</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">]</span><span class="o">;</span><span class="w">
</span><span class="nb">Histogram</span><span class="p">[</span><span class="nv">data</span><span class="p">]</span></code></pre></figure>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/em_algorithm/histogram.png" alt="Histogram of mixture distribution" />
</p>

<p>Let’s plot the probability density functions of the mixture distribution for various mixing probabilities \(\pi\). We notice how for \(\pi\to 0\) the mixture distribution approaches the 1st distribution, and for \(\pi\to 1\), the 2nd distribution. For in-between values, it’s a mixture! ;)</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nb">Style</span><span class="p">[</span><span class="nb">Grid</span><span class="p">[{</span><span class="w">
   </span><span class="nb">Table</span><span class="p">[</span><span class="w">
    </span><span class="nb">Plot</span><span class="p">[</span><span class="nb">PDF</span><span class="p">[</span><span class="nv">mixdist</span><span class="p">[</span><span class="nv">p</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">x</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">x</span><span class="o">,</span><span class="w"> </span><span class="o">-</span><span class="m">10</span><span class="o">,</span><span class="w"> </span><span class="m">20</span><span class="p">}</span><span class="o">,</span><span class="w"> 
     </span><span class="nb">PlotLabel</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="s">"p="</span><span class="w"> </span><span class="o">&lt;&gt;</span><span class="w"> </span><span class="nb">ToString</span><span class="o">@</span><span class="nv">p</span><span class="o">,</span><span class="w">
     </span><span class="nb">FrameLabel</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="s">"x"</span><span class="o">,</span><span class="w"> </span><span class="s">"PDF(x)"</span><span class="p">}</span><span class="o">,</span><span class="w"> 
     </span><span class="nb">Frame</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="nb">True</span><span class="o">,</span><span class="w"> </span><span class="nb">True</span><span class="o">,</span><span class="w"> </span><span class="nb">False</span><span class="o">,</span><span class="w"> </span><span class="nb">False</span><span class="p">}</span><span class="o">,</span><span class="w">
     </span><span class="nb">AxesOrigin</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="o">-</span><span class="m">10</span><span class="o">,</span><span class="w"> </span><span class="m">0</span><span class="p">}</span><span class="o">,</span><span class="w"> </span><span class="nb">Filling</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">Axis</span><span class="p">]</span><span class="o">,</span><span class="w">
    </span><span class="p">{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="m">0</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="m">0.3</span><span class="p">}]</span><span class="w">
   </span><span class="p">}]</span><span class="o">,</span><span class="w">
 </span><span class="nb">ImageSizeMultipliers</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="m">0.7</span><span class="p">]</span></code></pre></figure>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/em_algorithm/varying_mixing_prob.png" alt="PDF of mixture distribution for varying mixing probability" />
</p>

<p>Let us now define the log-likelihood function:</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nv">logLikelihood</span><span class="p">[</span><span class="nv">data</span><span class="o">_,</span><span class="w"> </span><span class="nv">p</span><span class="o">_,</span><span class="w"> </span><span class="nv">m1</span><span class="o">_,</span><span class="w"> </span><span class="nv">s1</span><span class="o">_,</span><span class="w"> </span><span class="nv">m2</span><span class="o">_,</span><span class="w"> </span><span class="nv">s2</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w">
 </span><span class="bp">Module</span><span class="p">[{}</span><span class="o">,</span><span class="w">
  </span><span class="nb">Sum</span><span class="p">[</span><span class="w">
   </span><span class="nb">Log</span><span class="p">[</span><span class="w">
    </span><span class="nv">p</span><span class="w"> </span><span class="nb">PDF</span><span class="p">[</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">x</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">p</span><span class="p">)</span><span class="w"> </span><span class="nb">PDF</span><span class="p">[</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">x</span><span class="p">]</span><span class="w"> </span><span class="o">/.</span><span class="w"> 
     </span><span class="nv">x</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="w">
    </span><span class="p">]</span><span class="o">,</span><span class="w">
   </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nb">Length</span><span class="o">@</span><span class="nv">data</span><span class="p">}]</span><span class="w">
  </span><span class="p">]</span><span class="w">
  </span></code></pre></figure>

<p>Ok, we are ready to go. We will first vary the mixing probability \(\pi\), keeping the rest of the model’s parameters fixed. In some sense, we are brute-forcing \(\pi\), to find \(\pi\):</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nv">llvalues</span><span class="w"> </span><span class="o">=</span><span class="w"> 
  </span><span class="nb">Table</span><span class="p">[{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">logLikelihood</span><span class="p">[</span><span class="nv">data</span><span class="o">,</span><span class="w"> </span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">]}</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="m">0</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">}]</span><span class="o">;</span><span class="w">
</span><span class="p">{</span><span class="nv">pmax</span><span class="o">,</span><span class="w"> </span><span class="nv">llmax</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> 
 </span><span class="nv">llvalues</span><span class="p">[[</span><span class="nb">Ordering</span><span class="p">[</span><span class="nv">llvalues</span><span class="p">[[</span><span class="nb">All</span><span class="o">,</span><span class="w"> </span><span class="m">2</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="o">-</span><span class="m">1</span><span class="p">][[</span><span class="m">1</span><span class="p">]]]]</span><span class="w">
</span><span class="c">(* {0.3, -14437.1} *)</span><span class="w">

</span><span class="nv">plot1</span><span class="w"> </span><span class="o">=</span><span class="w">
 </span><span class="nb">Show</span><span class="p">[</span><span class="w">
  </span><span class="nb">ListPlot</span><span class="p">[</span><span class="nv">llvalues</span><span class="o">,</span><span class="w"> </span><span class="nb">Joined</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">True</span><span class="o">,</span><span class="w"> 
   </span><span class="nb">FrameLabel</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="s">"Probability p"</span><span class="o">,</span><span class="w"> </span><span class="s">"Log-Likelihood"</span><span class="p">}</span><span class="o">,</span><span class="w"> 
   </span><span class="nb">Frame</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="nb">True</span><span class="o">,</span><span class="w"> </span><span class="nb">True</span><span class="o">,</span><span class="w"> </span><span class="nb">False</span><span class="o">,</span><span class="w"> </span><span class="nb">False</span><span class="p">}</span><span class="o">,</span><span class="w"> 
   </span><span class="nb">GridLines</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{{</span><span class="nv">pmax</span><span class="p">}</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">llmax</span><span class="p">}}</span><span class="o">,</span><span class="w"> </span><span class="nb">GridLinesStyle</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="nb">Dashed</span><span class="p">]</span><span class="o">,</span><span class="w">
  </span><span class="nb">ListPlot</span><span class="p">[</span><span class="nv">llvalues</span><span class="o">,</span><span class="w"> </span><span class="nb">PlotStyle</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="p">{</span><span class="nb">Red</span><span class="o">,</span><span class="w"> </span><span class="nb">AbsolutePointSize</span><span class="p">[</span><span class="m">5</span><span class="p">]}]</span><span class="w">
  </span><span class="p">]</span></code></pre></figure>

<p align="center">
 <img style="width: 50%; height: 50%" src="https://ekamperi.github.io/images/em_algorithm/log_likelihood_p.png" alt="Log likelihood for varying mixing probability" />
</p>

<p>Do you see how \(\ell(\theta\mid x)\) is maximized at \(\pi = 0.3\)? By the same token, we can try other model parameters, but we will always come to the same conclusion: <strong>the log-likelihood, therefore the likelihood, is maximized when our guesses become equal to the ground-truth values for the model’s parameters</strong>.</p>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/em_algorithm/log_likelihood_combined.png" alt="Log likelihood for varying mixing probability, mean and standard deviation" />
</p>

<h2 id="reformulating-the-problem-as-a-latent-variable-problem">Reformulating the problem as a latent variable problem</h2>
<p>Previously, we varied one parameter at a time, keeping the rest at their ground-truth values. We will now get serious and seek to <strong>estimate the values of <em>all</em> parameters simultaneously</strong>. If we attempt to directly maximize \(\ell(\theta|x)\), it will be tough due to the sum of terms inside the logarithm. For those of you who doubt it, just calculate the partial derivatives of \(\ell(\theta|x)\) with respect to \(\pi, \mu_1, \sigma_1, \mu_2, \sigma_2\) and contemplate solving the system where all these derivatives are required to become zero. Good luck with that! :P</p>

<p>There’s another way to go, though. We will reformulate our problem as a problem of <strong>maximum likelihood estimation with latent variables</strong>. For this, we will introduce a set of latent variables called \(\Delta_i \in \{0,1\}\). If \(\Delta_i = 0\) then \(x_i\) was sampled from the 1st distribution, and if \(\Delta_i = 1\), then it came from the 2nd distribution. In this case, the log-likelihood \(\ell(\theta\mid x,\Delta)\) is given by:</p>

\[\begin{align*}
\ell(\theta\mid x,\Delta) = &amp;\sum_{i=1}^N \left[ (1-\Delta_i) \log \varphi_1(x_i) + \Delta_i \log\varphi_2(x_i)\right] +\\
&amp;\sum_{i=1}^N \left[ (1-\Delta_i)\log\pi + \Delta_i\log(1-\pi)\right]
\end{align*}\]

<p>When we write \(\varphi_1(x_i)\) in reality we mean \(\varphi_1(x_i\mid \mu_1, \sigma_1^2)\), and similarly for \(\varphi_2(x_i)\) we mean \(\varphi_2(x_i\mid \mu_2, \sigma_2^2)\). The reason we omited it, is for keeping the log-likelihood expression easily readable. Feel free to check that the above formula is equal to the previous expression of \(\ell(\theta\mid x)\), by first letting \(\Delta_i = 0\) and then \(\Delta_i = 1\).</p>

<p>But, we don’t actually know the values \(\Delta_i\). After all, these are the latent variables that we introduced into the problem! If you feel that we ain’t making any progress, hold on. Here’s where the EM algorithm kicks in. Even though we don’t know the exact values \(\Delta_i\), we will use their <em>expected</em> values given our current best estimates for the model’s parameters! <strong>This is the expectation step of the EM algorithm</strong>. So, instead of \(\Delta_i\), we will use \(\gamma_i\) defined as:</p>

\[\gamma_i(\theta) = \mathbb{E}(\Delta_i\mid \theta,x) = \text{Pr}(\Delta_i = 1\mid \theta,x)\]

<p>Once we have \(\gamma_i\) calculated, we know which distribution \(x_i\) belongs to. Therefore, we can update the model’s parameters using the weighted maximum-likelihood fits. For Gaussian distributions, this is just the mean and standard deviation of the \(x_i\). <strong>This is the maximization step!</strong> Actually, \(\gamma_i\) doesn’t take discrete values like the \(\Delta_i\). Instead, it lies in the interval \([0,1]\) and, therefore, the EM algorithm does a soft membership assignment. I.e., for every \(x_i\), it assigns a probability that it comes from the 1st or the 2nd distribution. That’s why, when we calculate the Gaussians’ parameters, we use a \(\gamma_i\)-weighted average.</p>

<h2 id="em-algorithm">EM algorithm</h2>

<p>So, here’s the EM algorithm for our particular problem:</p>

<ul>
  <li>Initialize unknown parameters (e.g., \(\hat{\pi} = 0.5, \hat{\mu_1} = \text{random }x_i, \sigma_i = \sum_{i=1}^N(x_i-\bar{x})^2/N, \ldots\)</li>
  <li><strong>Expectation step</strong>:</li>
</ul>

\[\hat{\gamma_i} = \frac{(1-\pi) \varphi_2(x_i)}{\pi \varphi_1(x_i) + (1-\pi)\varphi_2(x_i)}\]

<ul>
  <li><strong>Maximization step</strong>:</li>
</ul>

\[\begin{align*}
\hat{\mu_1} &amp;= \frac{\sum_{i=1}^N (1-\hat{\gamma_i})x_i}{\sum_{i=1}^N (1-\hat{\gamma_i})}\hspace{3cm}\hat{\mu_2} = \frac{\sum_{i=1}^N \hat{\gamma_i} x_i}{\sum_{i=1}^N \hat{\gamma_i}}\\
\hat{\sigma_1} = &amp;\sqrt{\frac{\sum_{i=1}^N (1-\hat{\gamma_i})(x_i-\hat{\mu_1})^2}{\sum_{i=1}^N (1-\hat{\gamma_i})}}\hspace{1cm}
\hat{\sigma_2} = \sqrt{\frac{\sum_{i=1}^N \hat{\gamma_i}(x_i-\hat{\mu_2})^2}{\sum_{i=1}^N \hat{\gamma_i}}}\\
\hat{\pi} &amp;= \sum_{i=1}^N(1-\hat{\gamma_i})/N
\end{align*}\]

<ul>
  <li>Repeat until convergence or maximum number of iterations reached.</li>
</ul>

<p>Here is a sample code that implements the EM algorithm for our particular problem. The code doesn’t look pretty without Mathematica’s syntax color highlighting and the Notebook’s format, but anyway.</p>

<figure class="highlight"><pre><code class="language-mathematica" data-lang="mathematica"><span class="nv">em</span><span class="p">[</span><span class="nv">data</span><span class="o">_,</span><span class="w"> </span><span class="nv">p</span><span class="o">_,</span><span class="w"> </span><span class="nv">m1</span><span class="o">_,</span><span class="w"> </span><span class="nv">s1</span><span class="o">_,</span><span class="w"> </span><span class="nv">m2</span><span class="o">_,</span><span class="w"> </span><span class="nv">s2</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w">
 </span><span class="bp">Module</span><span class="p">[{</span><span class="nv">newp</span><span class="o">,</span><span class="w"> </span><span class="nv">newm1</span><span class="o">,</span><span class="w"> </span><span class="nv">news1</span><span class="o">,</span><span class="w"> </span><span class="nv">newm2</span><span class="o">,</span><span class="w"> </span><span class="nv">news2</span><span class="o">,</span><span class="w"> </span><span class="nv">g</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}</span><span class="o">,</span><span class="w">
  </span><span class="nv">npts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Length</span><span class="o">@</span><span class="nv">data</span><span class="o">;</span><span class="w">
  </span><span class="nv">g</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Table</span><span class="p">[((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">p</span><span class="p">)</span><span class="w"> </span><span class="nb">PDF</span><span class="p">[</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]])</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nv">p</span><span class="w"> </span><span class="nb">PDF</span><span class="p">[</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">p</span><span class="p">)</span><span class="w"> </span><span class="nb">PDF</span><span class="p">[</span><span class="nv">dist</span><span class="p">[</span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]])</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="o">;</span><span class="w">
  </span><span class="nv">newm1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]])</span><span class="o">*</span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="o">;</span><span class="w">
  </span><span class="nv">newm2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[</span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">*</span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[</span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="o">;</span><span class="w">
  </span><span class="nv">news1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Sqrt</span><span class="p">[</span><span class="nb">Sum</span><span class="p">[(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]])</span><span class="o">*</span><span class="p">(</span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">m1</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]]</span><span class="o">;</span><span class="w">
  </span><span class="nv">news2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Sqrt</span><span class="p">[</span><span class="nb">Sum</span><span class="p">[</span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">*</span><span class="p">(</span><span class="nv">data</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">m2</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[</span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]]</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]]</span><span class="o">;</span><span class="w">
  </span><span class="nv">newp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Sum</span><span class="p">[(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">g</span><span class="p">[[</span><span class="nv">i</span><span class="p">]])</span><span class="o">/</span><span class="nv">npts</span><span class="o">,</span><span class="w"> </span><span class="p">{</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">npts</span><span class="p">}]</span><span class="o">;</span><span class="w">
  </span><span class="p">{</span><span class="nv">newp</span><span class="o">,</span><span class="w"> </span><span class="nv">newm1</span><span class="o">,</span><span class="w"> </span><span class="nv">news1</span><span class="o">,</span><span class="w"> </span><span class="nv">newm2</span><span class="o">,</span><span class="w"> </span><span class="nv">news2</span><span class="o">,</span><span class="w"> </span><span class="nv">g</span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">

</span><span class="nv">doEM</span><span class="p">[</span><span class="nv">data</span><span class="o">_</span><span class="p">]</span><span class="w"> </span><span class="o">:=</span><span class="w">
 </span><span class="bp">Module</span><span class="p">[{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="o">,</span><span class="w"> </span><span class="nv">g</span><span class="p">}</span><span class="o">,</span><span class="w">
  </span><span class="p">{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="m">0.5</span><span class="o">,</span><span class="w"> </span><span class="nb">RandomChoice</span><span class="p">[</span><span class="nv">data</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nb">StandardDeviation</span><span class="p">[</span><span class="nv">data</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nb">RandomChoice</span><span class="p">[</span><span class="nv">data</span><span class="p">]</span><span class="o">,</span><span class="w"> </span><span class="nb">StandardDeviation</span><span class="p">[</span><span class="nv">data</span><span class="p">]}</span><span class="o">;</span><span class="w">
  </span><span class="nb">Print</span><span class="p">[{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">}]</span><span class="o">;</span><span class="w">
  </span><span class="nb">For</span><span class="w"> </span><span class="p">[</span><span class="nv">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">,</span><span class="w"> </span><span class="nv">i</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">40</span><span class="o">,</span><span class="w"> </span><span class="nv">i</span><span class="o">++,</span><span class="w">
   </span><span class="p">{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="o">,</span><span class="w"> </span><span class="nv">g</span><span class="p">}</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">em</span><span class="p">[</span><span class="nv">data</span><span class="o">,</span><span class="w"> </span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">]</span><span class="o">;</span><span class="w">
   </span><span class="nb">If</span><span class="p">[</span><span class="nb">Mod</span><span class="p">[</span><span class="nv">i</span><span class="o">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="o">,</span><span class="w"> </span><span class="nb">Print</span><span class="p">[{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="p">}]]</span><span class="w">
   </span><span class="p">]</span><span class="o">;</span><span class="w">
  </span><span class="p">{</span><span class="nv">p</span><span class="o">,</span><span class="w"> </span><span class="nv">m1</span><span class="o">,</span><span class="w"> </span><span class="nv">s1</span><span class="o">,</span><span class="w"> </span><span class="nv">m2</span><span class="o">,</span><span class="w"> </span><span class="nv">s2</span><span class="o">,</span><span class="w"> </span><span class="nv">g</span><span class="p">}</span><span class="w">
  </span><span class="p">]</span></code></pre></figure>

<p>This is a short test run, where we confirm that the algorithm converges to the ground-truth values (the red lines are the ground-truth values). As we mentioned in the introduction, EM is a local algorithm, meaning it can get stuck at a local maximum. Therefore, sometimes we may need to repeat the algorithm a few times to ensure a near-global optimal solution.</p>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/em_algorithm/parameters_convergence.png" alt="Expectation Maximization algorithm for Gaussian mixture models" />
</p>

<p>In the following plot, we see how the \(\gamma_i\)’s vary as the observed data transition from the 1st to the 2nd distribution. E.g., when we look at observed data around x=1 (or less), the \(\gamma_i\)’s are equal to zero. This means that the EM algorithm doesn’t cast any doubt on the source of these values. They were sampled from the 1st distribution. When we look at observed data around x=9 (or more), EM is confident that these values originate from the second distribution (\(\gamma_i=1\)). However, when we are in between, \(\gamma_i\)’s assume intermediate values around 0.5, conveying the uncertainty regarding which distribution each \(x_i\) belongs to. So, by applying the EM algorithm, <strong>we discovered the membership of each observed value (with some uncertainty), <em>and</em> we estimated the model’s unknown parameters!</strong> Neat?</p>

<p align="center">
 <img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/em_algorithm/x_vs_gamma.png" alt="Expectation Maximization algorithm for Gaussian mixture models" />
</p>

<h1 id="references">References</h1>
<ol>
  <li>The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.</li>
</ol>]]></content><author><name>Stathis Kamperis</name></author><category term="mathematics" /><category term="machine learning" /><category term="Mathematica" /><category term="mathematics" /><category term="optimization" /><summary type="html"><![CDATA[An introduction to the expectation-maximization algorithm focusing on the concept of maximum likelihood estimation]]></summary></entry><entry><title type="html">Acquisition functions in Bayesian Optimization</title><link href="https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html" rel="alternate" type="text/html" title="Acquisition functions in Bayesian Optimization" /><published>2021-06-11T00:00:00+00:00</published><updated>2021-06-11T00:00:00+00:00</updated><id>https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions</id><content type="html" xml:base="https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html"><![CDATA[<h3 class="no_toc" id="contents">Contents</h3>

<ul id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#a-schematic-bayesian-optimization-algorithm" id="markdown-toc-a-schematic-bayesian-optimization-algorithm">A schematic Bayesian Optimization algorithm</a></li>
  <li><a href="#acquisition-functions" id="markdown-toc-acquisition-functions">Acquisition Functions</a>    <ul>
      <li><a href="#upper-confidence-bound-ucb" id="markdown-toc-upper-confidence-bound-ucb">Upper Confidence Bound (UCB)</a></li>
      <li><a href="#probability-of-improvement-pi" id="markdown-toc-probability-of-improvement-pi">Probability of Improvement (PI)</a></li>
      <li><a href="#expected-improvement-ei" id="markdown-toc-expected-improvement-ei">Expected Improvement (EI)</a></li>
    </ul>
  </li>
</ul>

<h1 id="introduction">Introduction</h1>
<p>In a <a href="https://ekamperi.github.io/machine%20learning/2021/05/08/bayesian-optimization.html">previous blog post</a>, we talked about Bayesian Optimization (BO) as a generic method for optimizing a black-box function, \(f(x)\), that is a function whose formula we don’t know. The only thing we can do in this setup is to ask \(f\) evaluate at some \(x\) and observe the output.</p>

<p align="center">
 <img style="width: 40%; height: 40%" src="https://ekamperi.github.io/images/acquisition_functions/blackbox.png" alt="Blackbox function" />
</p>

<h1 id="a-schematic-bayesian-optimization-algorithm">A schematic Bayesian Optimization algorithm</h1>
<p>The essential ingredients of a BO algorithm are the <strong>surrogate model</strong> (SM) and the <strong>acquisition function</strong> (AF). The surrogate model is often a <a href="https://ekamperi.github.io/mathematics/2021/03/30/gaussian-process-regression.html">Gaussian Process</a> that can fit the observed data points and quantify the uncertainty of unobserved areas. So, SM is our effort to approximate the unknown black-box function \(f(x)\).</p>

<p>Next, the acquisition function “looks” at the SM and determines what areas in the domain of \(f(x)\) are worth exploiting and what areas are worth exploring. Accordingly, in areas where \(f(x)\) is optimal or areas that we haven’t yet looked at, AF assumes a high value. On the contrary, in areas where \(f(x)\) is suboptimal or areas that we have already sampled from, AF’s value is small. By finding the \(x\) that maximizes the acquisition function, we identify the next best guess for \(f\) to try. That’s right: instead of maximizing directly \(f(x)\), whose analytic form we don’t even know, we instead maximize another function, AF, that is much easier to do and much less expensive. So, the steps that a BO algorithm follows are the following.</p>

<p align="center">
 <img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/acquisition_functions/bo_flow.png" alt="Blackbox function" />
</p>

<p>In the following video, we demonstrate the <strong>exploitation</strong> (trying slightly different things that have already been proven to be good solutions) vs. <strong>exploration</strong> (trying totally different things from areas that have not yet been probed) tradeoff. Although here \(f(x)\) is known, in the general case, it is not.</p>

<p align="center">
<video id="movie" width="80%" height="80%" preload="" controls="">
   <source id="srcMp4" src="https://ekamperi.github.io/images/acquisition_functions/ucb_acq.mp4#t=0.2" />
</video>
</p>

<h1 id="acquisition-functions">Acquisition Functions</h1>
<h2 id="upper-confidence-bound-ucb">Upper Confidence Bound (UCB)</h2>
<p>Probably as simple as an acquisition function can get, upper confidence bound contains explicit exploitation (\(\mu(x)\)) and exploration (\(\sigma(x)\)) terms:</p>

\[a(x;\lambda) = \mu(x) + \lambda \sigma (x)\]

<p>With UCB, the exploitation vs. exploration tradeoff is straightforward and easy to tune via the parameter \(\lambda\). Concretely, UCB is a weighted sum of the expected performance captured by \(\mu(x)\) of the Gaussian Process, and of the uncertainty \(\sigma(x)\), captured by the standard deviation of the GP. When \(\lambda\) is small, BO will favor solutions that are expected to be high-performing, i.e., have high \(\mu(x)\). On the contrary, when \(\lambda\) is large, BO rewards the exploration of currently uncharted areas in the search space.</p>

<p>Here is an example with a large value for \(\lambda\). UCB favors areas where we don’t have any samples from.</p>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/acquisition_functions/ucb_large_lambda.png" alt="UCB function" />
</p>

<p>This is an example with a value for \(\lambda\) around 1 (I made \(\lambda=1.2\) so that AF and upper confidence interval curves don’t coincide). UCB balances between known good values and unexplored areas.</p>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/acquisition_functions/ucb_medium_lambda.png" alt="UCB function" />
</p>

<p>Finally, here is an example with a small value for \(\lambda\). UCB is very conservative in this case and will cause aggressive sampling around the current best solution.</p>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/acquisition_functions/ucb_small_lambda.png" alt="UCB function" />
</p>

<h2 id="probability-of-improvement-pi">Probability of Improvement (PI)</h2>

<p>Suppose that we’d like to maximize \(f(x)\), and the best solution we have so far is \(x^\star\). Then, we can define “improvement”, \(I(x)\), as:</p>

\[I(x) = \max(f(x) - f(x^\star), 0)\]

<p>Therefore, if the new \(x\) we are looking at has an associated value \(f(x)\) that is less than \(f(x^\star)\), then \(f(x) - f(x^\star)\) is negative. So we aren’t improving at all, and the above formula returns 0, since the maximum number between any negative number and 0 is 0. On the contrary, if the new value \(f(x)\) is larger than our current best estimate, then \(f(x) - f(x^\star)\) is positive. In this case \(I(x)\) returns the difference which is how much we will improve over our current best solution if we evaluate \(f\) at the new point \(x\).</p>

<p>In probability of improvement acquisition function, for each candidate \(x\) we assign the probability of \(I(x)&gt;0\), i.e., \(f(x)\) being larger than our current best \(f(x^\star)\). Let us recall that in a <a href="https://ekamperi.github.io/mathematics/2021/03/30/gaussian-process-regression.html">Gaussian Process</a>, at each point there’s a Gaussian distribution attached. Therefore, at point \(x\) the value of the function \(f(x)\) is sampled from a normal distribution with mean \(\mu(x)\) and variance \(\sigma^2(x)\):</p>

\[f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))\]

<p>Now, let us use a reparameterization trick. If \(z \sim \mathcal{N}(0, 1)\), then \(f(x) = \mu(x) + \sigma(x) z\) is a normal distribution with mean \(\mu(x)\) and variance \(\sigma^2(x)\). Therefore, we can rewrite the improvement function, \(I(x)\), as:</p>

\[I(x) = \text{max}(f(x) - f(x^\star), 0) = \text{max}(\mu(x) + \sigma(x) z - f(x^\star), 0) \,\, z \sim \mathcal{N}(0,1)\]

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/acquisition_functions/probability_of_improvement.png" alt="Probability of Improvement function" />
</p>

<p>Let us take a pause here and make sure that we really understand what’s going on. Here \(x\) is some point that we want to check whether it worths evaluating \(f\) at. So, we assign a value \(I(x)\) to it. However, \(I(x)\)’s value is <strong>sampled</strong> from a normal distribution \(\mathcal{N}(\mu(x), \sigma^2(x))\). So, here’s how we calculate:</p>

\[\text{PI}(x) = \text{Pr}(I(x) &gt; 0) \Leftrightarrow \text{Pr}(f(x) &gt; f(x^\star))\]

<p>If you look at the image above, it’s clear that the probability of improvement is the shaded area under the Gaussian curve for \(z&gt;z_0\). Therefore:</p>

\[\text{PI}(x) = 1 - \Phi(z_0) = \Phi(-z_0) = \Phi\left(\frac{\mu(x)-f(x^\star)}{\sigma(x)}\right)\]

<p>Where \(\Phi(z) \equiv \text{CDF}(z)\) and \(z_0 = \frac{f(x^\star) - \mu(x)}{\sigma(x)}\).</p>

<h2 id="expected-improvement-ei">Expected Improvement (EI)</h2>
<p>PI considers only the probability of improving our current best estimate, but it does not factor in the magnitude of the improvement. This is where the expected improvement acquisition function is different. Instead of looking at the improvement \(I(x)\), which is a random variable, we will instead calculate the “Expected Improvement”, which is the expected value of \(I(x)\):</p>

\[\text{EI}(x)\equiv\mathbb{E}\left[I(x)\right] = \int_{-\infty}^{\infty} I(x)\varphi(z) \mathop{\mathrm{d}z}\]

<p>Where \(\varphi(z)\) is the probability density function of the normal distribution \(\mathcal{N}(0,1)\), i.e., \(\varphi(z) = \frac{1}{\sqrt{2\pi}}\exp\left(-z^2/2\right)\). In case you aren’t familiar with the <a href="https://www.wikiwand.com/en/Expected_value">expected value</a> of a random variable, it’s kind of a weighted average of “value” times “probability of getting that value”.</p>

<p>Ok, so:</p>

\[\text{EI}(x) = \int_{-\infty}^{\infty} I(x)\varphi(z) \mathop{\mathrm{d}z}=\int_{-\infty}^{\infty}\underbrace{\max(f(x) - f(x^\star), 0)}_{I(x)}\varphi(z)\mathop{\mathrm{d}z}\]

<p>How do we calculate this integral? We need to get rid of the \(max\) operator. In order to do that, we are going to break up the integral into two components, one where \(f(x) - f(x^\star)\) is positive and one where it is negative. The point where the switch happens is given by:</p>

\[f(x) = f(x^\star) \Rightarrow \mu + \sigma z = f(x^\star) \Rightarrow z = \frac{f(x^\star) - \mu}{\sigma}\]

<p>Let’s call this point \(z_0 = \frac{f(x^\star) - \mu}{\sigma}\), and break up the integral as:</p>

\[\text{EI}(x) = \underbrace{\int_{-\infty}^{z_0} I(x)\varphi(z) \mathop{\mathrm{d}z}}_{\text{Zero since }I(x)=0} + \int_{z_0}^{\infty} I(x)\varphi(z) \mathop{\mathrm{d}z}\]

<p>Ok, so we are good to go now:</p>

\[\begin{aligned}
\text{EI}(x)
&amp;=\int_{z_0}^{\infty} \max(f(x)-f(x^\star),0) \varphi(z)\mathop{\mathrm{d}z} =
\int_{z_0}^{\infty} \left(\mu+\sigma z - f(x^\star)\right)\varphi(z) \mathop{\mathrm{d}z}\\
&amp;= \int_{z_0}^{\infty} \left(\mu - f(x^\star) \right)\varphi(z)\mathop{\mathrm{d}z} +
\int_{z_0}^{\infty} \sigma z \frac{1}{\sqrt{2\pi}}e^{-z^2/2}\mathop{\mathrm{d}z} \\\\
&amp;=\left(\mu- f(x^\star)\right) \underbrace{\int_{z_0}^{\infty}\varphi(z)\mathop{\mathrm{d}z}}_{1-\Phi(z_0)\equiv 1-\text{CDF}(z_0)} + \frac{\sigma}{\sqrt{2\pi}}\int_{z_0}^{\infty}  z e^{-z^2/2}\mathop{\mathrm{d}z}\\
&amp;=\left(\mu- f(x^\star)\right) (1-\Phi(z_0)) - \frac{\sigma}{\sqrt{2\pi}}\int_{z_0}^{\infty}  \left(e^{-z^2/2}\right)' \mathop{\mathrm{d}z}\\
&amp;=\left(\mu- f(x^\star)\right) (1-\Phi(z_0)) - \frac{\sigma}{\sqrt{2\pi}} \left[e^{-z^2/2}\right]_{z_0}^{\infty}\\
&amp;=\left(\mu- f(x^\star)\right) \underbrace{(1-\Phi(z_0))}_{\Phi(-z_0)} + \sigma \varphi(z_0) \\
&amp;=\left(\mu- f(x^\star)\right) \Phi\left(\frac{\mu-f(x^\star)}{\sigma}\right) + \sigma \varphi\left(\frac{\mu - f(x^\star)}{\sigma}\right)
\end{aligned}\]

<p>At the last point, we used the fact that the PDF of normal distribution is symmetric, therefore \(\phi(z_0) = \phi(-z_0)\). Alright, so this equation might seem intimidating, but it’s really not. So, when does \(\text{EI}(x)\) take high values? When \(\mu &gt; f(x^\star)\). I.e., then mean value of the Gaussian Process is high at \(x\). Expected improvement is also increased when there’s lots of uncertainty, therefore when \(\sigma &gt; 1\). By the way, the formula above works for \(\sigma(x)&gt;0\), otherwise, if \(\sigma(x) = 0\) (as it happens at the observed data points), it holds that \(\text{EI}(x)=0\).</p>

<p>There’s one last before we conclude. By injecting a (hyper)parameter \(\xi\) into the formula for \(\text{EI}(x)\), we can fine tune how much exploitation vs. how much exploration the BO algorithm will do. So, the full formula is:</p>

\[\text{EI}(x;\xi) = \left(\mu- f(x^\star) - \xi\right) \Phi\left(\frac{\mu-f(x^\star)-\xi}{\sigma}\right) + \sigma \varphi\left(\frac{\mu - f(x^\star)-\xi}{\sigma}\right)\]

<p>For \(\xi=0\), we just end up with the previous formula. However, for large values of \(\xi\), you can think of it as if we pretend to have a larger current best value than we actually do! Therefore, this steers the BO algorithm towards more exploration.</p>]]></content><author><name>Stathis Kamperis</name></author><category term="machine learning" /><category term="algorithms" /><category term="Bayes theorem" /><category term="optimization" /><category term="programming" /><summary type="html"><![CDATA[An introduction to acquisition function in the context of Bayesian Optimization]]></summary></entry><entry><title type="html">Bayesian optimization for hyperparameter tuning</title><link href="https://ekamperi.github.io/machine%20learning/2021/05/08/bayesian-optimization.html" rel="alternate" type="text/html" title="Bayesian optimization for hyperparameter tuning" /><published>2021-05-08T00:00:00+00:00</published><updated>2021-05-08T00:00:00+00:00</updated><id>https://ekamperi.github.io/machine%20learning/2021/05/08/bayesian-optimization</id><content type="html" xml:base="https://ekamperi.github.io/machine%20learning/2021/05/08/bayesian-optimization.html"><![CDATA[<h3 class="no_toc" id="contents">Contents</h3>

<ul id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#the-ingredients-of-bayesian-optimization" id="markdown-toc-the-ingredients-of-bayesian-optimization">The ingredients of Bayesian Optimization</a>    <ul>
      <li><a href="#surrogate-model" id="markdown-toc-surrogate-model">Surrogate model</a></li>
      <li><a href="#acquisition-function" id="markdown-toc-acquisition-function">Acquisition function</a></li>
    </ul>
  </li>
  <li><a href="#hyperparameter-tuning-of-an-svm" id="markdown-toc-hyperparameter-tuning-of-an-svm">Hyperparameter tuning of an SVM</a>    <ul>
      <li><a href="#create-a-dataset" id="markdown-toc-create-a-dataset">Create a dataset</a></li>
      <li><a href="#objective-function-definition" id="markdown-toc-objective-function-definition">Objective function definition</a></li>
      <li><a href="#optimization" id="markdown-toc-optimization">Optimization</a></li>
      <li><a href="#brute-force-evaluation-of-objective-function" id="markdown-toc-brute-force-evaluation-of-objective-function">Brute-force evaluation of objective function</a></li>
      <li><a href="#references" id="markdown-toc-references">References</a></li>
    </ul>
  </li>
</ul>

<h3 id="introduction">Introduction</h3>
<p>Plot: We died and ended up in <a href="https://en.wikipedia.org/wiki/Inferno_(Dante)">Dante’s inferno</a> – the optimization version. So, what does it mean to be in an optimization hell?</p>

<p>We are asked to optimize a function <strong>we don’t have an analytic expression</strong> for. It follows that <strong>we don’t have access to the first or second derivatives</strong>, hence using <a href="https://ekamperi.github.io/machine%20learning/2019/07/28/gradient-descent.html">gradient descent</a> or <a href="https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization">Newton’s method</a> is a no-go. Also, <strong>we don’t have any convexity guarantees</strong> about \(f(x)\). Therefore, methods from the convex optimization field are also not available to us. The only thing we can do is to evaluate \(f(x)\) at some \(x\)’s. However, as if the situation was not bad enough, <strong>the function we want to optimize is very costly</strong>. So, we can’t just go ahead and massively evaluate \(f(x)\) in, say, 100 billion random points and keep the one \(x\) that optimizes \(f(x)\)’s value.</p>

<p align="center">
<img style="width: 35%; height: 35%" src="https://ekamperi.github.io/images/bayesian_optimization/dante_inferno.png" alt="Dante inferno" />
</p>

<p>To summarize, we want to optimize an expensive, black-box, derivative-free, possibly non-convex function. And for this kind of problem, <strong>Bayesian Optimization (BO)</strong> is a universal and robust method.</p>

<p>Mind that <strong>the evaluation of the objective function is not necessarily computational</strong>! Let me give you a couple of examples, where \(f(x)\) is not something you can calculate with a computer:</p>

<ol>
  <li>You are a researcher investigating mixtures of chemotherapeutic drugs for their ability to kill cancer cells. You have narrowed it down to three candidate molecules, and you need to find the best combination of concentrations \(c_1, c_2, c_3\) of the three drugs. Evaluating the objective function \(f(c_1,c_2,c_3)\) in this context entails conducting actual experiments in the lab requiring personnel, consumables, and waiting for hours or days for the cell cultures to grow. Therefore, considering all possible concentration combinations is not a realistic approach. Instead, you need to begin with a few random drug concentrations, test them, and then use the experimental outcomes to predict the most promising drug combination to use next. Makes sense?</li>
  <li>You work as a consultant for an oil company, and you want to maximize a probability density function \(f({\tiny\text{LAT}, \tiny\text{LONG}})\) of finding oil if you drill on \(({\tiny\text{LAT}, \tiny\text{LONG}})\) coordinates. Here, the evaluation of the function at a point requires the conduction of actual drilling. And this costs lots of money; therefore, you need to make good educated guesses, and you need to do so with only a few trials.</li>
</ol>

<p><strong>In other cases, however, \(f(x)\) is indeed be computational</strong>. For instance, we may define it as the k-fold cross-validation error of a machine-learning model whose hyperparameters we want to tune. As a matter of fact, we will do precisely this later on.</p>

<h3 id="the-ingredients-of-bayesian-optimization">The ingredients of Bayesian Optimization</h3>
<h4 id="surrogate-model">Surrogate model</h4>
<p>Since we lack an expression for the objective function, the first step is to <strong>use a surrogate model to approximate \(f(x)\)</strong>. It is typical in this context to use Gaussian Processes (GPs), as we have already discussed in a <a href="https://ekamperi.github.io/mathematics/2021/03/30/gaussian-process-regression.html">previous blog post</a>. It’s vital that you grasp the concept of GPs, and then BO will require almost no mental effort to sink. There are other choices for surrogate models, but let’s stick to GPs for now. Once we have built a proxy model for \(f(x)\), we want to decide which point \(x\) to sample next. This is the responsibility of the acquisition function (AF), which kind of “peeks” at the GP and generates the best guess \(x\). So, in BO, there are two main components: the <em>surrogate model</em>, which most often is a Gaussian Process modeling \(f(x)\), and the <em>acquisition function</em> that yields the next \(x\) to evaluate. Having said that, a BO algorithm would look like this in pseudocode:</p>

<ol>
  <li>Evaluate \(f(x)\) at \(n\) initial points</li>
  <li>While \(n \le N\) repeat:
    <ul>
      <li>Update the surrogate model (e.g., the GP posterior) using all available data \(\mathcal{D}_{1:n}\)</li>
      <li>Compute the acquisition function,\(u(x\mid\mathcal{D}_{1:n})\), using the current surrogate model</li>
      <li>Let \(x_{n+1}\) be the maximizer of the acquisition function, i.e. \(x_{n+1} = \text{argmax}_x u(x\mid\mathcal{D}_{1:n})\)</li>
      <li>Evaluate \(y_{n+1} = f(x_{n+1})\)</li>
      <li>Augment the data \(\mathcal{D}_{1:n+1} = \{\mathcal{D}_{1:n}, (x_{n+1}, y_{n+1})\}\) and increment \(n\)</li>
    </ul>
  </li>
  <li>Return either the \(x\) evaluated with the largest \(f(x)\), or the point with the largest posterior mean.</li>
</ol>

<h4 id="acquisition-function">Acquisition function</h4>
<p>As we have already noted, the purpose of the acquisition function is to guide the next best point to sample \(f(x)\). Acquisition functions are constructed so that a high value corresponds to potentially high values of the objective function. Either because the prediction is high or because the uncertainty is high. Which is why they favor regions that already correspond to optimal values or areas that haven’t been explored yet. This is known as the so-called <strong>exploration-exploitation trade-off</strong>.</p>

<p>If you have played strategy games, like <a href="https://en.wikipedia.org/wiki/Age_of_Empires">Age of Empires</a> or <a href="https://en.wikipedia.org/wiki/Command_%26_Conquer">Command &amp; Conquer</a>, you are already familiar with the concept. Initially, we are placed at some part of the map, and only the immediate area is visible to us. We may choose to sit there and mine any resources we already have access to or send a scouter to explore the invisible part of the map. By exploring the map, we risk meeting the enemy and getting killed, but also, we may find some high-value resources.</p>

<p align="center">
<img style="width: 90%; height: 90%" src="https://ekamperi.github.io/images/bayesian_optimization/age_of_empires.png" alt="Exploitation vs exploration trafeodd" />
</p>

<p>To find the next point to evaluate, we optimize the acquisition function. This an optimization problem itself, but luckily it does not require the evaluation of the objective function. In some cases, we may even derive an exact equation for the AF and find a solution with, say, gradient-based optimization. There are three often cited acquisition functions: <strong>expected improvement</strong> (EI), <strong>maximum probability of improvement</strong> (MPI), and <strong>upper confidence bound</strong> (UCB). Although often mentioned last, I think it’s best to talk about UCB because it contains explicit exploitation and exploration terms:</p>

\[a_{\text{UCB}}(x;\lambda) = \mu(x) + \lambda \sigma(x)\]

<p>With UCB, the exploitation <em>vs.</em> exploration trade-off is explicit and easy to tune via the parameter \(\lambda\). Concretely, we construct a weighted sum of the expected performance captured by \(\mu(x)\) of the Gaussian Process, and of the uncertainty \(\sigma(x)\), captured by the standard deviation of the GP. Assuming a small \(\lambda\), BO will favor solutions that are expected to be high-performing, i.e., have high \(\mu(x)\). Conversely, high values of \(\lambda\) will make BO favor the exploration of currently uncharted areas in the search space.</p>

<p>Here is an example of a Gaussian Process along with a corresponding acquisition function. This is a 1-dimensional optimization problem, but the idea is the same for more variables. The <strong>black dots</strong> are our measurements, i.e. the \(x\)’s where we have already sampled \(f(x)\). The <strong>black dotted line</strong> is the objective function, and the <strong>black solid line</strong> is our surrogate model of it, i.e., our posterior Gaussian Process. The <strong>blue shaded area</strong> represents the uncertainty of our surrogate model, \(\sigma(x)\), corresponding to regions in the domain of the objective function that we don’t have any observations. The <strong>green line</strong> is the acquisition function, which informs us what point \(x\) to sample next. Notice that it takes high values in regions where our GP’s \(\mu(x)\) is high and \(\sigma(x)\) is high.</p>

<p align="center">
<img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/bayesian_optimization/gaussian_process_acquision_function.png" alt="Exploitation vs exploration trafeodd" />
</p>
<p>Image taken <a href="https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083">from here</a>.</p>

<p>This was a lightweight introduction to how a Bayesian Optimization algorithm works under the hood. Next, we will use a third-party library to tune an SVM’s hyperparameters and compare the results with some ground-truth data acquired via brute force. In the future, we will talk more about BO, perhaps by implementing our own algorithm with GPs, acquisition functions, and all.</p>

<h3 id="hyperparameter-tuning-of-an-svm">Hyperparameter tuning of an SVM</h3>
<p>Let’s import some of the stuff we will be using:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">make_classification</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVC</span>

<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib.tri</span> <span class="k">as</span> <span class="n">tri</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">hyperopt</span> <span class="kn">import</span> <span class="n">fmin</span><span class="p">,</span> <span class="n">tpe</span><span class="p">,</span> <span class="n">Trials</span><span class="p">,</span> <span class="n">hp</span><span class="p">,</span> <span class="n">STATUS_OK</span></code></pre></figure>

<h4 id="create-a-dataset">Create a dataset</h4>
<p>Then, we construct an artificial training dataset with many classes, where some of the features are informative, and some are not:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Create a random n-class classification problem.
</span>
<span class="c1"># n_features is the total number of features
# n_informative is the number of informative features 
# n_redundant features are generated as random linear combinations of the informative features
</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span> <span class="o">=</span> <span class="n">make_classification</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">2500</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">n_informative</span><span class="o">=</span><span class="mi">7</span><span class="p">,</span> <span class="n">n_redundant</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span></code></pre></figure>

<h4 id="objective-function-definition">Objective function definition</h4>

<p>In this example, we will be using the <code class="language-plaintext highlighter-rouge">hyperopt</code> package to perform the hyperparameter tuning. First, we define our objective/cost/loss function. This is the \(f(\mathbf{x})\) that we want talked about in the introduction, and \(\mathbf{x} = [C, \gamma]\) is the parameter space. Therefore, we want to find the best combination of \(C, \gamma\) values that minimizes \(f(\mathbf{x})\). The machine learning model that we will be using is a <a href="https://en.wikipedia.org/wiki/Support-vector_machine">Support Vector Machine (SVM)</a>, and the loss will be derived from the average 3-fold cross-validation score.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">objective</span><span class="p">(</span><span class="n">args</span><span class="p">):</span>
    <span class="s">'''Define the loss function / objective of our model.

    We will be using an SVM parameterized by the regularization parameter C
    and the parameter gamma.
    
    The C parameter trades off correct classification of training examples
    against maximization of the decision function's margin. For larger values
    of C, a smaller margin will be accepted.

    The gamma parameter defines how far the influence of a single training
    example reaches, with larger values meaning 'close'. 
    '''</span>
    <span class="n">C</span><span class="p">,</span> <span class="n">gamma</span> <span class="o">=</span> <span class="n">args</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">C</span><span class="o">=</span><span class="mi">10</span> <span class="o">**</span> <span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mi">10</span> <span class="o">**</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">12345</span><span class="p">)</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">estimator</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">X</span><span class="o">=</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y_train</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'roc_auc'</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">3</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">'params'</span><span class="p">:</span> <span class="p">{</span><span class="s">'C'</span><span class="p">:</span> <span class="n">C</span><span class="p">,</span> <span class="s">'gamma'</span><span class="p">:</span> <span class="n">gamma</span><span class="p">},</span> <span class="s">'loss'</span><span class="p">:</span> <span class="n">loss</span><span class="p">,</span> <span class="s">'status'</span><span class="p">:</span> <span class="n">STATUS_OK</span> <span class="p">}</span></code></pre></figure>

<h4 id="optimization">Optimization</h4>
<p>Now, we will use the <code class="language-plaintext highlighter-rouge">fmin()</code> function from the <code class="language-plaintext highlighter-rouge">hyperopt</code> package. In this step, we need to specify the search space for our parameters, the database in which we will be storing the evaluation points of the search, and finally, the search algorithm to use. The careful reader might notice that we are doing 1000 evaluations, although we said that evaluation \(f(x)\) is expensive. That’s correct; the only reason we do so is because we want to exaggerate the effect of exploitation <em>vs.</em> exploration, as you shall see in the plots.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">trials</span> <span class="o">=</span> <span class="n">Trials</span><span class="p">()</span>
<span class="n">best</span> <span class="o">=</span> <span class="n">fmin</span><span class="p">(</span><span class="n">objective</span><span class="p">,</span>
    <span class="n">space</span><span class="o">=</span><span class="p">[</span><span class="n">hp</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="s">'C'</span><span class="p">,</span> <span class="o">-</span><span class="mf">4.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">),</span> <span class="n">hp</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="s">'gamma'</span><span class="p">,</span> <span class="o">-</span><span class="mf">4.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">)],</span>
    <span class="n">algo</span><span class="o">=</span><span class="n">tpe</span><span class="p">.</span><span class="n">suggest</span><span class="p">,</span>
    <span class="n">max_evals</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
    <span class="n">trials</span><span class="o">=</span><span class="n">trials</span><span class="p">)</span></code></pre></figure>

<p>Let’s print the results:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">best</span><span class="p">)</span>
<span class="mi">100</span><span class="o">%|</span><span class="err">██████████</span><span class="o">|</span> <span class="mi">1000</span><span class="o">/</span><span class="mi">1000</span> <span class="p">[</span><span class="mi">13</span><span class="p">:</span><span class="mi">01</span><span class="o">&lt;</span><span class="mi">00</span><span class="p">:</span><span class="mi">00</span><span class="p">,</span>  <span class="mf">1.28</span><span class="n">trial</span><span class="o">/</span><span class="n">s</span><span class="p">,</span> <span class="n">best</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.046323449153816476</span><span class="p">]</span>
<span class="p">{</span><span class="s">'C'</span><span class="p">:</span> <span class="mf">0.7280999882033379</span><span class="p">,</span> <span class="s">'gamma'</span><span class="p">:</span> <span class="o">-</span><span class="mf">1.6752085795502363</span><span class="p">}</span></code></pre></figure>

<p>Let us now extract the  value of our objective function for every \(C, \gamma\) pair:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Extract the loss for every combination of C, gamma
</span><span class="n">results</span> <span class="o">=</span> <span class="n">trials</span><span class="p">.</span><span class="n">results</span>
<span class="n">ar</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
    <span class="n">C</span> <span class="o">=</span> <span class="n">r</span><span class="p">[</span><span class="s">'params'</span><span class="p">][</span><span class="s">'C'</span><span class="p">]</span>
    <span class="n">gamma</span> <span class="o">=</span> <span class="n">r</span><span class="p">[</span><span class="s">'params'</span><span class="p">][</span><span class="s">'gamma'</span><span class="p">]</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">r</span><span class="p">[</span><span class="s">'loss'</span><span class="p">]</span>
    <span class="n">ar</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">loss</span></code></pre></figure>

<p>And then use it to plot the loss surface:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">ar</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">ar</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">ar</span><span class="p">[:,</span> <span class="mi">2</span><span class="p">]</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">tricontour</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">loss</span><span class="p">,</span> <span class="n">levels</span><span class="o">=</span><span class="mi">14</span><span class="p">,</span> <span class="n">linewidths</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">colors</span><span class="o">=</span><span class="s">'k'</span><span class="p">)</span>
<span class="n">cntr</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">tricontourf</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">loss</span><span class="p">,</span> <span class="n">levels</span><span class="o">=</span><span class="mi">14</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"RdBu_r"</span><span class="p">)</span>

<span class="n">fig</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">cntr</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="s">'ko'</span><span class="p">,</span> <span class="n">ms</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">ylim</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Loss as a function of $10^C$, $10^\gamma$'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'C'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'gamma'</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>

<p align="center">
<img style="width: 65%; height: 65%" src="https://ekamperi.github.io/images/bayesian_optimization/bayesian_optimization.png" alt="Bayesian optimization" />
</p>

<h4 id="brute-force-evaluation-of-objective-function">Brute-force evaluation of objective function</h4>
<p>Since the parameter space is just 2-dimensional, the dataset relatively small, and the SVM training fast, we can brute-force compute the value of the objective function for all possible values of \(C\) and \(\gamma\). These will be our ground-truth data against which we will compare the results from the BO run.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">sample_loss</span><span class="p">(</span><span class="n">args</span><span class="p">):</span>
    <span class="n">C</span><span class="p">,</span> <span class="n">gamma</span> <span class="o">=</span> <span class="n">args</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">C</span><span class="o">=</span><span class="mi">10</span> <span class="o">**</span> <span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mi">10</span> <span class="o">**</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">12345</span><span class="p">)</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">estimator</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">X</span><span class="o">=</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y_train</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'roc_auc'</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">3</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">loss</span>

<span class="n">lambdas</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">25</span><span class="p">)</span>
<span class="n">gammas</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="n">C</span><span class="p">,</span> <span class="n">gamma</span><span class="p">]</span> <span class="k">for</span> <span class="n">gamma</span> <span class="ow">in</span> <span class="n">gammas</span> <span class="k">for</span> <span class="n">C</span> <span class="ow">in</span> <span class="n">lambdas</span><span class="p">])</span>

<span class="n">real_loss</span> <span class="o">=</span> <span class="p">[</span><span class="n">sample_loss</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="k">for</span> <span class="n">params</span> <span class="ow">in</span> <span class="n">param_grid</span><span class="p">]</span></code></pre></figure>

<p>And here is the respective contour plot:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">C</span><span class="p">,</span> <span class="n">G</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">lambdas</span><span class="p">,</span> <span class="n">gammas</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">cp</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">contourf</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">G</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">real_loss</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">C</span><span class="p">.</span><span class="n">shape</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"RdBu_r"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">cp</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Loss as a function of $10^C$, $10^\gamma$'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'$C$'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'$\gamma$'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>

<p align="center">
<img style="width: 65%; height: 65%" src="https://ekamperi.github.io/images/bayesian_optimization/ground_truth.png" alt="Bayesian optimization" />
</p>

<p>Let’s place the two plots side-by-side and talk about the results. In the <strong>left image</strong>, we see the ground-truth values of the loss function that we acquired by computing the value \(\ell(C, \gamma)\) for every possible pair of \((C, \gamma)\) via a grid-search. You see the blue shaded region corresponding to low values for the loss function (good!) and the red stripe at the top corresponding to high values for the loss function (bad!). In the <strong>right image</strong>, we see the black points corresponding to our tried values. Do you notice how there is a high density of points near the blue shaded area where \(\ell(C,\gamma)\) is minimized? That’s <strong>exploitation</strong>! The BO algorithm picked up some good solutions into that area and sampled aggressively around that region. On the contrary, it tried some values near the top red stripe region, and since those trials yielded bad results, it didn’t bother sampling any further there.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: center">Ground-truth values</th>
      <th style="text-align: center">Bayesian Optimization</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center"><img src="https://ekamperi.github.io/images/bayesian_optimization/ground_truth.png" alt="" /></td>
      <td style="text-align: center"><img src="https://ekamperi.github.io/images/bayesian_optimization/bayesian_optimization.png" alt="" /></td>
    </tr>
  </tbody>
</table>

<h4 id="references">References</h4>
<ol>
  <li>https://thuijskens.github.io/2016/12/29/bayesian-optimisation/</li>
</ol>]]></content><author><name>Stathis Kamperis</name></author><category term="machine learning" /><category term="algorithms" /><category term="Bayes theorem" /><category term="neural networks" /><category term="optimization" /><category term="programming" /><category term="Python" /><summary type="html"><![CDATA[An introduction to Bayesian-based optimization for tuning hyperparameters in machine learning models]]></summary></entry><entry><title type="html">Longest substring with non-repeating characters</title><link href="https://ekamperi.github.io/programming/2021/04/14/longest-non-repeating-substring.html" rel="alternate" type="text/html" title="Longest substring with non-repeating characters" /><published>2021-04-14T00:00:00+00:00</published><updated>2021-04-14T00:00:00+00:00</updated><id>https://ekamperi.github.io/programming/2021/04/14/longest-non-repeating-substring</id><content type="html" xml:base="https://ekamperi.github.io/programming/2021/04/14/longest-non-repeating-substring.html"><![CDATA[<p>I have been doing some interviews for job positions like data scientist, machine learning engineer, and software developer during the past months. To prepare for the coding part of these interviews and brush up on my algorithmic thinking and programming skills, I decided to do some ad-hoc practicing. There are lots of websites with coding challenges of varying difficulty. Some examples include <a href="https://leetcode.com/">Leetcode</a>, <a href="https://www.hackerrank.com/">HackerRank</a>, <a href="https://www.topcoder.com/">Topcoder</a>, and others. Although I kind of dislike the contrived nature of these quizzes, I joined Leetcode nonetheless. Anyway, I picked a problem under the “medium” difficulty category that I’ll blog about today. The problem is about <strong>finding the longest substring with non-repeating characters in a string</strong>.</p>

<h3 id="problem-formulation">Problem formulation</h3>
<p>Given a string <em>s</em>, find the length of the longest substring without repeating characters.</p>

<p><strong>Example 1</strong>:
Input: s = “abcabcbb”
Output: 3
Explanation: The answer is “abc”, with the length of 3.</p>

<p><strong>Example 2</strong>:
Input: s = “bbbbb”
Output: 1
Explanation: The answer is “b”, with the length of 1.</p>

<p><strong>Example 3</strong>:
Input: s = “pwwkew”
Output: 3
Explanation: The answer is “wke”, with the length of 3.
Notice that the answer must be a substring, “pwke” is a subsequence and not a substring.</p>

<p><strong>Example 4</strong>:
Input: s = “”
Output: 0</p>

<p><strong>Constraints</strong>:
\(0 \le \text{s.length} \le 5 \times 10^4\)
<em>s</em> consists of English letters, digits, symbols and spaces.</p>

<h3 id="solutions">Solutions</h3>
<p>We import some libraries that we will need later on.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">string</span>
<span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">time</span></code></pre></figure>

<p>For starters, we will write a function that generates random strings consisting of lowercase letters, digits, and whitespace characters of varying lengths. We will use it to see how our different solutions scale with increasing input size. When coding such problems, it’s essential to have abundant examples that cover all edge cases. By the way, I’ve found it easier to write and run my code in a Jupyter Notebook inside Visual Studio Code and then paste it to Leetcode.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">str_generator</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">6</span><span class="p">,</span> <span class="n">chars</span><span class="o">=</span><span class="n">string</span><span class="p">.</span><span class="n">ascii_lowercase</span> <span class="o">+</span> <span class="n">string</span><span class="p">.</span><span class="n">digits</span> <span class="o">+</span> <span class="n">string</span><span class="p">.</span><span class="n">whitespace</span><span class="p">):</span>
    <span class="k">return</span> <span class="s">''</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">chars</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">size</span><span class="p">))</span>

<span class="c1"># Print 10 random strings of random length [0,20) 
</span><span class="n">input_str</span> <span class="o">=</span> <span class="p">[</span><span class="n">str_generator</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">20</span><span class="p">))</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span>
<span class="k">print</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>

<span class="c1">#    ['75ypzflfi85wgbe', 'k4dogu\x0c14ckj', 'zcj8aoquhzfsh1g7uyh', '\x0cce\r\tt48nq1gio', 'c58',
#     'ol\tnfq7', 'i', 'jsjn\t8', '2tj\x0bb413', '']</span></code></pre></figure>

<h3 id="the-horrible-solution">The horrible solution</h3>

<p>My first attempt resulted in the following readable yet absolutely horrible, complexity-wise, solution. The <code class="language-plaintext highlighter-rouge">rep()</code> function is good, actually, and we will be using it in the other solutions as well. It uses a dictionary to track whether the next character has already been seen inside a substring. It has the advantage that it iterates only once the substring, so it’s \(\mathcal{O}(N)\) time complexity. Had we used a nested loop to search for repeating characters, that would lead us to \(\mathcal{O}(N^2)\) complexity from the get-go!</p>

<p>So, the following algorithm starts with the entire string and checks whether it has any repeating characters. If it doesn’t, then this is the longest substring of length N! Return its length, and we are done. If it has repeating characters, though, we slice it into two N-1 substrings. If the repeating characters are located in only one out of the two, we know that the other is the longest substring with a length N-1. Return it immediately, and we are done. Last, if there are repeating characters in both the substrings of length N-1, we need to dig deeper, and therefore we return the maximum length of the substring contained in these two substrings.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">rep</span><span class="p">(</span><span class="n">s</span><span class="p">:</span><span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="s">'''Returns True if str has repeating characters in it and False otherwise'''</span>
    <span class="n">freq</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">s</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">freq</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">True</span>
        <span class="n">freq</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="bp">False</span>

<span class="k">def</span> <span class="nf">helper</span><span class="p">(</span><span class="n">s</span><span class="p">:</span><span class="nb">str</span><span class="p">,</span> <span class="n">n</span><span class="p">:</span><span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">'''The most horrible solution in terms of time and space complexity.
    It uses recursion to generate the substrings, starting from the full
    string and generating substrings.'''</span>
    <span class="k">if</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">n</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">rep</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">n</span>
    <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">s</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
    <span class="n">rep_a</span><span class="p">,</span> <span class="n">rep_b</span> <span class="o">=</span> <span class="n">rep</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="n">rep</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="p">(</span><span class="n">rep_a</span> <span class="ow">and</span> <span class="n">rep_b</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">n</span><span class="o">-</span><span class="mi">1</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">helper</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">helper</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">verySlowLLS</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="k">return</span> <span class="n">helper</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">))</span></code></pre></figure>

<p>So, why does this algorithm perform so poorly? As I understand, there are two reasons: 1. Recursion is expensive because each time we call the <code class="language-plaintext highlighter-rouge">helper()</code> function, a new stack frame needs to be allocated, and 2. When we are calling <code class="language-plaintext highlighter-rouge">max(helper(a, n-1), helper(b, n-1))</code>, we don’t really <em>divide</em> the input, let alone <em>conquer</em> it! We merely go from N to N-1. It’s not as if we reduced the search space from N to N/2 or something. <strong>So, remember: if you recurse, you better be dividing the search space at each step, otherwise don’t recurse!</strong></p>

<h3 id="a-decent-solution-of-mathcalon2-complexity">A decent solution of \(\mathcal{O}(N^2)\) complexity</h3>
<p>The next two solutions use sliding windows, either forward or backward, to find all possible substrings in a string. The forward method keeps track of the length that is maximum at the current point in time. We need to do this because we must look up to the original string of length N.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">slowLLS_forward</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">'''It uses sliding windows of length 1, 2, ..., N-1, N.
    That's why we need to keep track of the currently maximum
    length.'''</span>
    <span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">L</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">L</span>
    <span class="n">max_len</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">L</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">L</span> <span class="o">-</span> <span class="n">w</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
            <span class="n">sub</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">:(</span><span class="n">i</span><span class="o">+</span><span class="n">w</span><span class="p">)]</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">rep</span><span class="p">(</span><span class="n">sub</span><span class="p">):</span>
                <span class="n">current_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sub</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">current_len</span> <span class="o">&gt;</span> <span class="n">max_len</span><span class="p">:</span>
                    <span class="n">max_len</span> <span class="o">=</span> <span class="n">current_len</span>
    <span class="k">return</span> <span class="n">max_len</span></code></pre></figure>

<p>On the other hand, if we are moving backward, i.e., if we are examining substrings of decreasing length, we know that the first substring without any repeating characters is the one with the maximum length.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">slowLLS_backward</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">'''It uses sliding windows of length N, N-1, N-2, ..., 1.
    That's why we don't need to keep track of the currently
    maximum length. The first non-repeating substring we encounter
    is the one with the maximum length.'''</span>
    <span class="n">L</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">L</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">L</span>
    <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">L</span> <span class="o">-</span> <span class="n">w</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
            <span class="n">sub</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">:(</span><span class="n">i</span><span class="o">+</span><span class="n">w</span><span class="p">)]</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">rep</span><span class="p">(</span><span class="n">sub</span><span class="p">):</span>
                <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">sub</span><span class="p">)</span></code></pre></figure>

<h3 id="the-best-solution-of-mathcalon-complexity">The best solution of \(\mathcal{O}(N)\) complexity</h3>

<p>This is actually the best solution I could come up with. We are using two variables to keep track of the start and the end of the currently maximum substring. Every time we see a non-repeating character, we advance the <em>end</em> of the presently maximal length substring. Contrastly, every time we encounter a repeating character, we advance the <em>start</em> of the currently maximal length substring. In this case, however, we need to remove from our dictionary all characters that we had discarded when we pushed the origin of the substring.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">fastLLS</span><span class="p">(</span><span class="n">s</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">'''Calculate the longest non-repeating substring
    on one go, by keeping track of the start (variable a) and
    end (variable b) of the currently maximum such substring.'''</span>
    <span class="n">max_len</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">a</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">b</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">track</span> <span class="o">=</span> <span class="p">{}</span>

    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">track</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
            <span class="n">b</span> <span class="o">=</span> <span class="n">i</span>
            <span class="n">track</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">start</span> <span class="o">=</span> <span class="n">a</span>
            <span class="n">end</span> <span class="o">=</span> <span class="n">track</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
            <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">s</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]:</span>
                <span class="k">del</span> <span class="n">track</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
            <span class="n">a</span> <span class="o">=</span> <span class="n">end</span>
            <span class="n">b</span> <span class="o">=</span> <span class="n">i</span>
            <span class="n">track</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span>
        <span class="n">m</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span>
        <span class="k">if</span> <span class="n">m</span> <span class="o">&gt;</span> <span class="n">max_len</span><span class="p">:</span> <span class="n">max_len</span> <span class="o">=</span> <span class="n">m</span>
    <span class="k">return</span> <span class="n">max_len</span></code></pre></figure>

<p>Indeed, after submitting this solution to Leetcode I got:</p>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/runtime_vs_others.png" alt="Longest non-repeating substring" />
</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">step</span> <span class="o">=</span> <span class="mi">2</span>
<span class="k">def</span> <span class="nf">profile_function</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
    <span class="s">'''Profile `f' by applying it on input strings of
    progressively increasing length up to `n'.'''</span>
    <span class="n">runtimes</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">step</span><span class="p">):</span>
        <span class="n">input_str</span> <span class="o">=</span> <span class="n">str_generator</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">i</span><span class="p">)</span>
        <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">perf_counter</span><span class="p">()</span>
        <span class="n">f</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>
        <span class="n">runtimes</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="n">i</span><span class="p">,</span> <span class="n">time</span><span class="p">.</span><span class="n">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">])</span>
    <span class="k">return</span> <span class="n">runtimes</span>

<span class="k">def</span> <span class="nf">plot_runtimes</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">fitDegree</span><span class="p">,</span> <span class="n">title</span><span class="p">):</span>
    <span class="s">'''Plot runtimes along with a polynomial fit of `fitDegree' degree.
    By default don't create figure / show the plot, so that we can call
    this function inside a subplot() context.'''</span>
    <span class="c1">#plt.figure()
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">r</span><span class="p">),</span> <span class="n">s</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Input string length'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Execution time in sec'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">)</span>

    <span class="c1"># Add a polynomial fit
</span>    <span class="k">if</span> <span class="n">fitDegree</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="n">model</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">poly1d</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">polyfit</span><span class="p">(</span><span class="o">*</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">r</span><span class="p">),</span> <span class="n">fitDegree</span><span class="p">))</span>
        <span class="n">polyline</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">*</span> <span class="n">step</span><span class="p">,</span> <span class="mi">50</span><span class="p">)</span>
        <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">polyline</span><span class="p">,</span> <span class="n">model</span><span class="p">(</span><span class="n">polyline</span><span class="p">),</span> <span class="s">'r'</span><span class="p">)</span>
    <span class="c1">#plt.show()</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">runtimes_very_slow</span> <span class="o">=</span> <span class="n">profile_function</span><span class="p">(</span><span class="n">verySlowLLS</span><span class="p">,</span> <span class="mi">34</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_very_slow</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">verySlowLLS</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/output_7_0.svg" alt="Longest non-repeating substring" />
</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">runtimes_slow_forward</span> <span class="o">=</span> <span class="n">profile_function</span><span class="p">(</span><span class="n">slowLLS_forward</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_slow_forward</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">slowLLS_forward</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/output_8_0.svg" alt="Longest non-repeating substring" />
</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">runtimes_slow_backward</span> <span class="o">=</span> <span class="n">profile_function</span><span class="p">(</span><span class="n">slowLLS_backward</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_slow_backward</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">slowLLS_backward</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/output_9_0.svg" alt="Longest non-repeating substring" />
</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">runtimes_fast</span> <span class="o">=</span> <span class="n">profile_function</span><span class="p">(</span><span class="n">fastLLS</span><span class="p">,</span> <span class="mi">10000</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_fast</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">fastLLS</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 80%; height: 80%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/output_10_0.svg" alt="Longest non-repeating substring" />
</p>

<p>As a sanity check, we verify that all algorithms return the same result for strings of various lengths. We can’t really go past a length of 30 characters because the recursive algorithm takes ages to run.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Sanity check -- all algorithms should agree
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">3</span><span class="p">):</span>
    <span class="n">input_str</span> <span class="o">=</span> <span class="n">str_generator</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">i</span><span class="p">)</span>
    <span class="n">y1</span> <span class="o">=</span> <span class="n">fastLLS</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>
    <span class="n">y2</span> <span class="o">=</span> <span class="n">slowLLS_forward</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>
    <span class="n">y3</span> <span class="o">=</span> <span class="n">slowLLS_forward</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>
    <span class="n">y4</span> <span class="o">=</span> <span class="n">verySlowLLS</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">y1</span> <span class="o">!=</span> <span class="n">y2</span> <span class="ow">or</span> <span class="n">y2</span> <span class="o">!=</span> <span class="n">y3</span> <span class="ow">or</span> <span class="n">y3</span> <span class="o">!=</span> <span class="n">y4</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">input_str</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">y1</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">y3</span><span class="p">,</span> <span class="n">y4</span><span class="p">)</span>
        <span class="k">break</span></code></pre></figure>

<p>In this plot we combine the running times of all algorithms side by side.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Plot the runtimes of all algorithms side by side
</span><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_fast</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">fastLLS</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_slow_backward</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">slowLLS_backward</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_slow_forward</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">slowLLS_forward</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">)</span>
<span class="n">plot_runtimes</span><span class="p">(</span><span class="n">runtimes_very_slow</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">verySlowLLS</span><span class="p">.</span><span class="n">__name__</span><span class="p">)</span></code></pre></figure>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/output_12_0.svg" alt="Longest non-repeating substring" />
</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">()</span>
<span class="c1">#plt.xscale('log')
</span><span class="n">plt</span><span class="p">.</span><span class="n">yscale</span><span class="p">(</span><span class="s">'log'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">runtimes_fast</span><span class="p">),</span> <span class="n">s</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">runtimes_slow_backward</span><span class="p">),</span> <span class="n">s</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">runtimes_slow_forward</span><span class="p">),</span> <span class="n">s</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">runtimes_very_slow</span><span class="p">),</span> <span class="n">s</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Input string length'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Execution time in sec'</span><span class="p">);</span></code></pre></figure>

<p align="center">
 <img style="width: 100%; height: 100%" src="https://ekamperi.github.io/images/Leetcode/longest_nonrepeating_substring/output_13_0.svg" alt="Longest non-repeating substring" />
</p>]]></content><author><name>Stathis Kamperis</name></author><category term="programming" /><category term="algorithms" /><category term="Leetcode" /><category term="programming" /><category term="Python" /><summary type="html"><![CDATA[How to find the longest substring with non-repeating characters in a string]]></summary></entry><entry><title type="html">Decision Trees: Gini index vs entropy</title><link href="https://ekamperi.github.io/machine%20learning/2021/04/13/gini-index-vs-entropy-decision-trees.html" rel="alternate" type="text/html" title="Decision Trees: Gini index vs entropy" /><published>2021-04-13T00:00:00+00:00</published><updated>2021-04-13T00:00:00+00:00</updated><id>https://ekamperi.github.io/machine%20learning/2021/04/13/gini-index-vs-entropy-decision-trees</id><content type="html" xml:base="https://ekamperi.github.io/machine%20learning/2021/04/13/gini-index-vs-entropy-decision-trees.html"><![CDATA[<h3 id="introduction">Introduction</h3>
<p>Decision trees are tree-based methods that are used for both regression and classification. They work by segmenting the feature space into several simple subregions. To make predictions, trees assume either the mean <em>or</em> the most frequent class of the training points inside the region our observation falls, depending on whether we do regression or classification, respectively. Decision trees are straightforward to interpret, and as a matter of fact, they can be even easier to interpret than linear or logistic regression models. Perhaps because they relate to how the human decision-making process works. On the downside, trees usually lack the level of predictive accuracy of other methods. Also, they can be susceptible to changes in the training dataset, where a slight change in it may cause a dramatic change in the final tree. That’s why <em>bagging</em>, <em>random forests</em> and <em>boosting</em> are used to construct more robust tree-based prediction models. But that’s for another day. Today we are going to talk about how the split happens.</p>

<h3 id="gini-impurity-and-information-entropy">Gini impurity and information entropy</h3>
<p>Trees are constructed via <strong>recursive binary splitting of the feature space</strong>. In classification scenarios that we will be discussing today, the criteria typically used to decide which feature to split on are the <strong>Gini index</strong> and <strong>information entropy</strong>. Both of these measures are pretty similar numerically. They take small values if most observations fall into the same class in a node. Contrastly, they are maximized if there’s an equal number of observations across all classes in a node. A node with mixed classes is called impure, and the Gini index is also known as <strong>Gini impurity</strong>.</p>

<p>Concretely, for a set of items with \(K\) classes, and \(p_k\) being the fraction of items labeled with class \(k\in {1,2,\ldots,K}\), the <strong>Gini impurity</strong> is defined as:</p>

\[G = \sum_{k=1}^K p_k (1 - p_k) = 1 - \sum_{k=1}^N p_k^2\]

<p>And <strong>information entropy</strong> as:</p>

\[H = -\sum_{k=1}^K p_k \log p_k\]

<p>In the following plot, both metrics are plotted against each other, considering a set of K=2 classes with probability \(p\) and \(1-p\), respectively. Notice how for small values of \(p\), Gini is consistently lower than entropy. Therefore, it penalizes less small impurities. <strong>This is a crucial observation that will prove helpful in the context of imbalanced datasets</strong>.</p>

<p align="center">
<img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/decision_trees/gini_vs_entropy.png" alt="Gini vs entropy" />
</p>

<p>The Gini index is used by the CART (classification and regression tree) algorithm, whereas information gain via entropy reduction is used by algorithms like <a href="https://en.wikipedia.org/wiki/C4.5_algorithm">C4.5</a>. In the following image, we see a part of a decision tree for predicting whether a person receiving a loan will be able to pay it back. The left node is an example of a low impurity node since most of the observations fall into the same class. Contrast this with the node on the right where observations of different classes are mixed in.</p>

<p align="center">
<img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/decision_trees/pure_vs_impure_node.png" alt="Decision trees: pure vs impure nodes" />
</p>

<p>Image taken from “Provost, Foster; Fawcett, Tom. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking”.</p>

<p>Let’s calculate the <strong>Gini impurity of the left node</strong>:</p>

\[\begin{align}
G\left(\text{Balance &lt; 50K}\right)
&amp;= 1-\sum_{k=1}^{2} p_k^2 = 1-p_1^2 - p_2^2\\
&amp;=1-\left(\frac{12}{13}\right)^2 -\left(\frac{1}{13}\right)^2
\simeq 0.14
\end{align}\]

<p>And the <strong>Gini impurity of the right node</strong>:</p>

\[\begin{align}
G\left(\text{Balance} \ge \text{50K}\right)
&amp;= 1-\sum_{k=1}^{2} p_k^2 = 1-p_1^2 - p_2^2\\
&amp;=1-\left(\frac{4}{17}\right)^2 -\left(\frac{13}{17}\right)^2
\simeq 0.36
\end{align}\]

<p>We notice that the left node has a lower Gini impurity index, which we’d expect since \(G\) measures impurity, and the left node is purer relative to the right one. Let’s calculate now the <strong>entropy of the left node</strong>:</p>

\[\begin{align}
H\left(\text{Balance &lt; 50K}\right)
&amp;= -\sum_{k=1}^{2} p_k \log{p}_k = -p_1 \log{p}_1 -p_2 \log{p}_2\\
&amp;=-\frac{12}{13}\log\left(\frac{12}{13}\right) -\frac{1}{13}\log\left(\frac{1}{13}\right)
\simeq 0.27 nats
\end{align}\]

<p>Depending on whether we are using \(log_2\) or \(log_e\) in the entropy formula we get the result in <em>bits</em> or <em>nats</em>, respectively. For instance, here it’s \(H \simeq 0.39 bits\). Let’s calculate the <strong>entropy of the right node</strong> as well:</p>

\[\begin{align}
H\left(\text{Balance}\ge\text{50K}\right)
&amp;= -\sum_{k=1}^{2} p_k \log{p}_k = -p_1 \log{p}_1 -p_2 \log{p}_2\\
&amp;=-\frac{4}{17}\log\left(\frac{4}{17}\right) -\frac{13}{17}\log\left(\frac{13}{17}\right)
\simeq 0.55 nats
\end{align}\]

<p>Again, if we’d use base 2 in the entropy’s logarithm, we’d get \(H \simeq 0.79 bits\). Units aside, we see that the left node has lower entropy than the right one, which is expected since the left one is in a more <em>ordered</em> state and entropy measures <em>disorder</em>. So, it’s \(H_\text{left} \simeq 0.27 nats\) and  \(H_\text{right} \simeq 0.55 nats\). <strong>The various algorithms for assembling decision trees pick the next feature to split, so maximum impurity reduction is achieved.</strong></p>

<p>Let’s calculate how much entropy is reduced by splitting on the “Balance” feature:</p>

\[\begin{align*}
H(\text{Parent}) &amp;= -\frac{16}{30} \log\left(\frac{16}{30}\right) -\frac{14}{30}\log\left(\frac{14}{30}\right)\simeq 0.69nats\\
H(\text{Balance}) &amp;= \frac{13}{30} \times 0.27 + \frac{17}{30} \times 0.55 \simeq 0.43nats
\end{align*}\]

<p>Therefore, the information gain by splitting on the “Balance” feature is:</p>

\[\text{IG} = H(\text{Parent}) - H(\text{Balance}) = 0.69 - 0.43 = 0.26nats\]

<p>If we were to choose among “Balance” and some other feature, say “Education”, we would make up our mind based on the IG of both. If IG of “Balance” was 0.26 nats and IG of “Education” was 0.14 nats, we would pick the former to split.</p>

<p>So when do we use Gini impurity versus information gain via entropy reduction? Both metrics work more or less the same, and in only a few cases do the results differ considerably. Having said that, <strong>there’s a scenario where entropy might be more prudent: imbalanced datasets.</strong></p>

<h3 id="an-example-of-an-imbalanced-dataset">An example of an imbalanced dataset</h3>

<p>The package <a href="https://cran.r-project.org/web/packages/ROSE/ROSE.pdf">ROSE</a> comes with a built-in imbalanced dataset named <em>hacide</em>, consisting of <em>hacide.train</em> and <em>hacide.test</em>. The dataset has three variables in it for a total of \(N=10^3\) observations. The <em>cls</em>, short for “class”, is the response categorical variable, and \(x_1\) and \(x_2\) are the predictor variables. For building our classification trees, we will use the <a href="https://cran.r-project.org/web/packages/rpart/rpart.pdf">rpart</a> package.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Load the necessary libraries and the dataset </span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ROSE</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rpart</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rpart.plot</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">hacide</span><span class="p">)</span><span class="w">

</span><span class="c1"># Check imbalance on training set</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">hacide.train</span><span class="o">$</span><span class="n">cls</span><span class="p">)</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1">#   0   1 </span><span class="w">
</span><span class="c1"># 980  20 </span></code></pre></figure>

<p>As you may see from the output above, this is a very imbalanced dataset. The vast majority, 980, of the 1000 observations belong to the “0” class, and only 20 belong to the “1” class. We will now fit a decision tree by using Gini as the split criterion.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Use gini as the split criterion</span><span class="w">
</span><span class="n">tree.imb</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rpart</span><span class="p">(</span><span class="n">cls</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hacide.train</span><span class="p">,</span><span class="w"> </span><span class="n">parms</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">split</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gini"</span><span class="p">))</span><span class="w">
</span><span class="n">pred.tree.imb</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">tree.imb</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hacide.test</span><span class="p">)</span><span class="w">
</span><span class="n">accuracy.meas</span><span class="p">(</span><span class="n">hacide.test</span><span class="o">$</span><span class="n">cls</span><span class="p">,</span><span class="w"> </span><span class="n">pred.tree.imb</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># Call: </span><span class="w">
</span><span class="c1"># accuracy.meas(response = hacide.test$cls, predicted = pred.tree.imb[, 2])</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># Examples are labelled as positive when predicted is greater than 0.5 </span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># precision: 1.000</span><span class="w">
</span><span class="c1"># recall: 0.200</span><span class="w">
</span><span class="c1"># F: 0.167</span></code></pre></figure>

<p>Things don’t look all that great. Although we have a perfect precision (reminder: Precision=TP/(TP+FP)), meaning that we don’t have any false positives, our recall is very low (reminder: Recall=TP/(TP+FN), meaning that we have many false negatives. So basically, our classifier outputs pretty much always the majority class “0”. F-metric also is very low. And this is the ROC curve which shows how horrible our performance is.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">roc.curve</span><span class="p">(</span><span class="n">hacide.test</span><span class="o">$</span><span class="n">cls</span><span class="p">,</span><span class="w"> </span><span class="n">pred.tree.imb</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">plotit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gini index"</span><span class="p">)</span><span class="w">
</span><span class="c1"># Area under the curve (AUC): 0.600</span></code></pre></figure>

<p align="center">
<img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/decision_trees/gini_auc.png" alt="Gini vs entropy ROC curve" />
</p>

<p>So what did go wrong here? Let’s take a look at the decision tree itself. Notice that the left node has 10 observations of the minority class and 979 of the dominant class. From the perspective of Gini impurity index that’s a very pure node, because \(G_L = 1 - (10/989)^2 - (979/989)^2 \simeq 0.02\). The same applies, albeit to a lesser degree, for the right node: \(G_R = 1 - (1/11)^2 - (10/11)^2\simeq 0.17\). Therefore, \(G\) doesn’t appear to be working so great with our imbalanced dataset.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">rpart.plot</span><span class="p">(</span><span class="n">tree.imb</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gini Index"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">extra</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre></figure>

<p align="center">
<img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/decision_trees/gini_tree.png" alt="Gini vs entropy ROC curve" />
</p>

<p>Let’s repeat the fitting, but now we will use entropy as the split criterion for growing our tree.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Use information gain as the split criterion</span><span class="w">
</span><span class="n">tree.imb</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rpart</span><span class="p">(</span><span class="n">cls</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hacide.train</span><span class="p">,</span><span class="w"> </span><span class="n">parms</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">split</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"information"</span><span class="p">))</span><span class="w">
</span><span class="n">pred.tree.imb</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">tree.imb</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hacide.test</span><span class="p">)</span><span class="w">
</span><span class="n">accuracy.meas</span><span class="p">(</span><span class="n">hacide.test</span><span class="o">$</span><span class="n">cls</span><span class="p">,</span><span class="w"> </span><span class="n">pred.tree.imb</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># Call: </span><span class="w">
</span><span class="c1"># accuracy.meas(response = hacide.test$cls, predicted = pred.tree.imb[, 2])</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1">#  Examples are labelled as positive when predicted is greater than 0.5 </span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># precision: 1.000</span><span class="w">
</span><span class="c1"># recall: 0.400</span><span class="w">
</span><span class="c1"># F: 0.286</span></code></pre></figure>

<p>The precision is still perfect, i.e. we aren’t predicting any false positives, and we doubled the recall. This improvement also reflects to the F metric. Also, the ROC curve of the new decision tree is way better than the previous run.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">roc.curve</span><span class="p">(</span><span class="n">hacide.test</span><span class="o">$</span><span class="n">cls</span><span class="p">,</span><span class="w"> </span><span class="n">pred.tree.imb</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">plotit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="c1"># Area under the curve (AUC): 0.883</span></code></pre></figure>

<p align="center">
<img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/decision_trees/entropy_auc.png" alt="Gini vs entropy ROC curve" />
</p>

<p>Here is the decision tree itself. Admittedly, it’s a bit more complex that when we used Gini, but overall the classifier is more performant and useful.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">rpart.plot</span><span class="p">(</span><span class="n">tree.imb</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Information Gain"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">extra</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre></figure>

<p align="center">
<img style="width: 70%; height: 70%" src="https://ekamperi.github.io/images/decision_trees/entropy_tree.png" alt="Gini vs entropy ROC curve" />
</p>]]></content><author><name>Stathis Kamperis</name></author><category term="machine learning" /><category term="decision trees" /><category term="machine learning" /><category term="mathematics" /><category term="R language" /><summary type="html"><![CDATA[Gini index vs entropy in decision trees with imbalanced datasets]]></summary></entry></feed>