How to reduce the amount of labelled data required for your Deep Learning model with a human in the loop?#

In this blog post, we will explore how Bayesian active learning can be used to reduce the amount of labeled data required for a deep learning model with a human in the loop in the context of a classification task.

Active learning is a machine learning approach where a model iteratively selects the most informative data points from a large pool of unlabeled data to be labeled and added to the training set, with the goal of maximizing its performance on new, unseen data while minimizing the amount of labeled data required.

This iterative process is illustrated below.

Active Learning cycle#

The active learning cycle involves iteratively selecting the most informative unlabeled samples, requesting labels from a human annotator, incorporating those labels into the training data, and retraining the model to improve its performance.

Mathematical problem formulation#

As described in [Houlsby et al., 2011], within a Bayesian framework we assume the existence of some latent parameters, \(\omega\), that control the dependence between inputs and outputs, \(p(y|x, \omega)\).

Having observed data \(\mathcal{D} = \{(xi, yi)\}_{i=1}^{n}\), a posterior distribution over the parameters is inferred, \(p(\omega|\mathcal{D})\).

To quantify the uncertainty around \(\omega\), Shannon’s entropy is used in information theory: \(H(\omega|\mathcal{D}) = \int p(\omega|\mathcal{D})\log(p(\omega|\mathcal{D}))d\omega\).

Our problem consists in selecting points that minimise the entropy.

A greedy approach to this problem consists in selecting the next record that leads to the most immediate benefit, without considering the potential consequences of that decision in the long run.

Therefore, twe seek the data point x that maximises the decrease in expected posterior entropy:

\[\underset{x}{\mathrm{argmax}} H(\omega|\mathcal{D}) − E_{y \sim p(y|x,\mathcal{D})} H(\omega|y,x,\mathcal{D})\]

However, the parameter posteriors are high dimensional for deep learning models and computing their entropies is intractable.

So we are going to see an equivalent formulation of this problem that leads to more tractable estimations thanks to the concept of mutual information.

Bayesian Active Learning by Disagreement#

Mutual Information#

In information theory, Mutual information is a non-negative quantity that is symmetric in its arguments. Given two random variables A and B, the mutal information measures how much information A carries about B , and vice versa:

\[\begin{split}\begin{flalign*} I[A, B ] & := H[A] − E_{p(B)} H[A|B] &\\ & := H[B] − E_{p(A)} H[B|A] &\\ & := H[A] + H[B ] − H[A, B]&\\ \end{flalign*}\end{split}\]

Problem reformulation#

By noticing that setting \(A = \omega|\mathcal{D}\), and \(B = y|\mathcal{D}, x\), we get that:

\[H(\omega|\mathcal{D}) − E_{y \sim p(y|x,\mathcal{D})} H(\omega|y,x,\mathcal{D}) = H(y|x,\mathcal{D}) − E_{\omega \sim p(\omega|\mathcal{D})} H(y|x, \omega)\]

We get the BALD formulation:

\[\underset{x}{\mathrm{argmax}} H(y|x,\mathcal{D}) − E_{\omega \sim p(\omega|\mathcal{D})} H(y|x, \omega)\]

Interpretation of this problem formulation:

So we seek records x for which the model is marginally most uncertain about its output y (high \(H(y|x,\mathcal{D})\))
The second term penalizes records x where many of the sampled models are uncertain about its prediction (low \(E_{\omega \sim p(\omega|\mathcal{D})} H(y|x, \omega, \mathcal{D})\)).

This keeps records where the models disagree, hence the term disagreement in BALD.

How to compute BALD#

To calculate \(H(y|x,\mathcal{D}) − E_{\omega \sim p(\omega|\mathcal{D})} H(y|x, \omega)\) , one fast and scalable approximation is using dropout uncertainty as described in [Gal et al., 2017].

We approximate \(p(\omega|\mathcal{D})\) by \(q_{\theta}(\omega)\), which represents the dropout distribution, with \(\theta\) having the variational parameters used at training time (for more information about this approximation please refer to this previous post).

We then get:

\[\begin{split}\begin{flalign*} H(y|x,\mathcal{D}) & := - \sum_{c} p(y=c | x,\mathcal{D}) \log(p(y=c | x,\mathcal{D})) &\\ & = - \sum_{c} \int p(y=c | x, \omega) p (\omega | \mathcal{D})d\omega \log(\int p(y=c | x, \omega) p (\omega | \mathcal{D})d\omega)&\\ & \approx - \sum_{c} \int p(y=c | x, \omega) q_{\theta}(\omega)d\omega \log(\int p(y=c | x, \omega) q_{\theta}(\omega)d\omega)&\\ & \approx - \sum_{c} \frac{\sum_{t=1}^{T}\hat{p}_{c}(y | x, \hat{\omega_{t}})}{T} \log(\frac{\sum_{t=1}^{T}\hat{p}_{c}(y | x, \hat{\omega_{t}})}{T}) \quad \text{ with } \hat{\omega_{t}} \sim q_\theta(\omega)&\\ \end{flalign*}\end{split}\]

Similarly:

\[\begin{split}\begin{flalign*} - E_{\omega \sim p(\omega|\mathcal{D})} H(y|x, \omega) & := \sum_{c} \int p(y=c | x, \omega) \log(p(y=c | x, \omega)) p(\omega | \mathcal{D})d\omega&\\ & \approx \frac{ \sum_{t=1}^{T} \hat{p}_{c}(y | x, \hat{\omega_{t}}) \log(\hat{p}_{c}(y | x, \hat{\omega_{t}}))}{T} &\\ \end{flalign*}\end{split}\]

Conclusion#

By doing T forward passes of our deep learning model with dropout activated, we can estimate the BALD formula.

We can then ask annotators to label the k most uncertain prediction from our set of unlabelled data.

Previous: Dropout as a way to Model Uncertainty in Deep Learning Models Next: Why Linear Regression is Called “Regression”: regression towards mediocrity

Comments

comments powered by Disqus