Relative || Entropy

What is Online Learning?

Online learning (OL) is a method of machine learning that follows a very particular model. Whereas a typical machine learning scenario involves having a set of (possibly very large) training data, and a set of test data. (often called batch learning), in OL the data is not received at the beginning of learning, but rather arrives sequentially -one by one- and our learning algorithm must improve as it goes along, learning from past examples in order to be able to make accurate predictions on future examples.

OL has some very attractive qualities. In batch learning, it is assumed that the training data is sampled i.i.d (independently, and identically distributed) from some unknown distribution, and that the test data is also drawn from the same distribution. A hypothesis (such as a classifier, or regressor) is then formed from this training data with the goal of having a low error on the unseen test data - that is to have low generalisation error. OL on the other hand, makes no statistical assumptions on the data received! OL algorithms are also usually very fast, and completely scalable, as they usually require very little memory. This makes them well-suited to big data as the data is only processed one by one.

In OL learning proceeds in trials, or rounds \(t = 1,2,...\). On round \(t\) the algorithm is presented with a piece of data (usually a vector \(x_{t} \in \mathbb{R}^{n}\)) and a question (to predict the value of \(y_{t}\)) and must then make that prediction \(\hat{y}_{t}\), the algorithm then receives the true answer and receives a loss. This loss could be negative if the answer given was correct, but in most cases this loss is just zero if \(\hat{y}_{t} = y_{t}\). The actual function we use to measure the loss of our algorithm depends on the situation and type of prediction we are making. More generally however, the typical metric we use to measure the success of our algorithm is cumulative loss, the total loss accrued over all trials.

OL is a very natural way of connecting the fields of machine learning, game theory, and decision theory. You can see why, the whole situation we just described as a game between your algorithm and an adversary. Your adversary gives you \(x_{t}\), you must predict \(\hat{y}_{t}\), and you then receive \(y_{t}\) and your loss. We will have a lot more to say about this situation later.

A Natural OL Problem - Learning with Expert Advice

Consider the following problem, you want to try your luck on the stock market, and have invested in Fake-Company-Ltd. Now you're smart about your investment and realize that you have 5 friends who have each had their own success investing, and all claim to be experts. Being the smart person you are, you decide that everyday you will ask each of them to predict what will happen to the stock you own on that day (rise or fall). You of course must make your own decision based on their predictions. If we combine these predictions into a vector, and set \(rise = 1\) and \(fall = 0\), we have a binary vector \(x_{t} \in \{0,1\}^5\) (because we have 5 experts). For example on your first day you receive

\[ \mathbf{x}_{1} = \left[ 1,1,0,0,1\right]\]

and you must predict \(\hat{y}_{t}\), which might result in you buying more stock or selling your stock. Hopefully with no other information you would predict \(\hat{y}_{1} = 1\), but at the end of the day you receive the 'true label', your stock went down! \(y_{1} = 0\) ! The next day your experts offer the following advice:

\[ \mathbf{x}_{2} = \left[ 0,1,1,0,1\right]\]

Now what do you do? Over time you will surely notice that some of your experts are better than others, and make fewer mistakes. Of course our algorithm should try and identify which expert (or experts) is best, so that we weight their advice more heavily. Before we do that let's formalise the setting of this algorithm.


For t = 1,2,...,T do:
    Receive   x_t 
    Predict    pred_yt 
    Receive true label  yt 
    Receive loss    |pred_yt - yt |

Where here our loss on a particular round is the "0-1 loss" where we receive a loss of 1 on each mistake. It essentially counts the number of mistakes we make.

If our algorithm, \(A\) receives a sequence \(S = {\left(\mathbf{x}_{1},y_{1}\right),\left(\mathbf{x}_{2},y_{2}\right), ... ,\left(\mathbf{x}_{T},y_{T}\right)}\) our total loss is then

\[L_{A}\left(S\right) = \sum_{t = 1}^{T}\vert\hat{y}_{t} - y_{t}\vert\]

So that is our goal, to produce an algorithm \(A\) that has a small \(L_{A}\left(S\right)\) after \(T\) trials.

A Natural Solution - The Halving Algorithm

Let's assume the very idealized setting in which one expert is going to be correct on all trials. Our goal would be to find that expert as quickly as possible and start following their advice. This wouldn't take too long among our 5 financial friends, so let's start with a pool of \(n\) experts. The Halving algorithm is devilishly simple and very effective at finding the best expert.

All we need is a Consistency Class, \(C\), which is just the set of experts who have predicted correctly on every trial so far. Initially all experts belong in \(C\), when we receive our first instance \(\mathbb{x}_{1}\), we predict with the 'majority vote' of the experts. Whenever an expert predicts incorrectly, they are removed from \(C\). It's really that simple.

Concretely then, if we define \(\text{maj}_{C}\left(\mathbf{x}_{t}\right) \in {0,1}\) to be the 'majority vote' of the experts \(\mathbf{x}_{t}\) that are in \(C\), we can define the Halving algorithm in pseudo-code thusly:


Initialize: C := set of all experts
For t = 1,...,T do:
  receive x_t
  predict pred_yt = majority vote of C
  receive y_t
  for i = 1,...,n:
    if x_t[i] != y_t:
      remove expert i from C

A Mistake Bound on the Halving Algorithm

We can now derive a mistake bound to show the maximum number of mistakes the algorithm can make. Since we predict with the majority vote of experts in \(C\), when we make a mistake then at least half of the experts in \(C\) were wrong, and can be eliminated. If we define \(\vert C_{t} \vert\) as the cardinality (size) of the set \(C\) at time \(t\), then we know that

\[\vert C_{t+1} \vert \leq \frac{1}{2}\vert C_{t} \vert\]

and since \(\vert C_{1} \vert = n\), and we finish with at least 1 expert remaining in \(C\), we have after \(M\) mistakes:

\[ 1 \leq \vert C_{T} \vert \leq \vert C_{1}\vert\left(\frac{1}{2}\right)^{M} = n 2^{-M}\]

which after re-arranging and taking a logarithm of base \(2\) gives us:

\[M \leq \log_{2}\left(n\right)\]

Of course the situation that one expert is correct on all trials is very unrealistic, in this case, if we applied the Halving algorithm our Consistency class \(C\) will quickly be an empty set. So we must move beyond the Halving algorithm and find away to 'punish' experts for making a mistake without discarding them entirely. This leads us to a slightly different framing of our goal, and introduces a central idea in OL algorithms: Regret.

Since our algorithm's loss is defined as \(L_{A}\left(s\right) = \sum_{t=1}^{T}\vert\hat{y}_{t} - y_{t}\vert\) we can define a similar loss of each individual exert \(i\):

\[L_{i}\left(S\right) = \sum_{t=1}^{T}\vert\mathbf{x}_{t}\left(i\right) - y_{t}\vert\]

then our goal should be to find an algorithm that performs only slightly worse than the best possible expert in hindsight, this can be described as regret:

\[Regret = L_{A}\left(s\right) - \min_{i} L_{i}\left(S\right)\]

this will eventually lead us to bounds of the form \(L_{A}\left(S\right) = \alpha \min_{i} L_{i}\left(S\right) + \beta \log\left(n\right)\) where \(\alpha, \beta\) are small (or really we should say not large) constants.

The Weighted Majority Algorithm

The problem with the Halving algorithm should now be clear, we discarded experts as soon as they made a mistake. A solution to this problem is the Weighted Majority (WM) algorithm. In this algorithm each expert is assigned a weight. Whenever an expert makes a mistake, their weight is diminished, and on each trial we combine the predictions of each expert with their weights to get the (wait for it...) Weighted Majority! (sorry).

To keep track of the weights of each expert, we must maintain a weight vector \(\mathbf{w}_{t}\). The initial value of the weights is arbitrary, so let's set the initial weight vector to

\[\mathbf{w}_{1} = \left[1,1,...,1\right]\]

then whenever expert \(i\) makes a mistake, we can multiply that expert's weight by \(\beta \in \left[0,1\right)\), decreasing it. Then the WM algorithm will look something like this:

Initialize: w = [1,1,...,1];  0<=beta<= 1
For t = 1,...,T do:
    receive x_t
    p_0 = x_t[x_t == 0].dot(w[x_t==0])
    p_1 = x_t[x_t == 1].dot(w[x_t==1])
    if p_1 >= p_0:
        predict 1
    else
        predict 0
    receive yt
    for i = 1,...,n:
        if x_t[i] != yt:
            w[i] = beta*w[i]

Notice that if \(\beta = 0\) we have exactly the halving algorithm!

A Mistake Bound on the WM Algorithm

The analysis of the WM algorithm is slightly more interesting than the halving algorithm, but it follows the same idea. Let \(M\) be the total number of mistakes of the algorithm, and \(M_{i}\) be the total number of mistakes made by expert \(i\). Let \(W_{t} = \sum_{i}w_{t}(i)\) be the total weight at a given time \(t\). It's clear that the total 'majority weight' \(W_{man} \geq \frac{1}{2}W_{t}\) and the 'minority weight' \(W_{man} \leq \frac{1}{2}W_{t}\).

If we don't make a mistake on round \(t\), then the 'minority weights' are decreased, giving

\[ W_{t+1} \leq W_{t}\]

whereas if we do make a mistake, then the 'majority weights' are decreased by a factor of \(\beta\), giving

\[ W_{t+1} \leq \frac{1}{2}W_{t} + \beta\frac{1}{2}W_{t}\]

where the first term on the RHS can be seen as the 'minority weights' and the second term the 'majority weights', this simplifies to

\[ W_{t+1} \leq \frac{1+\beta}{2}W_{t}\]

Then after \(M\) mistakes we have

\[ W_{T+1} \leq \left(\frac{1+\beta}{2}\right)^{M}W_{1}\]

Now since no weights reach zero, or become negative, the total weight \(W_{T+1} = \sum_{i}W_{T+1}(i)\) will always be greater than any of individual weight \(\mathbf{w}_{t}(i) = \beta^{M_{i}}\). Putting this together gives

\[\beta^{M_{i}} \leq \left(\frac{1+\beta}{2}\right)^{M}W_{1} = \left(\frac{1+\beta}{2}\right)^{M} n\]

Taking logarithms of both sides gives us

\[M \leq \frac{\ln\left(\frac{1}{\beta}\right)}{\ln\left(\frac{2}{1+\beta}\right)}M_{i} + \frac{1}{\ln\left(\frac{2}{1+\beta}\right)}\ln\left( n\right)\]

note that this is true for any expert \(i\), and therefore it's true for the best expert, which finally gives us

\[M \leq a\min_{i} M_{i} + b\ln\left( n\right)\]

where \(a =\frac{\ln\left(\frac{1}{\beta}\right)}{\ln\left(\frac{2}{1+\beta}\right)}\) and \(b =\frac{1}{\ln\left(\frac{2}{1+\beta}\right)}\). Choosing \(\beta = 0.5\) gives \(a = 2.41\) and \(b = 3.48\). This is exactly what we wanted, we get a small additive term extra compared to the best expert in the number of mistakes!

Departing Thoughts

It is worth mentioning that there is nothing special about the loss function we used \(\left(L = \vert\hat{y} - y\vert\right)\). In fact many other loss functions exist, such as the square loss \(\left( L = \left(\hat{y} - y\right)^{2}\right)\) or entropic loss \(\left( L = y\ln\left(\frac{y}{\hat{y}}\right) + \left( 1 - y\right)\ln\left(\frac{1-y}{1-\hat{y}}\right)\right)\).

The use of different loss functions can serve various purposes, as we'll see in future posts on OL, different loss functions can lead to different algorithms and different performances. The WM algorithm has led to many different variants, and techniques with different mistake bounds.

Intro to Online Learning March 14, 2018