Exploring regret bounds for Thomspson sampling using Information Theory
22 May 2022 | Thompson Sampling Online Learning Information TheoryThompson sampling, dates back to a research work from 1933, and is still a topic of research under different settings. This is indicative of its far reaching utility that makes researchers and practitioners both interested in the topic. So, what is Thompson sampling and why do we need it? It is easier to answer the second question. In a number of settings we are faced with a decision to choose an action, among a set of actions and playing the action gets us a reward. This reward can be stochastic and our objective is to find an action that maximizes my reward in the long term. This is also known as the “Multi-armed bandit” problem. What makes this problem interesting, is you need to explore a number of action a number of times to be actually able to understand the expected reward for each action. This is the famous exploration-exploration tradeoff, even humans perform to find best actions — for example, choosing a restaurant. A real example of the multiarmed bandit problem can be the movie recommendations in netflix, netflix needs to choose an action (movie recommendation) that the users will like (reward). We are usually interested in an algorithm that minimizes the expected “regret” in such a setting:
where
Thompson sampling is one such algorithm that aims to achieve minimum regret. We are now in a good place to understand what the algorithm is. The only unknown in our ability to minimize regret is the environment we are currently in. Knowing the environments amounts to knowing the rewards each action give, and allows us to select the best action deterministically. Thompson sampling starts with a estimate of the environment distribution. This estimate is set to the distribution over environments
Algorithm: Thompson Sampling
- Sample an environment from the posterior distribution
, where . - Play the optimal action for the selected environment in true environment. Since we have chosen the environment we know the optimal action
. - Observe the reward from the true environment.
- Update posterior distribution over the environments based on history
of observed action-reward pairs.
Regret bound for Thompson Sampling
In this blog, our focus is to analyze the behavior of the algorithm and derive a bound on its regret/performance. This blog is based on the paper An Information-Theoretic Analysis of Thompson Sampling by Russo et al. We will use the following notations:
We also needs some information theoretic quantities going forward in the analysis that is best clarified at this point.
First, lets get comfortable with the notion of bayesian regret.
The outer expectation is over the possible environments. Once, we have picked an environment, choosing different environments gives us rewards that is fully specified by the environment. The algorithm we have uses the history of observed action-reward pairs upto time
The first equation above just rewrites our original definition of bayesian regret using a joint probability distribution over environments and histories. The second equation takes the outer expectation over histories and inner expectation over environments conditioned on the history. All the definitions are mathematically equivalent although the original definition of bayesian regret is most intuitive.
Now we define a new quantity called \textbf{information ratio} that is of much importance in the exploration-exploitation literature. We denote information ratio by
Let’s break down what this quantity means.
- This term is the difference of the best reward I can get vs the reward I get when I play based on the history knowledge of
. Note that since the only given quantity is the history , the true environment is uncertain and consequently the best actions to play as well.
- This term tells us “How much does the action I take at time
with it corresponding observed reward reduce my entropy over “.
Now suppose this information ratio is bounded by a small constant. What does that mean?
- Either the algorithm picks an action that will have a small numerator. i.e The best action given the information (history) it has (
). - Or the algorithm pics an action that will have a high denominator. i.e decrease the uncertainty over optimal action
.
We are typically interested in algorithms with such a small bounded information ratio that is bounded by
We can now proceed to deriving a general regret bound that is applicable to all algorithms with a bounded information ratio. Later, we show that Thompson sampling has a bounded information ratio and the bound we derived is applicable to it as well. Starting with our definition of bayesian regret:
All we have done above is plug in the definition of information ratio in the definition of bayesian regret, and used the assumption that the information ratio is bounded. We can proceed further by doing some additional algebraic manipulations:
Recall that
Summing these terms till time T,
Thus the regret bound depends on the entropy of the optimal action distribution and the information ratio while scaling sublinearly with T. The uncertainty in the random variable
Any algorithm that has a bounded information will make zero mistakes eventually by performing intelligent exploration and exploitation. Thompson sampling is precisely one such algorithm.
Thompson Sampling has a Bounded Information Ratio
We will be discussing a proof that produces a loose bound for information ratio in the thompson sampling setting and is applicable in a variety of settings. When there is a more rich information structure, i.e playing one action reveals information about other action, we can derive even tighter bounds. We leave that exploration to interested readers in the original paper.
We can concisely write the thompson sampling algorithm in one line: It selects action based on its probability of being optimal given the history. Mathematically,
As usual we will need some mathematical tools to proceed.
Useful fact 1: KL divergence form of Mutual information
Useful fact 2: Simplified expressions for Information bound
Numerator=
Now with these simplified expressions, we can easily bound the numerator of the information ratio with its denominator:
Thus we have shown that the information ratio is bounded by