About Research Talks Blog Projects Resume

Exploring regret bounds for Thomspson sampling using Information Theory

22 May 2022 | Thompson Sampling Online Learning Information Theory Sikchi, Harshit

Thompson sampling, dates back to a research work from 1933, and is still a topic of research under different settings. This is indicative of its far reaching utility that makes researchers and practitioners both interested in the topic. So, what is Thompson sampling and why do we need it? It is easier to answer the second question. In a number of settings we are faced with a decision to choose an action, among a set of actions and playing the action gets us a reward. This reward can be stochastic and our objective is to find an action that maximizes my reward in the long term. This is also known as the “Multi-armed bandit” problem. What makes this problem interesting, is you need to explore a number of action a number of times to be actually able to understand the expected reward for each action. This is the famous exploration-exploration tradeoff, even humans perform to find best actions — for example, choosing a restaurant. A real example of the multiarmed bandit problem can be the movie recommendations in netflix, netflix needs to choose an action (movie recommendation) that the users will like (reward). We are usually interested in an algorithm that minimizes the expected “regret” in such a setting:

\[\text{Regret(T)} = \mathbb{E}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)\right]\]

where $A^*$ denotes the optimal action, $A_t$ denotes the action played by the algorithm and $\theta$ denotes the environment. Intuitively regret is a measure of the best reward I could have gotten vs the reward I actually got. A environment decides what rewards we get for each actions. This notion can of Regret can be generalized where we consider a distribution over environments $P$ under which we aim to minimize regret (we aim to do well in a distribution over different environments). This is called as Bayesian Regret.

\[\text{Bayesian-Regret} = \mathbb{E}_{\theta\sim P}\left[\mathbb{E}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)]\right]\right]\]

Thompson sampling is one such algorithm that aims to achieve minimum regret. We are now in a good place to understand what the algorithm is. The only unknown in our ability to minimize regret is the environment we are currently in. Knowing the environments amounts to knowing the rewards each action give, and allows us to select the best action deterministically. Thompson sampling starts with a estimate of the environment distribution. This estimate is set to the distribution over environments $P$ (same as in the definition of bayesian regret). As it gets more information about the environment it is truly in, the algorithm refines its estimates. Overall there are three steps:

Algorithm: Thompson Sampling

Sample an environment from the posterior distribution $P_k$, where $P_0=P$.
Play the optimal action for the selected environment in true environment. Since we have chosen the environment we know the optimal action $a_k$.
Observe the reward from the true environment.
Update posterior distribution over the environments based on history $\mathcal{H}$ of observed action-reward pairs. $P_{k+1}=P_k(\theta\mid r_k,a_k)$

Regret bound for Thompson Sampling

In this blog, our focus is to analyze the behavior of the algorithm and derive a bound on its regret/performance. This blog is based on the paper An Information-Theoretic Analysis of Thompson Sampling by Russo et al. We will use the following notations: $A_t$ is the action played by the algorithm at time $t$, $A^*$ is the optimal action that could have been played, $\theta$ are the environment parameters, $R(A_t,\theta)$ is the reward obtained from playing action $A_t$ in environment $\theta$, $\mathcal{H}_t := [A_1,R_1,A_2,R_2...,A_t,R_t]$ is the history of actions and rewards obtained till time $t$.

We also needs some information theoretic quantities going forward in the analysis that is best clarified at this point. $H(X)$ denotes the entropy of discrete random variable $X$, $I(X;Y)$ denotes the mutual information between $X$ and $Y$, and can be written as: $I(X;Y) = H(X) - H(X|Y)$. Intuitively, the mutual information definition says that how much does knowing about Y reduce my entropy over X. If knowing about Y reveals what X, then $H(X|Y)=0$ and mutual information is maximized as the entropy of a discrete random variable is always greater than 0. Mutual information also satisfies the chain rule: $I(X;(Z_1,Z_2,..Z_T))=I(X;Z_1)+I(X;Z_2|Z_1)+I(X_3;Z_3|Z_2,Z_1)+..+I(X;Z_T|Z_1..Z_{T-1})$

First, lets get comfortable with the notion of bayesian regret. $\text{Bayesian-Regret(T)} = \mathbb{E}_{\theta\sim P}\left[\mathbb{E}_{\mathcal{H_t}}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)]\right]\right]\\$

The outer expectation is over the possible environments. Once, we have picked an environment, choosing different environments gives us rewards that is fully specified by the environment. The algorithm we have uses the history of observed action-reward pairs upto time $t$, denoted by $\mathcal{H}_t$ and produces an action to be played and its decision can be stochastic. The inner expectation is over the stochastic decisions of the algorithm based on the observed history. We can rewrite the bayesian regret as follows:

\[\begin{align} \text{Bayesian-Regret(T)} &= \mathbb{E}_{\theta\sim P,h_{t-1}\sim\mathcal{H_{t-1}}}\left[\mathbb{E}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)]\right]\right]\\ &=\mathbb{E}_{h_t\sim \mathcal{H}_T}\left[\mathbb{E}_{\theta\sim P|h_{t-1}}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)]\right]\right] \end{align}\]

The first equation above just rewrites our original definition of bayesian regret using a joint probability distribution over environments and histories. The second equation takes the outer expectation over histories and inner expectation over environments conditioned on the history. All the definitions are mathematically equivalent although the original definition of bayesian regret is most intuitive.

Now we define a new quantity called \textbf{information ratio} that is of much importance in the exploration-exploitation literature. We denote information ratio by $\Gamma_t$ and is given by:

\[\Gamma_t = \frac{\mathbb{E}_{\theta|h_{t-1}}[R(A^*,\theta)-R(A_t,\theta)]^2}{I(A^*;(A_t,R(A_t,\theta))|h_{t-1})}\]

Let’s break down what this quantity means.

$\textbf{Numerator: }\mathbb{E}_{\theta|h_{t-1}}[R(A^*,\theta)-R(A_t,\theta)]^2$

This term is the difference of the best reward I can get vs the reward I get when I play based on the history knowledge of $h_{t-1}$. Note that since the only given quantity is the history $h_{t-1}$, the true environment is uncertain and consequently the best actions to play as well.

$\textbf{Denominator: }I(A^*;(A_t,R(A_t,\theta))|h_{t-1})$

This term tells us “How much does the action I take at time $t$ with it corresponding observed reward reduce my entropy over $A^*\|h_{t-1}$ “.

Now suppose this information ratio is bounded by a small constant. What does that mean?

Either the algorithm picks an action that will have a small numerator. i.e The best action given the information (history) it has ($\textbf{exploit}$).
Or the algorithm pics an action that will have a high denominator. i.e decrease the uncertainty over optimal action $A^*~\textbf{explore}$.

We are typically interested in algorithms with such a small bounded information ratio that is bounded by $\bar{\Gamma}$ ($\Gamma_t\le\bar{\Gamma}$).

We can now proceed to deriving a general regret bound that is applicable to all algorithms with a bounded information ratio. Later, we show that Thompson sampling has a bounded information ratio and the bound we derived is applicable to it as well. Starting with our definition of bayesian regret:

\[\begin{align} \text{Bayesian-Regret(T)} &= \mathbb{E}_{h_t\sim \mathcal{H}_T}\left[\mathbb{E}_{\theta\sim P|h_{t-1}}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)]\right]\right]\\ &= \mathbb{E}_{\mathcal{H}_{t-1}}\left[\sum_{t=1}^T\sqrt{\Gamma_t I(A^*;(A_t,R(A_t,\theta))|h_{t-1})}\right]~~~\text{(Information ratio definition)}\\ &= \mathbb{E}_{\mathcal{H}_{t-1}}\left[\bar{\Gamma}\sum_{t=1}^T\sqrt{I(A^*;(A_t,R(A_t,\theta))|h_{t-1})}\right]~~~\text{(Bounded information ratio)} \end{align}\]

All we have done above is plug in the definition of information ratio in the definition of bayesian regret, and used the assumption that the information ratio is bounded. We can proceed further by doing some additional algebraic manipulations:

\[\begin{align} \text{Bayesian-Regret(T)} &= \mathbb{E}_{h_t\sim \mathcal{H}_T}\left[\mathbb{E}_{\theta\sim P|h_{t-1}}\left[\sum_{t=1}^T[R(A^*,\theta)-R(A_t,\theta)]\right]\right]\\ &\le \mathbb{E}_{\mathcal{H}_{t-1}}\left[\bar{\Gamma}\sum_{t=1}^T\sqrt{I(A^*;(A_t,R(A_t,\theta))|h_{t-1})}\right]~~~\text{(Bounded information ratio)}\\ &\le \sqrt{\bar{\Gamma}T\mathbb{E}_{\mathcal{H}_{t-1}}\left[\sum_{t=1}^TI(A^*;(A_t,R(A_t,\theta))|h_{t-1})\right]}~~~\text{(by Cauchy schwartz inequality)}\\ \end{align}\]

Recall that $h_{t-1}=(a_1,r_1,a_2,r_2,...,a_{t-1},r_{t-1})$ is a particular history that was observed. We use $\mathcal{H}_{t-1}=(A_1,R_1,A_2,R_2,..,A_{t-1},R_{t-1})$ to denote the random variable for history—the possible histories that can be observed. The terms inside the inner expectation in the last equation above is the definition of conditional mutual independence.

\[\mathbb{E}_{\mathcal{H}_{t-1}}\left[I(A^*;(A_t,R(A_t,\theta))|h_{t-1})\right]= I(A^*,(A_t,R(A_t,\theta)|\mathcal{H}_{t-1}))\]

Summing these terms till time T,

$\begin{align} \mathbb{E}_{\mathcal{H}_{t-1}}\left[\sum_{t=1}^TI(A^*;(A_t,R(A_t,\theta))|h_{t-1})\right]&=\sum_{t=1}^T[I(A^*,(A_t,R(A_t,\theta)|\mathcal{H}_{t-1})]\\ &=\sum_{t=1}^TI(A^*,(A_t,R(A_t,\theta))|(A_1,,R_1,A_2,R_2,...A_{t-1},R_{t-1}))\\ &= I(A^*;(A_1,R_1,A_2,R_2,...A_T,R_T))~~~\text{(by chain rule of mutual information)}\\ &= I(A^*;\mathcal{H}_T)\\ &\le H(A^*)~~~\text{(by definition of mutual information)} \end{align}$ Thus plugging the above derived bound back in the inner expectation for bayesian regret, we have the general regret bound when the information ratio is upper bounded:

\[\text{Bayesian-Regret(T)}\le \sqrt{\bar{\Gamma}H(A^*)T}\]

Thus the regret bound depends on the entropy of the optimal action distribution and the information ratio while scaling sublinearly with T. The uncertainty in the random variable $A^*$ comes from the prior distribution over environments $P$. Bayesian regret can also be thought of total mistakes made till time T. So, the average mistakes made as time $T$ goes to $\infty$ is:

\[\frac{\text{Bayesian-Regret(T)}}{T} \le \lim_{T\to\infty}\frac{\sqrt{\bar{\Gamma}H(A^*)T}}{T} =0\]

Any algorithm that has a bounded information will make zero mistakes eventually by performing intelligent exploration and exploitation. Thompson sampling is precisely one such algorithm.

Thompson Sampling has a Bounded Information Ratio

We will be discussing a proof that produces a loose bound for information ratio in the thompson sampling setting and is applicable in a variety of settings. When there is a more rich information structure, i.e playing one action reveals information about other action, we can derive even tighter bounds. We leave that exploration to interested readers in the original paper.

We can concisely write the thompson sampling algorithm in one line: It selects action based on its probability of being optimal given the history. Mathematically,

\[\textbf{Thompson Sampling Rule: } P(A_t|h_t)=P(A^*|h_t)~~~\forall t\]

As usual we will need some mathematical tools to proceed.

Useful fact 1: KL divergence form of Mutual information

\[I(X;Y) = \mathbb{E}_{X}[D(P(Y|X))||P(Y)]\]

Useful fact 2: Simplified expressions for Information bound

\[\begin{align} \text{Denominator} &= I(A^*;(A_t,R(A_t,\theta))|h_{t-1})\\ &= I(A^*;A_t|h_{t-1}) + I(A^*;R(A_t,\theta)|A_t,h_{t-1})~~~(\text{Chain rule})\\ &= I(A^*;R(A_t,\theta)|A_t,h_{t-1})~~~(A^* ~\text{and}~ A~\text{are independent})\\ &= \sum_{a\in\mathcal{A}} P(A_t=a) I(A^*;R(a_t,\theta)|a_t,h_{t-1})~~~(\text{Useful fact 1})\\ &= \sum_{a\in\mathcal{A}} P(A_t=a) I(A^*;R(a_t,\theta)|h_{t-1})\\ &= \sum_{a\in\mathcal{A}} P(A_t=a)\left(\sum_{a^*\in\mathcal{A}} P(A^*=a^*)[D(P(R(a_t,\theta|h_{t-1}))\|P(R(a_t,\theta|h_{t-1}))) ] \right)\\ &= \sum_{a\in\mathcal{A}} P(A^*=a)\left(\sum_{a^*\in\mathcal{A}} P(A^*=a^*)[D(P(R(a_t,\theta|h_{t-1}))\|P(R(a_t,\theta|h_{t-1}))) ] \right)~~~\text{(Thompson sampling rule)}\\ &= \sum_{a,a^*\in\mathcal{A}} P(A^*=a)P(A^*=a^*)[D(P(R(a_t,\theta|h_{t-1}))\|P(R(a_t,\theta|h_{t-1}))) ]\\ \end{align}\]

Numerator=

\[\begin{align} &= \mathbb{E}_{\theta|h_{t-1}}[R(A^*,\theta)-R(A_t,\theta)]\\ &= \sum_{a^*\in\mathcal{A}} P(A^*=a) \mathbb{E}_{\theta|h_{t-1}}[R(a^*,\theta)|A^*=a] - \sum_{a\in\mathcal{A}} P(A_t=a) \mathbb{E}_{\theta|h_{t-1}}[R(a,\theta)|A_t=a]\\ &= \sum_{a^*\in\mathcal{A}} P(A^*=a) R(a^*,\theta|A^*=a) - \sum_{a\in\mathcal{A}} P(A^*=a) R(a,\theta)~~~\text{(Thompson sampling rule)}\\ &= \sum_{a^*\in\mathcal{A}} P(A^*=a) [R(a^*,\theta|A^*=a) - R(a,\theta)]\\ \end{align}\]

Now with these simplified expressions, we can easily bound the numerator of the information ratio with its denominator:

\[\begin{align} \mathbb{E}_{\theta|h_{t-1}}[R(A^*,\theta)-R(A_t,\theta)]^2 &= \left( \sum_{a\in\mathcal{A}} P(A^*=a|h_{t-1}) [R(a,\theta|A^*=a) - R(a,\theta)]\right)^2~~~\text{(Simplified numerator expression)}\\ &\le |\mathcal{A}| \sum_{a^*\in\mathcal{A}} P(A^*=a|h_{t-1})^2 [R(a^*,\theta|A^*=a) - R(a,\theta)]^2~~~\text{(Cauchy schwartz inequality)}\\ &\le |\mathcal{A}| \sum_{a,a^*\in\mathcal{A}} P(A^*=a|h_{t-1})P(A^*=a|h_{t-1}) [R(a^*,\theta|A^*=a^*) - R(a,\theta)]^2\\ & \le \frac{|\mathcal{A}|}{2} \sum_{a,a^*\in\mathcal{A}} P(A^*=a)P(A^*=a) [D(P(R(a,\theta)|A^*=a^*,h_{t-1})\|P(R(a,\theta)|h_{t-1}))]~~~\text{(Useful fact 2)}\\ &\le \frac{|\mathcal{A}|}{2} I(A^*;(A_t,R(A_t,\theta))|h_{t-1})~~~\text{(Simplified denominator expression)} \end{align}\]

Thus we have shown that the information ratio is bounded by $\frac{\mid\mathcal{A}\mid}{2}$ when using Thompson sampling with finite actions and a bounded reward function. Plugging in the information ratio in our general bound for bayesian regret, the bayesian regret for thompson sampling is bounded by:

\[\text{Bayesian-Regret(T)}\le \sqrt{\frac{|A|H(A^*)T}{2}}\]

You will need to sign in to GitHub to add a comment! To edit or delete your comment, visit the discussions page and look for your comment in the right discussion.