Skip to main content

Questions tagged [thompson-sampling]

For questions about Thompson sampling, which is a technique for choosing actions (that addresses the exploration-exploitation dilemma) in the multi-armed bandit and reinforcement learning problems.

2 votes
1 answer
223 views

When considering multi-armed bandits in different formats, UCB, $\epsilon$-greedy, thompson sampling etc seems so greedy/myopic in the sense that it solely considers reward for the current timestep. ...
hugh's user avatar
  • 53
1 vote
0 answers
49 views

Posterior Sampling Lemma was introduced in the (More) Efficient RL via Posterior Sampling and looks like this. $M^*$ here is the true MDP while $M_k$ is the MDP sampled from the posterior in episode $...
pecey's user avatar
  • 353
2 votes
0 answers
34 views

Suppose that I'm training a machine learning model to predict people's age by a picture of their faces. Lets say that I have a dataset of people from 1 year olds to 100 year olds. But I want to choose ...
noone's user avatar
  • 123
1 vote
0 answers
94 views

I'm working with the Online Logistic Regression Algorithm (Algorithm 3) of Chapelle and Li in their paper, "An Empirical Evaluation of Thompson Sampling" (https://papers.nips.cc/paper/2011/...
MABQ's user avatar
  • 11
4 votes
0 answers
158 views

I am looking at the different existing methods of action selection in reinforcement learning. I found several methods like epsilon-greedy, softmax, upper confidence bound and Thompson sampling. I ...
user14053977's user avatar
0 votes
1 answer
509 views

I often see Thompson Sampling in RL literature, however, I am not able to relate it to any of the current RL techniques. How exactly does it fit with RL?
desert_ranger's user avatar
3 votes
3 answers
1k views

Why aren't exploration techniques, such as UCB or Thompson sampling, typically used in bandit problems, used in full RL problems? Monte Carlo Tree Search may use the above-mentioned methods in its ...
Mika's user avatar
  • 371
1 vote
0 answers
65 views

Agrawal and Goyal (http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf page 3) discussed how we can extend Thompson sampling for bernoulli bandits to Thompson sampling for stochastic bandits in ...
Felix P.'s user avatar
  • 295
1 vote
1 answer
5k views

I ran a test using 3 strategies for multi-armed bandit: UCB, $\epsilon$-greedy, and Thompson sampling. The results for the rewards I got are as follows: Thompson sampling had the highest average ...
Java coder's user avatar
4 votes
2 answers
3k views

In policy gradient algorithms the output is a stochastic policy - a probability for each action. I believe that if I follow the policy (sample an action from the policy) I make use of exploration ...
gnikol's user avatar
  • 177
8 votes
0 answers
174 views

In my implementation of Thompson Sampling (TS) for online Reinforcement Learning, my distribution for selecting $a$ is $\mathcal{N}(Q(s, a), \frac{1}{C(s,a)+1})$, where $C(s,a)$ is the number of times ...
Kevin's user avatar
  • 81
5 votes
1 answer
833 views

In some implementations of off-policy Q-learning, we need to know the action probabilities given by the behavior policy $\mu(a)$ (e.g., if we want to use importance sampling). In my case, I am using ...
nicolas's user avatar
  • 53