8
$\begingroup$

In my implementation of Thompson Sampling (TS) for online Reinforcement Learning, my distribution for selecting $a$ is $\mathcal{N}(Q(s, a), \frac{1}{C(s,a)+1})$, where $C(s,a)$ is the number of times $a$ has been picked in $s$.

However, I found that this does not work well in some cases depending on the magnitude of $Q(s,a)$. For example, if $Q(s_i,a_1) = 100$, and $C(s_i,a_1) = 1$, then then this gives a standard deviation of 0.5, which is extremely confident even though the action has only been picked once. Compare that to $a_2$ which may be the optimal action but has never been picked, so $Q(s_i, a_2) = 0$ and $C(s_i,a_2) = 0$. It is unlikely that TS will ever pick $a_2$.

So, how do I solve this problem?

I tried normalizing the Q-values such that they range from 0 to 1, but the algorithm returns much lower total returns. I think I have to adapt the magnitude of the standard deviations relative to the Q-values as well. Doing it for 1 normal distribution is pretty straightforward, but I can't figure out how to do it for multiple distributions which have to take into consideration of the other distributions.

Edit: Counts should be $C(s,a)$ instead of $C(s)$ as Neil pointed out

$\endgroup$
4
  • $\begingroup$ "if $Q(s_i,a_1) = 100$, and $C(a_1) = 1$, then..." in your imagined scenario, you have made a very large update to $Q(s_i,a_1)$ with either 0 or 1 samples from it - is this actually possible with the learning mechanism you are using, e.g. are you processing rewards such that $\alpha \times r_t$ could make a $+100$ value update in a single step? This is salient, because there is no need to worry about and attempt to fix numerical problems that don't occur in practice. $\endgroup$ Commented Nov 22, 2019 at 13:35
  • $\begingroup$ Counts should be s,a. I have edited the question. I am selecting the action with the highest sampled Q-value. $\endgroup$ Commented Nov 22, 2019 at 13:59
  • $\begingroup$ I have a learning rate of 0.1, I didn't include it, perhaps I should have. even then, it doesn't change the issue where $\mathcal{N}(10, 0.5)$ for $a_1$ is still going to be picked over $a_2$ with distribution $\mathcal{N}(0, 1)$. $\endgroup$ Commented Nov 22, 2019 at 14:02
  • $\begingroup$ What about using the advantage function $A(s,a)=Q(s,a) - V(s)$ instead of $Q(s,a)$? It should have a much lower variance, and, indeed, you should be able to compute the value function to do so. $\endgroup$ Commented Apr 17, 2023 at 19:25

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.