Normalizing Normal Distributions in Thompson Sampling for online Reinforcement Learning

Question

In my implementation of Thompson Sampling (TS) for online Reinforcement Learning, my distribution for selecting $a$ is $\mathcal{N}(Q(s, a), \frac{1}{C(s,a)+1})$, where $C(s,a)$ is the number of times $a$ has been picked in $s$.

However, I found that this does not work well in some cases depending on the magnitude of $Q(s,a)$. For example, if $Q(s_i,a_1) = 100$, and $C(s_i,a_1) = 1$, then then this gives a standard deviation of 0.5, which is extremely confident even though the action has only been picked once. Compare that to $a_2$ which may be the optimal action but has never been picked, so $Q(s_i, a_2) = 0$ and $C(s_i,a_2) = 0$. It is unlikely that TS will ever pick $a_2$.

So, how do I solve this problem?

I tried normalizing the Q-values such that they range from 0 to 1, but the algorithm returns much lower total returns. I think I have to adapt the magnitude of the standard deviations relative to the Q-values as well. Doing it for 1 normal distribution is pretty straightforward, but I can't figure out how to do it for multiple distributions which have to take into consideration of the other distributions.

Edit: Counts should be $C(s,a)$ instead of $C(s)$ as Neil pointed out

"if $Q(s_i,a_1) = 100$, and $C(a_1) = 1$, then..." in your imagined scenario, you have made a very large update to $Q(s_i,a_1)$ with either 0 or 1 samples from it - is this actually possible with the learning mechanism you are using, e.g. are you processing rewards such that $\alpha \times r_t$ could make a $+100$ value update in a single step? This is salient, because there is no need to worry about and attempt to fix numerical problems that don't occur in practice. — Neil Slater
– Neil Slater, Commented Nov 22, 2019 at 13:35
Counts should be s,a. I have edited the question. I am selecting the action with the highest sampled Q-value. — Kevin
– Kevin, Commented Nov 22, 2019 at 13:59
I have a learning rate of 0.1, I didn't include it, perhaps I should have. even then, it doesn't change the issue where $\mathcal{N}(10, 0.5)$ for $a_1$ is still going to be picked over $a_2$ with distribution $\mathcal{N}(0, 1)$. — Kevin
– Kevin, Commented Nov 22, 2019 at 14:02
What about using the advantage function $A(s,a)=Q(s,a) - V(s)$ instead of $Q(s,a)$? It should have a much lower variance, and, indeed, you should be able to compute the value function to do so. — Luca Anzalone
– Luca Anzalone, Commented Apr 17, 2023 at 19:25

Stack Exchange Network

Normalizing Normal Distributions in Thompson Sampling for online Reinforcement Learning

0

You must log in to answer this question.

Hot Network Questions

Normalizing Normal Distributions in Thompson Sampling for online Reinforcement Learning

0

You must log in to answer this question.

Related

Hot Network Questions