Proof of Convergence for SARSA/Q-Learning Algorithm

Question

I would like to ask if someone can refer to me the paper containing the proof of convergence of $Q-$learning/SARSA (either/both), one of the learning algorithms in reinforcement learning.

The iterative algorithm for SARSA is as follows:

$$ Q(s_t, a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)],$$ where $r$ is the reward, $\gamma$ is the discount factor, $s$ is the state, $a$ is the action, $t$ is the time.

I have seen this work many times, but never understood why this works.

Can someone explain/give an insight why this algorithm converges and $Q-$learning/SARSA is possible?

Thanks

See Watkins & Dayan (1992) for detailed proof of convergence of Q-learning or Watsons PhD thesis Learning from delayed rewards (page 220). The second one was more understandable for me and I can recommend reading it. — Jan Vainer
– Jan Vainer, Commented Aug 27, 2017 at 16:01
There is a proof for Q_learning in proposition 5.5 in the book Neuro-dynamic programming, Bertsekas and Tsitsiklis. Sutton and Barto refers to Singh, Jaakkola, Littman, and Szepesvari (2000) for the proof of SARSA. — user220743
– user220743, Commented Sep 16, 2018 at 6:11

Hans · Accepted Answer · 2017-09-24 23:45:28Z

The SARSA algorithm is a stochastic approximation to the Bellman equations for Markov Decision Processes. One way of writing the Bellman equation for $q_{\pi}(s,a)$ is:

$$q_{\pi}(s,a) = \sum_{s',r}p(s',r|s,a)( r + \gamma \sum_{a'}\pi(a'|s') q_{\pi}(s',a'))$$

Where $p(s',r|s,a)$ is the probability of transition to state $s'$ with reward $r$ given current state $s$ and action $a$.

In Dynamic Programming solutions to MDPs, the Bellman equation is used directly as an update mechanism that converges to correct values for $q_{\pi}(s,a)$. For estimating the value of a fixed policy, this works because the only stationary points of these updates are when the two sides are equal, and there is one equation for each $q_{\pi}(s,a)$ relating it linearly to one or more other $q_{\pi}(s',a')$, so it has a computable solution.

When you add the search for an optimal solution, such as in SARSA, then the policy $\pi$ changes. There is a proof that changing the policy to pick the action $\pi'(s) = argmax_a q_{\pi}(s,a)$ will always either improve the policy or be optimal. This is called the Policy Improvement Theorem, and is based on the inequality $v_{\pi}(s) \le q_{\pi}(s,\pi'(s))$ - there is an extension to the theorem that covers $\epsilon$-greedy policies used in learners such as Monte Carlo Control or SARSA.

TD learning, including SARSA and Q-Learning, uses the ideas of Dynamic Programming in a sample-based environment where the equalities are true in expectation. But essentially you can see how the update $q_{\pi}(s,a) = \sum_{s',r}p(s',r|s,a)( r + \gamma \sum_{a'}\pi(a'|s') q_{\pi}(s',a'))$ has turned into SARSA's update:

The weighted sum over state transition and reward probabilities happens in expectation as you take many samples. So $Q(S,A) = \mathbb{E}[ \text{Sampled}(R) + \gamma \sum_{a'}\pi(a'|S') q_{\pi}(S',a')]$ (technically you have to sample R and S' together)
Likewise the weighting of the current policy happens in expectation. So $Q(S,A) = \mathbb{E}[ \text{Sampled}(R + \gamma Q(S',A'))]$
To change this expectation into an incremental update, allowing for non-stationarity as the policy improves over time, we add a learning rate and move each estimate towards the latest sampled value: $Q(S,A) = Q(S,A) +\alpha[R + \gamma Q(S',A') - Q(S,A)]$

For a more thorough explanation of the building blocks of algorithms like SARSA and Q-Learning, you can read Reinforcement Learning: An Introduction. Or for a more concise and mathematically rigorous approach you can read Algorithms for Reinforcement Learning.

$\begingroup$ Better than a textbook answer. $\endgroup$

PascalVKooten
– PascalVKooten

2018-09-16 06:44:19 +00:00
Commented Sep 16, 2018 at 6:44 — PascalVKooten
– PascalVKooten, Commented Sep 16, 2018 at 6:44

Stack Exchange Network

Proof of Convergence for SARSA/Q-Learning Algorithm

1 Answer 1

Hot Network Questions

Proof of Convergence for SARSA/Q-Learning Algorithm

1 Answer 1

Related

Hot Network Questions