I have a question related to an alterative Q-Learning approach. I'd like to know if this already exists and I am not aware of it, or it doesn't exist because there are theoretical problems behind it. <br>
**Traditional Q-Learning**<br>
In traditional Q-learning, the update of the Q-value happens at every iteration. The agent is in state `s`, performs action `a`, reaches state `s'` and obtains reward `r`. The Q-value for that pair state-action is updated according to the Bellman equation. As an example, let's consider an agent exploring a grid-world. One specific cell in the world gives a positive reward. Until the agent finds the cell leading to a positive reward, the agent gets small or no rewards. Then, after reaching the cell with the high reward, all the previous cells gets updated iteratively with positive Q-values when they are explored. This means that we need many iterations, and especially if the path leading to the positive reward is very long or requires a specific set of actions, it can take a significant amount of episodes (or never be fully discovered, even if it was explored previously).<br>
**Alternative approach**<br>
My question is: why can't we update all the Q-values related to the agent at the end of the episode? That means, information is cached during iterations and when the agent terminates the episode, the Q-values gets updated for all the cells that led to that reward (or did not lead to a reward). In this case, the agent doesn't need tons of episodes to update the Q-values related to that path. <br>
Is this approach legit? Which problem do we face following this? <br>
The only drawback I can think of is lack of exploration, that can be solved choosing random actions according to a decreasing epsilon value.
**EDIT**
My problem is not computing efficiency, but it's mainly that I am using a descending learning rate, equal to $$\alpha = \frac{1}{1+number\_visits}$$
For this reason, my intuition says that updating the Q-values at the end of the episode should work better. Indeed, since the learning rate depends on the number of visits, the contribution of the early visits is way higher. Updating as soon as possible should reduce the negative impact of this aspect.