Search Results
| Search type | Search syntax |
|---|---|
| Tags | [tag] |
| Exact | "words here" |
| Author | user:1234 user:me (yours) |
| Score | score:3 (3+) score:0 (none) |
| Answers | answers:3 (3+) answers:0 (none) isaccepted:yes hasaccepted:no inquestion:1234 |
| Views | views:250 |
| Code | code:"if (foo != bar)" |
| Sections | title:apples body:"apples oranges" |
| URL | url:"*.example.com" |
| Saves | in:saves |
| Status | closed:yes duplicate:no migrated:no wiki:no |
| Types | is:question is:answer |
| Exclude | -[tag] -apples |
| For more details on advanced search visit our help page | |
Results tagged with q-learning
Search options not deleted user 836
A model-free reinforcement learning technique.
2 votes
Accepted
Q table creation and update for dynamic action space
It is a finite MDP with states represented as 6 dimensional vectors of integers. The number of discrete values in each index of the state vector varies from 24 to 90. The action space varies from sta …
4 votes
Accepted
Q-learning why do we subtract the Q(s, a) term during update?
The wikipedia formulation does indeed show you a better view of how the update rule for action values is constructed: $$Q(s_t, a_t) \leftarrow (1-\alpha)\cdot Q(s_t, a_t) + \alpha\left[ r_t + \gamma …
1 vote
Choosing the right parameters for SARSA and Q-Learning & Comparing Models
As you are building policies in simulation, and can avoid the need to use approximate methods (the state space is small enough to fit in a table in memory), then your goal is to converge on the optima …
1 vote
Accepted
Is my understanding of On-Policy and Off-Policy TD algorithms correct?
1) With an on-policy algorithm we use the current policy (a regression model with weights W, and ε-greedy selection) to generate the next state's Q. Yes. To avoid confusion, it may be better to u …
1 vote
Accepted
Can you interpolate with QLearning or Reinforcement learning in general?
Since the convergence of QLearning is so slow I am wondering if it is possible with QLearning to interpolate the QValue of unexplored states since QLearning does not use a model? When Q learning …
1 vote
Accepted
Why does Q-learning use an actor model and critic model?
The book you are reading is being somewhat lax with terms. It uses the terms "actor" and "critic", but there is another algorithm called actor-critic which is very popular recently and is quite differ …
2 votes
Accepted
Dueling DQN what does a' mean?
It is just a type of namespacing, because $a$ is already assigned the chosen action. There are two contexts of action being considered in the equation, so there needs to be a symbol for each context. …
2 votes
Accepted
What is the immediate reward in value iteration?
what is $R_a(s,s')$ ? In this case, it appears to represent the expected immediate reward received when taking action $a$ and transitioning from state $s$ to state $s'$. It is written this way so …
3 votes
What's going wrong with my Tic Tac Toe Q-Learning Alghoritm?
You have a couple of mistakes around assigning reward, and the update mechanism. You intend to grant 0 reward for a loss, 0.5 reward for a tie and 1 reward for a win. And you place those rewards as f …
2 votes
Accepted
If the set of all possible states changes each time, how can Q-learning "learn" anything?
if the length and height of the rectangle are random, as well as the starting position and the location of the Treasure, how can the bot apply the knowledge acquired to the new problem? You have …
11 votes
Accepted
Reinforcement learning: decreasing loss without increasing reward
How should I interpret this? If a lower loss means more accurate predictions of value, naively I would have expected the agent to take more high-reward actions. A lower loss means more accurate p …
4 votes
Accepted
Exploration in Q learning: Epsilon greedy vs Exploration function
Any exploration function that ensures the behaviour policy covers all possible actions will work in theory with Q learning. By covers I mean that there is a non-zero probability of selecting each acti …
58 votes
Accepted
What is "experience replay" and what are its benefits?
The key part of the quoted text is: To perform experience replay we store the agent's experiences $e_t = (s_t,a_t,r_t,s_{t+1})$ This means instead of running Q-learning on state/action pairs as …
1 vote
Accepted
Neural network q learning for tic tac toe - how to use the threshold
You are effectively implementing $\epsilon$-greedy action selection. The usual way to represent this in RL, at least that I am familiar with, is not as a "threshold" for probability of choosing the …
2 votes
Accepted
Q learning Neural network Tic tac toe - When to train net
This update scheme: Q(s,a) += reward * gamma^(inverse position in game state) has a couple of problems: You are - apparently - incrementing Q values rather than training them to a reference targe …