Search Results

Search type	Search syntax
Tags	[tag]
Exact	"words here"
Author	user:1234 user:me (yours)
Score	score:3 (3+) score:0 (none)
Answers	answers:3 (3+) answers:0 (none) isaccepted:yes hasaccepted:no inquestion:1234
Views	views:250
Code	code:"if (foo != bar)"
Sections	title:apples body:"apples oranges"
URL	url:"*.example.com"
Saves	in:saves
Status	closed:yes duplicate:no migrated:no wiki:no
Types	is:question is:answer
Exclude	-[tag] -apples
For more details on advanced search visit our help page

Results tagged with q-learning

Search options not deleted user 836

42 results

Relevance Newest Score Active

A model-free reinforcement learning technique.

2 votes

Accepted

Q table creation and update for dynamic action space

It is a finite MDP with states represented as 6 dimensional vectors of integers. The number of discrete values in each index of the state vector varies from 24 to 90. The action space varies from sta …

Neil Slater

29.5k

answered Jul 17, 2019 at 20:26

4 votes

Accepted

Q-learning why do we subtract the Q(s, a) term during update?

The wikipedia formulation does indeed show you a better view of how the update rule for action values is constructed: $$Q(s_t, a_t) \leftarrow (1-\alpha)\cdot Q(s_t, a_t) + \alpha\left[ r_t + \gamma …

Neil Slater

29.5k

answered Jan 29, 2018 at 8:13

1 vote

Choosing the right parameters for SARSA and Q-Learning & Comparing Models

As you are building policies in simulation, and can avoid the need to use approximate methods (the state space is small enough to fit in a table in memory), then your goal is to converge on the optima …

Neil Slater

29.5k

answered Apr 21, 2017 at 9:11

1 vote

Accepted

Is my understanding of On-Policy and Off-Policy TD algorithms correct?

1) With an on-policy algorithm we use the current policy (a regression model with weights W, and ε-greedy selection) to generate the next state's Q. Yes. To avoid confusion, it may be better to u …

Neil Slater

29.5k

answered Jan 10, 2018 at 12:28

1 vote

Accepted

Can you interpolate with QLearning or Reinforcement learning in general?

Since the convergence of QLearning is so slow I am wondering if it is possible with QLearning to interpolate the QValue of unexplored states since QLearning does not use a model? When Q learning …

Neil Slater

29.5k

answered Apr 18, 2018 at 13:11

1 vote

Accepted

Why does Q-learning use an actor model and critic model?

The book you are reading is being somewhat lax with terms. It uses the terms "actor" and "critic", but there is another algorithm called actor-critic which is very popular recently and is quite differ …

Neil Slater

29.5k

answered May 12, 2018 at 7:39

2 votes

Accepted

Dueling DQN what does a' mean?

It is just a type of namespacing, because $a$ is already assigned the chosen action. There are two contexts of action being considered in the equation, so there needs to be a symbol for each context. …

Neil Slater

29.5k

answered Jun 4, 2018 at 10:11

2 votes

Accepted

What is the immediate reward in value iteration?

what is $R_a(s,s')$ ? In this case, it appears to represent the expected immediate reward received when taking action $a$ and transitioning from state $s$ to state $s'$. It is written this way so …

Neil Slater

29.5k

answered Oct 8, 2018 at 19:32

3 votes

What's going wrong with my Tic Tac Toe Q-Learning Alghoritm?

You have a couple of mistakes around assigning reward, and the update mechanism. You intend to grant 0 reward for a loss, 0.5 reward for a tie and 1 reward for a win. And you place those rewards as f …

Neil Slater

29.5k

answered Oct 13, 2018 at 8:45

2 votes

Accepted

If the set of all possible states changes each time, how can Q-learning "learn" anything?

if the length and height of the rectangle are random, as well as the starting position and the location of the Treasure, how can the bot apply the knowledge acquired to the new problem? You have …

Neil Slater

29.5k

answered May 5, 2019 at 8:36

11 votes

Accepted

Reinforcement learning: decreasing loss without increasing reward

How should I interpret this? If a lower loss means more accurate predictions of value, naively I would have expected the agent to take more high-reward actions. A lower loss means more accurate p …

Neil Slater

29.5k

answered Sep 4, 2018 at 12:48

4 votes

Accepted

Exploration in Q learning: Epsilon greedy vs Exploration function

Any exploration function that ensures the behaviour policy covers all possible actions will work in theory with Q learning. By covers I mean that there is a non-zero probability of selecting each acti …

Neil Slater

29.5k

answered May 6, 2021 at 7:11

58 votes

Accepted

What is "experience replay" and what are its benefits?

The key part of the quoted text is: To perform experience replay we store the agent's experiences $e_t = (s_t,a_t,r_t,s_{t+1})$ This means instead of running Q-learning on state/action pairs as …

Neil Slater

29.5k

answered Jul 19, 2017 at 7:25

1 vote

Accepted

Neural network q learning for tic tac toe - how to use the threshold

You are effectively implementing $\epsilon$-greedy action selection. The usual way to represent this in RL, at least that I am familiar with, is not as a "threshold" for probability of choosing the …

Neil Slater

29.5k

answered Jan 13, 2018 at 18:48

2 votes

Accepted

Q learning Neural network Tic tac toe - When to train net

This update scheme: Q(s,a) += reward * gamma^(inverse position in game state) has a couple of problems: You are - apparently - incrementing Q values rather than training them to a reference targe …

Neil Slater

29.5k

answered Jan 14, 2018 at 9:32

2 3 Next

15 30 50 per page

Stack Exchange Network

Search Results

Q table creation and update for dynamic action space

Q-learning why do we subtract the Q(s, a) term during update?

Choosing the right parameters for SARSA and Q-Learning & Comparing Models

Is my understanding of On-Policy and Off-Policy TD algorithms correct?

Can you interpolate with QLearning or Reinforcement learning in general?

Why does Q-learning use an actor model and critic model?

Dueling DQN what does a' mean?

What is the immediate reward in value iteration?

What's going wrong with my Tic Tac Toe Q-Learning Alghoritm?

If the set of all possible states changes each time, how can Q-learning "learn" anything?

Reinforcement learning: decreasing loss without increasing reward

Exploration in Q learning: Epsilon greedy vs Exploration function

What is "experience replay" and what are its benefits?

Neural network q learning for tic tac toe - how to use the threshold

Q learning Neural network Tic tac toe - When to train net

Hot Network Questions