Return to Revisions

3 of 4

added 133 characters in body

edited Aug 26, 2018 at 12:01

29.5k
5
82
101

The main requirement of on-policy policy gradient methods is that they use a parametric policy $\pi(a|s, \theta)$ that is differentiable with respect to the parameters $\theta$.

This is not restricted to only describing discrete probability distribution functions (e.g. softmax output layer of neural network). A description of any probability distribution function that is differentiable and possible to sample from is all that is required. This is true for the Normal distribution for instance, so one relatively common solution in continuous spaces is for a neural network to output the mean and standard deviation for the distribution of each component of the action vector that accepts continuous values.

Typically the neural network does not perform the sampling to choose an action. This is also true for a softmax output - it is only additional code, outside of the NN, that interprets the values and selects the action. In addition, and unlike softmax, the NN does not need to directly represent the probability distribution function, just enough data to drive the sampling process. However, the nature of the distribution function does need to be taken into account when calculating the gradient in policy gradient methods.

Thus, it's not obvious how can my action be something continuous like: "turn +19.2345 angles clockwise". Such an action must have already been pre-defined to the "19.2345" value, right?

What the policy might output here, is effectively the function $\mathcal{N}(\mu, \sigma)$, which you sample to get an action like "turn x degrees clockwise". The neural network might output just $(25, 7)$ and then additional code will interpret those values as describing the distribution so it can be sampled. E.g. if you got a mean 25, standard deviation 7, then when you select the action from that, you might get "turn +19.2345 degrees clockwise"

answered Aug 26, 2018 at 8:59

Neil Slater

29.5k
5
82
101