Exploding gradients and non-gradient-based optimization methods

Question

I was wondering if there is any literature available on training systems, which may show exploding gradients (e.g. recurrent neural networks), by using non-gradient based optimization methods. As far as I understand the problem, exploding gradients lead to regions with very high ascend (ridges) in the error surface, which causes gradient-based methods to jump away very far from a minimum (see e.g. Figure 6 here). The reason is, that the change in parameters is proportional to the gradient itself.

Would similar jumps also affect other learning/optimization algorithms?

For example, I am thinking of particle swarm optimization. In that case, each particle keeps a vector of the direction in which it is currently searching. What would happen, if this direction vector points towards a ridge caused by an exploding gradient? If there really is a minimum close to the ridge, the direction of most particles should point towards the ridge at some time during optimization. Also, this would cause the particles to cross over the ridge, which would suddenly increase the error value to be optimized. I cannot tell if this would also somehow cause the particles to jump away from the ridges, or if they would still cluster close to the actual minimum.

How about other non-gradient-based algorithms, such as genetic optimization? Would those also be affected by exploding gradients, or are they somehow immune to them?

I have an unsatisfactory answer. As the algorithms you mention only care about function evaluation they are immune to the exact problem you point out. But. Exploding gradient implies that the function is not very well behaved, and will generally be very difficult. PSO and genetic algorithms don't make use of gradient information, and are typically pretty much useless in high dimensions unless function evaluations are \emph{very} cheap. There is an entire body of literature on non-gradient algorithms that can provide guarantees, but I am not very familiar with it. — David Kozak
– David Kozak, Commented Feb 15, 2018 at 18:33
There's a subtle point here, which is that the so-called "vanilla RNN" is most acutely influenced by exploding/vanishing gradient phenomena. Refinements such as LSTM considerably improve the gradient behavior. It's usually not completely clear when a person says "RNN" whether they mean specifically the "vanilla RNN" or all classes of NN cells which process sequential data (LSTM, GRU, RHN, NAS...). The paper you link to is focused on the "vanilla RNN" specifically, but I'm not sure which usage you're employing in your question. — Sycorax
– Sycorax ♦, Commented Feb 15, 2018 at 19:25
@Sycorax Yes, sure LSTM and others can somewhat circumvent the problem of exploding gradients. In my case, I am using network based on the parallel distributed processing idea by Rumelhart and McClelland (more specifically the interactive activation version), which has not seen much use in machine learning, but more so in cognitive modeling. Still, due to feedback between neurons it is subject to the same problem of bifurcations and exploding gradients (I have actually seen bifurcations of this before, so I know they exist). — LiKao
– LiKao, Commented Feb 16, 2018 at 12:31

Sid · Accepted Answer · 2018-02-15 20:09:45Z

2

In my experience, gradient clamping seems to work fine for exploding gradients: you basically set the gradient = grad_max * (unit vector along your gradient) if the |gradient|>grad_max.

The ideas relating to exploding gradients in RNNs, and the solutions are described in https://arxiv.org/pdf/1211.5063.pdf (This is one of the first notable papers on the subject; I am sure there are many more). In particular see Figure 6 and the ideas underlying algorithm 1

answered Feb 15, 2018 at 20:09

Sid

2,65714 silver badges17 bronze badges

$\begingroup$ Yes, I would agree that gradient clamping would help to solve the problem of exploding gradients in general. Unfortunately, I cannot use this approach for my specific case. The reason is, that in my current case I cannot get the real gradient, at best I can get a gradient (and error function) with an additional random error, which may also change during evaluations. Therefore, I am using methods (e.g. PSO) which are somewhat immune to those fluctuating errors. However, if they are not immune to exploding gradients, then they will still get very bad estimates. $\endgroup$

LiKao
– LiKao

2018-02-16 12:26:50 +00:00
Commented Feb 16, 2018 at 12:26
1

$\begingroup$ This is identical to the situation we encounter with neural networks, so I am guessing the same approach should work fine. When computing the gradients, we are using mini-batches, so in essence, we are getting noisy estimates of the gradients at each iteration, with the magnitude of the noise being erratic, and dependent on the sample chosen in the minibatch. $\endgroup$

Sid
– Sid

2018-02-19 03:31:19 +00:00
Commented Feb 19, 2018 at 3:31
$\begingroup$ Yes, I was thinking the same, as I typed my comment above (the real stochastic nature of backprop due to mini-batches did not occur to me when considering this problem before). I guess I should also try gradient descent with clamping here, and see how it turns out. $\endgroup$

LiKao
– LiKao

2018-02-19 09:39:38 +00:00
Commented Feb 19, 2018 at 9:39

Add a comment |

Stack Exchange Network

Exploding gradients and non-gradient-based optimization methods

1 Answer 1

Hot Network Questions

Exploding gradients and non-gradient-based optimization methods

1 Answer 1

Related

Hot Network Questions