I was wondering if there is any literature available on training systems, which may show exploding gradients (e.g. recurrent neural networks), by using non-gradient based optimization methods. As far as I understand the problem, exploding gradients lead to regions with very high ascend (ridges) in the error surface, which causes gradient-based methods to jump away very far from a minimum (see e.g. Figure 6 here). The reason is, that the change in parameters is proportional to the gradient itself.
Would similar jumps also affect other learning/optimization algorithms?
For example, I am thinking of particle swarm optimization. In that case, each particle keeps a vector of the direction in which it is currently searching. What would happen, if this direction vector points towards a ridge caused by an exploding gradient? If there really is a minimum close to the ridge, the direction of most particles should point towards the ridge at some time during optimization. Also, this would cause the particles to cross over the ridge, which would suddenly increase the error value to be optimized. I cannot tell if this would also somehow cause the particles to jump away from the ridges, or if they would still cluster close to the actual minimum.
How about other non-gradient-based algorithms, such as genetic optimization? Would those also be affected by exploding gradients, or are they somehow immune to them?