Please refer theseHere are some of the things you could do:
- When using SoftMax cross entropy function:
the SoftMax numerator should never have zero-values due to the exponential. However, due to floating point precision, the numerator could be a very small value, say, exp(-50000), which essentially evaluates to zero.(ref.)
- Quick fixes could be to either increase the precision of your model (using 64-bit floats instead of, presumably, 32 bit floats), or just introduce a function that caps your values, so anything below zero or exactly zero is just made to be close enough to zero that the computer doesn't freak out. For example, use X = np.log(np.max(x, 1e-9)) before going into the softmax.(ref.)
You can use methods like "FastNorm" which improves numerical stability and reduces accuracy variance enabling higher learning rate and offering better convergence.(ref.)
Check weights initialization: If unsure, use Xavier or He initialization. Also, your initialization might be leading you to a bad local minimum, so try a different initialization and see if it helps.
Decrease the learning rate, especially if you are getting NaNs in the first 100 iterations.
NaNs can arise from division by zero or natural log of zero or negative number.
Try evaluating your network layer by layer and see where the NaNs appear.
some of the suggestions were taken from the references from the two great posts on StackOverflow & on KDnuggests