Table of Contents
Open Table of Contents
Intro
Sometimes, we encounter NaN (Not a Number) during training. The reason behind this is overflow and underflow resulting from numerical precision. The function is a good example. When x is very large, yields overflow(inf). On the other hand, when is a very small negative number, yields underflow(0). Subsequent operations on an overflow or underflow number will cause NaN, e.g. , . Fortunately, there exist numerical techniques to avoid this from happening.
Softmax
Softmax is frequently used in deep learning, either used as a non-linear activation function or used to normalize logits to probabilities. Calculating softmax directly often leads to underflow or overflow. The trick is to leverage its shift invariant property.
In practice, is taken as . After shifting, is ensured to take a number not larger than , and hence prevents overflow. Also, at least one component of the denominator equals 1, thus preventing division by zero.
Cross Entropy
In deep learning, cross-entropy is often calculated between ground truth labels and softmax normalized logits.
where ground truth label, is logits and number of classes. The operation here brings a new scenario of overflow. When taking an input of , leads to -inf. The trick of calculating softmax is not safe anymore. That’s why deep learning libraries such as Tensorflow and PyTorch implement cross-entropy that directly takes logits as input.
So far, still suffers from overflow. The trick here is still shifting the exponential function.
Since this trick is widely used in machine learning, it has a famous name: log-sum-exp trick. The first time I met with this trick, deep learning was not popular as today, I was implementing an EM algorithm to train Gaussian Mixture Model, NaN occurred during calculating post probability. After a lot of searching, I found log-sum-exp trick.
Sigmoid
(1) leads to overflow when is a very small negative number, while (2) leads to overflow when x is very large. So the solution is: when , use (1), otherwise use (2).
Binary Cross Entropy
Binary cross entropy is often used in binary classification problems or multi-label classification problems.
Again, due to the operation, simply calculating sigmoid and then cross entropy is not safe. After expansion and merge, the formula writes,
The key is to calculate safely. If is positive, everything is ok. If , log-sum-exp trick will do the job.
Other tricks
Add a little eplison , e.g. 1e-8.
- division by zero :
- log :