Common Numerical Techniques in Deep Learning

Open Table of Contents

Intro
Softmax
Cross Entropy
Sigmoid
Binary Cross Entropy
Other tricks

Intro

Sometimes, we encounter NaN (Not a Number) during training. The reason behind this is overflow and underflow resulting from numerical precision. The function $exp(x)$ is a good example. When x is very large, $exp(x)$ yields overflow(inf). On the other hand, when $x$ is a very small negative number, $exp(x)$ yields underflow(0). Subsequent operations on an overflow or underflow number will cause NaN, e.g. $\frac{inf}{inf} = nan$ , $\frac{any\ number}{0} = nan$ . Fortunately, there exist numerical techniques to avoid this from happening.

Softmax

softmax(x_i) = \frac{ exp(x_i) }{ \sum_{j=1}^{n} exp(x_j) }

Softmax is frequently used in deep learning, either used as a non-linear activation function or used to normalize logits to probabilities. Calculating softmax directly often leads to underflow or overflow. The trick is to leverage its shift invariant property.

softmax(x-c) = softmax(c)

In practice, $c$ is taken as $max(x)$ . After shifting, $exp(.)$ is ensured to take a number not larger than $0$ , and hence prevents overflow. Also, at least one component of the denominator equals 1, thus preventing division by zero.

Cross Entropy

In deep learning, cross-entropy is often calculated between ground truth labels and softmax normalized logits.

CE(x,y) = \sum_{i}^{N} y_{i}\ log\ \frac{ exp(x_i) }{ \sum_{j=1}^{N} exp(x_j) }

where $y$ ground truth label, $x$ is logits and $N$ number of classes. The $log$ operation here brings a new scenario of overflow. When taking an input of $0$ , $log(0)$ leads to -inf. The trick of calculating softmax is not safe anymore. That’s why deep learning libraries such as Tensorflow and PyTorch implement cross-entropy that directly takes logits as input.

CE(x,y) = \sum_{i}^{N} y_{i} x_{i} - \sum_{i}^{N} y_{i} log \sum_{j=1}^{N} exp(x_j)

So far, $\sum_{j=1}^{n} exp(x_j)$ still suffers from overflow. The trick here is still shifting the exponential function.

log\sum_{j=1}^{n} exp(x_j) = max(x) + log\sum_{j=1}^{n} exp(x_j-max(x))

Since this trick is widely used in machine learning, it has a famous name: log-sum-exp trick. The first time I met with this trick, deep learning was not popular as today, I was implementing an EM algorithm to train Gaussian Mixture Model, NaN occurred during calculating post probability. After a lot of searching, I found log-sum-exp trick.

Sigmoid

Sigmoid(x) = \frac{1}{1+e^{-x}} \tag{1}

Sigmoid(x)= \frac{e^{x}}{e^{x}+1} \tag{2}

(1) leads to overflow when $x$ is a very small negative number, while (2) leads to overflow when x is very large. So the solution is: when $x>0$ , use (1), otherwise use (2).

Binary Cross Entropy

Binary cross entropy is often used in binary classification problems or multi-label classification problems.

BCE(y,x) = -( y log\frac{1}{1+e^{-x}} + (1-y) log(\frac{e^{-x}}{1+e^{-x}}) )

Again, due to the $log$ operation, simply calculating sigmoid and then cross entropy is not safe. After expansion and merge, the formula writes,

BCE(y,x) = x - xy + log(1+e^{-x})

The key is to calculate $log(1+e^{-x})$ safely. If $x$ is positive, everything is ok. If $x<0$ , log-sum-exp trick will do the job.

log(1+e^{-x}) = max(0,-x) + log(e^{-max(0,-x)} + e^{-x-max(0,-x)})

Other tricks

Add a little eplison $\epsilon$ , e.g. 1e-8.

division by zero : $\frac{b}{a+\epsilon}$
log : $log(a+\epsilon)$