Skip to content

Common Numerical Techniques in Deep Learning

Posted on:April 21, 2023 at 12:00 AM

Table of Contents

Open Table of Contents

Intro

Sometimes, we encounter NaN (Not a Number) during training. The reason behind this is overflow and underflow resulting from numerical precision. The function exp(x)exp(x) is a good example. When x is very large, exp(x)exp(x) yields overflow(inf). On the other hand, when xx is a very small negative number, exp(x)exp(x) yields underflow(0). Subsequent operations on an overflow or underflow number will cause NaN, e.g. infinf=nan\frac{inf}{inf} = nan , any number0=nan\frac{any\ number}{0} = nan . Fortunately, there exist numerical techniques to avoid this from happening.

Softmax

softmax(xi)=exp(xi)j=1nexp(xj)softmax(x_i) = \frac{ exp(x_i) }{ \sum_{j=1}^{n} exp(x_j) }

Softmax is frequently used in deep learning, either used as a non-linear activation function or used to normalize logits to probabilities. Calculating softmax directly often leads to underflow or overflow. The trick is to leverage its shift invariant property.

softmax(xc)=softmax(c)softmax(x-c) = softmax(c)

In practice, cc is taken as max(x)max(x). After shifting, exp(.)exp(.) is ensured to take a number not larger than 00, and hence prevents overflow. Also, at least one component of the denominator equals 1, thus preventing division by zero.

Cross Entropy

In deep learning, cross-entropy is often calculated between ground truth labels and softmax normalized logits.

CE(x,y)=iNyi log exp(xi)j=1Nexp(xj)CE(x,y) = \sum_{i}^{N} y_{i}\ log\ \frac{ exp(x_i) }{ \sum_{j=1}^{N} exp(x_j) }

where yy ground truth label, xx is logits and NN number of classes. The loglog operation here brings a new scenario of overflow. When taking an input of 00, log(0)log(0) leads to -inf. The trick of calculating softmax is not safe anymore. That’s why deep learning libraries such as Tensorflow and PyTorch implement cross-entropy that directly takes logits as input.

CE(x,y)=iNyixiiNyilogj=1Nexp(xj)CE(x,y) = \sum_{i}^{N} y_{i} x_{i} - \sum_{i}^{N} y_{i} log \sum_{j=1}^{N} exp(x_j)

So far, j=1nexp(xj)\sum_{j=1}^{n} exp(x_j) still suffers from overflow. The trick here is still shifting the exponential function.

logj=1nexp(xj)=max(x)+logj=1nexp(xjmax(x)) log\sum_{j=1}^{n} exp(x_j) = max(x) + log\sum_{j=1}^{n} exp(x_j-max(x))

Since this trick is widely used in machine learning, it has a famous name: log-sum-exp trick. The first time I met with this trick, deep learning was not popular as today, I was implementing an EM algorithm to train Gaussian Mixture Model, NaN occurred during calculating post probability. After a lot of searching, I found log-sum-exp trick.

Sigmoid

Sigmoid(x)=11+ex(1)Sigmoid(x) = \frac{1}{1+e^{-x}} \tag{1}
Sigmoid(x)=exex+1(2)Sigmoid(x)= \frac{e^{x}}{e^{x}+1} \tag{2}

(1) leads to overflow when xx is a very small negative number, while (2) leads to overflow when x is very large. So the solution is: when x>0x>0, use (1), otherwise use (2).

Binary Cross Entropy

Binary cross entropy is often used in binary classification problems or multi-label classification problems.

BCE(y,x)=(ylog11+ex+(1y)log(ex1+ex))BCE(y,x) = -( y log\frac{1}{1+e^{-x}} + (1-y) log(\frac{e^{-x}}{1+e^{-x}}) )

Again, due to the loglog operation, simply calculating sigmoid and then cross entropy is not safe. After expansion and merge, the formula writes,

BCE(y,x)=xxy+log(1+ex)BCE(y,x) = x - xy + log(1+e^{-x})

The key is to calculate log(1+ex)log(1+e^{-x}) safely. If xx is positive, everything is ok. If x<0x<0, log-sum-exp trick will do the job.

log(1+ex)=max(0,x)+log(emax(0,x)+exmax(0,x))log(1+e^{-x}) = max(0,-x) + log(e^{-max(0,-x)} + e^{-x-max(0,-x)})

Other tricks

Add a little eplison ϵ\epsilon, e.g. 1e-8.