Introduction
In my previous post on deep learning, I discussed how to create feed-forward neural network layers, and how create models of arbitrary size with those layers.
I also mention that a variety of activation functions can be used in the hidden layers of a neural network, and various loss functions can be used depending on the desired goals for a model.
In this post, I will go over some common activation and loss functions and derive their local gradients.
Common Activation Functions
Sigmoid Function
The sigmoid function takes in real-valued input x x x and returns a real-valued output in the range [ 0 , 1 ] [0, 1] [ 0 , 1 ] .
This activation function is most often used in output layers for binary classification models, although it could technically be used in hidden layers as well.
σ ( x ) = 1 1 + e − x ∂ σ ( x ) ∂ x = ∂ ∂ x ( 1 + e − x ) − 1 = − ( 1 + e − x ) − 2 ∂ ∂ x ( 1 + e − x ) = − ( 1 + e − x ) − 2 ∂ ∂ x e − x = − ( 1 + e − x ) − 2 e − x ∂ ∂ x − x = ( 1 + e − x ) − 2 e − x = e − x ( 1 + e − x ) 2 = 1 1 + e − x e − x 1 + e − x = 1 1 + e − x 1 + e − x − 1 1 + e − x = 1 1 + e − x [ 1 + e − x 1 + e − x − 1 1 + e − x ] = 1 1 + e − x [ 1 − 1 1 + e − x ] = σ ( x ) [ 1 − σ ( x ) ] \begin{aligned}
\sigma(x) &= \frac{1}{1 + e^{-x}} \\
\frac{\partial \sigma(x)}{\partial x} &= \frac{\partial}{\partial x} (1+e^{-x})^{-1}\\
&= -(1+e^{-x})^{-2} \frac{\partial}{\partial x} (1 + e^{-x})\\
&= -(1+e^{-x})^{-2} \frac{\partial}{\partial x} e^{-x}\\
&= -(1+e^{-x})^{-2} e^{-x} \frac{\partial}{\partial x} -x\\
&= (1+e^{-x})^{-2} e^{-x}\\
&= \frac{e^{-x}}{(1+e^{-x})^{2}}\\
&= \frac{1}{1+e^{-x}} \frac{e^{-x}}{1+e^{-x}}\\
&= \frac{1}{1+e^{-x}} \frac{1 + e^{-x} - 1}{1+e^{-x}}\\
&= \frac{1}{1+e^{-x}} \left[\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right]\\
&= \frac{1}{1+e^{-x}} \left[1 - \frac{1}{1+e^{-x}}\right]\\
&= \sigma(x) [1 - \sigma(x)]\\
\end{aligned} σ ( x ) ∂ x ∂ σ ( x ) = 1 + e − x 1 = ∂ x ∂ ( 1 + e − x ) − 1 = − ( 1 + e − x ) − 2 ∂ x ∂ ( 1 + e − x ) = − ( 1 + e − x ) − 2 ∂ x ∂ e − x = − ( 1 + e − x ) − 2 e − x ∂ x ∂ − x = ( 1 + e − x ) − 2 e − x = ( 1 + e − x ) 2 e − x = 1 + e − x 1 1 + e − x e − x = 1 + e − x 1 1 + e − x 1 + e − x − 1 = 1 + e − x 1 [ 1 + e − x 1 + e − x − 1 + e − x 1 ] = 1 + e − x 1 [ 1 − 1 + e − x 1 ] = σ ( x ) [ 1 − σ ( x ) ]
Tanh Function
The tanh function takes in real-valued input x x x and returns a real-valued output in the range [ − 1 , 1 ] [-1, 1] [ − 1 , 1 ] .
This activation function is most often seen as an activation function for hidden layers.
tanh ( x ) = sinh ( x ) cosh ( x ) = e x − e − x e x + e − x ∂ tanh ( x ) ∂ x = [ ∂ ∂ x ( e x − e − x ) ] ( e x + e − x ) − ( e x − e − x ) [ ∂ ∂ x ( e x + e − x ) ] ( e x + e − x ) 2 = ( e x + e − x ) 2 − ( e x − e − x ) 2 ( e x + e − x ) 2 = 1 − ( e x − e − x ) 2 ( e x + e − x ) 2 = 1 − tanh 2 ( x ) \begin{aligned}
\tanh(x) &= \frac{\sinh(x)}{\cosh(x)}\\
&= \frac{e^x - e^{-x}}{e^x + e^{-x}} \\
\frac{\partial \tanh(x)}{\partial x} &=
\frac{
\left[\frac{\partial}{\partial x} (e^x - e^{-x})\right]
(e^x + e^{-x})
-
(e^x - e^{-x})
\left[\frac{\partial}{\partial x} (e^x + e^{-x})\right]
}{
(e^x + e^{-x})^2
}\\
&=
\frac{
(e^x + e^{-x})^2 - (e^x - e^{-x})^2
}{
(e^x + e^{-x})^2
}\\
&= 1 -
\frac{
(e^x - e^{-x})^2
}{
(e^x + e^{-x})^2
}\\
&= 1 - \tanh^2(x)\\
\end{aligned} tanh ( x ) ∂ x ∂ tanh ( x ) = cosh ( x ) sinh ( x ) = e x + e − x e x − e − x = ( e x + e − x ) 2 [ ∂ x ∂ ( e x − e − x ) ] ( e x + e − x ) − ( e x − e − x ) [ ∂ x ∂ ( e x + e − x ) ] = ( e x + e − x ) 2 ( e x + e − x ) 2 − ( e x − e − x ) 2 = 1 − ( e x + e − x ) 2 ( e x − e − x ) 2 = 1 − tanh 2 ( x )
Rectified Linear Activation Function (ReLU)
The ReLU function takes in real-valued input x x x and returns real-valued output in the range [ 0 , ∞ ) [0, \infty) [ 0 , ∞ ) .
The ReLU function is most often seen as an activation function for hidden layers.
It should be noted that formally, the derivative for the ReLU function at 0 is undefined.
However, in practice, it is often set to 0 when x x x is 0.
r ( x ) = { x x > 0 0 x < 0 ∂ r ( x ) ∂ x = { 1 x > 0 0 x < 0 \begin{aligned}
r(x) &=
\begin{cases}
x & x > 0\\
0 & x < 0
\end{cases}\\
\frac{\partial r(x)}{\partial x} &=
\begin{cases}
1 & x > 0\\
0 & x < 0\\
\end{cases}\\
\end{aligned} r ( x ) ∂ x ∂ r ( x ) = { x 0 x > 0 x < 0 = { 1 0 x > 0 x < 0
Softmax Function
The softmax function takes in a vector of real-values and returns a vector of real values, each in the range [ 0 , 1 ] [0, 1] [ 0 , 1 ] .
The vector s ( x ) ⃗ \vec{s(x)} s ( x ) obtained from running vector x x x through the softmax function always sums to 1.
Due to this, the softmax function is most often used as the activation function for output layers in multinomial classification problems.
s ( x i ) = e x i ∑ k = 1 K e x k ∂ s ( x j ) ∂ x i = ∂ ∂ x i e x j ∑ k = 1 K e x k = ( ∂ ∂ x i e x j ) ∑ k = 1 K e x k − e x j ( ∂ ∂ x i ∑ k = 1 K e x k ) ( ∑ k = 1 K e x k ) 2 = { ( ∂ ∂ x i e x j ) ∑ k = 1 K e x k − e x j ( ∂ ∂ x i ∑ k = 1 K e x k ) ( ∑ k = 1 K e x k ) 2 i = j ( ∂ ∂ x i e x j ) ∑ k = 1 K e x k − e x j ( ∂ ∂ x i ∑ k = 1 K e x k ) ( ∑ k = 1 K e x k ) 2 i ≠ j = { e x i ∑ k = 1 K e x k − e x i e x i ( ∑ k = 1 K e x k ) 2 i = j 0 ∑ k = 1 K e x k − e x j e x i ( ∑ k = 1 K e x k ) 2 i ≠ j = { e x i ( ∑ k = 1 K e x k − e x i ) ( ∑ k = 1 K e x k ) 2 i = j e x i ( − e x j ) ( ∑ k = 1 K e x k ) 2 i ≠ j = { s ( x i ) ( 1 − s ( x j ) ) i = j s ( x i ) ( 0 − s ( x j ) ) i ≠ j = s ( x i ) ( δ i = j − s ( x j ) ) \begin{aligned}
s(x_i) &= \frac{e^{x_i}}{\sum^K_{k=1} e^{x_k}} \\
\frac{\partial s(x_j)}{\partial x_i} &= \frac{\partial}{\partial x_i} \frac{e^{x_j}}{\sum^K_{k=1} e^{x_k}}\\
&= \frac{
\left(\frac{\partial}{\partial x_i} e^{x_j}\right)\sum^K_{k=1}e^{x_k} -
e^{x_j}\left(\frac{\partial}{\partial x_i}\sum^K_{k=1}e^{x_k}\right)
}{
(\sum^K_{k=1}e^{x_k})^2
}\\
&=
\begin{cases}
\frac{
\left(\frac{\partial}{\partial x_i} e^{x_j}\right)\sum^K_{k=1}e^{x_k} -
e^{x_j}\left(\frac{\partial}{\partial x_i}\sum^K_{k=1}e^{x_k}\right)
}{
(\sum^K_{k=1}e^{x_k})^2
} & i=j\\
\frac{
\left(\frac{\partial}{\partial x_i} e^{x_j}\right)\sum^K_{k=1}e^{x_k} -
e^{x_j}\left(\frac{\partial}{\partial x_i}\sum^K_{k=1}e^{x_k}\right)
}{
(\sum^K_{k=1}e^{x_k})^2
} & i\neq j\\
\end{cases}\\
&=
\begin{cases}
\frac{
e^{x_i}\sum^K_{k=1}e^{x_k} -
e^{x_i}e^{x_i}
}{
(\sum^K_{k=1}e^{x_k})^2
} & i=j\\
\frac{
0\sum^K_{k=1}e^{x_k} -
e^{x_j}e^{x_i}
}{
(\sum^K_{k=1}e^{x_k})^2
} & i\neq j\\
\end{cases}\\
&=
\begin{cases}
\frac{
e^{x_i}(\sum^K_{k=1}e^{x_k} - e^{x_i})
}{
(\sum^K_{k=1}e^{x_k})^2
} & i=j\\
\frac{
e^{x_i}(-e^{x_j})
}{
(\sum^K_{k=1}e^{x_k})^2
} & i\neq j\\
\end{cases}\\
&=
\begin{cases}
s(x_i)(1 - s(x_j))
& i=j\\
s(x_i)(0 - s(x_j))
& i\neq j\\
\end{cases}\\
&= s(x_i)(\delta_{i=j} - s(x_j))\\
\end{aligned} s ( x i ) ∂ x i ∂ s ( x j ) = ∑ k = 1 K e x k e x i = ∂ x i ∂ ∑ k = 1 K e x k e x j = ( ∑ k = 1 K e x k ) 2 ( ∂ x i ∂ e x j ) ∑ k = 1 K e x k − e x j ( ∂ x i ∂ ∑ k = 1 K e x k ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ( ∑ k = 1 K e x k ) 2 ( ∂ x i ∂ e x j ) ∑ k = 1 K e x k − e x j ( ∂ x i ∂ ∑ k = 1 K e x k ) ( ∑ k = 1 K e x k ) 2 ( ∂ x i ∂ e x j ) ∑ k = 1 K e x k − e x j ( ∂ x i ∂ ∑ k = 1 K e x k ) i = j i = j = ⎩ ⎪ ⎨ ⎪ ⎧ ( ∑ k = 1 K e x k ) 2 e x i ∑ k = 1 K e x k − e x i e x i ( ∑ k = 1 K e x k ) 2 0 ∑ k = 1 K e x k − e x j e x i i = j i = j = ⎩ ⎪ ⎨ ⎪ ⎧ ( ∑ k = 1 K e x k ) 2 e x i ( ∑ k = 1 K e x k − e x i ) ( ∑ k = 1 K e x k ) 2 e x i ( − e x j ) i = j i = j = { s ( x i ) ( 1 − s ( x j ) ) s ( x i ) ( 0 − s ( x j ) ) i = j i = j = s ( x i ) ( δ i = j − s ( x j ) )
Common Loss Functions
Binary Cross-Entropy
The binary cross-entropy loss function is used for models where the desired output is a binary probability.
This is necessary in binary classification models.
L BCE ( y , y ^ ) = − [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ] ∂ L ∂ y ^ = ∂ ∂ y ^ − [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ] = − [ ∂ ∂ y ^ y log y ^ + ∂ ∂ y ^ ( 1 − y ) log ( 1 − y ^ ) ] = − [ y y ^ − 1 − y 1 − y ^ ] = 1 − y 1 − y ^ − y y ^ \begin{aligned}
L_{\text{BCE}}(y, \hat{y}) &= -\left[y\log \hat{y} + (1-y)\log(1-\hat{y})\right]\\
\frac{\partial L}{\partial \hat{y}} &=
\frac{\partial}{\partial \hat{y}} -\left[y\log \hat{y} + (1-y)\log(1-\hat{y})\right]\\
&= -\left[
\frac{\partial}{\partial \hat{y}} y\log \hat{y} +
\frac{\partial}{\partial \hat{y}}(1-y)\log(1-\hat{y})
\right]\\
&= -\left[ \frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}} \right]\\
&= \frac{1-y}{1-\hat{y}} - \frac{y}{\hat{y}}\\
\end{aligned} L BCE ( y , y ^ ) ∂ y ^ ∂ L = − [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ] = ∂ y ^ ∂ − [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ] = − [ ∂ y ^ ∂ y log y ^ + ∂ y ^ ∂ ( 1 − y ) log ( 1 − y ^ ) ] = − [ y ^ y − 1 − y ^ 1 − y ] = 1 − y ^ 1 − y − y ^ y
Cross-Entropy
The cross-entropy function is used for models where the desired output is a probability distribution over K K K possible classes.
This is necessary in multinomial classification models.
L C E ( y , y ^ ) = − [ ∑ k = 1 K y k log ( y ^ k ) ] ∂ L ∂ y ^ i = ∂ ∂ y ^ i − [ ∑ k = 1 K y k log ( y ^ k ) ] = − [ ∂ ∂ y i ^ y i log ( y ^ i ) ] = − y i y ^ i \begin{aligned}
L_{CE}(y, \hat{y}) &= - \left[ \sum^{K}_{k=1} y_k \log(\hat{y}_k) \right] \\
\frac{\partial L}{\partial \hat{y}_i} &=
\frac{\partial}{\partial \hat{y}_i} - \left[ \sum^{K}_{k=1} y_k \log(\hat{y}_k) \right]\\
&= -\left[ \frac{\partial}{\partial \hat{y_i}} y_i \log(\hat{y}_i) \right]\\
&= -\frac{y_i}{\hat{y}_i}\\
\end{aligned} L C E ( y , y ^ ) ∂ y ^ i ∂ L = − [ k = 1 ∑ K y k log ( y ^ k ) ] = ∂ y ^ i ∂ − [ k = 1 ∑ K y k log ( y ^ k ) ] = − [ ∂ y i ^ ∂ y i log ( y ^ i ) ] = − y ^ i y i
Mean Squared Error
The mean squared error loss function is used for models where the desired output is a real number.
This is necessary for regression models.
L MSE ( y , y ^ ) = ( y − y ^ ) 2 ∂ L ∂ y ^ = ∂ ∂ y ^ ( y − y ^ ) 2 = 2 ( y − y ^ ) ∂ ∂ y ^ ( y − y ^ ) = 2 ( y ^ − y ) ∝ y ^ − y \begin{aligned}
L_{\text{MSE}}(y,\hat{y}) &= (y - \hat{y})^2\\
\frac{\partial L}{\partial \hat{y}} &= \frac{\partial}{\partial \hat{y}} (y - \hat{y})^2\\
&= 2(y-\hat{y})\frac{\partial}{\partial \hat{y}} (y - \hat{y})\\
&= 2(\hat{y}-y)\\
&\propto \hat{y}-y\\
\end{aligned} L MSE ( y , y ^ ) ∂ y ^ ∂ L = ( y − y ^ ) 2 = ∂ y ^ ∂ ( y − y ^ ) 2 = 2 ( y − y ^ ) ∂ y ^ ∂ ( y − y ^ ) = 2 ( y ^ − y ) ∝ y ^ − y