Logistic Regression

Introduction

Logistic regression is a parametric classification method that is used to estimate binary class output given real-valued input vectors. In this post, cross entropy loss is used as the loss function and gradient descent (or batched gradient descent) is used to learn parameters.

The computation graph below shows how logistic regression works. The dot product of each input, a vector x\vec{x} of size DD, and weight vector w\vec{w} transposed is taken, resulting in zz. zz is put through the sigmoid function, and the loss is calculated using the mean squared error on label yy and σ(z)\sigma(z).

The bias term β\beta is ignored for the purposes of this post, but can easily be appended to weight vector w\vec{w} after appending a 11 to an input vector x\vec{x}.

png

Learning

Given a data set XX of size NN with DD dimensions, parameters ww must be learned that minimize our loss function LCE(y,y^)L_{CE}(y, \hat{y}). The weight vector is learned using gradient descent. The derivation for term Lw\frac{\partial L}{\partial w} in the weight update is displayed in the derivations section of this post.

σ(z)=11+ez[Logistic function]LCE(y,y^)=[ylogy^+(1y)log(1y^)][Cross entropy loss]wi=wiαLwi[Weight update]=wiα[xi(σ(z)y)]wi=wiα1Bj=1Bxj,i(σ(z)y)[Batch weight update]\begin{aligned} \sigma(z) &= \frac{1}{1 + e^{-z}} & \text{[Logistic function]}\\ L_{\text{CE}}(y, \hat{y}) &= -\left[ y\log \hat{y} + (1-y)\log(1-\hat{y}) \right]& \text{[Cross entropy loss]}\\ w_i &= w_i - \alpha \frac{\partial L}{\partial w_i} & \text{[Weight update]}\\ &= w_i - \alpha \left[x_i (\sigma(z) - y)\right] &\\ w_i &= w_i - \alpha \frac{1}{B}\sum^B_{j=1}x_{j,i}(\sigma(z)-y) & \text{[Batch weight update]}\\ \end{aligned}

Code

Code for a logistic regression classifier is shown in the block below

from typing import List from tqdm import trange import torch def ErrorRate(y: torch.Tensor, yhat: torch.Tensor) -> float: """ Calculate error rate (1 - accuracy) Args: y: true labels yhat: predicted labels Returns: error rate """ return torch.sum((y != yhat).float()) / y.shape[0] class LogisticRegressionClassifier: def __init__(self) -> None: """ Instantiate logistic regression classifier """ self.w = None self.calcError = ErrorRate def fit(self, x, y, alpha=1e-4, epochs=1000, batch=32) -> None: """ Fit logistic regression classifier to dataset Args: x: input data y: input labels alpha: alpha parameter for weight update epochs: number of epochs to train batch: size of batches for training """ self.w = torch.rand((1, x.shape[1])) epochs = trange(epochs, desc='Error') for epoch in epochs: start, end = 0, batch for b in range((x.shape[0]//batch)+1): hx = self.probs(x[start:end]) dw = self.calcGradient(x[start:end], y[start:end], hx) self.w = self.w - (alpha * dw) start += batch end += batch hx = self.predict(x) error = self.calcError(y, hx) epochs.set_description('Err: %.4f' % error) def probs(self, x: torch.Tensor) -> torch.Tensor: """ Determine probability of label being 1 Args: x: input data Returns: probability for each member of input """ hx = 1 / (1 + torch.exp(-torch.einsum('ij,kj->i', x, self.w)))[:, None] return hx def predict(self, x: torch.Tensor) -> torch.Tensor: """ Predict labels Args: x: input data Returns: labels for each member of input """ hx = self.probs(x) hx = (hx >= 0.5).float() return hx def calcGradient(self, x: torch.Tensor, y: torch.Tensor, hx: torch.Tensor) -> torch.Tensor: """ Calculate weight gradient Args: x: input data y: input labels hx: predicted probabilities Returns: tensor of gradient values the same size as weights """ return torch.sum(x * (hx - y), dim=0) / x.shape[0]

Derivations

Derivative of loss function LL with respect to sigmoid output a=σ(z)a=\sigma(z):

La=a[yloga+(1y)log(1a)]=[yaloga+(1y)alog(1a)]=[yaaa+(1y)(1a)a(1a)]=[ya(1y)(1a)]\begin{aligned} \frac{\partial L}{\partial a} &= \frac{\partial}{\partial a} - \left[y\log a + (1-y) \log (1-a)\right]\\ &= - \left[ y\frac{\partial}{\partial a}\log a + (1-y)\frac{\partial}{\partial a}\log (1-a) \right]\\ &= - \left[ \frac{y}{a}\frac{\partial}{\partial a}a + \frac{(1-y)}{(1-a)}\frac{\partial}{\partial a}(1-a) \right]\\ &= -\left[\frac{y}{a} - \frac{(1-y)}{(1-a)}\right] \end{aligned}

Derivative of sigmoid function σ(z)\sigma(z) with respect to sigmoid input zz:

σ(z)z=z11+ez=z(1+ez)1=(1+ez)2z(1+ez)=(1+ez)2z(ez)=ez(1+ez)2z(z)=ez(1+ez)2=ez1+ez11+ez=(1+ez1+ez11+ez)1(1+ez)=(1σ(z))σ(z)\begin{aligned} \frac{\partial \sigma(z)}{\partial z} &= \frac{\partial}{\partial z}\frac{1}{1 + e^{-z}}\\ &= \frac{\partial}{\partial z}(1 + e^{-z})^{-1}\\ &= -(1 + e^{-z})^{-2}\frac{\partial}{\partial z}(1 + e^{-z})\\ &= -(1 + e^{-z})^{-2}\frac{\partial}{\partial z}(e^{-z})\\ &= \frac{-e^{-z}}{(1 + e^{-z})^{2}}\frac{\partial}{\partial z}(-z)\\ &= \frac{e^{-z}}{(1 + e^{-z})^{2}}\\ &= \frac{e^{-z}}{1 + e^{-z}} \frac{1}{1 + e^{-z}}\\ &= \left( \frac{1 + e^{-z}}{1 + e^{-z}} - \frac{1}{1 + e^{-z}} \right)\frac{1}{(1 + e^{-z})}\\ &= (1 - \sigma(z))\sigma(z)\\ \end{aligned}

Derivative of linear combination zz with respect to weight wiw_i:

zwi=wij=1Dwj×xj=xi\begin{aligned} \frac{\partial z}{\partial w_i} &= \frac{\partial}{\partial w_i} \sum^D_{j=1}w_j \times x_j\\ &= x_i \end{aligned}

Derivative of loss function LL with respect to weight wiw_i:

Lwi=zwiσ(z)zLσ(z)=xi[(1σ(z))σ(z)][1y1σ(z)yσ(z)]=xi[(1y)σ(z)(1σ(z))y]=xi[σ(z)yσ(z)y+yσ(z)]=xi[σ(z)y]\begin{aligned} \frac{\partial L}{\partial w_i} &= \frac{\partial z}{\partial w_i} \frac{\partial \sigma(z)}{\partial z} \frac{\partial L}{\partial \sigma(z)}\\ &= x_i \left[(1-\sigma(z))\sigma(z)\right]\left[\frac{1-y}{1-\sigma(z)}-\frac{y}{\sigma(z)}\right]\\ &= x_i [(1-y)\sigma(z) - (1-\sigma(z))y] \\ &= x_i [\sigma(z)-y\sigma(z) - y + y\sigma(z)] \\ &= x_i [\sigma(z)- y] \\ \end{aligned}

Resources

  • Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson, 2020.
  • Russell, Stuart J., et al. Artificial Intelligence: A Modern Approach. 3rd ed, Prentice Hall, 2010.