Logistic Regression Gradient Descent

Logistic regression gradient descent is the iterative procedure that actually trains a logistic regression model — it starts with random coefficients and nudges them downhill, step by step, until the log-loss cost is as small as it can be. This guide walks through every piece of the math: the sigmoid prediction, the log-loss cost, the surprisingly clean gradient, and the update rule that ties them together.

logistic regression gradient descent minimizing log loss
Gradient descent walks the coefficients downhill on the log-loss surface.

The model: from inputs to a probability

Before we can train anything, we need the prediction. Logistic regression takes a linear combination of the inputs and squeezes it through the sigmoid (logistic) function so the output is a probability between 0 and 1:

$$p_i = \sigma(z_i), \qquad z_i = \beta_0 + \beta_1 x_i, \qquad \sigma(z) = \frac{1}{1 + e^{-z}}$$

Here $z_i$ is the log-odds (the linear part), and $p_i$ is the predicted probability that example $i$ belongs to class 1. The coefficients $\beta_0$ and $\beta_1$ are exactly what training has to learn. You can experiment with the model interactively in our logistic regression calculator, and compare it to its cousin in logistic regression vs linear regression.

The cost function: log loss (binary cross-entropy)

To train the model we need a number that says how wrong the current coefficients are. For logistic regression that number is the log loss, also called binary cross-entropy:

$$J(\beta) = -\frac{1}{n}\sum_{i=1}^{n}\big[\,y_i \ln p_i + (1 – y_i)\ln(1 – p_i)\,\big]$$

Each term punishes confident wrong answers harshly: if the true label $y_i = 1$ but the model predicts $p_i$ near 0, then $\ln p_i \to -\infty$ and the loss explodes. When the prediction matches the label, the term goes to 0. Averaging over all $n$ examples gives a single cost to minimize.

📊 Why log loss instead of MSE?You might expect to reuse mean squared error from linear regression. The problem is that MSE combined with the sigmoid is non-convex — its surface has bumps and local minima where gradient descent can get stuck. Log loss, by contrast, is convex in the coefficients, so it has a single global minimum that gradient descent reliably reaches.

Log loss comes from maximum likelihood

Log loss is not arbitrary. Minimizing it is mathematically identical to maximizing the likelihood of the observed labels under the model. The likelihood of all the data is the product $\prod_i p_i^{y_i}(1 – p_i)^{1 – y_i}$; taking the negative logarithm turns that product into the sum above. So minimizing log loss = maximizing likelihood, which is why logistic regression is described as a maximum-likelihood method.

The gradient: an elegant, familiar result

Gradient descent needs the partial derivative of the cost with respect to each coefficient. The sigmoid has the neat property $\sigma'(z) = \sigma(z)\big(1 – \sigma(z)\big)$, and when you push the log loss through the chain rule, almost everything cancels. What remains is strikingly clean:

$$\frac{\partial J}{\partial \beta_j} = \frac{1}{n}\sum_{i=1}^{n}(p_i – y_i)\,x_{ij}$$

Read it in words: the gradient is the average of the prediction error $(p_i – y_i)$ weighted by the feature value $x_{ij}$. For the intercept $\beta_0$, treat $x_{i0} = 1$, so its gradient is just the mean error. The remarkable thing is that this is the same clean form as the gradient of linear regression — only $p_i$ (a sigmoid output) replaces $\hat{y}_i$ (a raw linear output).

✅ Same shape, different modelBoth linear and logistic regression land on the identical gradient template $\frac{1}{n}\sum (\text{prediction} – \text{actual})\,x_{ij}$. That shared structure is no accident — it falls out of pairing each model with its natural loss (squared error with the identity link, log loss with the logit link). Learn it once and it transfers.

The gradient descent update rule

With the gradient in hand, the update rule moves each coefficient a small step in the direction that decreases the cost — the opposite of the gradient:

$$\beta_j \leftarrow \beta_j – \alpha\,\frac{\partial J}{\partial \beta_j} = \beta_j – \alpha\,\frac{1}{n}\sum_{i=1}^{n}(p_i – y_i)\,x_{ij}$$

The symbol $\alpha$ is the learning rate — the size of each downhill step. Too small and training crawls; too large and the steps overshoot the minimum and the cost may diverge. Repeat the update over many passes (epochs) and the coefficients settle at the bottom of the convex log-loss bowl.

The algorithm step by step

Here is logistic regression gradient descent written out as a procedure you could implement directly:

  1. Initialize the coefficients. Set $\beta_0, \beta_1, \dots, \beta_k$ to zeros or small random numbers.
  2. Predict. For every example compute $z_i = \beta_0 + \beta_1 x_{i1} + \dots$ and then $p_i = \sigma(z_i)$.
  3. Compute the gradient. For each coefficient evaluate $\frac{\partial J}{\partial \beta_j} = \frac{1}{n}\sum_i (p_i – y_i)\,x_{ij}$.
  4. Update. Apply $\beta_j \leftarrow \beta_j – \alpha\,\frac{\partial J}{\partial \beta_j}$ for every coefficient simultaneously.
  5. Repeat until convergence. Loop steps 2 to 4 until the cost stops dropping or the gradient is near zero.
⚠ Practical notes that matterThree things make training smoother. First, feature scaling (standardizing inputs to similar ranges) reshapes the cost surface so gradient descent converges far faster. Second, choose $\alpha$ carefully — plot the cost per epoch; if it rises, $\alpha$ is too big. Third, unlike linear regression’s normal equation, logistic regression has no closed-form solution, so this iterative descent is the standard way to fit it.

Linear vs logistic: the parallel in one table

Putting the two side by side shows how much logistic regression gradient descent borrows from linear regression — the gradient row is identical in shape:

ComponentLinear regressionLogistic regression
Model$\hat{y}_i = \beta_0 + \beta_1 x_i$$p_i = \sigma(\beta_0 + \beta_1 x_i)$
Cost$\frac{1}{2n}\sum (\hat{y}_i – y_i)^2$$-\frac{1}{n}\sum [y_i\ln p_i + (1-y_i)\ln(1-p_i)]$
Gradient$\frac{1}{n}\sum (\hat{y}_i – y_i)\,x_{ij}$$\frac{1}{n}\sum (p_i – y_i)\,x_{ij}$
Solved byNormal equation or gradient descentGradient descent only (no closed form)

The cost functions differ, the models differ, but the gradient has the same form — that is the single most useful fact to carry forward into neural networks, where this same prediction-minus-target signal drives backpropagation.

Convex vs non-convex A convex cost is bowl-shaped: every downhill path leads to the one global minimum, so gradient descent cannot get trapped. Log loss with a sigmoid is convex; squared error with a sigmoid is not. That single property is the whole reason logistic regression uses log loss for its cost function.

A small worked feel for the update

Imagine one training point with $x = 2$, true label $y = 1$, and current prediction $p = 0.3$. The error is $p – y = 0.3 – 1 = -0.7$. The contribution to the slope gradient is $(-0.7)(2) = -1.4$. With learning rate $\alpha = 0.1$, the update nudges $\beta_1$ by $-\alpha(-1.4) = +0.14$ — it raises the coefficient, pushing the next prediction higher toward the true label of 1. Multiply this intuition across all points and epochs and you have the entire training loop. See it on real numbers in our logistic regression example.

🤖 ML context

Logistic regression is the simplest neural network — a single neuron with a sigmoid activation trained by gradient descent. The exact gradient you derived here, the prediction minus target signal, is the seed of backpropagation in deep networks. Master logistic regression gradient descent and you have the core optimization idea behind nearly all of modern machine learning. Try it hands-on in the logistic regression calculator.

Frequently asked questions

How does gradient descent train logistic regression?
It starts with initial coefficients, predicts probabilities with the sigmoid, computes the log-loss gradient (1/n) sum (p – y) x, and repeatedly updates each coefficient by subtracting the learning rate times that gradient until the cost stops decreasing.
Why does logistic regression use log loss instead of mean squared error?
Mean squared error paired with the sigmoid is non-convex, so gradient descent can get stuck in local minima. Log loss is convex in the coefficients, giving a single global minimum, and it also equals the negative log-likelihood, so minimizing it maximizes likelihood.
What is the gradient of the logistic regression cost function?
The partial derivative of log loss with respect to coefficient beta_j is (1/n) sum over i of (p_i – y_i) times x_ij. It is the average prediction error weighted by the feature value, the same clean form as linear regression’s gradient.
Does logistic regression have a closed-form solution like linear regression?
No. Linear regression has the normal equation, but logistic regression has no closed-form solution because the sigmoid makes the equations nonlinear. It must be fit iteratively with gradient descent or a similar optimizer.
How do I choose the learning rate for gradient descent?
Plot the cost against epochs. If the cost decreases smoothly, the learning rate is fine; if it diverges or oscillates, lower it; if it falls very slowly, raise it. Scaling the features first also makes a wider range of learning rates work well.

Key takeaways

Logistic regression gradient descent reduces to four moving parts: predict with the sigmoid $p_i = \sigma(z_i)$, measure error with convex log loss, compute the clean gradient $\frac{1}{n}\sum(p_i – y_i)x_{ij}$, and step downhill with $\beta_j \leftarrow \beta_j – \alpha\,\frac{\partial J}{\partial \beta_j}$. Because log loss is convex and equals the negative log-likelihood, this loop reliably finds the maximum-likelihood coefficients with no closed-form shortcut. Continue with the logistic regression calculator, the logistic regression example, or the formal reference on Wikipedia.

Scroll to Top