Logistic regression gradient descent is the iterative procedure that actually trains a logistic regression model — it starts with random coefficients and nudges them downhill, step by step, until the log-loss cost is as small as it can be. This guide walks through every piece of the math: the sigmoid prediction, the log-loss cost, the surprisingly clean gradient, and the update rule that ties them together.

The model: from inputs to a probability
Before we can train anything, we need the prediction. Logistic regression takes a linear combination of the inputs and squeezes it through the sigmoid (logistic) function so the output is a probability between 0 and 1:
$$p_i = \sigma(z_i), \qquad z_i = \beta_0 + \beta_1 x_i, \qquad \sigma(z) = \frac{1}{1 + e^{-z}}$$Here $z_i$ is the log-odds (the linear part), and $p_i$ is the predicted probability that example $i$ belongs to class 1. The coefficients $\beta_0$ and $\beta_1$ are exactly what training has to learn. You can experiment with the model interactively in our logistic regression calculator, and compare it to its cousin in logistic regression vs linear regression.
The cost function: log loss (binary cross-entropy)
To train the model we need a number that says how wrong the current coefficients are. For logistic regression that number is the log loss, also called binary cross-entropy:
$$J(\beta) = -\frac{1}{n}\sum_{i=1}^{n}\big[\,y_i \ln p_i + (1 – y_i)\ln(1 – p_i)\,\big]$$Each term punishes confident wrong answers harshly: if the true label $y_i = 1$ but the model predicts $p_i$ near 0, then $\ln p_i \to -\infty$ and the loss explodes. When the prediction matches the label, the term goes to 0. Averaging over all $n$ examples gives a single cost to minimize.
Log loss comes from maximum likelihood
Log loss is not arbitrary. Minimizing it is mathematically identical to maximizing the likelihood of the observed labels under the model. The likelihood of all the data is the product $\prod_i p_i^{y_i}(1 – p_i)^{1 – y_i}$; taking the negative logarithm turns that product into the sum above. So minimizing log loss = maximizing likelihood, which is why logistic regression is described as a maximum-likelihood method.
The gradient: an elegant, familiar result
Gradient descent needs the partial derivative of the cost with respect to each coefficient. The sigmoid has the neat property $\sigma'(z) = \sigma(z)\big(1 – \sigma(z)\big)$, and when you push the log loss through the chain rule, almost everything cancels. What remains is strikingly clean:
$$\frac{\partial J}{\partial \beta_j} = \frac{1}{n}\sum_{i=1}^{n}(p_i – y_i)\,x_{ij}$$Read it in words: the gradient is the average of the prediction error $(p_i – y_i)$ weighted by the feature value $x_{ij}$. For the intercept $\beta_0$, treat $x_{i0} = 1$, so its gradient is just the mean error. The remarkable thing is that this is the same clean form as the gradient of linear regression — only $p_i$ (a sigmoid output) replaces $\hat{y}_i$ (a raw linear output).
The gradient descent update rule
With the gradient in hand, the update rule moves each coefficient a small step in the direction that decreases the cost — the opposite of the gradient:
$$\beta_j \leftarrow \beta_j – \alpha\,\frac{\partial J}{\partial \beta_j} = \beta_j – \alpha\,\frac{1}{n}\sum_{i=1}^{n}(p_i – y_i)\,x_{ij}$$The symbol $\alpha$ is the learning rate — the size of each downhill step. Too small and training crawls; too large and the steps overshoot the minimum and the cost may diverge. Repeat the update over many passes (epochs) and the coefficients settle at the bottom of the convex log-loss bowl.
The algorithm step by step
Here is logistic regression gradient descent written out as a procedure you could implement directly:
- Initialize the coefficients. Set $\beta_0, \beta_1, \dots, \beta_k$ to zeros or small random numbers.
- Predict. For every example compute $z_i = \beta_0 + \beta_1 x_{i1} + \dots$ and then $p_i = \sigma(z_i)$.
- Compute the gradient. For each coefficient evaluate $\frac{\partial J}{\partial \beta_j} = \frac{1}{n}\sum_i (p_i – y_i)\,x_{ij}$.
- Update. Apply $\beta_j \leftarrow \beta_j – \alpha\,\frac{\partial J}{\partial \beta_j}$ for every coefficient simultaneously.
- Repeat until convergence. Loop steps 2 to 4 until the cost stops dropping or the gradient is near zero.
Linear vs logistic: the parallel in one table
Putting the two side by side shows how much logistic regression gradient descent borrows from linear regression — the gradient row is identical in shape:
| Component | Linear regression | Logistic regression |
|---|---|---|
| Model | $\hat{y}_i = \beta_0 + \beta_1 x_i$ | $p_i = \sigma(\beta_0 + \beta_1 x_i)$ |
| Cost | $\frac{1}{2n}\sum (\hat{y}_i – y_i)^2$ | $-\frac{1}{n}\sum [y_i\ln p_i + (1-y_i)\ln(1-p_i)]$ |
| Gradient | $\frac{1}{n}\sum (\hat{y}_i – y_i)\,x_{ij}$ | $\frac{1}{n}\sum (p_i – y_i)\,x_{ij}$ |
| Solved by | Normal equation or gradient descent | Gradient descent only (no closed form) |
The cost functions differ, the models differ, but the gradient has the same form — that is the single most useful fact to carry forward into neural networks, where this same prediction-minus-target signal drives backpropagation.
A small worked feel for the update
Imagine one training point with $x = 2$, true label $y = 1$, and current prediction $p = 0.3$. The error is $p – y = 0.3 – 1 = -0.7$. The contribution to the slope gradient is $(-0.7)(2) = -1.4$. With learning rate $\alpha = 0.1$, the update nudges $\beta_1$ by $-\alpha(-1.4) = +0.14$ — it raises the coefficient, pushing the next prediction higher toward the true label of 1. Multiply this intuition across all points and epochs and you have the entire training loop. See it on real numbers in our logistic regression example.
🤖 ML context
Logistic regression is the simplest neural network — a single neuron with a sigmoid activation trained by gradient descent. The exact gradient you derived here, the prediction minus target signal, is the seed of backpropagation in deep networks. Master logistic regression gradient descent and you have the core optimization idea behind nearly all of modern machine learning. Try it hands-on in the logistic regression calculator.
Frequently asked questions
How does gradient descent train logistic regression?
Why does logistic regression use log loss instead of mean squared error?
What is the gradient of the logistic regression cost function?
Does logistic regression have a closed-form solution like linear regression?
How do I choose the learning rate for gradient descent?
Key takeaways
Logistic regression gradient descent reduces to four moving parts: predict with the sigmoid $p_i = \sigma(z_i)$, measure error with convex log loss, compute the clean gradient $\frac{1}{n}\sum(p_i – y_i)x_{ij}$, and step downhill with $\beta_j \leftarrow \beta_j – \alpha\,\frac{\partial J}{\partial \beta_j}$. Because log loss is convex and equals the negative log-likelihood, this loop reliably finds the maximum-likelihood coefficients with no closed-form shortcut. Continue with the logistic regression calculator, the logistic regression example, or the formal reference on Wikipedia.