Multinomial Logistic Regression

Multinomial logistic regression is the model you reach for when the outcome has three or more unordered categories — like predicting whether an animal is a cat, a dog, or a bird — and it works by generalizing ordinary binary logistic regression with the softmax function. This guide explains the math, walks through a worked example, and shows exactly when to use it.

multinomial logistic regression classifying three or more categories
Multinomial logistic regression predicts one of several unordered classes.

What is multinomial logistic regression?

Multinomial logistic regression (also called multiclass logistic regression or softmax regression) extends logistic regression to problems with more than two possible outcomes that have no natural order. Binary logistic regression answers a yes/no question; the multinomial version answers a “which one of these $K$ classes?” question. Classic examples include predicting the chosen transport mode among {car, bus, train}, the species among {cat, dog, bird}, or which product a customer buys from a catalogue of three.

The crucial requirement is that the categories are unordered (nominal). If the categories have a ranking — like {low, medium, high} — you should use ordinal logistic regression instead, which exploits that ordering. For the two-class case you simply fall back to binary logistic regression.

The softmax generalization

In binary logistic regression a single linear score is squeezed through the sigmoid. Multinomial logistic regression keeps one linear score per class and squeezes the whole set through the softmax function, which turns $K$ raw scores into $K$ probabilities that are all positive and sum to one:

$$P(y=c\mid x)=\frac{e^{z_c}}{\sum_{j=1}^{K} e^{z_j}}, \qquad z_c=\beta_{c0}+\beta_{c1}x_1+\beta_{c2}x_2+\cdots$$

Here each class $c$ gets its own set of coefficients $\beta_{c0}, \beta_{c1}, \dots$, so $z_c$ is the linear score that class earns for a given input $x$. The exponentials make every score positive, and dividing by the sum $\sum_{j=1}^{K} e^{z_j}$ normalizes them into a valid probability distribution. The predicted class is simply the one with the highest probability.

Reference-category (baseline) formulation Because the probabilities must sum to 1, one class is redundant. The standard fix is to pick a reference (baseline) class and fix its coefficients to zero, then estimate coefficients only for the other $K-1$ classes relative to that baseline. Each coefficient then describes how a feature changes the log-odds of a class versus the reference class.
📊 One-vs-rest as an alternativeInstead of a single softmax model, you can train $K$ separate binary logistic regressions — each one “this class vs. all others” — and pick the class with the highest score. This one-vs-rest (OvR) scheme is simpler and is the default in some libraries, but its probabilities are not guaranteed to sum to 1, whereas the true multinomial (softmax) model is fit jointly and is properly calibrated.

Binary vs multinomial vs ordinal

The three flavours of logistic regression differ only in how many classes they handle and whether order matters. Here is the comparison in one table:

ModelNumber of classesOrdered?Core mechanism
Binary logistic regressionExactly 2N/ASigmoid on one linear score
Multinomial logistic regression3 or moreNo (nominal)Softmax over $K$ linear scores
Ordinal logistic regression3 or moreYes (ranked)Cumulative log-odds with shared slope

Worked example: predicting a customer’s chosen product

Imagine an online store with three products — A, B, and C — and you want to predict which one a customer buys from their age ($x_1$) and income ($x_2$). The outcome has three unordered categories, so multinomial logistic regression is the right tool. We pick product A as the reference class, so its coefficients are fixed to zero, and we estimate coefficients for B and C relative to A.

Suppose fitting the model gives these linear scores for a customer with $x_1=35$ (age) and $x_2=6$ (income in tens of thousands):

  1. Reference class A: $z_A = 0$ by construction.
  2. Class B: $z_B = \beta_{B0}+\beta_{B1}x_1+\beta_{B2}x_2 = -2.0 + 0.04(35) + 0.10(6) = 0.0$.
  3. Class C: $z_C = \beta_{C0}+\beta_{C1}x_1+\beta_{C2}x_2 = -4.0 + 0.02(35) + 0.50(6) = -0.3$.

Now apply softmax. The denominator is $e^{0}+e^{0}+e^{-0.3}=1+1+0.741=2.741$, so:

$$P(A)=\frac{1}{2.741}=0.365,\quad P(B)=\frac{1}{2.741}=0.365,\quad P(C)=\frac{0.741}{2.741}=0.270$$

The three probabilities sum to 1, and the model would predict a tie between A and B as the most likely purchase. Interpreting the coefficients: because A is the reference, $\beta_{C2}=0.50$ means that each extra unit of income multiplies the odds of choosing C over A by $e^{0.50}\approx 1.65$. A positive coefficient always reads as “this feature pushes the customer toward this class relative to the reference class,” which is why choosing a sensible baseline makes the whole model easier to explain.

✅ Reading multinomial coefficientsEvery coefficient in multinomial logistic regression is relative to the reference class, never absolute. Exponentiating a coefficient, $e^{\beta_{ck}}$, gives the odds ratio for class $c$ versus the baseline per one-unit rise in feature $k$. Change the reference class and the numbers change — but the predicted probabilities stay identical.

When to use multinomial logistic regression

Reach for multinomial logistic regression when both of these are true:

  1. The target has three or more categories. With exactly two, use binary logistic regression — the softmax with $K=2$ reduces to the ordinary sigmoid anyway.
  2. The categories are unordered (nominal). Brands, species, transport modes, and product choices have no inherent ranking, so each class deserves its own coefficients.
⚠ Ordered classes? Use ordinal insteadIf your categories carry a natural order — survey ratings like {poor, fair, good, excellent} or sizes {small, medium, large} — multinomial logistic regression throws that ordering information away. In those cases ordinal logistic regression is more efficient because it assumes a single shared slope across the ordered cut-points.

How it is trained

Like binary logistic regression, the multinomial model has no closed-form solution. It is fit by maximum likelihood, minimizing the multiclass cross-entropy (log) loss with gradient descent or a solver like L-BFGS. The same optimizer underlies the softmax output layer of modern neural networks, which is exactly why multinomial logistic regression is often described as a single-layer neural net with a softmax activation. You can build intuition for the binary backbone first with our logistic regression calculator.

🤖 ML context

Multinomial logistic regression is the bridge from classic statistics to deep learning: its softmax output is the final layer of nearly every neural-network classifier, from image recognition to language models. Master the logistic regression hub and the leap from binary logistic regression to multiclass becomes one short step — you are simply stacking one linear model per class and normalizing with softmax.

Frequently asked questions

What is multinomial logistic regression?
Multinomial logistic regression is a classification model for outcomes with three or more unordered categories. It generalizes binary logistic regression by fitting one linear score per class and converting those scores into probabilities with the softmax function.
How is multinomial logistic regression different from binary logistic regression?
Binary logistic regression handles exactly two classes using a sigmoid on a single linear score. Multinomial logistic regression handles three or more unordered classes using softmax over one linear score per class. With two classes, softmax reduces to the ordinary sigmoid.
What is the softmax function in multinomial logistic regression?
Softmax turns K raw linear scores into K probabilities that are all positive and sum to 1. For class c the probability is e^(z_c) divided by the sum of e^(z_j) over all classes j. The predicted class is the one with the highest probability.
What is the reference category in multinomial logistic regression?
Because the class probabilities must sum to 1, one class is redundant. The reference or baseline class has its coefficients fixed to zero, and every other class’s coefficients are interpreted as effects relative to that baseline class.
When should I use multinomial instead of ordinal logistic regression?
Use multinomial logistic regression when the three or more categories are unordered, such as brands or species. Use ordinal logistic regression when the categories have a natural ranking, such as small/medium/large, because it uses the ordering for a more efficient model.

Key takeaways

Multinomial logistic regression generalizes binary logistic regression to three or more unordered classes by giving each class its own linear score and normalizing with softmax, $P(y=c\mid x)=e^{z_c}/\sum_j e^{z_j}$. Coefficients are read relative to a chosen reference class, and you can fit it jointly or approximate it with one-vs-rest. Continue with the logistic regression calculator, the two-class case in binary logistic regression, the ranked-class case in ordinal logistic regression, or the formal reference on Wikipedia.

Scroll to Top