Multinomial Logistic Regression

Author

Julius Ndung’u

Published

June 2, 2024

Introduction to Multinomial Logistic Regression

Multinomial logistic regression is a statistical method used to model the relationship between a categorical dependent variable with more than two levels and one or more independent variables. Unlike binary logistic regression, which deals with dichotomous outcomes, multinomial logistic regression can handle outcomes with three or more categories. This method is widely used in various fields such as social sciences, marketing, and healthcare to predict categorical outcomes based on predictor variables.

Assumptions of Multinomial Logistic Regression

Multinomial logistic regression relies on several assumptions to ensure the validity of the model:

  • Independence of Irrelevant Alternatives (IIA): The odds of preferring one category over another are not influenced by the presence or absence of other categories.

  • Linearity of Logits: The logit (log-odds) of the outcome is a linear combination of the predictors.

  • No Multicollinearity: The predictor variables are not highly correlated with each other.

  • Sufficient Sample Size: A larger sample size is needed to ensure reliable estimates, especially with many categories.

Applications of Multinomial Logistic Regression

Multinomial logistic regression finds applications across various domains, including:

  • Healthcare: Predicting the type of treatment a patient will choose based on demographic and clinical characteristics.

  • Marketing: Understanding consumer choice among different brands or products.

  • Education: Analyzing factors influencing the choice of major or field of study.

  • Social Sciences: Examining voting behavior or preferences among multiple candidates or parties.

Performing Multinomial Logistic Regression in R

Let’s walk through an example of performing multinomial logistic regression in R using the built-in iris dataset to predict the species of iris flowers based on their sepal and petal measurements.

Sample Data

The iris dataset contains measurements of iris flowers from three different species: setosa, versicolor, and virginica. The variables include sepal length, sepal width, petal length, and petal width.

Code
# Load necessary libraries
library(nnet)
library(tidyverse)
library(broom)

# Load the iris dataset
data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Fitting the Model

Code
# Fit the multinomial logistic regression model
iris_model <- multinom(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
# weights:  18 (10 variable)
initial  value 164.791843 
iter  10 value 16.177348
iter  20 value 7.111438
iter  30 value 6.182999
iter  40 value 5.984028
iter  50 value 5.961278
iter  60 value 5.954900
iter  70 value 5.951851
iter  80 value 5.950343
iter  90 value 5.949904
iter 100 value 5.949867
final  value 5.949867 
stopped after 100 iterations
Code
# Display the summary of the model
summary(iris_model)
Call:
multinom(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + 
    Petal.Width, data = iris)

Coefficients:
           (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor    18.69037    -5.458424   -8.707401     14.24477   -3.097684
virginica    -23.83628    -7.923634  -15.370769     23.65978   15.135301

Std. Errors:
           (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor    34.97116     89.89215    157.0415     60.19170    45.48852
virginica     35.76649     89.91153    157.1196     60.46753    45.93406

Residual Deviance: 11.89973 
AIC: 31.89973 

Interpretation of Coefficients

The coefficients presented represent the estimated log-odds of an iris flower belonging to the “versicolor” or “virginica” species compared to the baseline species, “setosa.” Each coefficient corresponds to one of the predictor variables: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.

  • For the “versicolor” category:

Intercept: The intercept indicates the log-odds of an iris flower being “versicolor” when all predictor variables are zero. Sepal.Length, Sepal.Width, Petal.Length, Petal.Width: These coefficients represent the change in the log-odds of an iris flower being “versicolor” for a one-unit increase in each respective predictor variable. For example, a one-unit increase in Sepal.Length corresponds to a decrease of approximately 5.46 in the log-odds of the flower being “versicolor,” holding other variables constant.

  • For the “virginica” category:

Intercept: Similar to “versicolor,” the intercept represents the log-odds of an iris flower being “virginica” when all predictor variables are zero. Sepal.Length, Sepal.Width, Petal.Length, Petal.Width: These coefficients indicate the change in the log-odds of an iris flower being “virginica” for a one-unit increase in each respective predictor variable. For instance, a one-unit increase in Petal.Length corresponds to an increase of approximately 23.66 in the log-odds of the flower being “virginica,” holding other variables constant. >Lower AIC values indicate a better model fit while penalizing for the number of parameters in the model.

Conclusion

Multinomial logistic regression is a powerful tool for modeling categorical outcomes with more than two levels. By understanding the assumptions and applications of this method, and by using R for implementation, we can gain valuable insights into complex relationships between variables. This example with the iris dataset demonstrates the practical use of multinomial logistic regression in predictive modeling.