Logistic Regression

Author

Julius

Published

June 20, 2024

Introduction

Unlike linear regression, which predicts continuous outcomes, logistic regression is specifically designed for binary outcomes. It models the probability of the occurrence of a categorical response variable based on one or more predictor variables. The logistic regression model transforms the linear combination of predictors using the logistic function (also known as the sigmoid function) to constrain the predicted values between 0 and 1

Assumptions of Logistic Regression

While logistic regression is robust to violations of some assumptions, it does rely on certain key assumptions:

  • Linearity of Log Odds: The log odds of the outcome is a linear combination of the predictors.

  • Independence of Observations: The observations are independent of each other.

  • Absence of Multicollinearity: The predictors are not highly correlated with each other.

  • Sufficient Sample Size: A minimum number of observations are required for stable parameter estimation.

Applications of Logistic Regression

Logistic regression finds applications across various domains, including:

  • Medical Research: Predicting the likelihood of disease occurrence based on risk factors.

  • Marketing: Predicting customer churn or response to marketing campaigns.

  • Finance: Assessing the likelihood of loan default based on financial indicators.

  • Social Sciences: Analyzing survey data to predict voting behavior.

Performing Logistic Regression in R

Let’s walk through an example of performing logistic regression in R contraceptive data

The data used was used investigate the determinants of contraceptive use among individuals in a certain population based on demographic, socioeconomic, and geographic variables, including residence type, education level, and wealth index.

Variables in the Data

uses_contraceptive: This is the outcome variable and represents whether an individual uses contraception or not. It’s a factor variable with levels ‘yes’ and ‘no’.

Residence: This variable represents the type of residence where the individual lives. It’s a factor variable with levels ‘rural’ and ‘urban’, indicating whether the individual lives in a rural or urban area.

Education_level: This variable represents the education level of the individual. It’s a factor variable with levels such as ‘no education’, ‘primary’, ‘secondary’, and ‘higher’, indicating the highest level of education attained by the individual.

Wealth_index: This variable represents the wealth index or economic status of the individual. It’s a factor variable with levels such as ‘poor’, ‘middle’, and ‘rich’, indicating the economic status or wealth level of the individual.

Loading and Preprocessing Data

Code
# Importin the data
 contraceptive<-read.csv("contraceptive.csv")
# converting variables to categorical variables
contraceptive$uses_contraceptive<-factor(contraceptive$uses_contraceptive, levels = c(0,1), labels = c("no","yes"))
contraceptive$Residence<-factor(contraceptive$Residence, levels = c(0,1), labels = c("urban","rural"))
contraceptive$Education_level<- factor(contraceptive$Education_level, levels = c(0,1,2,3), labels = c("no_education","primary","secondary","higher"))
contraceptive$Wealth_index <- factor(contraceptive$Wealth_index, levels = c(0,1,2), labels = c("poor","middle","rich"))

Fitting Logistic Regression Model

Code
model <- glm(uses_contraceptive~.,data = contraceptive, family = "binomial")

Interpret Coefficients

Code
tbl_regression(model,pvalue_fun = ~style_pvalue(.x, digits = 3)) %>% 
  bold_p() %>% 
  as_flex_table()

Characteristic

log(OR)1

95% CI1

p-value

Residence

urban

rural

-0.19

-0.27, -0.12

<0.001

Education_level

no_education

primary

0.46

0.40, 0.52

<0.001

secondary

0.59

0.48, 0.70

<0.001

higher

0.75

0.60, 0.89

<0.001

Wealth_index

poor

middle

0.94

0.87, 1.0

<0.001

rich

1.1

1.0, 1.2

<0.001

1OR = Odds Ratio, CI = Confidence Interval

For individuals living in RURAL areas, the log of the Odds Ratio (log OR) is -0.19. This means that compared to individuals living in Urban areas, those living in rural areas have lower odds (about exp(-0.19) = 0.83 times) of using contraceptives.

For individuals with primary education, the log OR is 0.46. This means that compared to individuals with no education, those with primary education have higher odds (about exp(0.46) = 1.58 times) of using contraceptives. For individuals with secondary education, the log OR is 0.59. This suggests higher odds (about exp(0.59) = 1.80 times) of contraceptive usage compared to those with no education. For individuals with higher education, the log OR is 0.75. This implies higher odds (about exp(0.75) = 2.12 times) of contraceptive usage compared to those with no education. For individuals with middle wealth index, the log OR is 0.94. This means that compared to individuals in the poor wealth index category, those in the middle wealth index category have higher odds (about exp(0.94) = 2.56 times) of using contraceptives. For individuals with rich wealth index, the log OR is 1.1. This implies higher odds (about exp(1.1) = 3.00 times) of contraceptive usage compared to those in the poor wealth index category.

Goodness of fit of the Model using

Code
roc_curve <- roc(model$y, fitted(model))
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Code
# Calculate the AUC-ROC
auc_roc <- auc(roc_curve)

# Print AUC-ROC
print(auc_roc)
Area under the curve: 0.6652

The AUC value indicates the model’s ability to distinguish between the two outcome classes. A higher AUC value indicates better model performance. An AUC value between 0.6 and 0.7 is considered fair.

Conclusion

Logistic regression is a powerful tool for predicting binary outcomes and understanding the influence of various predictors. This blog post demonstrated how to perform logistic regression in R, interpret the results, and assess model performance using the ROC curve and AUC value.