Unlike linear regression, which predicts continuous outcomes, logistic regression is specifically designed for binary outcomes. It models the probability of the occurrence of a categorical response variable based on one or more predictor variables. The logistic regression model transforms the linear combination of predictors using the logistic function (also known as the sigmoid function) to constrain the predicted values between 0 and 1
Assumptions of Logistic Regression
While logistic regression is robust to violations of some assumptions, it does rely on certain key assumptions:
Linearity of Log Odds: The log odds of the outcome is a linear combination of the predictors.
Independence of Observations: The observations are independent of each other.
Absence of Multicollinearity: The predictors are not highly correlated with each other.
Sufficient Sample Size: A minimum number of observations are required for stable parameter estimation.
Applications of Logistic Regression
Logistic regression finds applications across various domains, including:
Medical Research: Predicting the likelihood of disease occurrence based on risk factors.
Marketing: Predicting customer churn or response to marketing campaigns.
Finance: Assessing the likelihood of loan default based on financial indicators.
Social Sciences: Analyzing survey data to predict voting behavior.
Performing Logistic Regression in R
Let’s walk through an example of performing logistic regression in R contraceptive data
The data used was used investigate the determinants of contraceptive use among individuals in a certain population based on demographic, socioeconomic, and geographic variables, including residence type, education level, and wealth index.
Variables in the Data
uses_contraceptive: This is the outcome variable and represents whether an individual uses contraception or not. It’s a factor variable with levels ‘yes’ and ‘no’.
Residence: This variable represents the type of residence where the individual lives. It’s a factor variable with levels ‘rural’ and ‘urban’, indicating whether the individual lives in a rural or urban area.
Education_level: This variable represents the education level of the individual. It’s a factor variable with levels such as ‘no education’, ‘primary’, ‘secondary’, and ‘higher’, indicating the highest level of education attained by the individual.
Wealth_index: This variable represents the wealth index or economic status of the individual. It’s a factor variable with levels such as ‘poor’, ‘middle’, and ‘rich’, indicating the economic status or wealth level of the individual.
For individuals living in RURAL areas, the log of the Odds Ratio (log OR) is -0.19. This means that compared to individuals living in Urban areas, those living in rural areas have lower odds (about exp(-0.19) = 0.83 times) of using contraceptives.
For individuals with primary education, the log OR is 0.46. This means that compared to individuals with no education, those with primary education have higher odds (about exp(0.46) = 1.58 times) of using contraceptives. For individuals with secondary education, the log OR is 0.59. This suggests higher odds (about exp(0.59) = 1.80 times) of contraceptive usage compared to those with no education. For individuals with higher education, the log OR is 0.75. This implies higher odds (about exp(0.75) = 2.12 times) of contraceptive usage compared to those with no education. For individuals with middle wealth index, the log OR is 0.94. This means that compared to individuals in the poor wealth index category, those in the middle wealth index category have higher odds (about exp(0.94) = 2.56 times) of using contraceptives. For individuals with rich wealth index, the log OR is 1.1. This implies higher odds (about exp(1.1) = 3.00 times) of contraceptive usage compared to those in the poor wealth index category.
# Calculate the AUC-ROCauc_roc<-auc(roc_curve)# Print AUC-ROCprint(auc_roc)
Area under the curve: 0.6652
The AUC value indicates the model’s ability to distinguish between the two outcome classes. A higher AUC value indicates better model performance. An AUC value between 0.6 and 0.7 is considered fair.
Conclusion
Logistic regression is a powerful tool for predicting binary outcomes and understanding the influence of various predictors. This blog post demonstrated how to perform logistic regression in R, interpret the results, and assess model performance using the ROC curve and AUC value.