Some data come with multinomial outcomes in which case, the outcome variable is a nominal or polychotomous variable with more than two levels. In multinomial outcome data, the outcome has no natural ordering. if it has, then it is best treated as a ordinal outcome data. For these data, we can modify the binary logistic model to make estimation and inference (D. Kleinbaum 2010 ; Hosmer, Lemeshow, and Sturdivant 2013) .
Variables with more than two levels are known as either:
If we employ logistic regression to such data, then the analysis is known as polytomous or multinomial logistic regression. Again, the polychotomous outcome does not have any natural order. When the categories of the outcome variable do have a natural order, ordinal logistic regression is preferred.
Examples of data with polychotomous outcome variables (with more than two levels) include:
A numerical outcome can be categorized based on different cut-off points. The newly created categorical variable is now either treated as a nominal or polychotomous or multinomial outcome or as a ordinal outcome. That justifies the use of multinomial logistic regression.
With a multinomial outcome data, an extension of logistic regression know as multinomial logit or multinomial logistic regression can be performed. In multinomial logistic regression, one of the categories of the outcome variable is designated as the reference category and each of the other levels is compared with this reference. The choice of reference category can be arbitrary and is at the discretion of the researcher. Most software set the first category as the reference (also known as the baseline category) by default.
Other models that can analyze data with polychotomous outcome include:
Remember, interpreting and assessing the significance of the estimated coefficients are the main objectives in regression analysis. in multinomial logistic regression, we would like to model the relationship between covariates with the outcome variable that has more than two categories but without ordering or ranking.
The actual values taken by the dependent variable are irrelevant. In Stata, the exponentiated beta \(\exp^\) will generate the so-called the relative-risk ratio. The dependent variable again, is a discrete variable and the model is estimated using maximum likelihood.
In multinomial logistic regression for example in data with three categories of the outcome, the sum of the probabilities for the three outcome categories must be equal to 1 (the total probability). The comparison is done between two categories each time. Because of that, each comparison considers only two probabilities, and the probabilities in the ratio do not sum to 1. Thus, the two odds-like expressions in multinomial logistic regression are not true odds.
Multinomial logistic regression can be thought as of simultenously fitting binary logits for all comparisons among the alternatives.
In the special case where the covariate is binary, coded 0 or 1, we simplify the notation to \(OR_j = OR_j(1,0)\) . Let use an example where data have 3 categories of outcome; 0,1 and 2.
Let’s say we have a dataset with the outcome variable, \(Y\) , and is coded as \(0, 1,\) or \(2\) . In practice one should check that the software package that is going to be used allows a \(0\) code as the smallest category. We have used packages that require that the codes begin with \(1\) .
SO the logit functions (log odds) when the outcome is a D variable with (0,1 and 2 values) are as below
If for example, we assume that the outcome labelled with \(Y=0\) is the reference outcome. The subscript on the odds ratio indicates which outcome is being compared to the reference category outcome. The odds ratio of outcome \(Y = j\) versus \(Y = 0\) for the covariate values of \(x = a\) versus \(x = b\) is:
Each odds ratio is calculated in a manner similar to that used in standard logistic regression. That is:
The conditional probabilities for the multinomial logit model are:
Start brand new analysis with a new project in RStudio. To do so,
library(here)
## here() starts at D:/Data_Analysis_CRC_multivar_data_analysis_codes
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package () to force all conflicts to become errors
library(haven) library(gtsummary) library(VGAM)
## Loading required package: stats4 ## Loading required package: splines
library(kableExtra)
## ## Attaching package: 'kableExtra' ## ## The following object is masked from 'package:dplyr': ## ## group_rows
The datasets contains all variables of our interest. For the purpose of this assignment, we want to explore association of hypertension status, weight and total cholesterol with the result of screening FBS. The variables in the datasets as follow:
We will use haven::read_dta() function to read stata .dta data into the memory. Remember, we can also use foreign::read.dta() function to read stata .dta data. It is your choice.
We use here() to indicate that the folder data contains the dataset. And then we specify metabolic_syndrome.dta as the dataset to be read.
ms read_dta(here('data','metabolic_syndrome.dta'))
We can then quickly glance at the number of observations and the type of variables in the ms dataset.
glimpse(ms)
## Rows: 4,340 ## Columns: 21 ## $ codesub "R-S615112", "MAA615089", "M-M616372", "MFM615361", "R-A61578… ## $ age 70, 20, 29, 25, 37, 43, 26, 28, 48, 20, 56, 55, 26, 39, 18, 2… ## $ hpt "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", … ## $ smoking "never smoked", "still smoking", "never smoked", "still smoki… ## $ dmdx "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "… ## $ height 1.54, 1.74, 1.54, 1.60, 1.44, 1.46, 1.47, 1.61, 1.68, 1.55, 1… ## $ weight 40.0, 54.6, 37.0, 48.4, 44.5, 45.5, 48.0, 40.0, 48.4, 41.0, 5… ## $ waist 76.0, 83.0, 83.0, 83.5, 85.0, 90.0, 91.0, 80.0, 82.0, 79.0, 9… ## $ hip 61, 62, 63, 64, 64, 64, 64, 64, 65, 66, 67, 67, 67, 67, 67, 6… ## $ msbpr 135.0, 105.0, 91.0, 117.0, 102.0, 124.0, 120.0, 85.0, 112.0, … ## $ mdbpr 80.0, 58.0, 60.0, 68.5, 78.0, 65.5, 77.0, 60.0, 74.0, 52.0, 1… ## $ hba1c 5.2, 5.3, 4.8, 4.8, 5.1, 5.1, 4.8, 4.9, 5.6, 4.2, 5.1, 5.3, 4… ## $ fbs 3.99, 4.26, 4.94, 4.60, 4.60, 4.42, 3.82, 4.40, 4.80, 3.68, 6… ## $ mogtt1h 7.06, 8.63, 6.26, 4.31, 9.49, 6.29, NA, 6.43, 9.23, 6.70, 8.7… ## $ mogtt2h 3.22, 6.49, 5.15, 3.85, 7.71, 5.65, 5.88, 4.89, 4.29, 2.59, 7… ## $ totchol 5.43, 5.13, 5.55, 4.01, 5.21, 6.19, 4.33, 5.84, 6.14, 6.02, 6… ## $ ftrigliz 1.06, 1.17, 0.72, 1.12, 0.78, 1.11, 0.73, 0.79, 1.63, 0.81, 1… ## $ hdl 1.65, 1.59, 2.24, 1.21, 1.43, 2.18, 0.98, 1.81, 1.63, 1.47, 1… ## $ ldl 2.69, 2.79, 2.55, 1.83, 2.40, 2.93, 1.82, 3.43, 3.71, 2.77, 3… ## $ gender "female", "male", "female", "male", "female", "female", "fema… ## $ crural "rural", "rural", "rural", "rural", "rural", "rural", "rural"…
ms ms %>% select(fbs, totchol, hpt , weight) %>% mutate(across(where(is.labelled), as_factor)) glimpse(ms)
## Rows: 4,340 ## Columns: 4 ## $ fbs 3.99, 4.26, 4.94, 4.60, 4.60, 4.42, 3.82, 4.40, 4.80, 3.68, 6.… ## $ totchol 5.43, 5.13, 5.55, 4.01, 5.21, 6.19, 4.33, 5.84, 6.14, 6.02, 6.… ## $ hpt "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", "… ## $ weight 40.0, 54.6, 37.0, 48.4, 44.5, 45.5, 48.0, 40.0, 48.4, 41.0, 52…
Let us create a categorical (also known as a factor variable) based on this classification:
ms ms %>% mutate(cat_fbs = cut(fbs, breaks = c(2.50 , 6.10 , 6.90 , 28.01 ), labels = c("Normal","IFG", "DM"))) ms %>% count(cat_fbs)
## # A tibble: 4 × 2 ## cat_fbs n ## ## 1 Normal 3130 ## 2 IFG 364 ## 3 DM 595 ## 4 251
We notice that there were 250 data has no values recorded as NA . Thus, we decide to remove observations when there are missing values for variable cat_fbs .
ms ms %>% filter(!is.na(cat_fbs)) ms %>% count(cat_fbs)
## # A tibble: 3 × 2 ## cat_fbs n ## ## 1 Normal 3130 ## 2 IFG 364 ## 3 DM 595
Next, is to return the table of summary statistics
ms %>% tbl_summary(by = cat_fbs, statistic = list(all_continuous() ~ " ()", all_categorical() ~ " (%)"
), type = list(where(is.logical) ~ "categorical"), label = list(fbs ~ "Fasting Blood Sugar (mmol/L)", totchol ~ "Total Cholesterol (mmol/L)", hpt~"Hypertension", weight ~ "Weight (kg)"), missing_text = "Missing") %>% modify_caption("**Table 1. Survey Participant Characteristic**") %>% modify_header(label ~ "**Variable**") %>% modify_spanning_header(c("stat_1", "stat_2", "stat_3") ~ "**Glycemic Control Status**") %>% modify_footnote(all_stat_cols() ~ "Mean (SD) or Frequency (%)") %>% bold_labels() %>% as_gt()
Variable | Glycemic Control Status | ||
---|---|---|---|
Normal N = 3,130 1 | IFG N = 364 1 | DM N = 595 1 | |
Fasting Blood Sugar (mmol/L) | 4.76 (0.82) | 6.44 (0.22) | 10.41 (3.63) |
Total Cholesterol (mmol/L) | 5.69 (1.24) | 6.08 (1.37) | 6.16 (1.31) |
Missing | 4 | 0 | 1 |
Hypertension | 281 (9.0%) | 75 (21%) | 126 (21%) |
Weight (kg) | 63 (14) | 68 (14) | 68 (15) |
Missing | 1 | 0 | 0 |
1 Mean (SD) or Frequency (%) |
levels(ms$cat_fbs)
## [1] "Normal" "IFG" "DM"
However, we would like the DM as the smallest category. To do that we will use the fct_relevel() function.
ms ms %>% mutate(cat_fbs = fct_relevel(cat_fbs, c("DM", 'IFG', 'Normal'))) levels(ms$cat_fbs)
## [1] "DM" "IFG" "Normal"
Our intention to investigate the relationship between totchol, hpt and weight with the outcome variables cat_fbs . Thus, we will perform multinomial logistic regression model to estimate the relation for 2 models:
In both models, the reference group is Normal
For independent variable totchol
log_chol vglm(cat_fbs ~ totchol, multinomial, data = ms) summary(log_chol)
## ## Call: ## vglm(formula = cat_fbs ~ totchol, family = multinomial, data = ms) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept):1 -3.33584 0.21306 -15.656 < 2e-16 *** ## (Intercept):2 -3.59935 0.25764 -13.970 < 2e-16 *** ## totchol:1 0.28357 0.03446 8.229 < 2e-16 *** ## totchol:2 0.24661 0.04176 5.905 3.53e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) ## ## Residual deviance: 5634.468 on 8164 degrees of freedom ## ## Log-likelihood: -2817.234 on 8164 degrees of freedom ## ## Number of Fisher scoring iterations: 4 ## ## Warning: Hauck-Donner effect detected in the following estimate(s): ## '(Intercept):2' ## ## ## Reference group is level 3 of the response
This is the model where hpt is the independent variable
log_hpt vglm(cat_fbs ~ hpt, multinomial, data = ms) summary(log_hpt)
## ## Call: ## vglm(formula = cat_fbs ~ hpt, family = multinomial, data = ms) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept):1 -1.80412 0.04983 -36.205 < 2e-16 *** ## (Intercept):2 -2.28830 0.06173 -37.067 < 2e-16 *** ## hptyes:1 1.00205 0.11823 8.475 < 2e-16 *** ## hptyes:2 0.96743 0.14389 6.724 1.77e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) ## ## Residual deviance: 5637.16 on 8174 degrees of freedom ## ## Log-likelihood: -2818.58 on 8174 degrees of freedom ## ## Number of Fisher scoring iterations: 4 ## ## No Hauck-Donner effect found in any of the estimates ## ## ## Reference group is level 3 of the response
And lastly, the independent variable is weight
log_wt vglm(cat_fbs ~ weight, multinomial, data = ms) summary(log_wt)
## ## Call: ## vglm(formula = cat_fbs ~ weight, family = multinomial, data = ms) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept):1 -3.214303 0.204303 -15.733 < 2e-16 *** ## (Intercept):2 -3.672632 0.249121 -14.742 < 2e-16 *** ## weight:1 0.023860 0.002988 7.984 1.42e-15 *** ## weight:2 0.023372 0.003627 6.444 1.16e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) ## ## Residual deviance: 5638.998 on 8172 degrees of freedom ## ## Log-likelihood: -2819.499 on 8172 degrees of freedom ## ## Number of Fisher scoring iterations: 4 ## ## Warning: Hauck-Donner effect detected in the following estimate(s): ## '(Intercept):2' ## ## ## Reference group is level 3 of the response
We feel that totchol, hpt and weight are all important independent variables. Hence, we want to fit a model with the three independent variables as the covariates.
mlog vglm(cat_fbs ~ totchol + hpt + weight, multinomial, data = ms) summary(mlog)
## ## Call: ## vglm(formula = cat_fbs ~ totchol + hpt + weight, family = multinomial, ## data = ms) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept):1 -4.907874 0.303768 -16.157 < 2e-16 *** ## (Intercept):2 -5.112990 0.366325 -13.958 < 2e-16 *** ## totchol:1 0.277217 0.035055 7.908 2.62e-15 *** ## totchol:2 0.239413 0.042391 5.648 1.63e-08 *** ## hptyes:1 0.899987 0.120451 7.472 7.91e-14 *** ## hptyes:2 0.867136 0.145569 5.957 2.57e-09 *** ## weight:1 0.022678 0.003074 7.378 1.61e-13 *** ## weight:2 0.021994 0.003710 5.928 3.07e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) ## ## Residual deviance: 5476.397 on 8158 degrees of freedom ## ## Log-likelihood: -2738.199 on 8158 degrees of freedom ## ## Number of Fisher scoring iterations: 5 ## ## Warning: Hauck-Donner effect detected in the following estimate(s): ## '(Intercept):1', '(Intercept):2' ## ## ## Reference group is level 3 of the response
Then, we hypothesize that there could be a significant interaction between totchol and weight. And to test the hypothesis, we extend the multivariable logistic regression model by adding an interaction term. This interaction term is a product between variable weight and totchol.
mlogi vglm(cat_fbs ~ totchol + hpt + weight + totchol*weight, multinomial, data = ms) summary(mlogi)
## ## Call: ## vglm(formula = cat_fbs ~ totchol + hpt + weight + totchol * weight, ## family = multinomial, data = ms) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept):1 -5.1253325 0.9671480 -5.299 1.16e-07 *** ## (Intercept):2 -7.0506874 1.1195704 -6.298 3.02e-10 *** ## totchol:1 0.3155740 0.1605324 1.966 0.04932 * ## totchol:2 0.5747241 0.1878902 3.059 0.00222 ** ## hptyes:1 0.8984459 0.1204877 7.457 8.87e-14 *** ## hptyes:2 0.8643701 0.1455478 5.939 2.87e-09 *** ## weight:1 0.0260899 0.0142834 1.827 0.06776 . ## weight:2 0.0514503 0.0165054 3.117 0.00183 ** ## totchol:weight:1 -0.0006015 0.0023804 -0.253 0.80052 ## totchol:weight:2 -0.0051091 0.0028023 -1.823 0.06828 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) ## ## Residual deviance: 5473.004 on 8156 degrees of freedom ## ## Log-likelihood: -2736.502 on 8156 degrees of freedom ## ## Number of Fisher scoring iterations: 5 ## ## Warning: Hauck-Donner effect detected in the following estimate(s): ## '(Intercept):1', '(Intercept):2' ## ## ## Reference group is level 3 of the response
The interaction term in our model showed p-values above 0.05 (p-values of 0.80 and 0.07, respectively). As the p-value is bigger than the level of significance at \(5\%\) and the value of regression parameters for the interaction terms are likely not clinically meaningful, we have decided not to use the model with an interaction term.
For the inference, we will
There is no facility inside the broom::tidy() function to generate confidence intervals for object with class vglm . Because of that we will use the coef() , confint() and cind(0 functions to produce a rather nice table of inferences.
We are going to follow these steps:
b_mlog coef(mlog) ci_mlog confint(mlog) b_ci_mlog data.frame(b_mlog,ci_mlog) %>% rename("log odds" = b_mlog, "Lower CI" = X2.5. "Upper CI" = X97.5..) b_ci_mlog %>% kbl(digits = 2, booktabs = T, caption = "Log odds from multinomial logistic regression") %>% kable_styling(position = "center")
log odds | Lower CI | Upper CI | |
---|---|---|---|
(Intercept):1 | -4.91 | -5.50 | -4.31 |
(Intercept):2 | -5.11 | -5.83 | -4.40 |
totchol:1 | 0.28 | 0.21 | 0.35 |
totchol:2 | 0.24 | 0.16 | 0.32 |
hptyes:1 | 0.90 | 0.66 | 1.14 |
hptyes:2 | 0.87 | 0.58 | 1.15 |
weight:1 | 0.02 | 0.02 | 0.03 |
weight:2 | 0.02 | 0.01 | 0.03 |
Afterwards, we will exponentiate the coefficients to obtain the relative-risk ratio. We then combine the results to the previous table. Finally, we will name the columns of the object tab_fitmlog2 .
rrr_mlog exp(b_ci_mlog) tab_mlog cbind(b_ci_mlog, rrr_mlog) colnames(tab_mlog) c('b', 'lower b', 'upper b', 'RRR', 'lower RRR', 'upper RRR') tab_mlog %>% kbl(digits = 2, booktabs = T, caption = "Log odds and RRR from multinomial logistic regression") %>% kable_styling(position = "center")
b | lower b | upper b | RRR | lower RRR | upper RRR | |
---|---|---|---|---|---|---|
(Intercept):1 | -4.91 | -5.50 | -4.31 | 0.01 | 0.00 | 0.01 |
(Intercept):2 | -5.11 | -5.83 | -4.40 | 0.01 | 0.00 | 0.01 |
totchol:1 | 0.28 | 0.21 | 0.35 | 1.32 | 1.23 | 1.41 |
totchol:2 | 0.24 | 0.16 | 0.32 | 1.27 | 1.17 | 1.38 |
hptyes:1 | 0.90 | 0.66 | 1.14 | 2.46 | 1.94 | 3.11 |
hptyes:2 | 0.87 | 0.58 | 1.15 | 2.38 | 1.79 | 3.17 |
weight:1 | 0.02 | 0.02 | 0.03 | 1.02 | 1.02 | 1.03 |
weight:2 | 0.02 | 0.01 | 0.03 | 1.02 | 1.01 | 1.03 |
The result from our multivariable logistic regression models can be interpreted as below:
We can calculate the predicted probability of each category of outcome by using the predict() function. Below are the result for the top 10 observations.
predict.vgam(mlog, type = 'response') %>% head(10)
## DM IFG Normal ## 1 0.15254733 0.09528434 0.7521683 ## 2 0.09000250 0.05817369 0.8518238 ## 3 0.07042058 0.04534241 0.8842370 ## 4 0.06047226 0.04095046 0.8985773 ## 5 0.07525337 0.04882982 0.8759168 ## 6 0.09712040 0.06068516 0.8421944 ## 7 0.06497274 0.04348092 0.8915463 ## 8 0.08025639 0.05100727 0.8687363 ## 9 0.10144228 0.06337974 0.8351780 ## 10 0.08548801 0.05392687 0.8605851
We can make a better table using knitr and kableExtra packages. Or we can save the results as a .csv file so we can edit it using spreadsheets.
tab_mlog %>% write_csv("results_multinomial.csv")
In this chapter, we extend binary logistic regression analysis to data with a categorical outcome variable but has three or more levels. We describe the multinomial logistic regression model, the log odds and also the conditional probabilities. When then show estimation for the single and multiple multinomial logistic regression including models with an interaction. Following that, we describe how to make inferences and perform predictions. Lastly, we make use of kableExtra package to produce a nice table for our model. We are big fans of two excellent books on logistic regression. The first one is Applied Logistic Regression (Hosmer, Lemeshow, and Sturdivant 2013) and the second one is Logistic Regression: Self Learning Text (D. Kleinbaum 2010) . You will acquire strong understanding of the concepts and model building process from the two books. To see what packages and functions relevant to logistic regression, you can refer to A Handbook of Statistical Analyses Using R (Everitt and Hothorn 2017) .