Variables selected in GLM being used in GLMM

Question

Variables selected in GLM being used in GLMM

Asked 7 years, 5 months ago

Viewed 1,025 times

3

I would like to ask a question about generalized mixed linear models. I know that the hierarchical structure of the observations, their independence and the existence of unbalanced samples among others can prevent me from using GLM instead of GLMM, but in the models I have tested in my work, which presents the best result (AIC, BIC etc.) is not the GLMM that started from all the variables that I have, but the GLMM that starts from the most statistically significant variables from the logistic regression.

Starting from this, I am afraid to be asked about the possibility of this use of Logistics or in fact the error of logistics would be only in the estimates of Betas which would not preclude me from using the variables pre-selected by it?

I appreciate any guidance you can give me to justify the use of logistic regression in the selection of variables for GLMM

1 answer

Browser other questions tagged r regression binary glmer

You are not signed in. Login or sign up in order to post.

by Marcus Nunes • **17,915** points · Answer 1 · 2018-02-14T15:17:09+00:00

in the models that I have tested in my work, what presents the best result (AIC BIC etc) is not the GLMM that started from all the variables that I have, but the GLMM that starts from the most statistically significant variables from logistic regression.

Attention: smaller AIC or BIC does not necessarily mean that the adjusted model is the best. AIC formula is given by

where k is the number of parameters estimated in the model and L-hat is the value of the estimated maximum likelihood function. As we want the model with lower AIC, the GLMM, because they have more parameters being estimated, will always be harmed in relation to the GLM.

The logic is similar to BIC, whose formula is given by

That is, in both cases, the higher the number of parameters, the higher the value of 2*k (or log(n)*k) for GLMM, because they have more parameters being estimated. In statistics, we always want the simplest possible model to describe our data.

And just the value of AIC or BIC doesn’t say much about the adjusted model. After all, just saying that the average of one sample is larger than the other, without taking into account the standard deviation, for example, does not guarantee us anything about the population average. We always have to do a hypothesis test to confirm this. Therefore, the ideal is to run a hypothesis test on AIB or BIC (more on that later, in this same answer).

I appreciate any guidance you can give me to justify the use of logistic regression in the selection of variables for GLMM

As Father Quevedo would say, that non ecziste. Logistic regression results (one of the GLM class models) cannot be used to select variables from a GLMM. These are different classes of models, with different hypotheses. What you can do is find the best model for your data using only logistic regression or just GLMM.

In particular, what I suggest to do when we adjust a GLMM to a data set is to start with the most complex model possible, with all predictive variables defined in the experiment, and go on simplifying the model from that. My philosophy of fitting models, based basically on my experience and what I learned during my training, is always to have at the end the simplest possible model. The way I will describe below does just this.

I’m going to do a data analysis of results on a fungus treatment for the toenails. This data can be obtained from the R, in the data frame HSAUR2::toenail:

library(HSAUR2)
str(toenail)
'data.frame':   1908 obs. of  5 variables:
 $ patientID: Factor w/ 294 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ outcome  : Factor w/ 2 levels "none or mild",..: 2 2 2 1 1 1 1 1 1 2 ...
 $ treatment: Factor w/ 2 levels "itraconazole",..: 2 2 2 2 2 2 2 1 1 1 ...
 $ time     : num  0 0.857 3.536 4.536 7.536 ...
 $ visit    : int  1 2 3 4 5 6 7 1 2 3 ...

The variables considered in this example were:

Outcome: how separate the fingernail was, with two levels (variable response)
Treatment: which remedy was used in the treatment (fixed effect)
visit: number of patient visit for treatment (fixed effect)
patientID: patient identification number (random effect)

First, I’m going to do an exploratory analysis of the data. I limited it to just 12 patients, so that the visualization would be better:

library(ggplot2)
library(dplyr)

toenail %>%
  filter(patientID %in% head(unique(toenail$patientID), 12)) %>%
  ggplot(., aes(x=visit, y=outcome, colour=treatment)) +
  geom_point() +
  facet_wrap(~ patientID) +
  labs(x="Visita", y="Resultado", colour="Tratamento") +
  scale_x_continuous(breaks=c(1, 3, 5, 7))

Clearly, the variable response should be considered as binomial. However, this experiment has repeated measurements, because each patient made up to seven visits for treatment. So it is best to adjust a GLMM to this data.

First, let’s adjust a model interacting amid treatment and visit, considering patientID as a random effect:

library(lme4)

modelo_treatment.visit.interacao <- glmer(outcome ~ treatment*visit + (1|patientID), data=toenail, family=binomial)

Then let’s adjust a model without interaction amid treatment and visit:

modelo_treatment.visit <- glmer(outcome ~ treatment + visit + (1|patientID), data=toenail, family=binomial)

As these models are nested, we can compare them through a likelihood ratio test:

anova(modelo_treatment.visit.interacao, modelo_treatment.visit, test="Chisq")

Data: toenail
Models:
modelo_treatment.visit: outcome ~ treatment + visit + (1 | patientID)
modelo_treatment.visit.interacao: outcome ~ treatment * visit + (1 | patientID)
                                 Df    AIC    BIC  logLik deviance  Chisq
modelo_treatment.visit            4 1260.3 1282.6 -626.17   1252.3       
modelo_treatment.visit.interacao  5 1258.8 1286.5 -624.39   1248.8 3.5627
                                 Chi Df Pr(>Chisq)  
modelo_treatment.visit                              
modelo_treatment.visit.interacao      1    0.05909 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The above command is testing the hypotheses

H_0: the models modelo_treatment.visit.interacao and modelo_treatment.visit are the same
H_1: the models modelo_treatment.visit.interacao and modelo_treatment.visit are not equal

Since the p-value for the comparison between the models was greater than 0.05 (was 0.05909), we can assume that these models are equal. Thus, interaction is not necessary in this case, as it is preferable to have a simpler model. Next, we need to find out if treatment or visit are necessary for the model.

So far, the chosen model has treatment and visit, no interaction. Let’s adjust a model without treatment and compare it to our current model:

modelo_visit <- glmer(outcome ~ visit + (1|patientID), data=toenail, family=binomial)

anova(modelo_treatment.visit, modelo_visit, test="Chisq")
Data: toenail
Models:
modelo_visit: outcome ~ visit + (1 | patientID)
modelo_treatment.visit: outcome ~ treatment + visit + (1 | patientID)
                       Df    AIC    BIC  logLik deviance  Chisq Chi Df
modelo_visit            3 1259.4 1276.0 -626.69   1253.4              
modelo_treatment.visit  4 1260.3 1282.6 -626.17   1252.3 1.0427      1
                       Pr(>Chisq)
modelo_visit                     
modelo_treatment.visit     0.3072

Again, the p-value was greater than 0.05. Therefore, the models are not different. We do not need treatment to explain the response variable. We will now test the model with treatment and see what happens:

modelo_treatment <- glmer(outcome ~ treatment + (1|patientID), data=toenail, family=binomial)

anova(modelo_treatment.visit, modelo_treatment, test="Chisq")
Data: toenail
Models:
modelo_treatment: outcome ~ treatment + (1 | patientID)
modelo_treatment.visit: outcome ~ treatment + visit + (1 | patientID)
                       Df    AIC    BIC  logLik deviance  Chisq Chi Df
modelo_treatment        3 1572.6 1589.2 -783.29   1566.6              
modelo_treatment.visit  4 1260.3 1282.6 -626.17   1252.3 314.25      1
                       Pr(>Chisq)    
modelo_treatment                     
modelo_treatment.visit  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now we see that there is difference between the model with treatment and visit and the model with only treatment.

Combining the last two results, we chose to keep the model with only visit.

Finally, just find out if visit is significant. For this, let’s adjust the model only with the intercept and do the same as we did in all previous cases:

modelo_intercepto <- glmer(outcome ~ 1 + (1|patientID), data=toenail, family=binomial)

anova(modelo_visit, modelo_intercepto, test="Chisq")
Data: toenail
Models:
modelo_intercepto: outcome ~ 1 + (1 | patientID)
modelo_visit: outcome ~ visit + (1 | patientID)
                  Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
modelo_intercepto  2 1571.5 1582.6 -783.74   1567.5                         
modelo_visit       3 1259.4 1276.0 -626.69   1253.4 314.09      1  < 2.2e-16

modelo_intercepto    
modelo_visit      ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since p-value is less than 0.05, we conclude that there is a difference between the adjusted models. Therefore, visit is significant. That is, what matters, in this case, is to visit the doctor regularly, no matter which remedy to use.

The method I illustrated above, with the likelihood ratio tests (LRT-, in English) is used to compare any type of model, from multiple linear regression to GLMM. I recommend looking for the book Mixed Effects Models and Extensions in Ecology with R, Zuur et. al. (2009) to see a more in-depth reference on this. The book even discusses the fact that the p-values obtained with this method are not reliable myth (in fact, almost no p-value in the context of GLMM is reliable).

The very discussion of how to adjust generalized linear mixed models is quite sophisticated. There are people, like Douglas Bates, who defends not calculating p-values. In some cases, the results of the package lme4 do not provide p-values for the tests performed. This is because a p-value is, in essence, a probability associated with the distribution of some random variable. In the case of traditional ANOVA, this distribution is F. In the case of a mixed model, it is likely that we do not know the distribution of the test statistic. Therefore, it does not make sense, in some model adjustments with missing or unbalanced data, to calculate p-values for the tests performed.

In your specific case, if the journal’s editor requires p-values to be reported, I would do a survey of the latest articles published in the area and see how the results were reported in them. I would try to reproduce the analysis performed in these articles, even if I asked the authors for the original data set, and I would see how to apply the results of these articles in the context of your research.

A good free source to learn more about it is this Ben Bolker FAQ. There are some books that deal with the subject, but all that I read are very theoretical, requiring a knowledge in statistics quite advanced.