GLM with non-significant P values

Asked

Viewed 31 times

0

My dataset has four variables: a dependent (Y), and three independent (X1, X2 and X3):

> dput(dados)
structure(list(Y = c(29.1, 27.7, 28.2, 28.1, 27.3, 25.9, 27.2, 
30.6, 27.6, 28.4, 26.6, 28.1), X1 = c(29, 27.8, 27, 27.8, 27.7, 
26.6, 26.8, 30.7, 27.6, 25.4, 26.7, 26.7), X2 = c(28.3, 27.2, 
26.6, 27.5, 27.1, 26.2, 24.8, 27.2, 26.3, 23.9, 24.3, 24.1), 
    X3 = c(28.4, 26.5, 26.8, 27.4, 27.4, 26.3, 25.8, 29.2, 27.1, 
    25, 24.8, 25.3), z_X1 = structure(c(1.12601061016992, 0.235101116409105, 
    -0.358838546098108, 0.235101116409105, 0.160858658595702, 
    -0.655808377351713, -0.507323461724911, 2.38813239299775, 
    0.0866162007823022, -1.54671787111253, -0.581565919538313, 
    -0.581565919538313), .Dim = c(12L, 1L), "`scaled:center`" = 27.4833333333333, "`scaled:scale`" = 1.34693816645102), 
    z_X2 = structure(c(1.46389155419594, 0.723532607246269, 0.319700454364632, 
    0.925448683687089, 0.656227248432664, 0.0504790191102044, 
    -0.891796004280285, 0.723532607246269, 0.117784377923812, 
    -1.49754423360274, -1.22832279834832, -1.36293351597553), .Dim = c(12L, 
    1L), "`scaled:center`" = 26.125, "`scaled:scale`" = 1.48576579581036), 
    z_X3 = structure(c(1.29369391058265, -0.124393645248333, 
    0.0995149161986654, 0.54733203909266, 0.54733203909266, -0.273666019546331, 
    -0.646846955291328, 1.89078340777465, 0.323423477645664, 
    -1.24393645248332, -1.39320882678132, -1.02002789103632), .Dim = c(12L, 
    1L), "`scaled:center`" = 26.6666666666667, "`scaled:scale`" = 1.33983264445658)), row.names = c(NA, 
-12L), class = "data.frame")

My goal is to use a GLM to assess the influence of X1, X2 and X3 about Y. How I intend to assess the Effect size later, I normalized the explanatory variables before creating the GLM formula:

> dados$z_X1 <- scale(dados$X1)
> dados$z_X2 <- scale(dados$X2)
> dados$z_X3 <- scale(dados$X3)

Then I did the GLM, and I noticed that there were no statistically significant relationships:

> fórmula <- dados$Y ~ dados$z_X1 + dados$z_X2 + dados$z_X3

> modelo <- glm(fórmula, data = dados, family = 'gaussian')

> summary(modelo)

Call:
glm(formula = fórmula, family = "gaussian", data = dados)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3441  -0.5775   0.1998   0.5659   1.2082  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.9000     0.2511 111.132  4.8e-14 ***
dados$z_X1    0.3932     0.6634   0.593    0.570    
dados$z_X2   -0.9897     0.5763  -1.717    0.124    
dados$z_X3    1.2718     0.9426   1.349    0.214    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.7563262)

    Null deviance: 15.8200  on 11  degrees of freedom
Residual deviance:  6.0506  on  8  degrees of freedom
AIC: 35.838

Number of Fisher Scoring iterations: 2

However, when out of curiosity I decided to do a separate linear regression with each explanatory variable, I realized that the values of P for X1 and X3 (standard) were significant:

> summary(lm(dados$Y ~ dados$z_X1))

Call:
lm(formula = dados$Y ~ dados$z_X1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4573 -0.4792 -0.1374  0.6180  1.7799 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.9000     0.2628 106.160   <2e-16 ***
dados$z_X1    0.8275     0.2745   3.014    0.013 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9104 on 10 degrees of freedom
Multiple R-squared:  0.4761,    Adjusted R-squared:  0.4237 
F-statistic: 9.087 on 1 and 10 DF,  p-value: 0.01302

> summary(lm(dados$Y ~ dados$z_X2))

Call:
lm(formula = dados$Y ~ dados$z_X2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0207 -0.5717 -0.2569  0.6391  2.4029 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.9000     0.3411  81.783 1.83e-15 ***
dados$z_X2    0.4106     0.3563   1.152    0.276    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.182 on 10 degrees of freedom
Multiple R-squared:  0.1172,    Adjusted R-squared:  0.02893 
F-statistic: 1.328 on 1 and 10 DF,  p-value: 0.276

> summary(lm(dados$Y ~ dados$z_X3))

Call:
lm(formula = dados$Y ~ dados$z_X3)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7915 -0.3154 -0.1562  0.4124  1.4478 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.9000     0.2804  99.508 2.57e-16 ***
dados$z_X3    0.7620     0.2928   2.602   0.0264 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9713 on 10 degrees of freedom
Multiple R-squared:  0.4037,    Adjusted R-squared:  0.3441 
F-statistic:  6.77 on 1 and 10 DF,  p-value: 0.0264

What would be the reason for this difference between GLM results and separate regressions? I suppose in GLM perhaps multicollinearity between X1, X2 and X3 can explain the lack of significance, but I’m not quite sure about that.

1 answer

0


I think you’re complicating the problem.

lm and glm

In the case of your question, you have the generalized Gaussian model with link identity. This is equivalent to the linear model. It is not worth using the function glm, using an iterative algorithm.

formula_lm <- Y ~ X1 + X2 + X3

fit_lm <- lm(formula_lm, dados)
fit_glm <- glm(formula_lm, dados, family = "gaussian")

coef(fit_lm)
#(Intercept)          X1          X2          X3 
# 11.9657747   0.2919041  -0.6660981   0.9492577 
coef(fit_glm)
#(Intercept)          X1          X2          X3 
# 11.9657747   0.2919041  -0.6660981   0.9492577 

deviance(fit_lm)
#[1] 6.050609
deviance(fit_glm)
#[1] 6.050609

summary(fit_lm)   # resultados
summary(fit_glm)  # equivalentes

Multicollinearity

If there are multicollinearity problems, start by calculating the Variance inflation factor, VIF. In practice the following table is used.

  • Less than 5: moderate;
  • Between 5 and 10: high;
  • more than 10: very high.

If the VIF is high or too high, remove this regressor from the model. In R, there is a function vif in the package car.

car::vif(fit_lm)
#      X1        X2        X3
#6.400063  4.830177 12.923282

You should therefore remove the variable X3.

fit_lm_12 <- lm(Y ~ X1 + X2, dados)
summary(fit_lm_12)

rmse <- function(object, na.rm = FALSE){
  sqrt(mean(resid(object)^2, na.rm = na.rm))
}

rmse(fit_lm)      # 0.7100827
rmse(fit_lm_12)   # 0.7867373

AIC(fit_lm)       # 35.83755
AIC(fit_lm_12)    # 36.29786

BIC(fit_lm)       # 38.26209
BIC(fit_lm_12)    # 38.26209

z-Scores

In this case not reasons to calculate the z-Scores, the coefficients change but the model statistics are the same, see the AIC, deviance and the summary.

formula_z <- Y ~ z_X1 + z_X2 + z_X3
fit_z <- lm(formula_z, dados)

coef(fit_z)
AIC(fit_z)
deviance(fit_z)
summary(fit_z)

In the general case, it is not always necessary to calculate the z-Scores. This is true even if the variables have very different scales. In, for example, PCA. This question from R-Help is precisely on this topic. The responses of Prof. Koenker and Prof. Heiberger give a (very) good idea of why not calculate them.

  • Thanks a lot for the explanation, @Rui Barradas, helped me a lot! I continue with only one question: When I remove X3 of the model, however, the other variables still do not present significance in GLM, although when testing the effect of each explanatory variable separately, I obtain significant P values. What, then, would be the best way to observe these significance in GLM?

  • 1

    @Arthurfilipe Instead of GLM, when there is multicollinearity try lasso regression or ridge regression. If you follow the R-bloggers steps on the LASSO you will see that it is the variable X1 which is eliminated.

  • Great stuff, thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.