0
My dataset has four variables: a dependent (Y
), and three independent (X1
, X2
and X3
):
> dput(dados)
structure(list(Y = c(29.1, 27.7, 28.2, 28.1, 27.3, 25.9, 27.2,
30.6, 27.6, 28.4, 26.6, 28.1), X1 = c(29, 27.8, 27, 27.8, 27.7,
26.6, 26.8, 30.7, 27.6, 25.4, 26.7, 26.7), X2 = c(28.3, 27.2,
26.6, 27.5, 27.1, 26.2, 24.8, 27.2, 26.3, 23.9, 24.3, 24.1),
X3 = c(28.4, 26.5, 26.8, 27.4, 27.4, 26.3, 25.8, 29.2, 27.1,
25, 24.8, 25.3), z_X1 = structure(c(1.12601061016992, 0.235101116409105,
-0.358838546098108, 0.235101116409105, 0.160858658595702,
-0.655808377351713, -0.507323461724911, 2.38813239299775,
0.0866162007823022, -1.54671787111253, -0.581565919538313,
-0.581565919538313), .Dim = c(12L, 1L), "`scaled:center`" = 27.4833333333333, "`scaled:scale`" = 1.34693816645102),
z_X2 = structure(c(1.46389155419594, 0.723532607246269, 0.319700454364632,
0.925448683687089, 0.656227248432664, 0.0504790191102044,
-0.891796004280285, 0.723532607246269, 0.117784377923812,
-1.49754423360274, -1.22832279834832, -1.36293351597553), .Dim = c(12L,
1L), "`scaled:center`" = 26.125, "`scaled:scale`" = 1.48576579581036),
z_X3 = structure(c(1.29369391058265, -0.124393645248333,
0.0995149161986654, 0.54733203909266, 0.54733203909266, -0.273666019546331,
-0.646846955291328, 1.89078340777465, 0.323423477645664,
-1.24393645248332, -1.39320882678132, -1.02002789103632), .Dim = c(12L,
1L), "`scaled:center`" = 26.6666666666667, "`scaled:scale`" = 1.33983264445658)), row.names = c(NA,
-12L), class = "data.frame")
My goal is to use a GLM to assess the influence of X1
, X2
and X3
about Y
. How I intend to assess the Effect size later, I normalized the explanatory variables before creating the GLM formula:
> dados$z_X1 <- scale(dados$X1)
> dados$z_X2 <- scale(dados$X2)
> dados$z_X3 <- scale(dados$X3)
Then I did the GLM, and I noticed that there were no statistically significant relationships:
> fórmula <- dados$Y ~ dados$z_X1 + dados$z_X2 + dados$z_X3
> modelo <- glm(fórmula, data = dados, family = 'gaussian')
> summary(modelo)
Call:
glm(formula = fórmula, family = "gaussian", data = dados)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3441 -0.5775 0.1998 0.5659 1.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.9000 0.2511 111.132 4.8e-14 ***
dados$z_X1 0.3932 0.6634 0.593 0.570
dados$z_X2 -0.9897 0.5763 -1.717 0.124
dados$z_X3 1.2718 0.9426 1.349 0.214
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.7563262)
Null deviance: 15.8200 on 11 degrees of freedom
Residual deviance: 6.0506 on 8 degrees of freedom
AIC: 35.838
Number of Fisher Scoring iterations: 2
However, when out of curiosity I decided to do a separate linear regression with each explanatory variable, I realized that the values of P for X1
and X3
(standard) were significant:
> summary(lm(dados$Y ~ dados$z_X1))
Call:
lm(formula = dados$Y ~ dados$z_X1)
Residuals:
Min 1Q Median 3Q Max
-1.4573 -0.4792 -0.1374 0.6180 1.7799
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.9000 0.2628 106.160 <2e-16 ***
dados$z_X1 0.8275 0.2745 3.014 0.013 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9104 on 10 degrees of freedom
Multiple R-squared: 0.4761, Adjusted R-squared: 0.4237
F-statistic: 9.087 on 1 and 10 DF, p-value: 0.01302
> summary(lm(dados$Y ~ dados$z_X2))
Call:
lm(formula = dados$Y ~ dados$z_X2)
Residuals:
Min 1Q Median 3Q Max
-2.0207 -0.5717 -0.2569 0.6391 2.4029
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.9000 0.3411 81.783 1.83e-15 ***
dados$z_X2 0.4106 0.3563 1.152 0.276
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.182 on 10 degrees of freedom
Multiple R-squared: 0.1172, Adjusted R-squared: 0.02893
F-statistic: 1.328 on 1 and 10 DF, p-value: 0.276
> summary(lm(dados$Y ~ dados$z_X3))
Call:
lm(formula = dados$Y ~ dados$z_X3)
Residuals:
Min 1Q Median 3Q Max
-1.7915 -0.3154 -0.1562 0.4124 1.4478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.9000 0.2804 99.508 2.57e-16 ***
dados$z_X3 0.7620 0.2928 2.602 0.0264 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9713 on 10 degrees of freedom
Multiple R-squared: 0.4037, Adjusted R-squared: 0.3441
F-statistic: 6.77 on 1 and 10 DF, p-value: 0.0264
What would be the reason for this difference between GLM results and separate regressions? I suppose in GLM perhaps multicollinearity between X1
, X2
and X3
can explain the lack of significance, but I’m not quite sure about that.
Thanks a lot for the explanation, @Rui Barradas, helped me a lot! I continue with only one question: When I remove
X3
of the model, however, the other variables still do not present significance in GLM, although when testing the effect of each explanatory variable separately, I obtain significant P values. What, then, would be the best way to observe these significance in GLM?– Arthur Filipe
@Arthurfilipe Instead of GLM, when there is multicollinearity try lasso regression or ridge regression. If you follow the R-bloggers steps on the LASSO you will see that it is the variable
X1
which is eliminated.– Rui Barradas
Great stuff, thank you very much!
– Arthur Filipe