Linear regressions in IHD with subdivided plot

Asked

Viewed 240 times

3

Hello, good afternoon!

would you like to know, how to perform a linear regression in dic with subdivided portion, detail I need the "Betas", because I intend to use the answer to predict a curve!

sample data:

dados<-structure(list(Fator1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L), .Label = c("AV1", "AV2", "AV3"), class = "factor"), Fator2 = c(10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 20L, 20L, 
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 40L, 40L, 40L, 
40L, 40L, 40L, 40L, 40L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 20L, 
20L, 20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 40L, 40L, 
40L, 40L, 40L, 40L, 40L, 40L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 
50L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 
20L, 20L, 20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 40L, 
40L, 40L, 40L, 40L, 40L, 40L, 40L, 50L, 50L, 50L, 50L, 50L, 50L, 
50L, 50L), REP = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), resposta = c(1.7, 
1.8, 1.7, 1.4, 1.7, 1.8, 1.7, 1.8, 1.6, 1.5, 1.7, 1.8, 1.8, 1.6, 
1.7, 1.6, 1.7, 1.6, 1.6, 1.5, 1.6, 1.8, 1.6, 1.7, 1.9, 1.8, 1.7, 
1.7, 1.7, 1.7, 1.7, 1.8, 1.6, 2, 1.9, 2.1, 1.7, 1.8, 1.8, 2, 
1.6, 1.6, 1.7, 1.6, 1.5, 1.6, 1.9, 1.8, 1.6, 1.4, 1.6, 1.6, 1.7, 
1.5, 1.6, 1.6, 1.8, 1.7, 1.8, 1.6, 1.7, 1.7, 1.7, 1.8, 1.6, 1.7, 
1.7, 1.7, 1.7, 1.6, 1.8, 1.8, 1.7, 1.9, 1.6, 1.8, 1.9, 1.9, 1.6, 
1.8, 1.5, 1.8, 1.6, 1.6, 1.6, 1.4, 1.6, 1.5, 1.6, 1.7, 1.6, 1.6, 
1.7, 1.6, 1.6, 1.6, 1.4, 1.3, 1.4, 1.2, 1.2, 1.3, 1.4, 1.2, 1.5, 
1.6, 1.6, 1.6, 1.4, 1.6, 1.5, 1.5, 1.6, 1.4, 1.7, 1.7, 1.6, 1.7, 
1.6, 1.8)), .Names = c("Fator1", "Fator2", "REP", "resposta"), class = "data.frame", row.names = c(NA, 
120L))

I’ve seen a model but in DBC:

m1<-aov(resposta~bloco+Fator1+Error(bloco:Fator1)+Fator2+Fator1:Fator2, data=dados)

I could not get the same answer for DIC, just replacing the blocks by repetition, and or removing the block from the code.

I used to compare the answer, the result presented by the package Expdes.pt::psub.dic

I need the "Betas" because I intend to use the command for prediction

predicao<-expand.grid(Fator1,Fator2)
predicao<-cbind(predicao, predict(m1, newdata=predicao, interval="confidence")

to make that 95% of the turn.

1 answer

3

The first thing to do when we analyze an experiment is exploratory data analysis. It wasn’t explicit in your question, but I’m assuming that Fator1 concerns the plots and Fator2, the sub-plots.

ggplot(dados, aes(x=Fator2, y=resposta)) + geom_point() + facet_grid(~ Fator1)

split plot

This graph already gives us an idea of what to expect from our test results. As I do not know the package ExpDes.pt, I chose to perform the analysis using the command aov, native of R.

modelo <- aov(resposta ~ Fator1*Fator2 + Error(REP:Fator1), data=dados)
summary(modelo)

Error: REP:Fator1
          Df Sum Sq Mean Sq F value Pr(>F)
Fator1     2 0.7300  0.3650   31.54  0.125
Residuals  1 0.0116  0.0116               

Error: Within
               Df Sum Sq Mean Sq F value  Pr(>F)   
Fator1          2 0.0905 0.04523   2.598 0.07894 . 
Fator2          1 0.1707 0.17067   9.804 0.00223 **
Fator1:Fator2   2 0.0646 0.03229   1.855 0.16128   
Residuals     111 1.9324 0.01741                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I adjusted a model with two main effects and interaction between them, so the formula Fator1*Fator2. In addition, he said that the term error depends on the repetition within the Fator1; so I put Error(REP:Fator1).

Note that the interaction term was not significant at the level of 5%. Therefore, I will adjust a new model, without the interaction part:

modelo.reduzido <- aov(resposta ~ Fator1+Fator2 + Error(REP:Fator1), data=dados)
summary(modelo.reduzido)

Error: REP:Fator1
          Df Sum Sq Mean Sq F value Pr(>F)
Fator1     2 0.7300  0.3650   31.54  0.125
Residuals  1 0.0116  0.0116               

Error: Within
           Df Sum Sq Mean Sq F value  Pr(>F)   
Fator1      2 0.0905 0.04523   2.559 0.08184 . 
Fator2      1 0.1707 0.17067   9.657 0.00239 **
Residuals 113 1.9970 0.01767                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Only Fator2 was significant at the level of 5%.

To check if the model’s hypotheses have been met, we run the code below:

modelo.plot <- aov(resposta ~ Fator1+Fator2 + REP:Fator1, data=dados)
par(mfrow=c(2,2))
plot(modelo.plot)

residuos

Notice that I’ve adjusted a new model, without Error, because of a limitation of R. If you spin summary(modelo.plot) you will see that the inferences are different, but do not worry. What matters here are the residues.

Residues that, by the way, are weird. It looks like there’s a pattern in the residue plots versus adjusted values and Scale-Location. It seems that there is a lack of randomness in this data. I would investigate it further to know where they came from and why this is happening.

QQ-Plot indicates normally distributed waste and, according to Leverage, there are no points of influence on this data set.

I’m going to owe the forecast part. I’ve never seen anything like this done in this context. It may be ignorance of mine, but I looked for something similar in some books here at home and found nothing about it.


Editing done to expand my previous comment:

The hypotheses I tested for the sub-plots in my code above was

H_0: the slope on the underside lines is equal H_1: There are at least one pair of slopes on the straight sides of the sub-tracks which is different

I thought better about the problem and I’m finding that these hypotheses do not make sense. See the first graph I drew. The inclination of the answers within AV1 and AV2 seems to me to be different from zero, while the line within AV3 seems to me to have inclination equal to zero.

And now comes an important part, which does not depend on accounts.

Let’s think about it. Can we really use only 5 points to determine whether there is a relationship between a predictive variable and a response variable? See, in the first graph, how the points do not organize very well in a straight line within AV1 and AV2. See how badly they are organized in straight form.

The case of the answers in AV3 is even worse. There we have a linear behavior, like that of AV1 or AV2, with a straight line? Or is it a straight with zero inclination? Or is it a parable? Or a cubic? Mathematically, having 5 points for the predictor variable, we can adjust up to a fourth-order polynomial.

But does this make sense? I, on second thought, don’t think so. Even, I’m thinking that adjusting a straight line to this data is not a good idea, and below comment it better.

However, the hypothesis test I did reports that there is a difference in the inclinations of a pair of these lines. I have not made the subsequent comparison tests, but I imagine that the straight lines within AV1 and AV2 have similar inclinations and the straight line of AV3 has zero inclination. I see no point in testing higher-order polynomials.

I argue that we always have to look for explanations in the real world for what we find using statistics. After running the tests, it’s critical that we explain what we’re seeing in this pile of numbers. Your problem, after your explanation yesterday, was clear to me. It seems to me to make sense to say that if the dose of Organic Fertilization (Fator2) increases, the Diameter (resposta) can increase to two times of evaluation (Fator1, levels AV1 and AV2) or not to increase (Fator1, levels AV3). We will only know this in fact if we do the post-hoc tests.

Still, I believe that assuming there’s a linear relationship between the variables resposta and Fator2 is a very strong hypothesis. In my opinion, there are few points for this to actually be assumed. So I agree with you that there should be four degrees of freedom Fator2. But then we will lose the power of prediction that a regression could give us in this case. The code, therefore, would be

modelo.completo <- aov(resposta ~ Fator1*as.factor(Fator2) + 
    Error(as.factor(REP)/Fator1), data=dados)
summary(modelo.completo)

Error: as.factor(REP)
          Df  Sum Sq  Mean Sq F value Pr(>F)
Residuals  7 0.03967 0.005667               

Error: as.factor(REP):Fator1
          Df Sum Sq Mean Sq F value   Pr(>F)    
Fator1     2 0.7952  0.3976    91.5 9.16e-09 ***
Residuals 14 0.0608  0.0043                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: Within
                         Df Sum Sq Mean Sq F value   Pr(>F)    
as.factor(Fator2)         4 0.5263 0.13158  10.359 7.22e-07 ***
Fator1:as.factor(Fator2)  8 0.5107 0.06383   5.025 4.10e-05 ***
Residuals                84 1.0670 0.01270                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that, unlike my original code, this time I took the necessary care to turn into factors those terms that were like numbers.

Note that we now have a significant interaction between Fator1 and Fator2. That is, Time of Evaluation and Organic Fertilization interact with each other. I don’t know what Evaluation Season means, but it has something to do with the season?

Note that the waste remains a little weird:

resíduos 2

Comparing the results of my model to yours, they’re a lot more alike. They are still different quantitatively, but they are rounding errors. Apparently, after my review, we come to the same result.

But this result does not serve to make the prediction you want. It serves to perform the post-hoc tests, to compare the averages between the sub-plots, but we can not predict. Even, as I reported above, I think it is incorrect that we make predictions in this case, because we have very little data to establish a criterion of relation between Diameter and Organic Fertilization. To make a comparison between the averages obtained for Organic Fertilization seems to me much more suitable.

  • The degrees of freedom for the Fator2 would be 4, and in the case of its model 1.

  • I disagree. It would be 4 degrees of freedom if the variable Fator2 were qualitative. In your problem, it is considered quantitative. Therefore, it is a degree of freedom only. Compare, for example, with a regression model, like these two: x <- 1:10; y <- x+rnorm(10); anova(lm(y ~ x)) and x <- 1:100; y <- x+rnorm(100); anova(lm(y ~ x)). Note that the predictive variable in each case is x. In the first case, it has size 10. In the second, 100. In both cases, the ANOVA table assigns only one degree of freedom to x, because it is not qualitative.

  • I worked a little harder on my answer. Take a look at it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.