One way to solve this problem is through the packages dplyr
and broom
:
library(dplyr)
library(broom)
iris.regressao <- iris %>%
group_by(Species) %>%
do(regressao =
lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=.))
tidy(iris.regressao, regressao)
# A tibble: 12 x 6
# Groups: Species [3]
Species term estimate std.error statistic p.value
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa (Intercept) 2.3518898 0.39286751 5.9864707 3.034183e-07
2 setosa Sepal.Width 0.6548350 0.09244742 7.0833236 6.834434e-09
3 setosa Petal.Length 0.2375602 0.20801921 1.1420107 2.593594e-01
4 setosa Petal.Width 0.2521257 0.34686362 0.7268727 4.709870e-01
5 versicolor (Intercept) 1.8955395 0.50705524 3.7383295 5.112246e-04
6 versicolor Sepal.Width 0.3868576 0.20454490 1.8913091 6.488965e-02
7 versicolor Petal.Length 0.9083370 0.16543248 5.4906811 1.666695e-06
8 versicolor Petal.Width -0.6792238 0.43538206 -1.5600639 1.255990e-01
9 virginica (Intercept) 0.6998830 0.53360089 1.3116227 1.961563e-01
10 virginica Sepal.Width 0.3303370 0.17432873 1.8949086 6.439972e-02
11 virginica Petal.Length 0.9455356 0.09072204 10.4223360 1.074269e-13
12 virginica Petal.Width -0.1697527 0.19807243 -0.8570233 3.958750e-01
I got the dataset iris
and considered the column Species
as a product of your problem. This was my grouping variable.
I did a regression using the variable Sepal.Length
as response and others as predictors. The function do
applied this regression to all levels of the variable Species
. I mean, I did three simultaneous regressions.
I organized the final result using the function tidy
, to make everything more presentable.
Edition held after the set of data be made available in the comments.
I used the original dataset and could not get to the reported error.
library(dplyr)
library(broom)
dados <- read.csv(file="arquivo.csv", sep=";", dec=",")
Just for conscience disposal, I wanted to see how many times each product appears in the database:
produtos.contagem <- dados %>%
select(Produto) %>%
group_by(Produto) %>%
count() %>%
arrange(n)
# A tibble: 2,418 x 2
# Groups: Produto [2,418]
Produto n
<fctr> <int>
1 Produto 1 1
2 Produto 100 1
3 Produto 1000 1
4 Produto 1006 1
5 Produto 1011 1
6 Produto 1012 1
7 Produto 1013 1
8 Produto 1021 1
9 Produto 1022 1
10 Produto 1023 1
# ... with 2,408 more rows
We can see that there are many products that appear only once, which will create problems with regression. After all, geometrically speaking, it takes at least 3 points in space to adjust an equation of the type y = x_1 + x_2
, which is the case here. And even then, an equation of this type will be uniquely defined, giving no chance to estimate variability.
But leaving aside the theoretical assumptions of linear regression, it is still possible to adjust the desired regression:
dados.regressao <- dados %>%
group_by(Produto) %>%
do(regressao =
lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))
tidy(dados.regressao, regressao)
# A tibble: 3,725 x 6
# Groups: Produto [2,418]
Produto term estimate std.error statistic p.value
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Produto 1 (Intercept) 2.000000 NaN NaN NaN
2 Produto 10 (Intercept) -73.045872 75.5475262 -0.9668863 0.4356077
3 Produto 10 Preco.Meu.Site 1.834862 1.8348624 1.0000000 0.4226497
4 Produto 100 (Intercept) 1.000000 NaN NaN NaN
5 Produto 1000 (Intercept) 2.000000 NaN NaN NaN
6 Produto 1001 (Intercept) 3.500000 1.5000000 2.3333333 0.2577621
7 Produto 1002 (Intercept) 38.970000 21.3485433 1.8254173 0.3190534
8 Produto 1002 Menor.Preco.Concorrente -0.300000 0.1732051 -1.7320508 0.3333333
9 Produto 1003 (Intercept) 110.760000 NaN NaN NaN
10 Produto 1003 Menor.Preco.Concorrente -0.800000 NaN NaN NaN
# ... with 3,715 more rows
The NaN
appear precisely in cases where it is impossible to adjust the desired model. For example, take the Produto 1
. It occurred only once in the database. It is impossible to adjust a single plane that passes through this point, because there are infinite planes with this characteristic.
What I recommend is to remove the products from the database with few observations. In the example below I keep in the database only products with 5 or more observations:
produtos.definitivos <- produtos.contagem %>%
filter(n >= 5)
dados.limpos <- dados %>%
filter(dados$Produto %in% produtos.definitivos$Produto)
dados.regressao <- dados.limpos %>%
group_by(Produto) %>%
do(regressao =
lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))
tidy(dados.regressao, regressao)
# A tibble: 1,153 x 6
# Groups: Produto [526]
Produto term estimate std.error statistic
<fctr> <chr> <dbl> <dbl> <dbl>
1 Produto 1004 (Intercept) 3.626591760 55.74608778 0.06505554
2 Produto 1004 Menor.Preco.Concorrente 0.048689139 0.06657824 0.73130714
3 Produto 1004 Preco.Meu.Site -0.087390762 0.84020377 -0.10401139
4 Produto 1005 (Intercept) 22.974217287 68.67298304 0.33454521
5 Produto 1005 Menor.Preco.Concorrente 0.008938733 0.09727655 0.09188990
6 Produto 1005 Preco.Meu.Site -0.117537498 0.34017868 -0.34551694
7 Produto 1007 (Intercept) 38.987889938 77.15185884 0.50533961
8 Produto 1007 Menor.Preco.Concorrente -0.015792461 0.03091735 -0.51079615
9 Produto 1007 Preco.Meu.Site -0.175931611 0.39215747 -0.44862492
10 Produto 101 (Intercept) -45.326666667 23.83058282 -1.90203769
# ... with 1,143 more rows, and 1 more variables: p.value <dbl>
Note that the NaN
disappeared, precisely because now we have, in each product, more than the minimum number of observations needed to adjust the desired model.
You tried to create your code?
– Renato Junior