Multiple Linear Regression in R in different groups

Question

Multiple Linear Regression in R in different groups

Asked 8 years ago

Viewed 602 times

2

I have a data set, with 4 columns:

Product - Site Price 1 - Concurrent Site Price - Quantities Sold

Where Product is a specific product (Refrigerator 1, Refrigerator 2, etc); Price Website 1 is the price practiced by my site; Price Site Competitor is the lowest price for the same product of a competitor; and Quantities Sold is the number of products sold for my website. Each line is a different scenario, combining these variables. There are several records for each product, since I practiced different prices, as well as my competitor.

I want to make a multiple linear regression model, correlating these variables to each product. Ex: The Refrigerator 1, when I practiced the price X1, and my competitor practiced the price x2, sold y Products.

The problem is that I want to do this in bulk, for various products. How do I make a code in R for this?

You tried to create your code?

– Renato Junior

2017/08/17 at 00:50

1 answer

Browser other questions tagged r regression

You are not signed in. Login or sign up in order to post.

by Marcus Nunes • **17,915** points · Answer 1 · 2017-08-17T01:50:03+00:00

One way to solve this problem is through the packages dplyr and broom:

library(dplyr)
library(broom)

iris.regressao <- iris %>%
  group_by(Species) %>%
  do(regressao = 
    lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=.))

tidy(iris.regressao, regressao)

# A tibble: 12 x 6
# Groups:   Species [3]
      Species         term   estimate  std.error  statistic      p.value
       <fctr>        <chr>      <dbl>      <dbl>      <dbl>        <dbl>
 1     setosa  (Intercept)  2.3518898 0.39286751  5.9864707 3.034183e-07
 2     setosa  Sepal.Width  0.6548350 0.09244742  7.0833236 6.834434e-09
 3     setosa Petal.Length  0.2375602 0.20801921  1.1420107 2.593594e-01
 4     setosa  Petal.Width  0.2521257 0.34686362  0.7268727 4.709870e-01
 5 versicolor  (Intercept)  1.8955395 0.50705524  3.7383295 5.112246e-04
 6 versicolor  Sepal.Width  0.3868576 0.20454490  1.8913091 6.488965e-02
 7 versicolor Petal.Length  0.9083370 0.16543248  5.4906811 1.666695e-06
 8 versicolor  Petal.Width -0.6792238 0.43538206 -1.5600639 1.255990e-01
 9  virginica  (Intercept)  0.6998830 0.53360089  1.3116227 1.961563e-01
10  virginica  Sepal.Width  0.3303370 0.17432873  1.8949086 6.439972e-02
11  virginica Petal.Length  0.9455356 0.09072204 10.4223360 1.074269e-13
12  virginica  Petal.Width -0.1697527 0.19807243 -0.8570233 3.958750e-01

I got the dataset iris and considered the column Species as a product of your problem. This was my grouping variable.

I did a regression using the variable Sepal.Length as response and others as predictors. The function do applied this regression to all levels of the variable Species. I mean, I did three simultaneous regressions.

I organized the final result using the function tidy, to make everything more presentable.

Edition held after the set of data be made available in the comments.

I used the original dataset and could not get to the reported error.

library(dplyr)
library(broom)
dados <- read.csv(file="arquivo.csv", sep=";", dec=",")

Just for conscience disposal, I wanted to see how many times each product appears in the database:

produtos.contagem <- dados %>% 
  select(Produto) %>%
  group_by(Produto) %>%
  count() %>%
  arrange(n)

# A tibble: 2,418 x 2
# Groups:   Produto [2,418]
        Produto     n
         <fctr> <int>
 1    Produto 1     1
 2  Produto 100     1
 3 Produto 1000     1
 4 Produto 1006     1
 5 Produto 1011     1
 6 Produto 1012     1
 7 Produto 1013     1
 8 Produto 1021     1
 9 Produto 1022     1
10 Produto 1023     1
# ... with 2,408 more rows

We can see that there are many products that appear only once, which will create problems with regression. After all, geometrically speaking, it takes at least 3 points in space to adjust an equation of the type y = x_1 + x_2, which is the case here. And even then, an equation of this type will be uniquely defined, giving no chance to estimate variability.

But leaving aside the theoretical assumptions of linear regression, it is still possible to adjust the desired regression:

dados.regressao <- dados %>%
  group_by(Produto) %>%
  do(regressao = 
       lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))

tidy(dados.regressao, regressao)

# A tibble: 3,725 x 6
# Groups:   Produto [2,418]
        Produto                    term   estimate  std.error  statistic   p.value
         <fctr>                   <chr>      <dbl>      <dbl>      <dbl>     <dbl>
 1    Produto 1             (Intercept)   2.000000        NaN        NaN       NaN
 2   Produto 10             (Intercept) -73.045872 75.5475262 -0.9668863 0.4356077
 3   Produto 10          Preco.Meu.Site   1.834862  1.8348624  1.0000000 0.4226497
 4  Produto 100             (Intercept)   1.000000        NaN        NaN       NaN
 5 Produto 1000             (Intercept)   2.000000        NaN        NaN       NaN
 6 Produto 1001             (Intercept)   3.500000  1.5000000  2.3333333 0.2577621
 7 Produto 1002             (Intercept)  38.970000 21.3485433  1.8254173 0.3190534
 8 Produto 1002 Menor.Preco.Concorrente  -0.300000  0.1732051 -1.7320508 0.3333333
 9 Produto 1003             (Intercept) 110.760000        NaN        NaN       NaN
10 Produto 1003 Menor.Preco.Concorrente  -0.800000        NaN        NaN       NaN
# ... with 3,715 more rows

The NaN appear precisely in cases where it is impossible to adjust the desired model. For example, take the Produto 1. It occurred only once in the database. It is impossible to adjust a single plane that passes through this point, because there are infinite planes with this characteristic.

What I recommend is to remove the products from the database with few observations. In the example below I keep in the database only products with 5 or more observations:

produtos.definitivos <- produtos.contagem %>% 
  filter(n >= 5)

dados.limpos <- dados %>%
  filter(dados$Produto %in% produtos.definitivos$Produto)

dados.regressao <- dados.limpos %>%
  group_by(Produto) %>%
  do(regressao = 
       lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))

tidy(dados.regressao, regressao)

# A tibble: 1,153 x 6
# Groups:   Produto [526]
        Produto                    term      estimate   std.error   statistic
         <fctr>                   <chr>         <dbl>       <dbl>       <dbl>
 1 Produto 1004             (Intercept)   3.626591760 55.74608778  0.06505554
 2 Produto 1004 Menor.Preco.Concorrente   0.048689139  0.06657824  0.73130714
 3 Produto 1004          Preco.Meu.Site  -0.087390762  0.84020377 -0.10401139
 4 Produto 1005             (Intercept)  22.974217287 68.67298304  0.33454521
 5 Produto 1005 Menor.Preco.Concorrente   0.008938733  0.09727655  0.09188990
 6 Produto 1005          Preco.Meu.Site  -0.117537498  0.34017868 -0.34551694
 7 Produto 1007             (Intercept)  38.987889938 77.15185884  0.50533961
 8 Produto 1007 Menor.Preco.Concorrente  -0.015792461  0.03091735 -0.51079615
 9 Produto 1007          Preco.Meu.Site  -0.175931611  0.39215747 -0.44862492
10  Produto 101             (Intercept) -45.326666667 23.83058282 -1.90203769
# ... with 1,143 more rows, and 1 more variables: p.value <dbl>

Note that the NaN disappeared, precisely because now we have, in each product, more than the minimum number of observations needed to adjust the desired model.