Multiple Linear Regression in R in different groups

Asked

Viewed 602 times

2

I have a data set, with 4 columns:

Product - Site Price 1 - Concurrent Site Price - Quantities Sold

Where Product is a specific product (Refrigerator 1, Refrigerator 2, etc); Price Website 1 is the price practiced by my site; Price Site Competitor is the lowest price for the same product of a competitor; and Quantities Sold is the number of products sold for my website. Each line is a different scenario, combining these variables. There are several records for each product, since I practiced different prices, as well as my competitor.

I want to make a multiple linear regression model, correlating these variables to each product. Ex: The Refrigerator 1, when I practiced the price X1, and my competitor practiced the price x2, sold y Products.

The problem is that I want to do this in bulk, for various products. How do I make a code in R for this?

  • You tried to create your code?

1 answer

2


One way to solve this problem is through the packages dplyr and broom:

library(dplyr)
library(broom)

iris.regressao <- iris %>%
  group_by(Species) %>%
  do(regressao = 
    lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=.))

tidy(iris.regressao, regressao)

# A tibble: 12 x 6
# Groups:   Species [3]
      Species         term   estimate  std.error  statistic      p.value
       <fctr>        <chr>      <dbl>      <dbl>      <dbl>        <dbl>
 1     setosa  (Intercept)  2.3518898 0.39286751  5.9864707 3.034183e-07
 2     setosa  Sepal.Width  0.6548350 0.09244742  7.0833236 6.834434e-09
 3     setosa Petal.Length  0.2375602 0.20801921  1.1420107 2.593594e-01
 4     setosa  Petal.Width  0.2521257 0.34686362  0.7268727 4.709870e-01
 5 versicolor  (Intercept)  1.8955395 0.50705524  3.7383295 5.112246e-04
 6 versicolor  Sepal.Width  0.3868576 0.20454490  1.8913091 6.488965e-02
 7 versicolor Petal.Length  0.9083370 0.16543248  5.4906811 1.666695e-06
 8 versicolor  Petal.Width -0.6792238 0.43538206 -1.5600639 1.255990e-01
 9  virginica  (Intercept)  0.6998830 0.53360089  1.3116227 1.961563e-01
10  virginica  Sepal.Width  0.3303370 0.17432873  1.8949086 6.439972e-02
11  virginica Petal.Length  0.9455356 0.09072204 10.4223360 1.074269e-13
12  virginica  Petal.Width -0.1697527 0.19807243 -0.8570233 3.958750e-01

I got the dataset iris and considered the column Species as a product of your problem. This was my grouping variable.

I did a regression using the variable Sepal.Length as response and others as predictors. The function do applied this regression to all levels of the variable Species. I mean, I did three simultaneous regressions.

I organized the final result using the function tidy, to make everything more presentable.


Edition held after the set of data be made available in the comments.

I used the original dataset and could not get to the reported error.

library(dplyr)
library(broom)
dados <- read.csv(file="arquivo.csv", sep=";", dec=",")

Just for conscience disposal, I wanted to see how many times each product appears in the database:

produtos.contagem <- dados %>% 
  select(Produto) %>%
  group_by(Produto) %>%
  count() %>%
  arrange(n)

# A tibble: 2,418 x 2
# Groups:   Produto [2,418]
        Produto     n
         <fctr> <int>
 1    Produto 1     1
 2  Produto 100     1
 3 Produto 1000     1
 4 Produto 1006     1
 5 Produto 1011     1
 6 Produto 1012     1
 7 Produto 1013     1
 8 Produto 1021     1
 9 Produto 1022     1
10 Produto 1023     1
# ... with 2,408 more rows

We can see that there are many products that appear only once, which will create problems with regression. After all, geometrically speaking, it takes at least 3 points in space to adjust an equation of the type y = x_1 + x_2, which is the case here. And even then, an equation of this type will be uniquely defined, giving no chance to estimate variability.

But leaving aside the theoretical assumptions of linear regression, it is still possible to adjust the desired regression:

dados.regressao <- dados %>%
  group_by(Produto) %>%
  do(regressao = 
       lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))

tidy(dados.regressao, regressao)

# A tibble: 3,725 x 6
# Groups:   Produto [2,418]
        Produto                    term   estimate  std.error  statistic   p.value
         <fctr>                   <chr>      <dbl>      <dbl>      <dbl>     <dbl>
 1    Produto 1             (Intercept)   2.000000        NaN        NaN       NaN
 2   Produto 10             (Intercept) -73.045872 75.5475262 -0.9668863 0.4356077
 3   Produto 10          Preco.Meu.Site   1.834862  1.8348624  1.0000000 0.4226497
 4  Produto 100             (Intercept)   1.000000        NaN        NaN       NaN
 5 Produto 1000             (Intercept)   2.000000        NaN        NaN       NaN
 6 Produto 1001             (Intercept)   3.500000  1.5000000  2.3333333 0.2577621
 7 Produto 1002             (Intercept)  38.970000 21.3485433  1.8254173 0.3190534
 8 Produto 1002 Menor.Preco.Concorrente  -0.300000  0.1732051 -1.7320508 0.3333333
 9 Produto 1003             (Intercept) 110.760000        NaN        NaN       NaN
10 Produto 1003 Menor.Preco.Concorrente  -0.800000        NaN        NaN       NaN
# ... with 3,715 more rows

The NaN appear precisely in cases where it is impossible to adjust the desired model. For example, take the Produto 1. It occurred only once in the database. It is impossible to adjust a single plane that passes through this point, because there are infinite planes with this characteristic.

What I recommend is to remove the products from the database with few observations. In the example below I keep in the database only products with 5 or more observations:

produtos.definitivos <- produtos.contagem %>% 
  filter(n >= 5)

dados.limpos <- dados %>%
  filter(dados$Produto %in% produtos.definitivos$Produto)

dados.regressao <- dados.limpos %>%
  group_by(Produto) %>%
  do(regressao = 
       lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))

tidy(dados.regressao, regressao)

# A tibble: 1,153 x 6
# Groups:   Produto [526]
        Produto                    term      estimate   std.error   statistic
         <fctr>                   <chr>         <dbl>       <dbl>       <dbl>
 1 Produto 1004             (Intercept)   3.626591760 55.74608778  0.06505554
 2 Produto 1004 Menor.Preco.Concorrente   0.048689139  0.06657824  0.73130714
 3 Produto 1004          Preco.Meu.Site  -0.087390762  0.84020377 -0.10401139
 4 Produto 1005             (Intercept)  22.974217287 68.67298304  0.33454521
 5 Produto 1005 Menor.Preco.Concorrente   0.008938733  0.09727655  0.09188990
 6 Produto 1005          Preco.Meu.Site  -0.117537498  0.34017868 -0.34551694
 7 Produto 1007             (Intercept)  38.987889938 77.15185884  0.50533961
 8 Produto 1007 Menor.Preco.Concorrente  -0.015792461  0.03091735 -0.51079615
 9 Produto 1007          Preco.Meu.Site  -0.175931611  0.39215747 -0.44862492
10  Produto 101             (Intercept) -45.326666667 23.83058282 -1.90203769
# ... with 1,143 more rows, and 1 more variables: p.value <dbl>

Note that the NaN disappeared, precisely because now we have, in each product, more than the minimum number of observations needed to adjust the desired model.

  • Hello Marcus, all right? Thanks for the help! There was only one problem: when I went to use the command "Tidy(data.Reset, Reset)", it gives this error: Column dados$SKU.MKP must be length 2333 (the number of Rows) or one, not 11726. I believe this happens because of the number of rows in this column, SKU.MKP(Which are the Products I mentioned in the question). Any ideas on how to solve this? I need the coefficients of each variable per product, because I will make a spreadsheet to calculate the scenarios in mass.

  • Unfortunately, without having the data set, it is very difficult for me to give you a definitive answer on this question. I don’t know how many lines your dataset has, or how many different products are registered.

  • Searching for the error found that it is happening due to the number of observations x number of different products. I have 2333 different products, totaling 11726 different observations. That is, for Product 1 I have 5 observations, for Product 2, 7 observations, and so on. That is what is giving error

  • Unbalancing between groups is not a problem for this code I passed. See the following dataset: iris2 <- iris[c(1:10, 51:70, 101:150), ]. He is the iris original, but with 10 plants of the first type, 20 of the second and 50 of the third. The 3 groups are unbalanced, but the code works the same way.

  • I’ve tried everything, but it keeps making this mistake. I am attaching a link to download the data:https://drive.google.com/file/d/0B0RV-qxx0Y51T1Fhc0t0Zk96TU/view?usp=sharing My code was this: data.Retrieval <- data %>% group_by($SKU.MKP data) %>% do(Results = lm(data$Sales.Diaries ~ data$Minor.Price.Competitor + data$Price.Meu.Site, date=.)) Tidy(data.Results, Results) Thanks a lot for your help!

  • Just to update, I had typed something wrong. The Code works perfectly. Thank you very much!!

  • What good news! If possible, accepted the answer so that others can enjoy it in the future.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.