How to winsorize database by group

Question

How to winsorize database by group

Asked 6 years, 2 months ago

Viewed 209 times

-1

How to market data when the database is per group? Example:

Empresas  Setores    Receita
Empres1   Comercio     ###
Empres2   Comercio     ###
Empres3   Comercio     ###
Empres21  Industria    ###
Empres22  Industria    ###
Empres23  Industria    ###

Please create a playable example. https://answall.com/questions/264168/quais-as-principais-fun%C3%A7%C3%B5es-to-create-an-example-m%C3%Adnimo-reproduce%C3%Advel-em-r

– bbiasi

2019/05/20 at 01:46

1 answer

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2019-05-20T15:26:59+00:00

I’ll use the function winsor package psych to take advantage of the data.
To answer the question, just use one of the ways R has to apply group functions.

R base.

Based on R function ave was made for this.

library(psych)

with(dados, ave(Receita, Setores, FUN = winsor))
#[1] 518.6 857.0 899.0 318.8 632.0 801.2

Or, creating a new column, in a copy of the base.

dados_a <- dados
dados_a$ReceitaWinsor <- with(dados, ave(Receita, Setores, FUN = winsor))

Bundle dplyr.

With the package dplyr also not difficult.

library(psych)
library(dplyr)

dados_b <- dados %>%
  group_by(Setores) %>%
  mutate(ReceitaWinsor = winsor(Receita))

dados_b
## A tibble: 6 x 4
## Groups:   Setores [2]
#  Empresas Setores   Receita ReceitaWinsor
#  <fct>    <fct>       <int>         <dbl>
#1 Empres1  Comercio      293          519.
#2 Empres2  Comercio      857          857 
#3 Empres3  Comercio      927          899 
#4 Empres21 Industria     110          319.
#5 Empres22 Industria     632          632 
#6 Empres23 Industria     914          801.

Although the values presented by this second solution do not seem to be the same as the results of the R base, in fact this is only a rounding done by dplyr.

identical(dados_a$ReceitaWinsor, dados_b$ReceitaWinsor)
#[1] TRUE

Dice.

As the question data is incomplete, here is a basis with the three columns. The first two columns are those of the question and the third will be generated randomly.

dados <-
structure(list(Empresas = structure(c(1L, 2L, 6L, 
3L, 4L, 5L), .Label = c("Empres1", "Empres2", 
"Empres21", "Empres22", "Empres23", "Empres3"), 
class = "factor"), Setores = structure(c(1L, 1L, 
1L, 2L, 2L, 2L), .Label = c("Comercio", "Industria"), 
class = "factor")), row.names = c(NA, -6L), 
class = "data.frame")

set.seed(1234)
dados$Receita <- sample(10:1000, 6, TRUE)