Detect outliers in grouped data

Question

Detect outliers in grouped data

Asked 3 years, 12 months ago

Viewed 76 times

0

I am working with dashboard data and needed to identify potential outliers in each group, example:

df <- data.frame(
    ID_group = c("1","1","2","1","3","2","2","3"),
    Case = c("A","B","C","D","E","F","G","H"),
    Var_X = c("1","2","1","2","3","5","6","10"),
    Var_Y = c("10,02","5","7","5,5","2,5","7","600,23","2,8"))

I would like to be able to identify possible outliers existing in the variable Var_x in each Id_group creating a column where it is marked if the case identified in Case is considered an outlier in var_X and in the Id_group to which it belongs.

1

Which method of detection you will use?

– Vinícius Félix

2021/08/23 at 12:45
I’ll use z-score

– user250908

2021/08/23 at 18:00

2 answers

1

Identify outliers

There are several procedures to identify outliers, the choice of method and cut-off criteria depend on your data and the purpose of the analysis. The most used are interquartile range (IQR) and z-score.

z-score

Normalization by mean and deviation is made and considered outliers values above x times the standard deviation (usually 3):

is.outlier <- function(x, sd = 3) abs(scale(x)) > sd

IQR

The difference between the 1st and 3rd quartiles (25% and 75%) is calculated and multiplied by a coefficient (usually 1.5). It is considered outliers values below or above this interquartile distance:

is.outlier <- function(x, coef = 1.5, ...) {
  q <- quantile(x, c(.25, .75), na.rm = TRUE, ...)
  iqr <- diff(q)
  x < q[1] - coef*iqr | x > q[2] + coef*iqr
}

Alternatively, you can use the function boxplot.stats (used internally by boxplot):

is.outlier <- function(x) x %in% boxplot.stats(x)$out

Identify by group

I created a more indicative example and will use the last example function (identify by z-score is only suitable for n high).

df <- data.frame(
  grupo = rep(LETTERS[1:2], each = 5),
  Y = c(1:4, 100:104, 1000))

With R base

Using split and apply. Just make sure the data is sorted by group first, so the result is attached correctly.

df <- df[order(df$grupo), ]
df$outlier <- unlist(lapply(split(df$Y, df$grupo), is.outlier))

df
#>    grupo    Y outlier
#> 1      A    1   FALSE
#> 2      A    2   FALSE
#> 3      A    3   FALSE
#> 4      A    4   FALSE
#> 5      A  100    TRUE
#> 6      B  101   FALSE
#> 7      B  102   FALSE
#> 8      B  103   FALSE
#> 9      B  104   FALSE
#> 10     B 1000    TRUE

Dplyr

library(dplyr)

df %<>% group_by(grupo) %>% mutate(outlier = is.outlier(Y))

Date.table

library(data.table)

setDT(df)

df[, outlier := is.outlier(Y), grupo]

Obg for the answer, very didactic. I was going to use z-score but, as I have groups with 5 to 10 observations, I may need to use IQR

– user250908

2021/08/28 at 18:06

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Vinícius Félix • **139** points · Answer 1 · 2021-08-23T20:17:20+00:00

I took your example, first created a data frame., transforming variables into numerical.

library(tidyverse)

df <- data.frame(
  id_group = c("1","1","2","1","3","2","2","3"),
  Case = c("A","B","C","D","E","F","G","H"),
  var_x = c("1","2","1","2","3","5","6","10"),
  var_y = c("10.02","5","7","5.5","2.5","7","600.23","2.8")
  ) %>% 
  mutate(across(starts_with("var_"), as.numeric))

Then I grouped by variable id_group and applied the function mutatis to create a new variable (out_var_x) indicating the outlier, in the case I used any condition, being it if the value of var_x is larger than the amount of 90% of your group, but you can add another criterion.

df %>% 
  group_by(id_group) %>% 
  mutate(out_var_x = if_else(var_x > quantile(var_x,.9),TRUE,FALSE))

# A tibble: 8 x 5
# Groups:   ID_group [3]
  id_group case  var_x var_y out_var_x
  <chr>    <chr> <dbl> <dbl> <lgl>    
1 1        A         1  10.0 FALSE    
2 1        B         2   5   FALSE    
3 2        C         1   7   FALSE    
4 1        D         2   5.5 FALSE    
5 3        E         3   2.5 FALSE    
6 2        F         5   7   FALSE    
7 2        G         6 600.  TRUE     
8 3        H        10   2.8 TRUE