Identify outliers
There are several procedures to identify outliers, the choice of method and cut-off criteria depend on your data and the purpose of the analysis. The most used are interquartile range (IQR) and z-score.
z-score
Normalization by mean and deviation is made and considered outliers values above x times the standard deviation (usually 3):
is.outlier <- function(x, sd = 3) abs(scale(x)) > sd
IQR
The difference between the 1st and 3rd quartiles (25% and 75%) is calculated and multiplied by a coefficient (usually 1.5). It is considered outliers values below or above this interquartile distance:
is.outlier <- function(x, coef = 1.5, ...) {
q <- quantile(x, c(.25, .75), na.rm = TRUE, ...)
iqr <- diff(q)
x < q[1] - coef*iqr | x > q[2] + coef*iqr
}
Alternatively, you can use the function boxplot.stats
(used internally by boxplot
):
is.outlier <- function(x) x %in% boxplot.stats(x)$out
Identify by group
I created a more indicative example and will use the last example function (identify by z-score is only suitable for n high).
df <- data.frame(
grupo = rep(LETTERS[1:2], each = 5),
Y = c(1:4, 100:104, 1000))
With R base
Using split
and apply
. Just make sure the data is sorted by group first, so the result is attached correctly.
df <- df[order(df$grupo), ]
df$outlier <- unlist(lapply(split(df$Y, df$grupo), is.outlier))
df
#> grupo Y outlier
#> 1 A 1 FALSE
#> 2 A 2 FALSE
#> 3 A 3 FALSE
#> 4 A 4 FALSE
#> 5 A 100 TRUE
#> 6 B 101 FALSE
#> 7 B 102 FALSE
#> 8 B 103 FALSE
#> 9 B 104 FALSE
#> 10 B 1000 TRUE
Dplyr
library(dplyr)
df %<>% group_by(grupo) %>% mutate(outlier = is.outlier(Y))
Date.table
library(data.table)
setDT(df)
df[, outlier := is.outlier(Y), grupo]
Which method of detection you will use?
– Vinícius Félix
I’ll use z-score
– user250908