How to send outliers with ggplot + geom_boxplot?

Asked

Viewed 230 times

1

I have the gráfico:

library(tidyverse)

dataset<-as_tibble(matrix(rnorm(6*1000,1500,200),ncol=6))
cluster<-kmeans(dataset,centers=3)
dataset$kmeans<-as.factor(cluster[['cluster']])

dataset%>%
  gather(.,key='group',value='var',V1:V6)%>%
  ggplot(aes(group,var,fill=kmeans))+
  facet_grid(kmeans~.)+
  geom_boxplot(outlier.color='yellow',outlier.shape=21,
                 outlier.fill='black',outlier.size=1)+
  stat_summary(fun.y=mean, geom="line", aes(group=1))+
  stat_summary(fun.y=mean, geom="point")+
  theme_dark()

Gráfico:

I have two questions:

  • there is some function within geom_boxplot capable of issuing the outliers in values?
  • there is some way to issue only the first occurrence of outlier, in value, up and down the box? Suppose the outliers are: 1.4 and 5 below and 500, 501, 502 above. So I would like to issue only 5 and 500.
  • Pq you want to omit only the first occurrence of outlier either up or down the box?

  • To filter from these values.

  • 1

    Is there any reason to be in ggplot2?

  • To stay in the framework tidyverse.

1 answer

4


First of all I’m going to recreate the data with a seed so that the results can be reproduced.

library(tidyverse)
set.seed(123)
dataset <- as_tibble(matrix(rnorm(6*1000,1500,200),ncol=6))
cluster <- kmeans(dataset,centers=3)
dataset$kmeans <- as.factor(cluster[['cluster']])

and name the chart so that we can use it further.

p <- dataset%>%
  gather(.,key='group',value='var',V1:V6) >%
ggplot(aes(group,var,fill=kmeans))+
  facet_grid(kmeans~.)+
  geom_boxplot(outlier.color='yellow',outlier.shape=21,
               outlier.fill='black',outlier.size=1)+
  stat_summary(fun.y=mean, geom="line", aes(group=1))+
  stat_summary(fun.y=mean, geom="point")+
  theme_dark()

Getting ouliers in the ggplot2

In the outliers are calculated at the time the graph will be drawn (when print(plot)).

The function responsible for this will be the function StatBoxplot (source code) which is executed within the chart’s "environment".

It is possible to force the execution of the calculations by means of the function ggplot_build(). So we have

calculado <- ggplot_build(p)

The created object has an element called data which has three data.frames. The first one refers to the data used to construct the boxplot. In this data.frame there is a list with the outliers of each boxplot called "outliers".

outliers <- calculado$data[[1]]$outliers
str(outliers)
List of 18
 $ : num 2148
 $ : num [1:4] 2059 2066 1191 2158
 $ : num [1:5] 992 1914 930 973 1819
 $ : num [1:3] 910 932 874
 $ : num [1:7] 2094 1163 2154 2085 2189 ...
 $ : num [1:3] 966 2072 2117
 $ : num(0) 
 $ : num [1:2] 961 890
 $ : num [1:5] 1036 998 1940 2047 988
 $ : num [1:2] 916 2079
 $ : num [1:6] 949 2000 922 1946 1909 ...
 $ : num [1:6] 1065 2038 2050 1031 1085 ...
 $ : num [1:2] 968 2015
 $ : num [1:3] 2137 2178 2063
 $ : num [1:4] 2104 1262 1259 2184
 $ : num [1:3] 1003 1975 1931
 $ : num 891
 $ : num [1:3] 862 848 1962 

Getting the umpteenth outliers

primeiros <- function(outliers, lower, upper) {
  primeiro_baixo <- max(outliers[outliers < lower])
  primeiro_cima <- min(outliers[outliers > upper])

  c(primeiro_baixo, primeiro_cima) %>% 
    map_dbl(~ ifelse(is.infinite(.x), NA, .x))
}

dados <- calculado$data[[1]]
pmap(list(dados$outliers, dados$lower, dados$upper), primeiros)

Using the package graphics

The package graphics is already in the R (is the package that has the function plot, for example) and has a function that can facilitate this task.

res <- boxplot(var ~ group, gather(dataset, key = 'group', value = 'var', V1:V6))
res$out
# [1] 2148.2080  971.3702  967.8154  938.0451  979.6601 2038.3428 2036.9718 ...
  • 1

    Hello, Tomas. I believe the part retiramos dasd` is disconnected with the rest of the text.

  • 1

    I got it. Thank you

Browser other questions tagged

You are not signed in. Login or sign up in order to post.