Error plotting with ggplot

Asked

Viewed 70 times

5

library(tidyverse)
library(lubridate)

Suppose a dataset containing the year followed by the number of books published by an author:

df<-tibble(
  ano = c(2007, 2008, 2017, 2018),
  n = c(1, 2, 2, 1)
)

df

# A tibble: 4 x 2
    ano     n
  <dbl> <dbl>
1  2007     1
2  2008     2
3  2017     2
4  2018     1

If I try to create a chart with:

df %>% 
  ggplot(aes(ano, n))+
  geom_line()

...I get the following:

inserir a descrição da imagem aqui

My interpretation of the graph generated (correct me if I’m wrong, please), is that, according to the graph, 1 book was published in 2007, 2 books in 2008, and 2 books each year between 2008 and 2017 and, again, 1 book in 2018.

The issue is that, in fact, between 2008 and 2017 there is no record.

I imagined that by having put the year simply as a number (that is, without having classified it as a year), it had caused this misunderstanding.

So, I converted the "year" column into years (in fact) with the expectation of solving the question. I used the function ymd() of lubridate and saved in a new dataset called "df2":

df2<-df %>% 
  mutate(ano = ymd(paste0(ano, "01", "01")))

df2

# A tibble: 4 x 2
  ano            n
  <date>     <dbl>
1 2007-01-01     1
2 2008-01-01     2
3 2017-01-01     2
4 2018-01-01     1

And plotted the graph:

df2 %>% 
  ggplot(aes(ano, n))+
  geom_line()+
  theme_bw()

However, as a result, I had exactly the same chart as before:

inserir a descrição da imagem aqui

... I don’t know, my expectation is that the correct graph would be a line (going up) that goes from x=2007, y=1 until x=2008, y=2 -- then another line (going down) that goes from x=2017, y=2 until x=2018, y=1. In other words, in my opinion there should not be this line (straight) linking 2008 to 2018 at y=2.

Where is my mistake?

  • 1

    good night! If there is no record that connects the value to date, this behavior should not happen... one way to plot is to convert into factor but still will not have that decay you expect. Hug!

  • 1

    The geom_line connects the points according to the x-axis. It is a common behavior of graphic libraries in R/Python that the program has no way of knowing the frequency of the observations and whether to connect them is relevant or not. The ideal is to complete the x-axis data and give them the value of 0 or NA.

2 answers

5


An example creating the sequence of dates with n equal to zero

library(ggplot2)
library(dplyr)

df <-data.frame(
  ano = c(2007, 2008, 2017, 2018),
  n = c(1, 2, 2, 1)
)

anos <- data.frame(ano = 2007:2018)

left_join(anos, df, by = 'ano') %>%  replace(is.na(.), 0) %>%
  ggplot( aes(factor(ano),factor(n), group = 1) )+
    geom_line()

inserir a descrição da imagem aqui

2

One solution is to use the function tidyr::complete.

df %>% 
  complete(ano = 2007:2018, fill = list(n = 0)) %>%
  ggplot(aes(ano, n))+
  geom_line()

inserir a descrição da imagem aqui

Browser other questions tagged

You are not signed in. Login or sign up in order to post.