How to count sequence numbers in R?

Asked

Viewed 550 times

0

I have a spreadsheet with the amount of rain for each day of the month. If there are at least 5 consecutive days with the cell value equal to 0, the counter gets 1. The next sequence with 5 or more consecutive days increments 1 more on the counter and so on. If there are 3 of these sequences the final count will be 3. How to implement this in R?

Segue uma imagem de como os dados estão na planilha com valores para 31 dias de um mês

  • Welcome to Stackoverflow! Unfortunately, this question cannot be reproduced by anyone trying to answer it. Please, take a look at this link and see how to ask a reproducible question in R. So, people who wish to help you will be able to do this in the best possible way.

2 answers

1

My dear, see the recommendations in the comment as they are very important to ensure the quality of the questions and answers. As you are arriving here in the OS I will give a discount and I will answer the question, let it be said in passing is nothing trivial, especially for those who are starting with the R.

REPRODUCIBLE EXAMPLE

The first thing you should have done in your question is to provide a reproducible example, so I will create a series of numbers that might well be from a count.

data =  sample(x = 1:10, size = 100000, replace = T)

What this code does is basically take a sample, with replacement, of 100,000 numbers between 1 and 10. Obviously there will be some repetitions and I want to know how many repetitions followed by 5 numbers there are.

FINDING ANY REPETITION OF 5 NUMBERS IN A ROW

For that I will use the command rollapply package zoo. This function creates a "window" that walks along the sequence of numbers. Inside this window you can execute any function as the average for example. Here in case I will use the average:

> data[1:10]
[1]  8  2  8  6 10 10  4  6  7 10 
> rollapply(data = data[1:10], 5, mean)
[1] 3.6 4.6 5.8 6.2 4.6 4.0

Note that the function returned only 6 numbers. This happened because I can only walk with the "window" up to 6 times when the dataset has only 10 numbers. See this figure:

inserir a descrição da imagem aqui

i surrounded the window as function rollapply makes. In case, for each window, which can rotate 6 times, he calculated the average. Now how this can help you find sequences of 5 consecutive days with the same number??

Just calculate in each window something that gives zero when all numbers are repeated! The function that does this is the standard deviation. If all numbers are equal the standard deviation is 0. So just use the rollapply along with the function sd, This settles the invoice! For example:

> dados = c(5,5,5,5,5,1,3,4,5,5,5,5,5)
> rollapply(dados, 5, sd)
[1] 0.0000000 1.7888544 1.7888544 1.6733201 1.6733201
[6] 1.6733201 0.8944272 0.4472136 0.0000000

here are two sequences of fives and therefore appeared two standard deviations equal to 0. Now just count how many times the zero appeared.

> sum(rollapply(dados, 5, sd) == 0)
[1] 2

Now doing the same in the data that I generated, which is a much larger sequence that we wouldn’t be able to visualize, we have:

> sum(rollapply(data, 5, sd) == 0)
[1] 6

SPECIFICALLY FINDING A SEQUENCE OF ZEROS

In case your series repeats several different numbers in a row and you are only interested in the zeros and not in the repetition of other numbers, what you can do is pass a function in the data and enter a noise in the numbers that are not zeros.

data2 <- ifelse(data == 0, data, jitter(data))
sum(rollapply(data2, 5, sd) == 0)

now the sequence number followed only with zero will be returned since other values that were previously repeated now with noise no longer repeat anymore.

CARING

See that in the previous example I generated 100,000 numbers by a random draw of number 1 to 10 and, still, I found six repeated sequences of numbers! The warning is: when finding this type of pattern in both series, especially if they are long rain series over 30 years old for example, ask yourself the following question: is this pattern I found so unlikely that I say it is significant? It’s not difficult for you to assign meanings to random noises in the data. There are many academic papers out there with this problem.

-1

I will try to show you a way to solve this problem without relying on exclusive functions. Here I only use the package dplyr contained in the package tidyverse. Assuming the data is in the format that df is:

library(tidyverse)

set.seed(12)
df <- tibble(dados = c(1,0,0,0,0,0, sample(x = 0:1, size = 1000, replace = T)))

df
#> # A tibble: 1,006 x 1
#>    dados
#>    <dbl>
#>  1     1
#>  2     0
#>  3     0
#>  4     0
#>  5     0
#>  6     0
#>  7     0
#>  8     1
#>  9     1
#> 10     0
#> # … with 996 more rows

df %>% 
    mutate(is_zero = dados == 0) %>% 
    mutate(is_start_of_streak = is_zero & lead(is_zero) & lag(!is_zero)) %>% 
    mutate(streak_id = cumsum(is_start_of_streak)*is_zero) %>% 
    filter(streak_id != 0) %>% 
    group_by(streak_id) %>% 
    summarise(size = n()) %>% 
    mutate(size_g_5 = size >= 5) %>% 
    summarise(total_de_streaks_maior_que_5 = sum(size_g_5))
#> # A tibble: 1 x 1
#>   total_de_streaks_maior_que_5
#>                          <int>
#> 1                           34

Created on 2019-04-18 by the reprex package (v0.2.1)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.