How to create numerical samples based on multiple conditions on multiple vectors?

Asked

Viewed 60 times

1

Given the following data frame:

df <- tibble::tribble(
  ~pass_id, ~km_ini, ~km_fin,
        1L,    0.89,    2.39,
        2L,    1.53,    3.03,
        3L,    21.9,    23.4,
        4L,    23.4,    24.9,
        5L,      24,    25.5,
        6L,    25.9,    27.4,
        7L,    36.7,    38.2,
        8L,    41.4,    42.9,
        9L,    42.1,    43.6,
       10L,    45.5,      47
  )

Created on 2020-02-17 by the reprex package (v0.3.0)

I need a sample of 50 numbers that meet the following criteria for the data frame as a whole, not just for each row of it:

  1. >= .750
  2. <= 99.450
  3. < km_ini - .750
  4. > km_fin + .750

What I’ve achieved so far is the easiest part, which are the first two (which I could do directly from the drawing itself with runif - merit 0). The problem is that later I tried to make a enframe and then I tried filter, unsuccessful.

P.S.: I don’t necessarily need the result as a data frame, it can be a vector.

library(tidyverse, verbose = F)

set.seed(42)
sort(runif(100000, 0, 99.450)) %>% 
  enframe(., "ID", "km") %>% 
  filter(km >= .750 & km <= 99.450 - .750)
#> # A tibble: 98,467 x 2
#>       ID    km
#>    <int> <dbl>
#>  1   763 0.750
#>  2   764 0.751
#>  3   765 0.751
#>  4   766 0.753
#>  5   767 0.753
#>  6   768 0.754
#>  7   769 0.754
#>  8   770 0.755
#>  9   771 0.755
#> 10   772 0.757
#> # … with 98,457 more rows

Created on 2020-02-17 by the reprex package (v0.3.0)

EDIT: Trying to visually represent the problem

The final result needs to be a numerical set that evaluates the entire data set, not just each line separately. As an example for the first two lines, see the following representation:

Representação esquemática

In this way, see that:

  • The black line indicates that I cannot have data smaller than . 750.
  • The blue line indicates where I can’t have records depending on the coverage area of the km_ini and km_fin (arrows) of line 1 plus an appendix considering the area of + or - 750 (between arrows and dots).
  • The red line indicates where I can’t have records depending on the coverage area of the km_ini and km_fin (arrows) of line 2 plus an appendix considering the area of + or - 750 (between arrows and dots).

This way, already face, the random set of data, within the first 4000 meters, could only have numbers from 3030 + 750.

The question, then, is to try to do this programmatically so that all the lines of the data frame are evaluated and the numbers generated are not within all the conditions cited.

1 answer

0

I don’t know if I understand the question but if you want n = 50 random numbers described after the last issue of the question perhaps the following code solves the problem.

  1. Calculate the minimum and maximum values described above.
  2. Create a table with n lines, randomly choosing which lines to replicate.
  3. Generates random numbers km within the calculated limits.
  4. Discard created auxiliary columns.

If you only need the numbers generated, you can still end with a pipe %>% for select(km).

library(tidyverse, verbose = FALSE)

set.seed(42)

n <- 50

df %>%
  rowwise() %>%
  mutate(Min = pmax(km_ini - 0.750, 0.750),
         Max = pmin(km_fin + 0.750, 99.450)) %>%
  ungroup() %>%
  left_join(tibble(pass_id = sample(.$pass_id, n, TRUE)), by = "pass_id") %>%
  mutate(km = if_else(Min >= 0.750, runif(n(), max(Max), 99.450), runif(n(), 0.750, min(Min)))) %>%
  select(-Min, -Max)
## A tibble: 50 x 4
#   pass_id km_ini km_fin    km
#     <int>  <dbl>  <dbl> <dbl>
# 1       1   0.89   2.39  77.0
# 2       1   0.89   2.39  91.7
# 3       1   0.89   2.39  57.5
# 4       1   0.89   2.39  61.8
# 5       1   0.89   2.39  90.6
# 6       2   1.53   3.03  83.6
# 7       2   1.53   3.03  60.2
# 8       2   1.53   3.03  50.0
# 9       2   1.53   3.03  55.0
#10       2   1.53   3.03  58.9
## … with 40 more rows
  • I really appreciate the help, maybe I didn’t ask the question in the most correct way (how to improve?). Look, the result in the column km doesn’t suit me because she picks criteria from some lines of df original, but not at all df. So let’s take, for example, line 8 of the result of your example. It points to a value of 2.54. However, 2.54 is not. 750 less than the next km_ini, which is 1.53. Thus, this value does not suit me. Even the value of line 7 (.413) does not meet me because it is not . 750 less than . 89 and nor is it also greater than . 750 (item 1 of the requirements list).

  • @rdornas I realized that for each line the random value should be between km_ini - 0.750 and km_fin + 0.750, hence the runif have these limits.

  • @rdornas In fact it is not greater than 0.750 (condition 1). So on line 8 it should be between max(km_ini - 0.750, 0.750) and min(km_fin + 0.750, 3.14), or not? These range limits are 0.750 and 3.14.

  • I think your approach is correct, the path may be there. The point is that the number I need has to be contemplated for the df as a whole. Really think of stretches of a highway, where none of the random numbers can be within any interval between km_ini and km_fim (and not 750 m before km_ini, not 750 m after km_end) in the entire data frame. In these conditions, in some km there is overlap, which makes it difficult to solve the problem. I really appreciate the help and the discussion!

  • @rdornas I think this is it. See now.

  • It’s amazing how simple something can be so complex. Unfortunately we didn’t get the answer. For example, in line 1, . 829 is not true in condition km_ini - .750. In fact, in this short excerpt of the result you sent, there is no result that is km_ini < km_ini - 750 or km_fin > km_fin + 750. At the end of the day, see that I cannot have any number less than 2.39 (which is the first km_fin), because on line 1, for example, .89 - .750 = .14, but condition 1 says it has to be >750 and between km_ini and km_fin I can’t have numbers either. I apologize for the work!

  • @rdornas Line 1: km always has to be >.75 and km_ini - .75 = .14 soon the liminf = .75 and km_ini = .89. Upstairs, km must be <99.45 and km_fin+.75=3.14 soon limsup = 3.14 and km_fin=2.39. So the random number km is or in [.75, .89] or in [2.39, 3.14]. That’s how I’m understanding the problem. Is that right? (Apparently not.) When you say that "there is no result that is km_ini < km_ini - 750 or km_fin > km_fin + 750" it is clear that not, eliminating km_inion both sides would 0 < -.75 and the same for km_fin.

  • Rui, I made a new edition of the question. I put an illustration to try to show the problem more visually. Maybe it’s a little clearer. If you have any questions, don’t hesitate to ask.

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.