Take a sample without repetition taking into account 2 variables in the R

Asked

Viewed 476 times

0

I have two bases. One with the lines I would like to take the sample and the other with the sample size with the dates. The first one that is the database of fact that I need to make the sample, is exemplified below called "good":

CNPJ    data
333333  201601
333333  201612
111111  201612
111111  201610
111111  201607
111111  201611
22222   201605
22222   201606
22222   201610
22222   201509
99999   201605
99999   201612
99999   201611
99999   201601

The second base is below called "tamamostra", it has only the sample size I need for each date, and this sample should be made with Cnpjs that do not repeat:

data    201509  201510  201512  201601  201602  201603  201604  201605  201606  201607  201610  201611  201612  Total
ruins   1          1       1       6       4       3       2       4       3       5       5       4       6       45
bons    3          3       3       14      10      7       5       10      7       12      12      10      14      105
Total   4          4       4       20      14      10      7       14     10    17         17      14      20      155

I need to make a "good" size sample for each date without repeating the same CNPJ. That is, for 201509 accurate a sample of size 3 with 3 different Cnpjs being that these Cnpjs cannot repeat themselves for the other dates, for 201601 accurate a sample of size 14 with Cnpjs that did not repeat on the previous date and so on, having, in the end, a total sample of size 105 with single Cnpjs. It is worth noting that there are some Cnpjs that do not have some dates.

I tried to use the for with the sample to make this sample, however, as I did not specify that the CNPJ could not be repeated, some Cnpjs came out repeated:

for(i in 2:14){
bons1[i]<-subset(bons,data==tamamostra[1,i])[sample(nrow(subset(bons,data==tamamostra[1,i])), tamamostra[3,i]), ]
}

How to do this in R? I believe that the dplyr package should have some solution.

  • There may not be such a sample. For example, if the 4 Cnpjs of 201510 are the same of 201512, there is no way to choose 3 of one and 3 of the other without having something in common.

  • Your question is unclear. You need a sample of size n with n Cnpjs, for the dates m, ok, but what is the relationship between the two datasets?

  • I edited the question. The base is large and it is possible to do.

  • @T.Veiga, it is not yet clear what the ratio of base 1 to base 2, see for your example, date = 201509, vc need 3 sample, ie 3 Cnpjs, this sample would be taken from base 1 (good), correct? Would each CNPJ be random or related to the date column? Would this output be merged or separated? The possible output for this example could be: "111111", "22222", "333333" or 11111122222333333?

1 answer

2

As your sample data is not large enough for a repeat sampling, I am generating other, simpler, just for demonstration:

dados <- data.frame(
  CNPJ = rep(1:20, each = 3),
  data = 2015:2017
)

tam <- data.frame(
  data = 2015:2017,
  bons = 1:3
)

The sample size table needs to be in "long" format. In the case of your data, you can convert them as follows:

tamamostra <- read.table(text = c('
  data    201509  201510  201512  201601  201602  201603  201604  201605  201606  201607  201610  201611  201612  Total
  ruins   1          1       1       6       4       3       2       4       3       5       5       4       6       45
  bons    3          3       3       14      10      7       5       10      7       12      12      10      14      105
  Total   4          4       4       20      14      10      7       14     10    17         17      14      20      155')
)
tam <- as.data.frame(t(tamamostra[,-c(1,ncol(tamamostra))]))
names(tam) <- tamamostra[[1]]

Using loop with subset

The idea here is to sequentially sample Cnpjs by dates and cut the drawn from the data table:

#data.frame para receber as amostras
amostra <- data.frame(
  CNPJ = NA,
  data = rep(tam$data, tam$bons)
)

# cópia dos dados, para preservar o original
dados -> dados.temp

for (data in tam$data) {
  samp.cnpj <- sample(dados.temp[dados.temp$data == data, 'CNPJ'], size = tam[tam$data == data, 'bons'])
  samp.cnpj -> amostra[amostra$data == data, 'CNPJ']
  dados.temp <- dados.temp[!dados.temp$CNPJ %in% samp.cnpj,]
}; rm(dados.temp, samp.cnpj)

> amostra
  CNPJ data
1    6 2015
2   18 2016
3    8 2016
4    7 2017
5   15 2017
6   19 2017

First drawing a date for each CNPJ

Here the idea is to first draw a date for each CNPJ (so there is no repetition) and then sample the Cnpjs by date, using the package for this data.table. This solution is potentially faster for a very large data set, but there may not be enough Cnpjs left to sample.

library(data.table)
setDT(dados)
amostra <- dados[, .(data = sample(data, 1)), by = CNPJ][tam, on = 'data'][, sample(CNPJ, bons), by = data]
names(amostra)[2] <- 'CNPJ'

> amostra
   data CNPJ
1: 2015    9
2: 2016    1
3: 2016   16
4: 2017    2
5: 2017    8
6: 2017    7

(Thanks to @juan-antonio-Roldán-Díaz for the suggestion of this idea)

  • @t-Eiga I don’t use dplyr, but if you want to use the second proposal with him, this reply in the English OS can help you: https://stackoverflow.com/a/41669338/9817508

Browser other questions tagged

You are not signed in. Login or sign up in order to post.