0
I have two bases. One with the lines I would like to take the sample and the other with the sample size with the dates. The first one that is the database of fact that I need to make the sample, is exemplified below called "good":
CNPJ data
333333 201601
333333 201612
111111 201612
111111 201610
111111 201607
111111 201611
22222 201605
22222 201606
22222 201610
22222 201509
99999 201605
99999 201612
99999 201611
99999 201601
The second base is below called "tamamostra", it has only the sample size I need for each date, and this sample should be made with Cnpjs that do not repeat:
data 201509 201510 201512 201601 201602 201603 201604 201605 201606 201607 201610 201611 201612 Total
ruins 1 1 1 6 4 3 2 4 3 5 5 4 6 45
bons 3 3 3 14 10 7 5 10 7 12 12 10 14 105
Total 4 4 4 20 14 10 7 14 10 17 17 14 20 155
I need to make a "good" size sample for each date without repeating the same CNPJ. That is, for 201509 accurate a sample of size 3 with 3 different Cnpjs being that these Cnpjs cannot repeat themselves for the other dates, for 201601 accurate a sample of size 14 with Cnpjs that did not repeat on the previous date and so on, having, in the end, a total sample of size 105 with single Cnpjs. It is worth noting that there are some Cnpjs that do not have some dates.
I tried to use the for with the sample to make this sample, however, as I did not specify that the CNPJ could not be repeated, some Cnpjs came out repeated:
for(i in 2:14){
bons1[i]<-subset(bons,data==tamamostra[1,i])[sample(nrow(subset(bons,data==tamamostra[1,i])), tamamostra[3,i]), ]
}
How to do this in R? I believe that the dplyr package should have some solution.
There may not be such a sample. For example, if the 4 Cnpjs of 201510 are the same of 201512, there is no way to choose 3 of one and 3 of the other without having something in common.
– Ailton Andrade de Oliveira
Your question is unclear. You need a sample of size n with n Cnpjs, for the dates m, ok, but what is the relationship between the two datasets?
– Thiago Fernandes
I edited the question. The base is large and it is possible to do.
– T. Veiga
@T.Veiga, it is not yet clear what the ratio of base 1 to base 2, see for your example, date = 201509, vc need 3 sample, ie 3 Cnpjs, this sample would be taken from base 1 (good), correct? Would each CNPJ be random or related to the date column? Would this output be merged or separated? The possible output for this example could be: "111111", "22222", "333333" or 11111122222333333?
– Thiago Fernandes