Create new database from random values with loop or other method

Asked

Viewed 76 times

4

I have three dataframes with different number of lines and I would like to create a new dataframe with 100 random values from these dataframes and based on three criteria:

  • A - Column a and b will have 100 random values of 1 dataframe

  • B - The first 50 rows of columns C1 and D1 in 50 paired random values, that is, they occur in the same row of dataframe 2

  • C - The subsequent 50 rows of columns (51-100) C2 and D2 in the other 50 paired random values, which occur in the same row from dataframe 3

I tried with loop but it doesn’t go well. How could I fix or do this in a better way?

Here are the data and the script, and the expected result:

a <- c(4,6,7,3,2,5,6,9,6,5,8,6,7,8,9,7,6)
b <- c(40,60,70,30,20,NA,60,90,60,50,75,34,42,32,NA,45,29)

c1 <- c(1,2,3,4,5,6,7,8,9,10)
d1 <- c(10,9,8,7,6,5,4,3,2,1)

c2 <- c(11,12,13,14,15,16,17,18,19,20)
d2 <- c(20,19,18,17,16,15,14,13,12,11)

df1 <- data.frame(a,b)
df2 <- data.frame(c1,d1)
df3 <- data.frame(c2,d2)

#newdf (with 100 rows)

n <- 100
newdf <- data.frame(n=rep(1:n))
newdf$a <- NA 
newdf$b <- NA 
newdf$c <- NA
newdf$d<- NA

for (i in 1:50){
  newdf$a[i] <- sample(df1$a, 1, replace=T) # random value
  newdf$b[i] <- sample(df1$b, 1, replace=T) # random value 
  newdf$c[i] <- sample[df2$c1,1, replace=T] # one criterion
  newdf$d[i] <- sample[df2$d1,1, replace=T] # one criterion
}

for (i in 51:100){
  newdf$a[i] <- sample(df1$a, 1, replace=T) # random value
  newdf$b[i] <- sample(df1$b, 1, replace=T) # random value 
  newdf$c[i] <- sample[df3$c2,1, replace=T] # two criterion
  newdf$d[i] <- sample[df3$d2,1, replace=T] #two criterion
}

#Result 

a      b     c    d 
7     60     1    10 # linha 1
6     50     3    8
2     90     5    6  # linha 50
.
.
.
2     90     11    20  # linha 51
.
.
.

1 answer

1

I believe that the best way to solve this problem is not through a loop. I solved it by selecting the rows randomly, all at once. I saved these results inside called vectors index_a, index_b, index_cd_50 and index_cd_100. These vectors therefore store the 100 rows drawn from df1, with columns a and b, and the 50 rows drawn from df2 and 50 rows drawn from df3.

These lines will be considered before or after position 50 when I mount newdf. Try running code line by line to identify what I did.

a <- c(4,6,7,3,2,5,6,9,6,5,8,6,7,8,9,7,6)
b <- c(40,60,70,30,20,NA,60,90,60,50,75,34,42,32,NA,45,29)

c1 <- c(1,2,3,4,5,6,7,8,9,10)
d1 <- c(10,9,8,7,6,5,4,3,2,1)

c2 <- c(11,12,13,14,15,16,17,18,19,20)
d2 <- c(20,19,18,17,16,15,14,13,12,11)

df1 <- data.frame(a,b)
df2 <- data.frame(c1,d1)
df3 <- data.frame(c2,d2)

index_a <- sample(1:nrow(df1), 100, replace=TRUE)
index_b <- sample(1:nrow(df1), 100, replace=TRUE)

index_cd_50  <- sample(1:nrow(df2), 50, replace=TRUE)

index_cd_100 <- sample(1:nrow(df3), 50, replace=TRUE)

newdf <- data.frame(a=df1$a[index_a],
                    b=df1$b[index_b],
                    c=c(df2$c1[index_cd_50], df2$d1[index_cd_100]),
                    d=c(df3$c2[index_cd_50], df3$d2[index_cd_100]))    

Browser other questions tagged

You are not signed in. Login or sign up in order to post.