Doubt with loop operations

Asked

Viewed 58 times

2

I have this df:

df_1 <- data.frame(
  x = replicate(
    n = 6, expr = runif(n = 30, min = 20, max = 100), simplify = TRUE
  ), 
  y = as.factor(sample(x = 1:3, size = 30, replace = TRUE))
)

I would like to know the cause of the first two functions functioning and the third and fourth not. I chose the function pairwise.t.test arbitrarily to illustrate.

For me, they would all be equivalent:

1)

vars <- names(df_1)[c(1:6)]

for (i in vars) {
  print(
    pairwise.t.test(x = df_1[, i], df_1$y, p.adj = 'bonferroni')
  )
}

Works.

2)

for (i in names(df_1)[c(1:6)]) {
  print(
    pairwise.t.test(x = df_1[, i], df_1$y, p.adj = 'bonferroni')
  )
}

Also works.

The following loop, which for me is equivalent to the previous ones, is not executed:

3)

for (i in df_1) {
  print(
    pairwise.t.test(x = names(i)[c(1:6)], i$y, p.adj = 'bonferroni')
  )
}

4)

Finally, we know that:

df_1[1]

amounts to

df_1[, 1]

and

df_1[,1]

All return to the first column of df. But if I remove the comma (, ) of df_1[, i] or withdraw the space between , and i ([,i]) in function 1) and 2) the loop does not work and "works wrong" respectively:

comma-free

for (i in vars) {
  print(
    pairwise.t.test(x = df_1[i], df_1$y, p.adj = 'bonferroni')
  )
}

Error in tapply(x, g, Mean, na.rm = TRUE) : Arguments must have same length

without the space

for (i in vars) {
  print(
    pairwise.t.test(x = df_1[,i], df_1$y, p.adj = 'bonferroni')
  )
}

# Pairwise comparisons using t tests with pooled SD 

# data:  df_1[, i] and df_1$y 

#   1 2
# 2 1 -
# 3 1 1

# P value adjustment method: bonferroni
  • What are the reasons for 3) and 4) do not work?

1 answer

3


I will not deal with loop 1 and 2 because they work and also because they are the same thing with the difference that one uses the variable vars and the other calculates it on the fly (at the command of the loop).

I’m going to compare loops 3 and 4 to 1. I’m going to switch from 6 columns to two so that the answer is less prolific. Another change I will make is to replace the "complex calculation" in the body of the loop by a print(i) to see what’s happening inside the loop.

Data used

df_1 <- data.frame(
  x = replicate(
    n = 2, expr = runif(n = 30, min = 20, max = 100), simplify = TRUE
  ), 
  y = as.factor(sample(x = 1:3, size = 30, replace = TRUE))
)
vars <- names(df_1)[c(1:2)]

Loop 1

for (i in vars) print(i)
#> [1] "x.1"
#> [1] "x.2"

What is happening in loop 1 is that the variable name is being passed to the function `[`(), who can handle texto and extracts the variable of the same name.

Loop 3

Already in loop 3 what happens is that the i represents the same data as the data.frame.

for (i in df_1) print(i)
#> [1] 30.26077 82.59508 71.04249 99.67011 36.02907 20.69992 31.05353 60.14356 40.53777 32.07807 23.52082 60.28327
#> [13] 49.96783 96.53946 94.50641 72.01676 94.42794 20.56521 20.45774 72.51956 36.98077 33.22457 45.25833 59.28694
#> [25] 98.41030 36.39350 69.02367 51.82203 68.45499 96.95839
#> [1] 48.96433 89.56191 63.90551 73.54613 99.35293 74.60017 25.81102 21.54059 98.78785 78.26632 45.56900 75.22186
#> [13] 44.45236 80.81733 87.45434 23.85018 62.25944 26.33234 63.73642 64.93282 79.85623 29.66782 33.67150 67.01610
#> [25] 74.98264 38.05653 29.91142 63.60954 26.37593 24.21256
#> [1] 1 2 3 2 3 3 2 3 3 3 3 2 2 2 3 1 1 1 1 1 3 1 3 2 3 2 2 3 1 2
Levels: 1 2 3

In this case it is not semantically correct, although it may be correct from the syntax point of view, using the i in that context because i no more names. Let’s see:

for (i in df_1) print(names(i))
#> NULL
#> NULL

That is, the code of loop 3 passes NULL as x for pairwise.t.test and then it seems that the loop is not rotated.

Loop 4

Finally in Loop 4 the difference between doing the subset of a data.frame using the [ with or without the comma (the space makes no difference to the interpreter of the ). Let’s see what happens when we print one and the other case:

for (i in vars) print(head(df_1[i]))
#>        x.1
#> 1 30.26077
#> 2 82.59508
#> 3 71.04249
#> 4 99.67011
#> 5 36.02907
#> 6 20.69992
#>        x.2
#> 1 48.96433
#> 2 89.56191
#> 3 63.90551
#> 4 73.54613
#> 5 99.35293
#> 6 74.60017
for (i in vars) print(head(df_1[, i]))
#> [1] 30.26077 82.59508 71.04249 99.67011 36.02907 20.69992
#> [1] 48.96433 89.56191 63.90551 73.54613 99.35293 74.60017

While df_1[i] retains the form of data.frame of output, df_1[, i] returns a vector that loses its characteristic of data.frame.

This justifies the error message of tapply since the length of a data.frame is its number of columns and not number of records (as in the case where a vector is passed). Sizes 1 are compared (a column of df_1[i]) with n (the size of column records df_1$y) and the tapply "plays" the question error.

  • Tomás, thanks. Just one additional question: how would this loop look in a list? Example list: lista <- split(df_1, df_1$y)

  • I don’t know if I understand correctly. Is it worth asking a new question explaining better?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.