Calculate average distance travelled with single tapply

Asked

Viewed 192 times

2

I have my database from the package hflights of the R, which shows a number of flights in the US. I need to calculate the average distance travelled (Distance) for each day of the week (variable DayofWeek) between flights with a delay of more than 60 and between flights with a delay of less than 60 (variable DepDelay). It is necessary to use a single tapply.

I tried something like this, but it’s wrong:

y=(c(which(sapply(dados,is.numeric))))y 
apply(as.matrix(y),1,function(x){tapply(dados[,x],list(dados$DayofWeek,dados$DepDelay>60),mean)})
  • Felipe if any answer answered your question, you can accept it by clicking the green ok button on the left side of the answer.

3 answers

3

In my view the most elegant way to do this is by using the packages dplyrand tidyr. The code using these functions becomes much simpler to read. It’s worth learning!

library(dplyr)
library(tidyr)

hflights %>% 
  filter(!is.na(DepDelay)) %>% # filtra os voos sem atraso
  mutate(DepDelay2 = ifelse(DepDelay>60, ">60", "<=60")) %>% # atraso maior que 60
  group_by(DayOfWeek, DepDelay2) %>% # indica o calculo em grupo
  summarise(media = mean(Distance)) %>% # usa a media para agregar
  spread(DepDelay2, media) # coloca em colunas separadas

# Source: local data frame [7 x 3]
# 
#   DayOfWeek      <60      >60
# 1         1 783.1453 796.0078
# 2         2 778.1595 796.8847
# 3         3 779.2828 816.0626
# 4         4 782.5144 816.7166
# 5         5 785.8960 790.0505
# 6         6 823.0040 821.7922
# 7         7 797.2524 803.0468

1

Two other ways of doing:

Using tapply, like Nishimura, but with the cut. The cool thing about the cut is that it extends easily to more than two conditions, just increase the breaks and the Abels:

tapply(X = dados$Distance, 
       INDEX = list(dados$DayOfWeek,
                    cut(dados$DepDelay, 
                        breaks =c(-Inf, 60, Inf), 
                        labels = c("menor60", "maior60"))),
       FUN = mean)

Using the dplyr without cut:

dados %>% 
  group_by(DayOfWeek) %>% 
  summarise(menor60 = mean(Distance[DepDelay <= 60], na.rm = TRUE),
            maior60 = mean(Distance[DepDelay > 60], na.rm = TRUE))

The dplyr solution with cut would be equivalent to Daniel’s response by changing the ifelse.

1

I don’t know how to do this for both groups (DepDelay > 60 and DepDelay < 60) using a single tapply, but I would do so for each of the groups:

tapply(X = hflights[which(hflights$DepDelay > 60),"Distance"], INDEX = hflights[which(hflights$DepDelay > 60),"DayOfWeek"], FUN = mean)
tapply(X = hflights[which(hflights$DepDelay < 60),"Distance"], INDEX = hflights[which(hflights$DepDelay < 60),"DayOfWeek"], FUN = mean)

Just note that there are also 232 cases with DepDelay == 60, if you want to consider all flights (not-canceled) from the database.

EDITED
Here is a (not very elegant) way to do this with a single tapply:

tapply(X = hflights$Distance, INDEX = list(hflights$DayOfWeek, ifelse(test = hflights$DepDelay > 60, yes = "> 60", no = "<= 60")), FUN = mean)

The only problem is that this way you need to include the 232 cases with DepDelay == 60 in one of the two groups (in my code I put in the second group, DepDelay < 60)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.