2
I need to estimate an ordered logistic regression in which the database is divided into 417 dataframes giving a total of 33.7 GB. Merging all subsets into a single dataframe would make the option slow. So I thought of doing the estimates in pieces as follows (I use the example below that reproduces the same logic of my problem):
library(tidyverse)
library(MASS)
#Criando dataframe 1
party <- factor(rep(c("Rep","Dem"), c(407, 428)),
levels=c("Rep","Dem"))
rpi <- c(30, 46, 148, 84, 99) # cell counts
dpi <- c(80, 81, 171, 41, 55) # cell counts
ideology <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology, rpi),
rep(ideology, dpi)), levels = ideology)
data1 <- data.frame(party,pol.ideology)
#Criando dataframe 2
party <- factor(rep(c("Rep","Dem"), c(410, 430)),
levels=c("Rep","Dem"))
rpi2 <- c(26, 50, 140, 95, 99) # cell counts
dpi2 <- c(75, 86, 141, 61, 67) # cell counts
ideology2 <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology2, rpi2),
rep(ideology2, dpi2)), levels = ideology2)
data2 <- data.frame(party,pol.ideology)
nrow(data1)
nrow(data2)
## Juntando dataframes "manualmente"
dat <- bind_rows(data1,data2)
table(dat)
nrow(dat)
# fit proportional odds model
pom <- polr(pol.ideology ~ party, data=dat)
summary(pom)
Hence I have the following output:
Call:
polr(formula = pol.ideology ~ party, data = dat)
Coefficients:
Value Std. Error t value
partyDem -0.8911 0.09016 -9.884
Intercepts:
Value Std. Error t value
Very Liberal|Slightly Liberal -2.4621 0.0929 -26.4893
Slightly Liberal|Moderate -1.4215 0.0755 -18.8239
Moderate|Slightly Conservative 0.1641 0.0659 2.4905
Slightly Conservative|Very Conservative 1.0570 0.0728 14.5272
Residual Deviance: 5042.654
AIC: 5052.654
Since there are 417 of files I thought about creating a loop so I don’t have to manually join the dataframes:
## LOOP
data = ls(pattern="data")
for(i in 1:length(ls(pattern="data"))){
pom <- polr(pol.ideology ~ party, data=i)
}
summary(pom)
Using the loop I have the following output:
Re-fitting to get Hessian
Call:
polr(formula = pol.ideology ~ party, data = i)
Coefficients:
Value Std. Error t value
partyDem -0.8115 0.1262 -6.433
Intercepts:
Value Std. Error t value
Very Liberal|Slightly Liberal -2.4608 0.1315 -18.7135
Slightly Liberal|Moderate -1.3726 0.1049 -13.0858
Moderate|Slightly Conservative 0.0947 0.0923 1.0267
Slightly Conservative|Very Conservative 1.0455 0.1020 10.2527
Residual Deviance: 2559.949
AIC: 2569.949
I note that the values of the coefficients and the values t are different between the first (estimated by manually joining the dataframes and the second model (estimated using loop between the dataframes), it should not be because it is the same data. In summary, I wish to estimate a single logistic regression with these two data.frames. My question is: the loop I built should consider all dataframes, but at the moment is only considering the last dataframe. What am I doing wrong in building the loop? What would be the solution?
I’m not sure I understand what’s right. Because from what I understand you are estimating the model once for each table in it, but so the value of pom will only refer to the last table you passed in for and not all together.
– Jorge Mendes
Why load the package
tidyverse
? Only the packagedplyr
is enough.– Rui Barradas
Jorge Mendes: My idea is that the model makes the estimation using all the tables. That’s why I used
ls(pattern="data")
to use all objects with date names (in this case dataframes 1 and data2). How could I use all tables? From what I’ve seen, it’s really just being pet dated.– Alexsandro
Rui Barradas: Truth, the
dplyr
resolve. I used thetidyverse
because I replicated from another code.– Alexsandro