Loop with dataframes in ordered logistic regression

Asked

Viewed 104 times

2

I need to estimate an ordered logistic regression in which the database is divided into 417 dataframes giving a total of 33.7 GB. Merging all subsets into a single dataframe would make the option slow. So I thought of doing the estimates in pieces as follows (I use the example below that reproduces the same logic of my problem):

library(tidyverse)
library(MASS)

#Criando dataframe 1
party <- factor(rep(c("Rep","Dem"), c(407, 428)), 
                levels=c("Rep","Dem"))  
rpi <- c(30, 46, 148, 84, 99) # cell counts
dpi <- c(80, 81, 171, 41, 55) # cell counts
ideology <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology, rpi), 
                         rep(ideology, dpi)), levels = ideology)
data1 <- data.frame(party,pol.ideology)

#Criando dataframe 2
party <- factor(rep(c("Rep","Dem"), c(410, 430)), 
                levels=c("Rep","Dem"))  
rpi2 <- c(26, 50, 140, 95, 99) # cell counts
dpi2 <- c(75, 86, 141, 61, 67) # cell counts
ideology2 <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology2, rpi2), 
                         rep(ideology2, dpi2)), levels = ideology2)
data2 <- data.frame(party,pol.ideology)

nrow(data1)
nrow(data2)

## Juntando dataframes "manualmente"
dat <- bind_rows(data1,data2)

table(dat)
nrow(dat)

# fit proportional odds model

pom <- polr(pol.ideology ~ party, data=dat)
summary(pom)

Hence I have the following output:

Call:
polr(formula = pol.ideology ~ party, data = dat)

Coefficients:
           Value Std. Error t value
partyDem -0.8911    0.09016  -9.884

Intercepts:
                                        Value    Std. Error t value 
Very Liberal|Slightly Liberal            -2.4621   0.0929   -26.4893
Slightly Liberal|Moderate                -1.4215   0.0755   -18.8239
Moderate|Slightly Conservative            0.1641   0.0659     2.4905
Slightly Conservative|Very Conservative   1.0570   0.0728    14.5272

Residual Deviance: 5042.654 
AIC: 5052.654

Since there are 417 of files I thought about creating a loop so I don’t have to manually join the dataframes:

## LOOP
data = ls(pattern="data")
for(i in 1:length(ls(pattern="data"))){
  pom <- polr(pol.ideology ~ party, data=i)  
}
summary(pom)

Using the loop I have the following output:

Re-fitting to get Hessian

Call:
polr(formula = pol.ideology ~ party, data = i)

Coefficients:
           Value Std. Error t value
partyDem -0.8115     0.1262  -6.433

Intercepts:
                                        Value    Std. Error t value 
Very Liberal|Slightly Liberal            -2.4608   0.1315   -18.7135
Slightly Liberal|Moderate                -1.3726   0.1049   -13.0858
Moderate|Slightly Conservative            0.0947   0.0923     1.0267
Slightly Conservative|Very Conservative   1.0455   0.1020    10.2527

Residual Deviance: 2559.949 
AIC: 2569.949 

I note that the values of the coefficients and the values t are different between the first (estimated by manually joining the dataframes and the second model (estimated using loop between the dataframes), it should not be because it is the same data. In summary, I wish to estimate a single logistic regression with these two data.frames. My question is: the loop I built should consider all dataframes, but at the moment is only considering the last dataframe. What am I doing wrong in building the loop? What would be the solution?

  • I’m not sure I understand what’s right. Because from what I understand you are estimating the model once for each table in it, but so the value of pom will only refer to the last table you passed in for and not all together.

  • Why load the package tidyverse? Only the package dplyr is enough.

  • Jorge Mendes: My idea is that the model makes the estimation using all the tables. That’s why I used ls(pattern="data") to use all objects with date names (in this case dataframes 1 and data2). How could I use all tables? From what I’ve seen, it’s really just being pet dated.

  • Rui Barradas: Truth, the dplyr resolve. I used the tidyverse because I replicated from another code.

1 answer

1


There are two problems with your loop:

  1. He’s doing a regression for each of the data frames., not for the data of the two together.
  2. Every loop the contents of the object pom is rewritten, meaning in the end has only the result of the regression to data2.

If you want a regression for all the data, you need to put it together first. The best option for this is to apply read* to a string with the filenames and unite them in a call:

# Salvando seus dados de exemplo em arquivos:
write.csv(data1, 'dados1.csv')
write.csv(data2, 'dados2.csv')

lista_de_arquivos <- list.files(pattern = ".csv$")

complete_data <- do.call(rbind, lapply(lista_de_arquivos, read.csv))

Or, if you’re using tidyverse:

complete_data <- lista_de_arquivos %>% map_df(~read.csv(.))

With the data united, performs a single regression:

pom <- polr(pol.ideology ~ party, complete_data)

The problem in your case is the large volume of data. Unfortunately, it is not possible to do the regression in portions and then combine the results to have a regression response for all the data. You can optimize data reading by using faster functions (such as tidyverse read_csv, or data.table fread), loading only the columns you need and encoding text strings as integrals, but it may not be enough. In this case, you enter Big Data, which is outside the scope of an answer in the OS.

  • Thank you for your contribution! I will try to join the data frames and optimize. I’ve been trying with read.csv.ffdf() and I realized that it greatly optimizes the size of the file, but I will compare it to the read_csv of tidyverse, or fread of the data.table). If not, I will try to store the data frames on a postgresql server and make ODBC connection with R.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.