Simultaneous threading (parallel processing) in R and serialized recording in Sqlite

Asked

Viewed 87 times

1

Hey there, guys. I am trying to develop a code that makes it possible to perform parallel processing (parser) of HTML files using the R Language and, consecutively, record the data extracted from HTML in the Sqlite Database in a serialized way.

To perform parallel processing, I am using the package furrr (this package is similar to purrr, with a focus on functional programming, however, its advantage is that it allows parallel (multi-thread) execution of the script. Already to make the connection with Sqlite and send the querys, I’m using respectively the RSQLiteand the DBI.

In the code below, the execution occurs perfectly if I put the function plan(sequential) establishing a serialized execution (in my real case, it would promote the handling of HTML files and the recording in Sqlite in a serialized form). But when I do plan(multiprocess)the script only executes about 25% of the recordings and returns an error: Error in result_bind(res@ptr, params) : database is locked In addition: Warning message: . Apparently, this indicates, in my view, that some process (thread) gave error after finding Sqlite blocked by another process (thread) running.

In the documentation of furrr"plan" can be configured as follows: # You set a "plan" for how the code should run. The easiest is 'multiprocess' # On Mac this picks plan(multicore) and on Windows this picks plan(multisession) (link)

THE CHALLENGE: My idea is that HTML files should be treated in parallel and recordings made serialized in Sqlite, since it does not allow simultaneous recordings. I believe this will generate performance gains, especially when there is a need to perform processing (parser) of more complex HTML data that take time to complete.

Some possible paths to be thought out:

1) It is necessary to perform some configuration in the parameters of RSQLite::SQLite(), to solve the problem?

2) It is necessary to use some function for error handling such as the purrr::safely during the performance of the function DBI::dbExecute so that she keeps trying to run the query in Sqlite until she succeeds? Or purrr::safely would have to be used during Sqlite connection in function drive_sqlite()?

3) Is there any package function DBIto help solve this problem?

4) Some kind of Sqlite cache for it to store the received R query and run it serialized? (link)

I’ve made several attempts and walked different paths... but unsuccessfully so far. Some of the articles I’ve consulted on the internet (link1 , Link2, link3, Link4) .

library(furrr)
library(RSQLite)
library(DBI)

drive_sqlite <- function(){ DBI::dbConnect(RSQLite::SQLite(), dbname = "bd_teste.db") }

DBI::dbExecute(drive_sqlite(),"CREATE TABLE tabela_teste (coluna1  TEXT NOT NULL,
                                                          coluna2  TEXT NOT NULL,
                                                          coluna3  TEXT NOT NULL,
                                                          coluna4  TEXT NOT NULL);")

tabela_dados <- tibble::tibble(coluna1 = c(1:3000),
                               coluna2 = c(1:3000),
                               coluna3 = c(1:3000),
                               coluna4 = c(1:3000))

funcao <- function(coluna1, coluna2, coluna3, coluna4) { 

            DBI::dbExecute(drive_sqlite(), "INSERT INTO tabela_teste VALUES(:coluna1,
                                                                            :coluna2,
                                                                            :coluna3,
                                                                            :coluna4);",
                                         params = list(coluna1 = as.character(coluna1),
                                                       coluna2 = as.character(coluna2),
                                                       coluna3 = as.character(coluna3),
                                                       coluna4 = as.character(coluna4)))

            DBI::dbDisconnect(drive_sqlite())

}

#plan(sequential)

plan(multiprocess)

#plan(multisession)

furrr::future_pmap(tabela_dados, funcao, .progress = TRUE)

1 answer

0

My dear @Georgesantiago your problem is insoluble, unfortunately. This is due to the implementation of Sqlite. Sqlite is not a database management system that allows multiple processes to write at the same time. Note your error

Error in result_bind(res@ptr, params) : database is locked In

says exactly that, this is, database is locked. Just to confirm what I’m talking about see one of the questions of the Sqlite FAQ:

(5) Can Multiple Applications or Multiple instances of the same application access a single database file at the same time?

Multiple processes can have the same database open at the same time. Multiple processes can be Doing a SELECT at the same time. But only one process can be making changes to the database at any Moment in time

or translating: only one process at a time can change the database.

WHAT IS THE SOLUTION THEN?

I haven’t tested your script but for sure If you use a more robust database management system, made to be used in multi-process production, you’ll be able to easily do what you want without bugs. You can use the Mysql, the Mariadb, the Postgresql etc. My favorite is Postgres Shot, which I suggest you use.

  • Hello, Flavio. Thank you for answering. However, I came up with a reasonable solution. I put a While loop and used purrr::Safely. With that, I was able to do what I wanted. In this case, when a parallel node tries to write to Sqlite while another node is writing... the While loop along with the purrr::Safely function keeps trying until Sqlite is released... It was an arrangement I made so as not to interrupt the execution process of the nodes with errors.

  • Nice, but see that in fact you still have the same problem. Ultimately what you did was avoid the "bumps" by making the script repeat the attempt to write with Safely. I would suggest you accept the answer as this is a common confusion in the use of Sqlite. Note that the problem is not your script, but the fact that you try to use a feature that does not exist in Sqlite and that is widely available in other similar programs.

  • Another point I wanted to point out is that patch with Safely greatly reduces the recording performance. Although its processing is parallel the recording ends up being a bottleneck such that it does not even help the processes work in parallel because they can not finish the process because they have to wait for the other to finish recording. And the solution is actually quite simple: just use a Postgres of life that has this resource ;-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.