1
Hey there, guys. I am trying to develop a code that makes it possible to perform parallel processing (parser) of HTML files using the R Language and, consecutively, record the data extracted from HTML in the Sqlite Database in a serialized way.
To perform parallel processing, I am using the package furrr
(this package is similar to purrr
, with a focus on functional programming, however, its advantage is that it allows parallel (multi-thread) execution of the script.
Already to make the connection with Sqlite and send the querys, I’m using respectively the RSQLite
and the DBI
.
In the code below, the execution occurs perfectly if I put the function plan(sequential)
establishing a serialized execution (in my real case, it would promote the handling of HTML files and the recording in Sqlite in a serialized form). But when I do plan(multiprocess)
the script only executes about 25% of the recordings and returns an error:
Error in result_bind(res@ptr, params) : database is locked In addition: Warning message:
. Apparently, this indicates, in my view, that some process (thread) gave error after finding Sqlite blocked by another process (thread) running.
In the documentation of furrr
"plan" can be configured as follows:
# You set a "plan" for how the code should run. The easiest is 'multiprocess' # On Mac this picks plan(multicore) and on Windows this picks plan(multisession)
(link)
THE CHALLENGE: My idea is that HTML files should be treated in parallel and recordings made serialized in Sqlite, since it does not allow simultaneous recordings. I believe this will generate performance gains, especially when there is a need to perform processing (parser) of more complex HTML data that take time to complete.
Some possible paths to be thought out:
1) It is necessary to perform some configuration in the parameters of RSQLite::SQLite()
, to solve the problem?
2) It is necessary to use some function for error handling such as the purrr::safely
during the performance of the function DBI::dbExecute
so that she keeps trying to run the query in Sqlite until she succeeds? Or purrr::safely
would have to be used during Sqlite connection in function drive_sqlite()
?
3) Is there any package function DBI
to help solve this problem?
4) Some kind of Sqlite cache for it to store the received R query and run it serialized? (link)
I’ve made several attempts and walked different paths... but unsuccessfully so far. Some of the articles I’ve consulted on the internet (link1 , Link2, link3, Link4) .
library(furrr)
library(RSQLite)
library(DBI)
drive_sqlite <- function(){ DBI::dbConnect(RSQLite::SQLite(), dbname = "bd_teste.db") }
DBI::dbExecute(drive_sqlite(),"CREATE TABLE tabela_teste (coluna1 TEXT NOT NULL,
coluna2 TEXT NOT NULL,
coluna3 TEXT NOT NULL,
coluna4 TEXT NOT NULL);")
tabela_dados <- tibble::tibble(coluna1 = c(1:3000),
coluna2 = c(1:3000),
coluna3 = c(1:3000),
coluna4 = c(1:3000))
funcao <- function(coluna1, coluna2, coluna3, coluna4) {
DBI::dbExecute(drive_sqlite(), "INSERT INTO tabela_teste VALUES(:coluna1,
:coluna2,
:coluna3,
:coluna4);",
params = list(coluna1 = as.character(coluna1),
coluna2 = as.character(coluna2),
coluna3 = as.character(coluna3),
coluna4 = as.character(coluna4)))
DBI::dbDisconnect(drive_sqlite())
}
#plan(sequential)
plan(multiprocess)
#plan(multisession)
furrr::future_pmap(tabela_dados, funcao, .progress = TRUE)
Hello, Flavio. Thank you for answering. However, I came up with a reasonable solution. I put a While loop and used purrr::Safely. With that, I was able to do what I wanted. In this case, when a parallel node tries to write to Sqlite while another node is writing... the While loop along with the purrr::Safely function keeps trying until Sqlite is released... It was an arrangement I made so as not to interrupt the execution process of the nodes with errors.
– George Santiago
Nice, but see that in fact you still have the same problem. Ultimately what you did was avoid the "bumps" by making the script repeat the attempt to write with Safely. I would suggest you accept the answer as this is a common confusion in the use of Sqlite. Note that the problem is not your script, but the fact that you try to use a feature that does not exist in Sqlite and that is widely available in other similar programs.
– Flavio Barros
Another point I wanted to point out is that patch with Safely greatly reduces the recording performance. Although its processing is parallel the recording ends up being a bottleneck such that it does not even help the processes work in parallel because they can not finish the process because they have to wait for the other to finish recording. And the solution is actually quite simple: just use a Postgres of life that has this resource ;-)
– Flavio Barros