How to iterate over a huge amount of records with scala sorm

Asked

Viewed 64 times

3

I want to iterate over a lot of records from a table with sorm, but I want to do it in a memory efficient way.

Today I use this code:

Db.query[Items].whereEqual("title", someTitle).fetch.foreach { webCount =>
          //do something
}

The problem is that this code first loads all the records before proceeding to each item in the loop. Is there any way to stream records?

Link to the original English question

2 answers

3

The methods Querier.fetch and Querier.fetchIds return streams, which doesn’t mean you can’t run out of memory if you have to work with all the returned objects at the same time.

The object Query (built by Querier) has properties limit and offset allowing pagination in the traditional sense of database:

val itemCount = Db.query[Items].whereEqual("title", someTitle).count()
// Consultas paginadas
Db.query[Items].whereEqual("title", someTitle).limit(10).offset(40).fetch // ...
  • I don’t need to work with all objects at once. The stream foreach is 'smart' or I have to use another type of iterator?

  • Daniel, you’d have to dive into storm source code to see how he is building this Stream (there are several ways to implement lazy behavior). If there is any kind of cursor returning blocked entries from the database and if you are not activating anything that will load all at once, I believe you have no problem (even so, dealing with traditional database, one hour we need paging in the most traditional sense of the thing).

2

According to a comment on the documentation, the fetch of the entities is done in two stages:

  1. The query with all filters and sorts is executed in the database, but recovering only the id (primary key) of each record.

  2. When you really get to the item on Stream returned by query, then the second phase occurs where the other fields are read.

So since the API doesn’t seem to provide any other way to iterate over the results, this would already be the most efficient way and would hardly burst the memory.

However, the comment is from 2 years ago and I don’t really know if it still applies to the latest versions.


Original comment:

Fetching entities in SORM Always Goes in two phases: first, all your Filters - no Matter how Intricate, - orderings and etc get Applied to a single multitable query which fetches just the ids of the matching entities; in the Second Phase SORM emits Multiple queries to Actually populate the Resulting entities Depending on Complexity of their Structure. Since all the selects of the Second Phase are by Primary Keys, they are very Cheap. But this area will Definitely become a Battlefield for all Kinds of optimizations in Future. Contibution is Much appreciated.

There Actually was a querier implementation which was Doing Everything in a single Phase: Both querying and Object Population Were being done in a single query in the version 0.1.0, - but then it turned out that, due to specifics of how Joining Tables Works, it could fetch a Million Rows for Certain multicollection entities, literally. So, downshifting to a Simpler Strategy turned out to be Inevitable.

The "Stream" Thing is there intentionally. It delays the Second Phase fetching queries and Objects Population until you Actually Reach them in the returned Stream. Although this Might be a Subject to changes in the Future.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.