String assembler (or How to improve the performance of massive replacements)

Asked

Viewed 110 times

8

TL;DR: I know there is a class called Stringbuilder, both in .NET how much in Java, that allows performing operations on a text without generating a new string for each method call. It would be very convenient to have something similar in Javascript, but I can not find anything similar.

My goal: let’s assume that I have some texts of considerable size on which I wish to perform certain substitution operations. By sizeable size, let’s imagine strings with millions of characters.

I know that Javascript may not seem to be an appropriate technology for this. But I wish to make the replacements in real time, which excludes techniques like leaving a job to operate on a server. I also wish to replace multiple snippets, based on input user’s.

The problem: Javascript replacements turn out to be expensive in memory. Someone corrects me if I’m wrong - but if I have a string that occupies a megabyte, when using the method replace of the object String, I’ll have an occupation of two megabytes: one from the original string, which will not cease to exist until the garbage collector claims it, and another from the new string. When executing a new replacement, there will be three megabytes, and so on. In the thousandth amendment, we are already occupying in the vicinity of a Gigabyte.

I am thinking of approximate numbers, and considering that all substitutions are global (I use regular expressions with the modifier g, i and..: /foo/g).

Doubt: there is something that plays the role of a StringBuilder in Javascript? If not, there is some way in which it could be implemented?

2 answers

6


I’ve been researching the subject and found this project StringBuilder. Which seems to be optimized for what you want.

Normally a code snippet similar to this would be used:

var myString = "";
for (var i = 0; i < 1000; i++) {
    myString += i.toString();
}

The problem with this is that a lot of memory is used. After the third iteration the following values will be in memory:

""
"1"
"12"
"123"

Example using StringBuilder:

var myStringBuilder = StringBuilder();
for (var i = 0; i < 1000; i++) {
    myStringBuilder.append(i.toString());
}

After the third iteration, only these values will be in memory:

"1"
"2"
"3"
  • 1

    Someone made a comment (infezlimente apagado) with a similar project in Code Project. Very interesting for concatenations - and I believe that, with some modifications, you can improve the performance for substitutions as well, I believe that you can facilitate the work of GC in some ways :) +1, and I would like to give another positive vote just because the project is on Github.

  • 1

    still, since then the language has not changed much with respect to strings.

  • 2

    +1 Just note that - although memory usage is actually lower - performance in time is usually significantly worse (Edit: in some environments more, others less, so I observed).

4

In Javascript, strings are immutable. This means that - whenever you try to do a string substitution - a new string will be created with the result. This makes it particularly problematic to efficiently implement something like the String.replace, in particular whether regular expressions are involved.

In theory, this can be improved by using arrays instead of strings. For example, to concatenate several strings, accumulate them in an array and at the end join them using the method join:

var arr = [];
for (var i = 0; i < 1000; i++) {
    arr.push(i.toString());
}
var myString = arr.join('');

This solution (which by the way is the same strategy used by StringBuilder as suggested in DBX8 response) Unfortunately it is not ideal, since Javascript does not have a "character" data type. Instead, when referencing an index in the string what we have back is just another string:

var str = "abc";
console.log(typeof str[1]); // string

For this reason, representing a string as a list of "characters" (or even code points, if we use integers as array elements) is hardly more efficient in memory [or time]. Only when the "pieces" are reasonably large is it expected to gain performance using this method (which may or may not be true in a regex replace, depending on the case).

What remains then is to use a lower-level structure, such as the Uint16Array. When creating an array of this type (ArrayBuffer; exists in various formats) only one copy is kept in memory. One can modify it at will that no additional copy is made, although one has to be careful not to exceed the space limit or things like that:

function stringParaBuffer(str) {
    var ret = new Uint16Array(str.length);
    for ( var i = 0 ; i < str.length ; i++ )
        ret[i] = str.charCodeAt(i);
    return ret;
}
function bufferParaString(arr) { return String.fromCharCode.apply(null, arr); }

var arr = stringParaBuffer("abc");  // Cria o buffer
arr[1] = "d".charCodeAt(0);         // Nenhuma cópia é feita
console.log(bufferParaString(arr)); // imprime: "adc"

arr[3] = "e".charCodeAt(0); // Falha silenciosamente, o array continua o mesmo

This solution, although "raw", can be used as a basis for building something more sophisticated (such as a StringBuilder which keeps record of the size, resizes as needed, etc.) which allows a regex replace without excessive use of memory. But so far, as far as I know, there is no library ready making use of it...

  • I loved the idea! I believe that running regular expressions in this structure should be very complex, but not impossible. I want to try something like this now :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.