What makes Join() so superior compared to other concatenation techniques?

Asked

Viewed 5,134 times

11

2 answers

16


Problem

There is a problem that one of the creators of this site (SO) calls Shlemiel the Painter’s Algorithm.

A painter paints the stretch of a highway. He starts very well, with high productivity. But every day he produces less, until after a few iterations his work becomes unviable.

This is because it keeps the paint can in a fixed place, so it paints a portion of the strip and has to return to the starting point to wet the brush. Each day he is further away from the can and takes longer on the path than in painting.

Shlemiel the painter's algorithm

Imagine this pattern growing dozens, hundreds or thousands of times. Quickly makes it impossible.

This is what usually happens with concatenations of data collections, especially string. As it grows and does not fit into the existing space for the previous version, a new allocation needs to be made to support the full size of the new version. And it has to dislocate the previous one that turned to garbage. And it fragments the memory. All this is time cost. In some languages the situation is worse since a change that does not even exceed the current size limit already causes a relocation (for a good cause). This is the case of Python.

Solution

The correct way is to figure out the final size, or at least an approximation of it and allocate everything you need, you can put the texts in that allocated area. Obviously it needs to be done in a structure that accepts the text be changed, which is not the case of the type string that is immutable, that is, any change generates a new object.

The join() does exactly what I described. He discovers the total size needed - taking the sizes of all strings that will be concatenated - allocates all the space needed and throws the texts into this space that is not yet a string. In the end he turns it into string. Here the cost is equivalent to the total size of the text which is much shorter than walking all over again in each concatenation.

Note that for few concatenations, typically up to 4, concatenation may perform even better than join(). Of course, in such a small volume no matter which is faster.

Alternatives

Of course the join() is not the only way to do it. You can do it manually if you need something a little more complex than the join() does not answer. Maybe using a bytearray or a standard list that are mutable (help, but not great because it may need new allocations, although minimized, does not need in each change, depends on the skill of the programmer).

To python page also shows how to use %s to obtain similar result. Formatting occurs in a function that manipulates everything in another data structure and only at the end that is generated the string final.

Some people like to use the guy StringIO who takes care of it.

I answered this in more detail for other languages like Java. And also C#.

8

Article: Efficient String Concatenation

Method 1 (concatenation)

def method1():
  out_str = ''
  for num in xrange(loop_count):
    out_str += `num`
  return out_str

Method 4 (Join)

def method4():
  str_list = []
  for num in xrange(loop_count):
    str_list.append(`num`)
  return ''.join(str_list)

Method 4 (Join) is significantly faster than concatenation.

This is because strings are immutable, meaning they can’t even be changed. To "change" one, it is necessary to create a new representation (a concatenation of the two) and then destroy the old strings. Join is faster because Python is able to optimize this process.

The text Python: string concatenation VS list Join is also very interesting and will in the source code of the Cpython implementation discover the answer:

When using the Join method, Python allocates memory to the final string only once; but if you concatenate several strings successively, Python has to allocate new memory for each concatenation. Guess what’s faster? ;)

That is, this is identical in terms of performance:

final_str = 'abc' + 'def'
final_str = ''.join('abc', 'def')  # não há diferença de desempenho

If you concatenate more than two strings, Join will be faster:

final_str = 'abc' + 'def' + 'ghi'  # aqui é realizado duas operações sucessivas
final_str = ''.join('abc', 'def', 'ghi')  # aqui é realizado uma só

Browser other questions tagged

You are not signed in. Login or sign up in order to post.