Why does 2*i*i tend to be faster than 2*(i*i) when i is integer?

Asked

Viewed 161 times

12

The two multiplications, 2*i*i and 2*(i*i), are equal and must generate the same result, what changes is only the order that the multiplications are made, but apparently are treated differently by the interpreter.

In the first multiplication, given the absence of parentheses and considering that all transactions have the same precedence, the interpreter will execute from left to right; that is, first will be done 2*i and the result multiplied by i.

In the second multiplication, given the presence of the parentheses, the order of execution is changed because the multiplication i*i takes precedence over multiplication by 2, so the interpreter will first make i*i and the result will multiply by 2.

Mathematically, the order of the factors does not change the product, so the result should be the same, but using the native package timeit it is possible to see a difference between the execution times between these multiplications:

>>> print('2*i*i:', timeit("for i in range(1000): 2*i*i"))
2*i*i: 276.91411599100684

>>> print('2*(i*i):', timeit("for i in range(1000): 2*(i*i)"))
2*(i*i): 279.21798045100877

Tests were done on Repl.it

Note: important to stress that the function timeit will run, by default, 1,000,000 times the specified code snippet and that the exact execution time may vary due to processor and inter-computer oscillations.

Why is there this time difference? The presence of parentheses in the expression changes the executed algorithm?

Stressing that the analysis should be done in the official implementation of Python, Cpython, in version 3+. Comparisons with other versions or implementations will be welcome as a way to increase the response.

  • I think it’s perception, maybe you have some idea of what causes the variation in the interpreter, but in a quick test here the 2*(i*i) it was faster, but this is the micro-optimization thing, I know that eventually they may be necessary, but the 100ms variation sometimes seems like a lot, but in a program it’s not always, what I must say is that for now can be in fact the interpreter and another time can be caused by the operating system being busy with other tasks, I know that your question is totally technical and I found it very good [...]

  • [...] but I also believe that this may be just an impression of you. Of course you can understand something about the language interpreter that confirms this small variation, I will follow the question also.

  • @Guilhermenascimento The idea of the question is not about optimization, but about the same algorithms. The Python interpreter actually alters the algorithm to make the multiplication under certain conditions and the difference between these algorithms that generates this time difference - although almost imperceptible.

  • As I said before, here with the tests the results were totally "random", which perhaps implies the issue of the use of the operating system and hardware affect at some level the "competition". But I understand that you may already have some idea what’s causing this specifically, so I’ll keep you looking forward :)

  • 1

    Hi Anderson, it might be worth posting the exact version of Python you used. The difference is fairly small and I smell random variation or small optimization of the final instructions issued to the CPU. Maybe multiplication by two is being replaced by something like i<<1 in the first case, or perhaps the interpreter issues slightly more complicated instructions due to parentheses. This is the kind of situation where Pypy and Cython should produce significantly different results as well.

  • @Anthonyaccioly Yes, so I added the Cpython tag. Any analysis should be done in the official language implementation. Comparisons with others are welcome as a way to increase the response. About the version, we can stick to version 3+. I will add the tag.

  • Hi Anderson, what I meant about specific version would be 3.X. X on platform Y. I say this because, although the Soen response gives good reasons for the initial version to be faster (different multiplication algorithms according to number size, small number cache, etc), the truth is that locally (Python 3.7.2rc1 in openSUSE 15 - Intel 64 bit) I’m seeing much more random results. It may be due to natural faults of microbenchmarks, may be that something has changed in the interpreter between versions, or, yet, something may change due to platform and CPU type.

  • 1

    @Anthonyaccioly was "almost" exactly what I said, the influence of the operating system under different hardware plus the use of other resources that can affect the competition in the processes, when I mentioned that it was micro-optimization, I did not mean that the intention was actually to optimize, but it is that in many cases micro-optimizations deceive us, when in fact the reasons that lead to certain "micro-differences" are external factors ;)

  • 1

    @Anthonyaccioly (cc @Guilhermenascimento) I will see if I can write the question then trying to delete these parameters and make it clear that the question is exactly about exchanging algorithms under certain conditions. Thanks for the feedback.

  • The question is very interesting. I was curious to know what factors are affecting the results on my machine and "breaking" the expected difference. Breaking down, the question pointed out by @Anderson links to this other Why is 2 * (i * i) Faster than 2 * i * i in Java?. In this case I was able to confirm the difference mentioned in Java 11. The "tricks" that each compiler/interpreter uses and the effects of everything William really mentioned are fascinating.

Show 6 more comments

1 answer

6


"You already know, but it doesn’t hurt to repeat" if you think you need optimizations at this level in a chunk of Python code, you are writing this chunk of code in the language wrong (or any other very high-level language such as ruby, php, and even Java).

Now there are some things you can answer in your question, and speculate a little, even without arriving at a definitive answer.

First - when you want to know differences of this type in Python, it is worth looking at which bytecode was generated for each snippet of code. Comparing the bytecode is easier to speculate what the differences may be. In Python, the function dis of the module of the same name allows to see the bytecode:

In [54]: def a(): 
    ...:     return 2 * i * i 
    ...:                                                                                                 

In [55]: def b(): 
    ...:     return 2 * (i * i) 
    ...:                                                                                                 

In [56]: import dis                                                                                      

In [57]: dis.dis(a)                                                                                      
  2           0 LOAD_CONST               1 (2)
              2 LOAD_GLOBAL              0 (i)
              4 BINARY_MULTIPLY
              6 LOAD_GLOBAL              0 (i)
              8 BINARY_MULTIPLY
             10 RETURN_VALUE

In [58]: dis.dis(b)                                                                                      
  2           0 LOAD_CONST               1 (2)
              2 LOAD_GLOBAL              0 (i)
              4 LOAD_GLOBAL              0 (i)
              6 BINARY_MULTIPLY
              8 BINARY_MULTIPLY
             10 RETURN_VALUE

So as you can see, there is actually a small difference: In the first case, the bytecode switches to load the values to be multiplied on the Python virtual machine stack with the multiplication calls themselves. In the second case, it puts all the values on the virtual machine stack, and then calls the multiplication operator twice in a row. It may be that CPU microcode level optimizers can optimize this repeated call then the function that will be called by the opcode BINARY_MULTIPLY (for example, the following call can give more hits on the CPU’s Prediction branch).

Anyway, if it’s not exactly that, I would still bet my chips that what happens is exactly some level optimization of the CPU microcode that is activated in the second case.

Exactly the kind of thing you will rarely worry about even if you are coding in C, let alone in Python. And in that case, before you say "wow, let me use Inline Assembler," it would be the case to look for high-level solutions that could use the computer resources more appropriately - code that uses the GPU or CPU vector processing units for example, that would earn orders of magnitude greater than staying micro-optimizing a single operation. (And in the case of Python, the "first" remedy will always be to use Numpy) .

To remove the web if by chance it is nothing within the Pyton code even, just checking also the code that will be called for BINARY_MULTIPLY - which certainly has optimizations for when both operands are Python Integers, but not beyond that (for example, an extra optimization if one of the operators is '2' - anyway, Runtime has to call the code that is in int.__mul__ for this) - but I’m talking about optimizations that will be between the VM find the opcode and call the int.__mul__.

By the way, the difference really is so insignificant that other changes make things change - see what happens when I measure times for functions a and b above:

In [59]: i = 10                                                                                          

In [60]: %timeit a()                                                                                     
119 ns ± 1.71 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [61]: %timeit b()                                                                                     
123 ns ± 0.725 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

That is, on this machine, the shape 2 * i * i is faster!

  • 1

    no - in the end neither of the two is faster - depends on the numbers you are using and the CPU - the difference is so small that the order is essentially random and NEVER, NEVER , should be used for a "I’ll do it this way because it’s faster - I saw it there in stackoverflow". If you need to do operations faster, explicitly do it in native code (with extension in C or Cython), or use Numpy.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.