Why is the "in" method for checking if an element belongs to a collection so slow?

Asked

Viewed 124 times

6

I was having a lot of difficulty executing code efficiently and found that the problem was on a line that used the operator in. Then I created a new function that searches for a match in the list of elements in binary form. I did an analysis of the performance of my function and the operator in and I noticed, to my surprise, that my function was much faster. Here’s my code:

import bisect
import matplotlib.pyplot as plt
import time
import random
import numpy as np
import seaborn as sns

sns.set_style("whitegrid")

def belongs2(elements, value):
    index = bisect.bisect_left(elements, value)
    if index < len(elements) and elements[index] == value:
        return True
    else:
        return False
    
def belongs2time(elements, value):
    start_time=time.time()
    index = bisect.bisect_left(elements, value)
    if index < len(elements) and elements[index] == value:
        True
    else:
        False
    return time.time() - start_time

def in_time(elements, value):
    start_time=time.time()
    value in elements
    return time.time() - start_time


# ### Timing belongs


number=random.choice(range(1000))
lengths=np.arange(100,100000,100)
time_belongs=[]
for lenght_list in lengths:
    numbers=[random.choice(range(1000)) for i in range(lenght_list)]
    numbers.sort()
    time_belongs.append(belongs2time(numbers, number))


# ### Timing in

number=random.choice(range(1000))
lengths=np.arange(100,100000,100)
time_in=[]
for lenght_list in lengths:
    numbers=[random.choice(range(1000)) for i in range(lenght_list)]
    numbers.sort()
    time_in.append(in_time(numbers, number))

fig, ax= plt.subplots()

ax.plot(lengths, time_belongs, label='belongs')
ax.plot(lengths, time_in, label='in')

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

plt.legend(loc='upper center', bbox_to_anchor=(0.5,-0.15),
          ncol=2, fontsize=14)

plt.yscale('log')

plt.show()

inserir a descrição da imagem aqui

I’m not used to doing this kind of "performance analysis" (actually, it’s the first time I’ve done it), so I may have made a mistake here and you’re coming to the wrong conclusion. But if you’re right, why the operator in is so slow? I expected a more efficient implementation of a method so ubiquitous to python. Is there any other method built-in more efficient than the in for this type of operation?

1 answer

5


First of all, if you want to test run time, a quick and a little more reliable way is to use the module timeit.

That said, the search using the module bisect is faster because she makes a binary search, which is an algorithm known to have O(logN) complexity. That is, in a list with billions of elements, it needs just over 30 iterations at most. The only prerequisite for it to function properly is that the elements are orderly.

Already the operator in, when applied to lists, do a linear search (and therefore is complex O(N)): it starts from the beginning of the list and goes testing the elements one by one, until finding (or until reaching the end of the list, if the element does not exist). Although slower than binary search, it is the most guaranteed if the list is not ordered.

Anyway, it’s not that the in be slow and stop. It may be slower than other algorithms, such as binary search, but if using it is not applicable, it may still be a viable alternative.


Your test rewritten with timeit would look like this:

import bisect
import random

def belongs(elements, value):
    index = bisect.bisect_left(elements, value)
    return index < len(elements) and elements[index] == value

def in_time(elements, value):
    return value in elements

valores = range(10000)
number = random.choice(valores)
print(f'procurando por {number}')
numeros = list(valores)

from timeit import timeit

# executa 100 mil vezes cada teste
params = { 'number' : 100000, 'globals': globals() }
# imprime os tempos em segundos
print(timeit('belongs(numeros, number)', **params))
print(timeit('in_time(numeros, number)', **params))

I created a list of all the values of range, so I already guarantee that it is ordered (but I think whatever for testing purposes, we could use your list as well).

Turning, we can see that the in is only faster if the number to be searched for is right at the top of the list (if it is one of the first, the in finds in fewer iterations, since the binary search always part of the middle of the list).


Note: if using one set instead of a list, the search gets faster:

valores = range(10000)
set_numeros = set(valores)
... restante do código igual ao exemplo acima

# adicionar mais um caso
print(timeit('in_time(set_numeros, number)', **params))

The detail, of course, is that a set does not guarantee the order of the elements and does not allow repeated values. If this is not a problem, searching it is faster - as per the page already linked previously, the operation in in a set is O(1).

  • Very interesting. Thank you for the detailed answer. I knew the concept of binary search, but how do you know that the in is linear? I did the test just because I didn’t know how to in did the search. I had a way of knowing that it was ante-hand linear?

  • Sorry, now that I saw that there was this information on the page you linked

  • 1

    @Lucas Another alternative is see the source code :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.