How do iterators created from lists rescue "random" values in Python?

Asked

Viewed 91 times

2

I’m studying iterators and generators in Python, and I was intrigued by a question, and I can’t find the answer to that anywhere.

When we create a Python generator through a generating expression like (x**2 for x in range(50)), we have a mathematical expression, a law that the program can use to calculate the next value, if required. The same happens if we create the generator by a generating function, such as:

def quadrados(n, max):
    while n < max:
        yield n**2

Anyway, in both cases the computer has a cake recipe, a step by step that it follows to generate the next item. I believe the same happens when creating an iterator with a class.

The question is, how an iterator created from a list of random values, for example, can "produce" the next values?
If we define an iterator as it = iter([95, 27, 2348, 32, 27]), How can he rescue the items without them being saved in memory? If he simply accesses this list and gets the value, it is because this value is saved in memory, no?

  • I don’t know them at all Internals of Python, but both iter when generators are to store the values (their status) in memory. In the case of iter, you passed a list (because it is delimited by [ ]), and probably creates an object that keeps a reference to this list. I can’t imagine any other way for him to be able to iterate for the values, other than keeping them internally as part of the created object.

  • It’s the only way I can imagine, too. but in this case the values would be being saved in memory, which seems to me a little contradictory, because if I understood the concept well, the iterators do not store all the data in memory, and so are lighter and faster than lists, for example.

  • 1

    But when you do iter([1, 2, 3]), the list [1, 2, 3] has already been created, then there is nothing to save there, because the memory has already been "worn"... There is nothing contradictory

  • I poked around in the source code of Cpython and put an answer giving more details...

  • TL;DR: when you pass a list to iter, the listta iterator keeps a reference to it. Then it is only destroyed when the iterator is also (when it goes out of scope, or with a del). There are no surprises there: just keep the same rule always: an object exists while it has references to it.

2 answers

4

First, a little terminology/concepts.

According to the documentation, one iterator (or "iterator") is any object that implements the methods __next__ and __iter__ (and the latter must return the object itself, while the first returns the next item of the iterator, or throws a StopIteration when there are no more elements).

Already one Generator (or "generator") is a function (and therefore is also called Generator Function) that returns a iterator Generator (that the documentation defines as "An object created by a Generator Function"). One Generator Function is created using the yield, which is already explained in detail here and here.

And there is also the Generator Expression, which is an expression that returns a Generator (like the one in the question: (x**2 for x in range(50))).

Every object iterator Generator (be created by a Generator Function or a Generator Expression) is also a iterator. But not all iterator is a Generator.


A Generator is Lazy in the sense that it only returns the next value if you call the method __next__ (either explicitly through the builtin next, or implicitly, when you walk through it with a for for example). Internally it stores its state (basically, "where it stopped" and the value of each variable at that point; read the link already indicated for more details), so that on the next call to next he can continue the execution from where he left off.

In general a Generator does not keep all values at once in memory, generating them on demand. But nothing prevents you from doing this:

# itera pelos elementos do iterável uma ou mais vezes
def ciclo(iteravel, repetir=1):
    # guarda os valores do iterável em uma lista
    valores = list(iteravel)
    for _ in range(repetir):
        for n in valores:
            yield n

# imprime duas vezes os valores da generator expression
for n in ciclo((x ** 2 for x in range(3)), 2):
    print(n)

That is, the Generator Function ciclo creates a Generator iterates several times by the specified iterable. Values are stored in a list because the iteravel could be another Generator (which can only be iterated once), so by storing the values in a list I guarantee that I can iterate several times without problems. In the example above, if I did the second for directly on iteravel, values would only be returned once, since the Generator Expression creates a Generator, and this can only be iterated once.

I mean, I created a Generator that keeps all the values in memory at once (just out of curiosity, that’s exactly what the function itertools.cycle ago).

Of course that’s a corner case, and in general, the generators do not store all values in memory (unless I force this, as I did above), computing them only when necessary. Already a iterator that is not a Generator, not necessarily will be Lazy.

When you do iter([1, 2, 3]), he returns a iterator, that is not a Generator, as can be verified with the module inspect:

import inspect

# iter cria um iterator que não é generator
x = iter([1, 30, 99, 42])
print(inspect.isgenerator(x)) # False

# generator expression cria um generator
x = (x * 2 for x in range(10))
print(inspect.isgenerator(x)) # True

In addition, the list [1, 2, 3] has already been created and internally iterator keeps a reference to it. What the iterator does is iterate to the next element when the method __next__ is called, but as the list has already been created, you do not have the "memory saving". A iterator would only save memory if its values were computed on demand, as the example of Generator above. For example:

# iterator que gera os valores sob demanda
class Squares(object):
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self): return self

    # o próximo elemento é calculado somente se next for chamado
    def __next__(self):
        if self.start >= self.stop:
            raise StopIteration
        current = self.start * self.start
        self.start += 1
        return current

iterator = Squares(3, 10)
print(next(iterator)) # 9
print(next(iterator)) # 16

If I use inspect.isgenerator in the iterator above, the result will be False.

But if you create one iterator based on the values of a list, and this list has already been created, there is no saving at all. So your question "how he can rescue the items without them being saved in memory?" part of a wrong premise, because the values are in the memory, and this is how the iterator can retrieve them.


Going a little deeper, I went to take a look at source code of Cpython (which is the language reference implementation, and most likely the one you are using). I consulted the code today (11/jun/2021):

The builtin iter is defined here:

static PyObject *
builtin_iter(PyObject *self, PyObject *const *args, Py_ssize_t nargs)
{
    PyObject *v;

    if (!_PyArg_CheckPositional("iter", nargs, 1, 2))
        return NULL;
    v = args[0];
    if (nargs == 1)
        return PyObject_GetIter(v);
    if (!PyCallable_Check(v)) {
        PyErr_SetString(PyExc_TypeError,
                        "iter(v, w): v must be callable");
        return NULL;
    }
    PyObject *sentinel = args[1];
    return PyCallIter_New(v, sentinel);
}

In this case, when we pass only one argument (for example, a list), it falls into if (nargs == 1) and flame PyObject_GetIter, which in turn is defined here:

PyObject *
PyObject_GetIter(PyObject *o)
{
    PyTypeObject *t = Py_TYPE(o);
    getiterfunc f;

    f = t->tp_iter;
    if (f == NULL) {
        if (PySequence_Check(o))
            return PySeqIter_New(o);
        return type_error("'%.200s' object is not iterable", o);
    }
    else {
        PyObject *res = (*f)(o);
        if (res != NULL && !PyIter_Check(res)) {
            PyErr_Format(PyExc_TypeError,
                         "iter() returned non-iterator "
                         "of type '%.100s'",
                         Py_TYPE(res)->tp_name);
            Py_DECREF(res);
            res = NULL;
        }
        return res;
    }
}

In the case, o is the list, whose Py_TYPE(o) is PyList_Type, which in turn is defined here:

PyTypeObject PyList_Type = {
    // um monte de linhas...
    list_iter,                                  /* tp_iter */
    // mais um monte de linhas...

See that in the field tp_iter the value is the function list_iter. Since this function is not null, the function PyObject_GetIter fell in the else and calls the function list_iter, passing the list as argument. And this function (defined in same file that defines the PyList_Type) does the following:

static PyObject *
list_iter(PyObject *seq)
{
    listiterobject *it;

    if (!PyList_Check(seq)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    it = PyObject_GC_New(listiterobject, &PyListIter_Type);
    if (it == NULL)
        return NULL;
    it->it_index = 0;
    Py_INCREF(seq);
    it->it_seq = (PyListObject *)seq;
    _PyObject_GC_TRACK(it);
    return (PyObject *)it;
}

And it’s on the line it->it_seq = (PyListObject *)seq; that we see that the list (seq) is assigned to the field it_seq of iterator. That is, the iterator keeps a reference to the list. Definition of the listiterobject (also in the same file that defines the list_iter):

typedef struct {
    PyObject_HEAD
    Py_ssize_t it_index;
    PyListObject *it_seq; /* Set to NULL when iterator is exhausted */
} listiterobject;

And the implementation of the method next (also in the same file that defines the list_iter):

static PyObject *
listiter_next(listiterobject *it)
{
    PyListObject *seq;
    PyObject *item;

    assert(it != NULL);
    seq = it->it_seq;
    if (seq == NULL)
        return NULL;
    assert(PyList_Check(seq));

    if (it->it_index < PyList_GET_SIZE(seq)) {
        item = PyList_GET_ITEM(seq, it->it_index);
        ++it->it_index;
        Py_INCREF(item);
        return item;
    }

    it->it_seq = NULL;
    Py_DECREF(seq);
    return NULL;
}

Which shows that the iterator going through the elements of the list one by one.


That is, the iterator keeps a reference from the list, and this is how it manages to return its elements. As the list has already been created, there is no generation Lazy (on demand), as is the case with generators (or with the iterator Squares from the example above), because the values were already created earlier when you created the list.

1

You seem to be getting a little confused when talking about generators and iterators.

Are you right to say that generators generate their values dynamically, that is, if we have the generating function (I edited to avoid an infinite loop):

def quadrados(n, max):
    while n < max:
        yield n**2
        n += 1

And you iterate on that generating function that way:

for quadrado in quadrados(2, 10):
    print(quadrado)

At each execution of for loop, Python accesses the body of the function quadrados, goes to the line with yield and returns the value to be used in your for loop.

The important thing to understand here is that at the moment between the yield and the end of the block inside the for loop, the function quadrados is effectively pause: the generator saves memory space because instead of having to store all the resulting values, it simply stores the current status of the function. The function quadrados is there, "frozen", waiting for the moment when it will be able to continue to perform - what happens at the moment when the block inside the for loop ends and you move on to the next run.

On the other hand, a generating expression as (x**2 for x in range(50)) is just a simpler way of writing a "pseudo generating function" on a line. If we try to iterate on this expression, Python will handle the values in the same way: generates a value, "freezes" the generator state, works with the value, "defrosts" the generator and generates the next value, ...

When you gave your example with it = iter([95, 27, 2348, 32, 27]), note that here we’re not talking about generators, but rather of iterators, which are simply objects on which we can iterate with a for loop. In his example, the list [95, 27, 2348, 32, 27] is yes created and kept in memory, because it is referenced by the iterator itself, even if you have explicitly not referenced it by a variable such as my_list = [95, 27, 2348, 32, 27]. This example of yours does not involve generators at any time, and therefore the memory space is immediately occupied by the list at the time it is declared.

Okay, but what if we now combine things using one generating expression as it = iter(x**2 for x in range(50))? In that case, both the original values of your range(50) the values resulting from the expression x**2 for x in range(50) are inside generators. So Python just needs to go "freezing" and "defrosting" the generators to be able to iterate on them, without having to allocate all the elements in memory.

  • 1

    technically there is no such thing as generators that cannot have an internal state that uses memory. It is only wishfull thinking of A.P. on account of short and simple examples of generators, very popular in the documentation.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.