How does Python handle and represent an array internally?

Asked

Viewed 566 times

9

In Python any type of array is of class type list, see:

array = ['Gato', 'Jake', 'Finn']
print(type(array))

Exit:

<class 'list'>

That is, every array is an object of list. However, there is one question that I have always had in my mind, which would be the way that Python treats an array internally, and what type of data structure the class list use to store array values.

Therefore, I would like to have my doubts clarified.

Doubts

  1. How an array is represented internally in Python?
  2. What data structure is used internally to represent and enable manipulation of the array, list-chain, hash table or a simple array with dynamic allocation?

3 answers

11


First, array is different from list.

In many cases, usually more information, the nomenclature error will not be an aggravating and will not harm the communication, but since we will be analyzing the internal functioning of language I think it is interesting to stick to the correct terms. Given the questions, it became clear that the doubt is about the functioning of lists, not of arrays.

The guy list is native to Python and has been implemented in C under the PyListObject.

typedef struct {
    PyVarObject ob_base;
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

Where do we have:

  • ob_base, an object that represents structures of variable size, implements the field ob_size;
  • ob_item, a pointer to pointer of PyObject;
  • allocated, an object Py_ssize_t representing the allocated size;

As you can see - including commented in the code - the items that make up the list in Python will be managed from ob_item. Because it is a pointer, it has the dynamism of being relocated according to the need of list growth, in which each position will point to another pointer PyObject that represents the list item. It is worth remembering here that all the structures that Python implements inherit from PyObject, then there will be no restrictions of the type that can be stored in the list.

Whenever the size of a list needs to be changed, the function list_resize is invoked; it is responsible for implementing the reallocation algorithm of a list.

The source code for this function is:

/* Ensure ob_item has room for at least newsize elements, and set
 * ob_size to newsize.  If newsize > ob_size on entry, the content
 * of the new slots at exit is undefined heap trash; it's the caller's
 * responsibility to overwrite them with sane values.
 * The number of allocated elements may grow, shrink, or stay the same.
 * Failure is impossible if newsize <= self.allocated on entry, although
 * that partly relies on an assumption that the system realloc() never
 * fails when passed a number of bytes <= the number of bytes last
 * allocated (the C standard doesn't guarantee this, but it's hard to
 * imagine a realloc implementation where it wouldn't be true).
 * Note that self->ob_item may change, and even if newsize is less
 * than ob_size on entry.
 */
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    PyObject **items;
    size_t new_allocated;
    Py_ssize_t allocated = self->allocated;

    /* Bypass realloc() when a previous overallocation is large enough
       to accommodate the newsize.  If the newsize falls lower than half
       the allocated size, then proceed with the realloc() to shrink the list.
    */
    if (allocated >= newsize && newsize >= (allocated >> 1)) {
        assert(self->ob_item != NULL || newsize == 0);
        Py_SIZE(self) = newsize;
        return 0;
    }

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().
     * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
     */
    new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);

    /* check for integer overflow */
    if (new_allocated > PY_SIZE_MAX - newsize) {
        PyErr_NoMemory();
        return -1;
    } else {
        new_allocated += newsize;
    }

    if (newsize == 0)
        new_allocated = 0;
    items = self->ob_item;
    if (new_allocated <= (PY_SIZE_MAX / sizeof(PyObject *)))
        PyMem_RESIZE(items, PyObject *, new_allocated);
    else
        items = NULL;
    if (items == NULL) {
        PyErr_NoMemory();
        return -1;
    }
    self->ob_item = items;
    Py_SIZE(self) = newsize;
    self->allocated = new_allocated;
    return 0;
}

Just as Maniero commented in his reply, basically Python allocates a new memory space already considering the new size of the list and considering spaces to spare and then copy the content to that memory space. Note that the space added to the size of the list will be proportional to the current size and this is done precisely to seek a linear amortization in the reallocation time. That is, it is easier to relocate a larger space once and insert multiple elements than to relocate each time you insert. Also note that all relocation is done on the object items and not directly on self->ob_item:

items = self->ob_item;

if (new_allocated <= (PY_SIZE_MAX / sizeof(PyObject *)))
    PyMem_RESIZE(items, PyObject *, new_allocated);
else
    items = NULL;

if (items == NULL) {
    PyErr_NoMemory();
    return -1;
}

self->ob_item = items;

PyMem_RESIZE is a macro defined by:

#define PyMem_RESIZE(p, type, n) \
  ( (p) = ((size_t)(n) > PY_SSIZE_T_MAX / sizeof(type)) ? NULL :        \
        (type *) PyMem_REALLOC((p), (n) * sizeof(type)) )

Which basically does the memory relocation on the specified pointer.

8

First of all understand that Python is a dynamic typing language and needs data structures to deal with this need. You do not access the data directly as in static languages. If you want to know more about this typing issue read How PHP arrays work internally?.

Just like in PHP, Python has a struct with a union (both languages are written in C) with the die and its type, and the dice can be a pointer.

The list has a structure with a header, which would be more or less normal if written in C even in something more sophisticated and an object with a sequence of data. These data will be the structures I mentioned in the previous paragraph. In C a array is just a sequence of data accessed by a pointer. So the Python list is no different than the array of the C and is as efficient as if we only look at the list itself. Inefficiency would only come to access data that has an internal control. It can be said that Python needs this header, but even in C to use the array conveniently and correctly in real code you need something like this.

This header basically has the size of the list, the pointer to the list object itself and an object that is the beginning of the list (I still need to understand it). You can switch from version to version.

The clear difference between a list and a array is that the second is usually defined as a fixed structure and cannot change size and the list can. Internally it only has two ways to change the size of a collection, or makes a linked list (or something like that) that allows you to add new elements simply and on demand, but access to the data is tragic or makes it like a array over capacity (allocates more than you need) and when you blow capacity you create another array and copies everything to this new and discards the old. This is normal in virtually all languages and is often called array dynamic. The name has nothing to do with the allocation that happens to be made dynamically with malloc() (even if indirectly).

Note that this holds true for Cpython which is the standard implementation, other implementations may do very differently, but as they should work in a compatible way and it can’t be that different. Source. For example Jython can use his own ArratList of the JVM that has the same semantics.

Already one dict is quite different because there is used a form of hashtable.

1

How an array is represented internally in Python?

Array is represented as a list in python. there is the library of array example:

from array import *

x = array('i', [10,20,30,40,50])
print(type(x))

but array does not support string only:

'b Signedchar
'B unsigned char u Py_unicode 'h' Signed short
'H#unsigned short
... See more in Array Tipos

there is an array library with more resources called Numpy, use example:

import numpy as np

a = np.array(['Gato', 'Jake', 'Finn'])
print(type(a))

What data structure is used internally to represent and enable manipulation of the array, list-chain, hash table or a simple array with dynamic allocation?

I can’t tell for sure by the lack of information in the python documentation.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.