Read and write binary files using STRUCT

Asked

Viewed 389 times

2

I have an array valores = [16, -25, 34, 2, 199, 45, 67, 90]and need to save them in a binary file using struct. Then I need to read the binary file with struct.

To read, I’m using the following code:

def escreveBin(valores):
    with open("colecao.bin", 'wb') as arq:
        arq.write(struct.pack('=i', valores))

But I’m getting the following error:

struct.error: required argument is not an integer

Also, after writing it, I will need to sort the file without bringing the contents to and main memory, as I need to take into account that the file can be very large.

Any idea how to do such a thing?

1 answer

1

The struct module works well, but can be a bit boring sometimes. In particular, it expects a data formatting letter to each parameter that will write, has no arrays - that is, you need to pass, in the string that specifies the format an "i" for each number in your list.

Fortunately this can be done directly with the f-strings:

...
valores = [16, -25, 34, 2, 199, 45, 67, 90] 
v = struct.pack(f"={'i' * len(valores)}", *valores) 
arq.write(v)

It is important to note also the * before values - this Python syntax causes each sequence item to be applied to the call as if it were a different argument - that is, it is as if pack(*val) were pack(val[0], val[1], val[2], ...)

Calling struct.pack right in the parentheses of Arq.write works as you did, of course, but it’s important to always remember to keep the code as clear as possible both when writing and when reading. Variables are "free".

That’s the answer to how write down this data - but you will have another problem at the time of read that data.

Reading this data

So - Struct generates binary data - exactly the values of the numbers you put in, and not a bit more. This has the following implication at reading time: there is no information on how many There were whole ones on the original list. The Python lists (lists) themselves are high-level data structures, with some meta-information associated - including length. When recording a file, this length information does not go together - at the time of reading, you have to have some method of knowing.

For example, strings in C solve this problem by writing a value "0" at the end of the string. Another way to do it is to record a first number that contains the number of elements that comes next - then you read this number first, and you already know how many elements you will read next. Serialization structures such as JSON or XML, have marker blocks of beginning and end of "sibling data".

If all you need are whole, Another way to know the number of integers in the file is to see the file size. Since they are 4-byte integers (32bits - it is the definition of "i" in the Python struct), just divide the data size by 4. Or decode one by one until the end of the data stream. To look more like the code of writing, we can do it this first way:

int_size = 4 
with open("colecao.bin", "rb") as file_:
   raw_data = file_.read()
   data = struct.unpack(f"={'i' * (len(raw_data) // int_size)}", raw_data)

Or using pathlib, since you won’t be dealing with the open file:

from pathlib import Path

raw_data = Path("colecao.bin").read_bytes()
data = struct.unpack(f"={'i' * (len(raw_data) // int_size)}", raw_data)

Ordering the content without bringing to the main memory

That would be a very specific problem - I believe it is an exercise and not a real application. In actual applications, 98% of the files fit in the main memory of the computer - including the Pandas library, for data processing, for example, not only does it work with all the data in memory, but most operations create a new copy of the data - instead of replacing them at the point where they are.

So, in this case, the most appropriate thing is to create a cojunto of functions that can read and write a binary integer at the given position of a file - and implement all the separate logic.

The functions would be like:

open_file = open("colecao.bin", "rb+")  # observe o arquivo sendo aberto para leitura e escrita binários 
...
int_size = 4

read_int(open_file, position):
    open_file.seek(position * int_size)
    return struct.unpack("=i", open_file.read(int_size))[0]


write_int(open_file, position, value):
    open_file.seek(position * int_size)
    return struct.pack("=i", value)

Super-Python for the rescue!

Now the interesting thing about doing an exercise like this in Python, in relation to other languages, is that it’s quite simple, in Python, to create a class that allows you to access the data of the file on disk as exactly as if it were a list, using the brackets to retrieve each element. For this, you just need to implement the function functionality above in the methods __getitem__ and __setitem__ of a class. If the class inherits from collections.abc.MutableSequence, and you still implement the methods __enter__ and __exit__, (in addition to the mandatory methods described in the link above - __len__ in fact - __delitem__ and insert shall lift NotImplementedError in this case), you can access your data by doing something like this:

with SequenciaEmArquivo("colecao.bin") as dados:
   # Insere o número da posição 0 do arquivo  na posição 2:
   dados[2] = dados[0]

   # ou a troca "pythonica": 

   dados[2], dados[0] = dados[0], dados[2]

Although a class implemented in this way has several of the features of the Python list, it is a "Sequence", not a list - in particular, it will not have the method sort. This does not prevent you from taking advantage of the same class, since your sorting problem is, and implement your algorithm precisely in the method sort of that class.

It is interesting to keep in mind that both this type of access and the direct one with above functions, can be pretty cool on machines of the class "computers" in which the operating system has enough memory to buffer files in memory. On an older machine (of the generation of '486'), or smaller category (machines with microcontrollers like Raspberry Pi), the operating system may have restrictions, and if the access is to a mechanical disk (HD vs SSD), the performance of this type of access may be unviable - literally hundreds of thousands of times slower than working with the data in memory, since for every read integer, there has to be a mechanical shift from the hard drive reading head to the data position.

Now, what do you really want to do?

Of course, understanding how to write exact bytes in files and read them back is very important, and every programmer, at some point, should know how to do this at least conceptually. Now, the usefulness of doing this may be more limited -

If you want to simply store a list in a file, and recover the values from that list later, the recommendation is to use the module pickle of Python, which transforms a Python object as a whole recursively into a sequence of bytes and writes this sequence to a file- so you are not restricted to 32bit integers and other primitive types, nor do you have to create your own protocol just for your application (aka - reinvent the wheel), to know how many values read.

On the other hand, if your application needs to store various types of values in a structured way, be able to recover them with efficient use of disk space - that’s why there are databases - you can use the sqlite, for example.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.