Reading list of integers from a binary file

Asked

Viewed 478 times

1

I need to read the 10 million integer numbers that are in a binary file and put them in a list. The program does not fail, but the readings give some strange numbers:

main:

nome_arqivo="randomnumbers.bin"
lista=[]
try:
    arquivo=open(nome_arqivo,"rb")
except IOError:
    print("Erro na abertura do arquivo")


#for x in arquivo:
    #lista.append(arquivo.read(4))

for x in range(10000000):
   lista.append(arquivo.read(4))

for item in lista:
    print(item)

The end of the printage comes out this way:

b'\x00\xca\x82\x05'

b'\x00Y\x88\x08'

b'\x00\xb1\xe4\x12'

b'\x00\xb6j\x0e'

b''

Process finished with exit code 0
  • Can you put a snippet of the file that has the numbers? Here did not open the link you passed. And why says that strange numbers came out?

2 answers

3


Generating file with 128 bytes with random data for testing, which represents 32 integers of 4 bytes:

$ head -c 128 < /dev/urandom > randomnumbers.bin

Generated file:

$ xxd randomnumbers.bin 
0000000: 300c 54ea 4023 8592 267c 0dc9 f961 0a6d  0.T.@#..&|...a.m
0000010: d0d6 cef3 950e 39ac 8422 5671 c1a2 2546  ......9.."Vq..%F
0000020: ea5a b0e5 cb00 9fb5 40e5 cb7b 849e fb36  .Z......@..{...6
0000030: d64e 77f8 0351 866c 4f2c 824b c98b 82a5  .Nw..Q.lO,.K....
0000040: 7421 e0d1 626a 2cdd 090e 69a4 0894 01bf  t!..bj,...i.....
0000050: 37a0 0405 cdbc 57f2 fa4f 1e78 89c1 f2b5  7.....W..O.x....
0000060: c8eb 2c63 4c13 2e47 d59b 234d b951 41df  ..,cL..G..#M.QA.
0000070: 1d65 52a7 51c9 240e 2426 4f55 a6c9 2cfb  .eR.Q.$.$&OU..,.

Solution #1: using the module struct:

import struct

lista=[]

with open( "randomnumbers.bin", "rb") as arq:
    for num in iter( lambda: arq.read(4), b'' ):
        lista.append( struct.unpack( 'i', num )[0] )

print(lista)

Exit:

[-363590608, -1836768448, -921863130, 1829396985, -204548400, -1405546859, 1901470340, 1176871617, -441427222, -1247870773, 2076960064, 922459780, -126398762, 1820741891, 1266822223, -1518171191, -773840524, -584291742, -1536618999, -1090415608, 84189239, -229131059, 2015252474, -1242381943, 1663888328, 1194201932, 1294179285, -549367367, -1487772387, 237291857, 1431250468, -80950874]

Solution #2: Using Numpy Arrays:

import numpy as np

with open( "randomnumbers.bin", "rb") as arq:
    lista = np.fromfile( arq, dtype=np.int32 ).tolist()

print(lista)

Exit:

[-363590608, -1836768448, -921863130, 1829396985, -204548400, -1405546859, 1901470340, 1176871617, -441427222, -1247870773, 2076960064, 922459780, -126398762, 1820741891, 1266822223, -1518171191, -773840524, -584291742, -1536618999, -1090415608, 84189239, -229131059, 2015252474, -1242381943, 1663888328, 1194201932, 1294179285, -549367367, -1487772387, 237291857, 1431250468, -80950874]

Solution #3: Using the method .from_bytes() (Python 3 only)

lista=[]

with open( "randomnumbers.bin", "rb") as arq:
    for num in iter( lambda: arq.read(4), b'' ):
        lista.append(int.from_bytes(num, byteorder='little', signed=True))

print(lista)

Exit:

[-363590608, -1836768448, -921863130, 1829396985, -204548400, -1405546859, 1901470340, 1176871617, -441427222, -1247870773, 2076960064, 922459780, -126398762, 1820741891, 1266822223, -1518171191, -773840524, -584291742, -1536618999, -1090415608, 84189239, -229131059, 2015252474, -1242381943, 1663888328, 1194201932, 1294179285, -549367367, -1487772387, 237291857, 1431250468, -80950874]

Analise Comparative (Python 3):

The utilitarian time can be used to compare the efficiency of the two solutions presented when processing large files.

Generating file with random data from 128MB bytes, which represents 33.554.432 integers of 4 bytes:

$ head -c 128M < /dev/urandom > randomnumbers.bin

test py.:

import sys
import struct
import numpy as np

def solucao_struct():
    lista=[]
    with open( "randomnumbers.bin", "rb") as arq:
        for num in iter( lambda: arq.read(4), b'' ):
            lista.append( struct.unpack( 'i', num )[0] )

def solucao_numpy():
    with open( "randomnumbers.bin", "rb") as arq:
        lista = np.fromfile( arq, dtype=np.int32 ).tolist()

def solucao_from_bytes():
    lista=[]
    with open( "randomnumbers.bin", "rb") as arq:
        for num in iter( lambda: arq.read(4), b'' ):
            lista.append(int.from_bytes(num, byteorder='little', signed=True))

if( sys.argv[1] == "--np" ):
    solucao_numpy()
elif( sys.argv[1] == "--struct" ):
    solucao_struct()
elif( sys.argv[1] == "--frombytes" ):
    solucao_from_bytes()

Measuring Solution Performance with Numpy Arrays:

$ time python3 teste.py --np

real    0m3.766s
user    0m2.384s
sys     0m1.362s

Integers Per Second (Numpy):

(128MB / 4Bytes) / 3.766s = 8909833.2

Measuring Solution Performance with Struct:

$ time python3 teste.py --struct

real    0m38.200s
user    0m36.700s
sys     0m1.411s

Integers Per Second (Struct):

(128MB / 4Bytes) / 38.2s = 878388.2

Measuring Solution Performance with int.from_bytes():

$ time python3 teste.py --frombytes

real    2m2.691s
user    2m1.057s
sys     0m1.375s

Integers Per Second (int.from_bytes()):

(128MB / 4Bytes) / 62.691s = 535235.2
  • 1

    Between your solutions and mine I found, which you find the most efficient?

  • 1

    @Rafael Efficiency issues are better measured than speculated. Something that may seem more efficient (for example, because it is lower-level) can be worse than a higher-level alternative.

  • @Rafael: I made an issue demonstrating how to measure the performance of the solutions. The solution using NymPy arrays turned out to be 12x more efficient than the solution with struct.unpack()

  • @Pabloalmeida: See Edition.

  • Man, thank you so much for introducing me to this numpy library. If before with my implementation was taking 10 seconds to read the 10 million ints now he is reading everything in 0.2 seconds. Also, I think this numpy will present me other great solutions for my work that I’m doing. Thank you

0

People, I was able to find the error. I’m used to c/c++ and there is no need to convert whole bytes when reading a binary file. Here it is necessary. Below the corrected code:

import time

nome_arqivo="randomnumbers.bin"
lista=[]
try:
    arquivo=open(nome_arqivo,"rb")
except IOError:
    print("Erro na abertura do arquivo")

for x in range(10000000):
    lista.append(int.from_bytes(arquivo.read(4), byteorder='little'))

for item in lista:
    time.sleep(4)
    print(item)
  • 1

    Remembering that int.from_bytes() is one of the new methods of int in the Python 3 and is not compatible with Python 2.

  • I believe every solution that uses a loop for will always be slow in relation to the method fromfile() of a NumPy Array.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.