As the most efficient answer in this case is substantially different from the accepted answer, I will write a little.
It has already been commented in the comments that this would not be by far the appropriate structure for a mass of data of this size - yet another one that needs to be changed.
How the lines have size fixed this is possible.
But you should open the file in special mode "Rb+" - if you try to open in "w" mode, the system erases the entire file even - and then, take some care to record each line back in the place where it is.
Accessing the records in a structured way:
First, however, let’s see how to access the data within of each row so that it is possible to maintain the system, without having to, at each modification, keep counting on the fingers in which column each field goes, and put it in an if.
Now, wanting to do this by playing index within each mbutidos line on multiple "if" commands on the path will make it harder - this is a typical case where you can use Python’s ability to customize access to attributes in a class to create something well legal: a class in which you access and modify each column by the field name, and internally it keeps the data in a single string, which can be written to the original or printed file.
The answer in How to extract information from a 'cnab' file using python?
covers how to create a class of these -the "Field" and "Base" classes the way they are there create an object that is a "descriptor" (Field class): it customizes access to the data in attributes = and the "Base" class has the rest of the machinery to allow access to the fields:
class Campo:
def __init__(self, inicio, final):
self.inicio = inicio
self.final = final
def __set_name__(self, owner, nome):
self.nome = nome
def __get__(self, instance, owner):
if not instance:
return self
return instance.dados_brutos[self.inicio: self.final]
class Base:
def __init__(self, dados):
self.dados_brutos = dados
def __repr__(self):
campos = []
for name, obj in self.__class__.__dict__.items():
if isinstance(obj, Campo):
campos.append((name, getattr(self, name)))
return "\n".join(f"{campo}:{conteudo}" for campo, conteudo in campos)
with this section of less than 25 lines, it is now possible to represent the lines in your file as a specific class:
class Frutas(Base):
codigo = Campo(0, 6)
nome = Campo(7, 13)
valor = Campo(14, 16)
uf = Campo(17, 19)
Check it out - I paste exactly the code above, plus the excerpt you passed as an example, in an interactive Python session and see how it works:
...: exemplo = """\
...: 123456 BANANA 00 SP
...: 123457 MACA 01 RJ
...: 123458 PERA 02 MG"""
...:
...: class Frutas(Base):
...: codigo = Campo(0, 6)
...: nome = Campo(7, 13)
...: valor = Campo(14, 16)
...: uf = Campo(17, 19)
...:
In [34]: x = [Frutas(linha) for linha in exemplo.split("\n")]
In [35]: x[0].nome
Out[35]: 'BANANA'
In [36]: x[0].valor = "23"
In [37]: x[0]
Out[37]:
codigo:123456
nome:BANANA
valor:23
uf:SP
In [38]: x[0].dados_brutos
Out[38]: '123456 BANANA 00 SP'
Manipulating file lines as records, with random access
Now, once a line is read, we have a way to work with it - the idea of having a class to represent a file like this open, and then using this class to read or write one line at a time, without the code that makes it need to worry about where or as the data are saved may be the most interesting.
(Hence, if tomorrow you exchange the data storage elsewhere, be it an SQL database, Nosql, etc... the code that uses the data does not even need know thereof )
from pathlib import Path
class MapeadorTxt:
def __init__(self, caminho_arquivo, classe_dados, comprimento_linha, codificacao="ASCII"):
self.caminho = Path(caminho_arquivo)
self.classe = classe_dados
self.comp = comprimento_linha
self.codificacao = codificacao
def __enter__(self):
self.arquivo = open(self.caminho, "rb+")
def __exit__(self, exc_type, exc_value, exc_tb):
self.arquivo.close()
self.arquivo = None
def __getitem__(self, index):
if index >= len(self):
raise IndexError
self.arquivo.seek(index * self.comp)
return self.classe(self.arquivo.read(self.comp).decode(self.codificacao))
def __setitem__(self, index, valor):
self.arquivo.seek(index * self.comp)
v = valor.dados_brutos.rstrip("\n").ljust(self.comp - 1) + "\n"
self.arquivo.write(v.encode(self.codificacao))
def __len__(self):
return self.caminho.stat().st_size // self.comp
def __repr__(self):
return f"Mapeador do arquivo {self.caminho} para a classe {self.classe.__name__}, com {len(self)} registros"
And with this class you can do what you propose in the question, using the above class in a "with" block-and we use the function enumerate
in the for
to have, in addition to the record, the index also - (is in the variablei
) and then, if you want to access the log on the top line, just use m[i - 1]
, for example.
Working here to change the banana code to "42":
In [57]: m = MapeadorTxt("exemplo.txt", Frutas, 19)
In [58]: with m:
...: for i, fruta in enumerate(m):
...: if fruta.nome.strip() == "BANANA":
...: fruta.valor = "42"
...: m[i] = fruta
...:
In [59]: cat exemplo.txt
123456 BANANA 42 SP
123457 MACA 01 RJ
123458 PERA 02 MG
This above code is not perfect - in particular, the "field" should handle text encoding and Decoding, and leave the internal values in bytes, or in a "bytearray".
And of course, that doesn’t solve the problem of access time - you still have 500 million records, which in a comparison of these have to be read one by one - play on an SQL basis, and you can search for dates, etc...
Put the
arquivo.write(linha)
outside theif
. I believe that you are not dealing adequately with the opening and closing of your file. Study a little more.– anonimo
You open the file in read mode at the beginning and then modify the file variable by opening it again in write mode at each for iteration?
– Guilherme Brügger
if I put write off the IF and if I have +1 BANANA, it deletes one and keeps the other, let’s assume I change the PERA to BANANA as well... at the end is only the PERA and APPLE in the file....
– LuizGTVSilva
@Guilhermebrügger basically this, at each line I open and change the file... impossible to keep in memory with the readlines pq is a file of 6GB
– LuizGTVSilva
You should not read a line, change its value and then try to rewrite that line in the same file. This does not work. Open the file, read and change each line and save in a new file. Thus the original is kept if the program ends in the middle.
– Guilherme Brügger
Yes, I will implement this in the future, even to ensure that the file does not corrupt. But first I need to modify the lines I need without deleting the rest of the content... Or do you think the solution might be to go copying line by line to a new file?
– LuizGTVSilva
You do not have to delete the rest of the lines. If the line does not have the value you are looking for, you simply write it without modification in the new file. If you do, you modify and write the new line. Either that or put it all in memory. Still, conceptually there is no difference. The 6GB string in memory would be like the new file, but in a more efficient memory support.
– Guilherme Brügger
Resposya accepts, and disucussões to the part, Oce knows that a text file is a bad option to maintain a set of data that you need to change, worse still of this size, is not? As the lines have fixed size in bytes, this is it possible - otherwise neither would be - does not mean that it is desirable. depending on your task, it may be worth a lot (but a lot) putting your data in a sqlite database - and if you have any legacy system that needs that specific format, generate the final output file when it’s time.
– jsbueno
In Python it’s almost trivial to create a class that represents a line in your file - if you’re really into it, a little more code could create indexes for some fields, and you could have something fast of random access - even in this file format there
– jsbueno
Check out my answer here, for a file using text data structure, in fixed-size lines: https://answall.com/questions/399778/como-extrair-as-informa%C3%A7%C3%b5es-de-um-file-cnab-using-python/400033#400033
– jsbueno
@jsbueno this is a legacy system of the company... I agree with everything you said!
– LuizGTVSilva