Modifying a Python file without losing the current content

Asked

Viewed 619 times

1

That doubt is a continuation of this here.

I have a very large TXT file (about 6GB), each line being 1,300 characters, and I am manipulating these lines, as an example, I will use these three lines here (the site is eating the blanks, but the lines have the same layout):

123456 BANANA 00 SP
123457 MACA   01 RJ
123458 PERA   02 MG

What I need is if the line contains the word "BANANA" at 8:14, it changes the "00" from 16:17 to "22".

For that, I’m implementing something like:

arquivo = open('testando.txt', 'r') #abre o arquivo
for linha in arquivo:
    codigo = str(linha[8:14])
    
    if codigo == 'BANANA':
       print("ACHOU, CODIGO: " + codigo)
       encontrado = True
    else:
       print("NÃO ACHOU, CÓDIGO: " + codigo)
       encontrado = False

    if encontrado:
        new_line = linha[:16] + "22" + linha[17:] #como python utiliza string imutáveis, estou apenas remontando ela
        print(new_line)
        linha = linha.replace(linha, new_line) #substituo a linha antiga pela nova, com a nova informacao
        arquivo = open('testando.txt','w')
        arquivo.write(linha) #adiciono minha nova linha no arquivo
        
    print(linha)
        
arquivo.close()

The problem I am encountering is that it is only saving the line that I changed and deleting all the rest of the file. I imagined that I was going through line by line, it was going to change only the line of that index, however it is replacing the complete file.

PS: I have tried the method write after the if and I’ve also tried writelines.

Know somehow to keep the file intact and change only the desired part?

  • Put the arquivo.write(linha) outside the if. I believe that you are not dealing adequately with the opening and closing of your file. Study a little more.

  • You open the file in read mode at the beginning and then modify the file variable by opening it again in write mode at each for iteration?

  • if I put write off the IF and if I have +1 BANANA, it deletes one and keeps the other, let’s assume I change the PERA to BANANA as well... at the end is only the PERA and APPLE in the file....

  • @Guilhermebrügger basically this, at each line I open and change the file... impossible to keep in memory with the readlines pq is a file of 6GB

  • You should not read a line, change its value and then try to rewrite that line in the same file. This does not work. Open the file, read and change each line and save in a new file. Thus the original is kept if the program ends in the middle.

  • Yes, I will implement this in the future, even to ensure that the file does not corrupt. But first I need to modify the lines I need without deleting the rest of the content... Or do you think the solution might be to go copying line by line to a new file?

  • You do not have to delete the rest of the lines. If the line does not have the value you are looking for, you simply write it without modification in the new file. If you do, you modify and write the new line. Either that or put it all in memory. Still, conceptually there is no difference. The 6GB string in memory would be like the new file, but in a more efficient memory support.

  • 1

    Resposya accepts, and disucussões to the part, Oce knows that a text file is a bad option to maintain a set of data that you need to change, worse still of this size, is not? As the lines have fixed size in bytes, this is it possible - otherwise neither would be - does not mean that it is desirable. depending on your task, it may be worth a lot (but a lot) putting your data in a sqlite database - and if you have any legacy system that needs that specific format, generate the final output file when it’s time.

  • 1

    In Python it’s almost trivial to create a class that represents a line in your file - if you’re really into it, a little more code could create indexes for some fields, and you could have something fast of random access - even in this file format there

  • Check out my answer here, for a file using text data structure, in fixed-size lines: https://answall.com/questions/399778/como-extrair-as-informa%C3%A7%C3%b5es-de-um-file-cnab-using-python/400033#400033

  • @jsbueno this is a legacy system of the company... I agree with everything you said!

Show 6 more comments

2 answers

4


First, call str in str(linha[8:14]) is redundant and unnecessary as linha[8:14] already returns a string. In addition, in the example file you reported, this excerpt corresponds to "ANANA " (but let’s assume that in the original file the indexes are the same, is a small adjustment that does not interfere with the rest).

Another point is that you are reading from the file at the same time you write to it. If there is any error in the middle of the way, the file will get corrupted, then ideally you first write everything in another temporary file, and only at the end, if all goes well, move the temporary file to the original.

And as I already said in your previous question, this variable encontrado is unnecessary. If you want to do something if the string is "BANANA", do everything within the first if:

import shutil, tempfile

# lê do arquivo e escreve em outro arquivo temporário
with open('testando.txt', 'r') as arquivo, \
     tempfile.NamedTemporaryFile('w', delete=False) as out:
    for linha in arquivo:
        codigo = linha[8:14]
        
        if codigo == 'BANANA':
           print("ACHOU, CODIGO: " + codigo)
           linha = linha[:16] + "22" + linha[17:] # remontar a linha
        else:
           print("NÃO ACHOU, CÓDIGO: " + codigo)

        out.write(linha) # escreve no arquivo temporário

# move o arquivo temporário para o original
shutil.move(out.name, 'testando.txt')

Note that the write stays out of if, since it is something that must always be done. The only thing that changes is that the linha is modified if you fall into if. If the code is not "BANANA", the line is written without modification.

I used with to open the files as this ensures that they will be closed at the end (even in case of error, which does not happen when you call close() directly - unless it is a block finally).

I also use the module tempfile to create the temporary archive and shutil.move to rename the file at the end.

See also that the replace that you were doing is not necessary, you can assign the reassembly in the variable itself linha.

  • 1

    That’s it. The difference is that that answer addressed everything that is wrong in the code and my comments only addressed the main question.

  • Thank you very much! Actually the indices are different here, but it makes sense here in my test file. I’m just having a problem, now running away a little from my example... let’s assume that BANANA is on line 2 and "00" is on line 1, that is, I first check the bottom line and then need to "go up" to another line p/ make the change... which is the best way to do this since the python for has no index?

  • @Luizgtvsilva It seems to me the case to ask another question (but search the site before, there must be some example). At first you have to keep the previous line too, but as I said, research that you should already have something like this, and if you don’t, ask another question - I’m not saying this out of spite, it’s just to keep the site organized: a question by specific problem :-) Do not forget to accept this answer (if it has solved, of course)

  • All right, thanks! I’ll look around and anything else.

  • 1

    It’s not for nothing - but recording everything in another file and renaming it at the end - it’s cool for up to 10, 20MB - with 6GB - an operation that would be of sub-milliseconds can take hours. As the lines have fixed size, you can treat each line as a record of length 1300 bytes, and work "inplace" - operating systems help.

  • 1

    Of course it is not the best option - the best would be to write a class and some 4 or 5 auxiliary functions to have a "mini Orm" and throw this guy in a sqlite - or already use a sqlalchemy.

  • 1

    @jsbueno I fully agree, and I make a mea culpa: I was going to edit the answer by talking about it, but in the end I ran out of time. I only managed to go back now to edit it, but as I know I won’t be able to suggest anything better than your answer, I think I’ll leave it at that :-) Anyway, thank you for the comments

  • 1

    And if anyone else gets here - it’s important to keep in mind that if they’re not all_as_lines the same size (this file structure is a legacy of how systems worked on mainframes) - so, yes, the only way to do it is this one: create another file and change only the lines that matter.

Show 3 more comments

2

As the most efficient answer in this case is substantially different from the accepted answer, I will write a little. It has already been commented in the comments that this would not be by far the appropriate structure for a mass of data of this size - yet another one that needs to be changed.

How the lines have size fixed this is possible. But you should open the file in special mode "Rb+" - if you try to open in "w" mode, the system erases the entire file even - and then, take some care to record each line back in the place where it is.

Accessing the records in a structured way:

First, however, let’s see how to access the data within of each row so that it is possible to maintain the system, without having to, at each modification, keep counting on the fingers in which column each field goes, and put it in an if.

Now, wanting to do this by playing index within each mbutidos line on multiple "if" commands on the path will make it harder - this is a typical case where you can use Python’s ability to customize access to attributes in a class to create something well legal: a class in which you access and modify each column by the field name, and internally it keeps the data in a single string, which can be written to the original or printed file.

The answer in How to extract information from a 'cnab' file using python? covers how to create a class of these -the "Field" and "Base" classes the way they are there create an object that is a "descriptor" (Field class): it customizes access to the data in attributes = and the "Base" class has the rest of the machinery to allow access to the fields:


class Campo:
    def __init__(self, inicio, final):
        self.inicio = inicio
        self.final = final

    def __set_name__(self, owner, nome):
        self.nome = nome

    def __get__(self, instance, owner):
        if not instance:
            return self
        return instance.dados_brutos[self.inicio: self.final]

class Base:
    def __init__(self, dados):
        self.dados_brutos = dados

    def __repr__(self):
        campos = []
        for name, obj in self.__class__.__dict__.items():
            if isinstance(obj, Campo):
                campos.append((name, getattr(self, name)))
        return "\n".join(f"{campo}:{conteudo}" for campo, conteudo in campos)


with this section of less than 25 lines, it is now possible to represent the lines in your file as a specific class:

class Frutas(Base):
    codigo = Campo(0, 6)
    nome = Campo(7, 13)
    valor = Campo(14, 16)
    uf = Campo(17, 19)
    

Check it out - I paste exactly the code above, plus the excerpt you passed as an example, in an interactive Python session and see how it works:

    ...: exemplo = """\ 
    ...: 123456 BANANA 00 SP 
    ...: 123457 MACA   01 RJ 
    ...: 123458 PERA   02 MG""" 
    ...:      
    ...: class Frutas(Base): 
    ...:     codigo = Campo(0, 6) 
    ...:     nome = Campo(7, 13) 
    ...:     valor = Campo(14, 16) 
    ...:     uf = Campo(17, 19) 
    ...:                                                                                                                                                                                                                       

In [34]: x = [Frutas(linha) for linha in exemplo.split("\n")]                                                                                                                                                                  

In [35]: x[0].nome                                                                                                                                                                                                             
Out[35]: 'BANANA'

In [36]: x[0].valor = "23"                                                                                                                                                                                                     

In [37]: x[0]                                                                                                                                                                                                                  
Out[37]: 
codigo:123456
nome:BANANA
valor:23
uf:SP

In [38]: x[0].dados_brutos                                                                                                                                                                                                     
Out[38]: '123456 BANANA 00 SP'

Manipulating file lines as records, with random access

Now, once a line is read, we have a way to work with it - the idea of having a class to represent a file like this open, and then using this class to read or write one line at a time, without the code that makes it need to worry about where or as the data are saved may be the most interesting. (Hence, if tomorrow you exchange the data storage elsewhere, be it an SQL database, Nosql, etc... the code that uses the data does not even need know thereof )


from pathlib import Path

class MapeadorTxt:
    def __init__(self, caminho_arquivo, classe_dados, comprimento_linha, codificacao="ASCII"):
        self.caminho = Path(caminho_arquivo)
        self.classe = classe_dados
        self.comp = comprimento_linha
        self.codificacao = codificacao
        
    def __enter__(self):
        self.arquivo = open(self.caminho, "rb+")
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        self.arquivo.close()
        self.arquivo = None
    
    def __getitem__(self, index):
        if index >= len(self):
            raise IndexError
        self.arquivo.seek(index * self.comp)
        return self.classe(self.arquivo.read(self.comp).decode(self.codificacao))
    
    def __setitem__(self, index, valor):
       self.arquivo.seek(index * self.comp)
       v = valor.dados_brutos.rstrip("\n").ljust(self.comp - 1) + "\n"
       self.arquivo.write(v.encode(self.codificacao))
       
    def __len__(self):
        return self.caminho.stat().st_size // self.comp
        
    def __repr__(self):
        return f"Mapeador do arquivo {self.caminho} para a classe {self.classe.__name__}, com {len(self)} registros"
    

And with this class you can do what you propose in the question, using the above class in a "with" block-and we use the function enumerate in the for to have, in addition to the record, the index also - (is in the variablei) and then, if you want to access the log on the top line, just use m[i - 1], for example.

Working here to change the banana code to "42":

In [57]: m = MapeadorTxt("exemplo.txt", Frutas, 19)                                                                                                           

In [58]: with m: 
    ...:     for i, fruta in enumerate(m): 
    ...:         if fruta.nome.strip() == "BANANA": 
    ...:             fruta.valor = "42" 
    ...:             m[i] = fruta 
    ...:                                                                                                                                                      

In [59]: cat exemplo.txt                                                                                                                                      
123456 BANANA 42 SP
123457 MACA   01 RJ
123458 PERA   02 MG

This above code is not perfect - in particular, the "field" should handle text encoding and Decoding, and leave the internal values in bytes, or in a "bytearray".

And of course, that doesn’t solve the problem of access time - you still have 500 million records, which in a comparison of these have to be read one by one - play on an SQL basis, and you can search for dates, etc...

  • I learned a lot from your comment, thank you very much! But I will not apply this solution, it is for something punctual company that I will never use again! But in its explanation it has several concepts that was very useful to me! Thanks :)

  • yes- the second part of the answer shows how to randomly access the file - if it is valuable data - this whole code has less than 100 lines - even for a single case, I would recommend - what is the chance of you missing a field limit in an "if" with records of 1300 bytes? 0? What is the value of in any column of value all values appear divided by 10?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.