How to delete an entry from a Python file without having to read the entire file?

Asked

Viewed 1,281 times

1

I have a file with the following entries:

Ana
Joao
Pedro
José
....

And I need to delete the line with the name Pedro, it would be easy for me to read the whole file, save in a list delete Pedro and rewrite the file:

nomes = open('nomes.txt','r').readlines()
del_pedro(nomes)
open('nomes.txt','w').write(nomes)

But this file has a huge size, and speed is something considerably essential in this task. Is there any way to read the whole file and when find the input I want, just delete the line and continue reading the file?:

nomes = open('nomes.txt','r')
for i in nomes:
    if(i == 'Pedro\n'):
        deleta(i)
  • 1
  • The problem of this solution is that I have to read the whole file and then write it again, My case I have to delete Peter without having read José.

  • You just need to delete the line?

  • Delete the line without having to load the whole file and continue reading the file normally.

  • Delete and continue reading? This made no sense. If you need to save the file without the line, you will have to read it whole, but if it is only in reading, just ignore the line. Remember that to read the file you do not need to use the readlines, which stores all content in a list, the function itself open returns a generator that can be iterated line by line, without weighing in memory.

  • It is because this script will be run several times, and for each entry there will be a query in Mongo. When the script gives an error I need to pick up where I left off, but it takes a long time to get back to where it left off, because for every entry there should be a query in the bank, so I had the idea to delete the line after it was processed, Then I’ll pick up where you left off without having to do the bank appointments. My current solution is to save the position read through Seek and Tell.

  • you have to wear a try... except to NOT STOP the script in case of error -all the rest of your doubt is irrelevant, since Voce is wondering about a supposed problem "y" (deleting lines from a text file), when you have a problem "X": having to -run thousands of queries in the database each time a record is not found.

Show 2 more comments

1 answer

6


Yes - you need to read the whole file, change what you want in memory, and save it again.

This is the recommended practice.

The main reason is that it is an unstructured text file: that is, each line has a length in bytes - and the use of normal files allowed by the operating system itself does not provide that you can change the size of a small piece of the file - just provides that you can re-write a few bytes, but they would have to be the same size.

So, technically it would be possible to make your program to write a space or "*" for each letter you wish to delete in the original file, but the performance for this is not the best possible, and the minimum number of bytes you can read or write in a file in general is 4096, anyway.

That is: you would make a complex program, subject to errors, that will cause the loss of your data if the program is interrupted during execution (by a system shutdown, or other failure), and even if on the Python side you were changing only about 10 or 15 bytes, the IO to disk would be 4096 bytes anyway.

You say "file too big" but unless your file has much more than 100,000 names in that style (about 1MB), and you are doing several operations of that type per minute that impact would be imperceptible in operation.

On the other hand, it is true that a text file with thousands of names in sequence is a very inefficient data structure: long before you get to that point, you should be using a proper mechanism to keep the data efficiently - especially if the data is critical (and much more if the performance is important).

The Python language comes with the sqlite database already ready for use, and it has an efficiency that is comparable to the large names of databases like Postgresql and Oracle for access from a single process - Managing some data as a list of names and other data associated with sqlite can give you a performance gain of 1000 to 50 thousand times compared to keeping the data in a file. txt simple.

With the tips you gave in the comments, and "guessing" the code you have there, it is possible to do the following:

def processa(arquivo_texto):
    with(open(arquivo_texto) as arquivo:
        nomes_para_remover = set()
        try:
             funcao_que_consulta_o_mongo(linha)
        except Exception as error:
             # use algum mecanismo de logging - pode ser um print mesmo
             nomes_pra_remover.add(linha)
    limpar_arquivo_texto(arquivo_texto, nomes_pra_remover)

def limpar_arquivo_texto(arquivo_texto, nomes_pra_remover):
     nome_novo =  arquivo_texto + "_novo"
     with open(arquivo_texto) as entrada, open(nome_novo, "wt") as saida:
         for linha in entrada:
              if linha not in nomes_pra_remover:
                   saida.write(linha)
     os.remove(arquivo_texto)
     os.rename(nome_novo, arquivo_texto)

This solution removes all the names that don’t fit you and does so by reading and writing the entire file only once, not once for each name. On a normal PC, even with an arch in the 10MB range (~1000000 names), you should perform the task in less than 1 second - the play with the names of the files in the specific function for this ensures that even if the execution is interrupted you do not lose your data: at all times you have the original file except when it is deleted and the new file renamed to the original name.

  • I understood what you did, but even so I had to resort to my initial solution of marking the position of the file, I think it is more practical.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.