How to implement journaling in Python?

Asked

Viewed 262 times

13

I need to do a series of operations involving binary files (you can’t use BD), and I need to ensure that they end successfully even in the event of failure in the middle of the operation. For this, I see no way out except to implement a system of journaling manually. I started writing a code, but I’m not sure if it’s correct (i.e. if there are no cases where a failure in the middle of the task can put it in an inconsistent state - including failure in the actual writing of the Journal).

Is there anything ready in that direction? And if there is, there is some problem in my attempt to remedy below?

def nova_tarefa(args):
    refazer_journal() # Refaz o que ficou inacabado da última vez (se houver)
    with open('args.json', 'w') as f:
        json.dump(args, f) # Prepara as novas tarefas
    with open('journal.txt', 'w') as f:
        f.write('comecar\n') # Coloca as novas tarefas no journal
    refazer_journal() # Faz as novas tarefas

def refazer_journal():
    try:
        with open('journal.txt', 'r') as f:
            passos = [x.strip() for x in f.readlines() if x.strip()]
    except:
        if not os.path.exists('journal.txt'):
            passos = []
        else:
            raise
    if not passos: # Se não há nada inacabado, termina
        return
    with open('args.json', 'r') as f:
        args = json.load(f)

    # Realiza as tarefas que ainda não foram marcadas como concluídas
    for indice,tarefa in enumerate(args['tarefas']):
        if len(passos) <= indice+1:
            realizar_tarefa_indepotente(tarefa)
            with open('journal.txt', 'a') as f:
                f.write('\ntarefa concluida\n')

    with open('journal.txt', 'w') as f:
        pass # Tudo foi feito com sucesso: esvazia o journal

(Note: what is written on Journal - comecar, tarefa concluida - it is not important, if only one letter is written on the line it is already considered that the step has been successful)

Code for testing (always worked with me - including editing the journal.txt to enter a series of errors):

def realizar_tarefa_indepotente(tarefa):
    if 'falhar' in tarefa:
        raise Exception('Tarefa falhou!')
    print 'Realizando tarefa: ' + tarefa['nome']

tarefa_ok = {'tarefas':[{'nome':'foo'}, {'nome':'bar'}, {'nome':'baz'}]}
falha_3 = {'tarefas':[{'nome':'foo'}, {'nome':'bar'}, {'nome':'baz','falhar':True}]}
falha_2 = {'tarefas':[{'nome':'foo'}, {'nome':'bar','falhar':True}, {'nome':'baz'}]}
falha_1 = {'tarefas':[{'nome':'foo','falhar':True}, {'nome':'bar'}, {'nome':'baz'}]}
falha_1_3 = {'tarefas':[{'nome':'foo','falhar':True}, {'nome':'bar'}, {'nome':'baz','falhar':True}]}
  • Note: it is my understanding that code review (code review) is part of our focus (is on topic) currently. If anyone does not agree, please manifest yourself in the goal.

  • I didn’t want to put this as an answer, since it’s just a link, but I find it interesting to document. That solution satisfy you?

  • 1

    @Felipeavelar Most of the features I found when searching for "python Journal" referred not to journaling [in the context of file system] but to the implementation of some kind of "diary"... I believe that this link is also about this - since it is a plugin for Plone (CMS system).

4 answers

7


I wrote an article that describes how to do this: http://epx.com.br/artigos/arqtrans.php and even a citizen made a Python implementation based on the article: https://pypi.python.org/pypi/acidfile/1.2.0

The basic technique is to write at least two files. First one, then the other in synchronous mode - as described in other responses.

The "spice" of the scheme is to add timestamp and a sum (hash) in the two files. This way we can know which is the most recent file - by timestamp - can check if the file is intact (by hashing).

The file system must implement journaling at least of the meta-data (usually the case, journaling of the data costs performance). This ensures that at least the folder and the file are readable. If the data is corrupted you detect by the hash.

  • Exactly what I was looking for! Better than reinventing the wheel... I will test this implementation you mentioned, but I’ve already looked at the sources and it seems to me that everything is ok.

  • I ended up using your solution: now I save myself args.json as a ACIDFile - ensuring to only update it after making sure that the previous task successfully completed (see reply from @utluiz). Greatly simplified the code.

4

  • An easily visible problem in your implementation is that the way you check to see if the file journal.txt exists after reading is subject to race conditions, see How to check if a file exists using Python.

  • In addition, the ideal is not to capture "Naked" exceptions, but to define the specific type of exception you expect, p.ex. IOError, OSError, etc. Likewise, when raise an exception, it is good to use one with a specific name related to the error type.

  • To iterate over the dictionary, as an alternative, rather than iterate over the result of the enumerate(), you can simply iterate over the list:

    passos = [{'nome':'foo','falhar':True}, {'nome':'bar'}, {'nome':'baz'}]
    for tarefa in passos:
        realizar_tarefa_indepotente(tarefa) 
    
  • The way you check if the 'fail' value works, but it is not very explicit. I would recommend switching to if 'falhar' in tarefa.items(), if you are simply checking whether it exists, or if tarefa.get('falhar') if you want to make sure that the value of tarefa['falhar'] is True.

  • A question about item #1: the try no longer does exactly that by requesting the file read? Then there should be another nested Try block?

  • 2

    No need, you can simply remove the block if since if there is an exception within the try the file probably doesn’t exist. Following the next Bullet, you could check for different exceptions and treat different cases (e.g. file does not exist, access denied, etc).

  • Okay, I get it. Thank you!

  • 2

    About item #3, think that the idea is not to perform the steps that have already been executed. For example, if there are 3 steps recorded in journal.txt, the tasks will be carried out from the 4th onwards. In that case, unless the index is used for something else in the actual code, maybe something like for tarefa in args['tarefas'][len(passos):]:.

  • Got it. My personal preference would be to iterate on these items anyway and then ignore them. Another alternative is to filter the list (e.g..: args['tarefas'] = [x for x in args['tarefas'] if x.get('falhar') is False]), but with this you lose the information about indexes and the amount of items that have already been made and it may take a while if the number of items is large.

  • 2

    Or even: since your file journal.txt only keeps the number of tasks that were successfully performed, you could simply record the number, incrementing it to each success and using it instead of len(passos) to skip tasks that have been successfully performed.

  • 2

    [on the last comment] this is not possible, because there is the risk of me corrupting the journal.txt if there is failure during handling the same. In the way I did, or a new line will be added to the file (confirming the task) or no (requiring it to be redone later). [on #1 and #2] Good suggestion, I’ll do it! [on #3] The @utluiz suggestion is good. I need to ignore tasks already done because, although each individual task is indepotent in the correct state of the system, it is not if the system has already changed state (i.e. if the next task has already started).

Show 2 more comments

3

Guaranteed recording of files

My first thought when it comes to ensuring file processing was the buffer data.

See what the documentation says file.flush():

Note: flush() does not necessarily write the file data to disk. Use flush() followed by os.fsync() to ensure behavior. (free translation)

I suppose tasks can have a much larger volume of data than the journal.txt. If a power drop occurs at the end of the method refazer_journal(), the Journal may have been written to disk while the task data is still in buffer.

I understand the data from Journal are sent afterward, but I do not believe there is any guarantee that the buffer will be lined up sequentially by the operating system and hardware.

  • 2

    Good observation! I will do this and, to ensure, also a check of the newly written files to confirm that everything is ok before to confirm the task. (and if I understood correctly, it is important that the os.fsync be used also after writing Ournal - to prevent the next task from starting without the previous task completion record being ready, right?)

  • 2

    @mgibsonbr Yes, I believe synchronization should be done after all recordings.

  • 1

    I read somewhere (I think it was in Linux System Programming) that Sync should be done 2 or 3 times, as is the custom of UNIX admins on the command line. I don’t remember the exact reason, it’s something like: the first Sync only starts the physical recording, but the second Sync effectively blocks before the first one ends (if it’s not finished yet).

  • @epx Wow! Then the level went down a lot for me... rsrs

2

Create a journaling based on filesystem files. I suggest using an epoch-based nomenclature... Writing and deleting items at once is always safer than manipulating them thousands of times. Journaling itself can be corrupted, since python atomicity is difficult to guarantee for long processes.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.