How to save a work in python so you don’t have to run everything every time you open the script?

Asked

Viewed 465 times

0

I’m working on a jupyter data analysis project, working with about 5 different data frames (which are imported from Csvs) and related through one key or another, from these Dfs and their relationships, inside jupyter, I start to check the consistency of the data, perform analyses, create new dates frames and so on.

My problem is: every time I close jupyter and resume work, only the outputs and scripts of each notebook are saved, but to get the generated objects I always need to run everything again from the beginning.

I researched a little and saw some means of serialization to go saving one by one, object by object (in a pickle, Feather or some other medium), and then also care for them one by one, but it seems to me to be extraordinarily laborious, as much as to run all over again...

I wonder if there’s some other way to save the job, something like an image, like the .rdata to R, for when to open the jupyter, to have an easy access to what was being developed.

Any information, or sharing of how you save your jobs will help me a lot. Thank you

1 answer

1


Yes - the way is to actually use a serialization like Pickle, but it’s not so much work - (actually, Jupyter has a shortcut to run all cells in sequence - are you using this? It shouldn’t be too much trouble)

The fact is that the way to work with Jupyter is Io that a media between traditional programming and a spreadsheet - when analyzing data, many of the cells you will use as draft, change parameters and expressions, will not want to break again - and there will be other cells as a base, where you are reading your data sources, assembling the dataframes, etc... which are probably what you want to run every time. So you can understand why simply "run all cells" might not be practical - since the cells used for draft and experimentation will be processed as well.

An output, before leaving to need to serialize things, is to put elements closer to "programming", with the use of functions.

Notebook cells, which can be re-executed at any time, do some of the roles that we have in functions - and then we tend to leave all the code "loose" in the cells - and this leads to this need to re-execute everything.

However, if you put all the code to create the dataframes you need in functions, and tie it all into a single boot function, you will need to perform just that function - without selecting cells to run, etc...

So, more concretely, let’s say that you have

cell 1:

import pandas as pd, numpy as np
# outros impots

df1 = pd.read_csv(...)
# outros passos para estruturar o df1

cell 2:


df2 = pd.read_csv(...)
# passos para estruturar o df2

And so on and so forth - You can put the calls of these cells into functions - getting something like:

cel1

# isole todas as importações numa única celula

cel2

def cria_df1():
    global df1
    df1 = pd.read_csv(...)
    # demais passos

... cel n:


def inicializa():
     cria_df1()
     cria_df2()
     cria_df3()
     cria_df4()
     cria_df5()

Okay, now when you start the job, all you have to do is run the imports, and in the very cell in which you will start to work

inicializa() - (the call can also be in the same cell where the function inicializa is set, clear - then you run only that cell)


Now, if the problem is not just the amount of steps and having to perform all by clicking on the cells, but rather that there is so much processing involved that the initialization takes more than a few seconds, can be worth serializing the dataframes and load back the data already processed.

For this, the Python pickle module is enough. I suggest structuring the rescue and loading of serialized in functions, for the same reasons above.

cell m:

def salva_tudo():
    import pickle
    pickle.dump((df1, df2, df3, df4, df5),  open("dados_mastigados.pickle", "wb"), protocol=-1)

def carrega_tudo():
    global df1, df2, df3, df4, df5
    import pickle
    df1, df2, df3, df4, df5 = pickle.load(open("dados_mastigados.pickle", "rb"))

Ready - just call the function carrega_tudo() for data from the datframes to be restored in memory at the point they were when salva_tudo() was called in a previous session.


Modularizing part of the code with functions is critical - but in addition, Jupyter notebbok itself has several functionalities, independent of Python, that allow improving the workflow as well - including running entire Python programs that are in files, mark cells not to be executed when "run all", etc ---

This article covers a lot of these features:

https://towardsdatascience.com/how-to-effortlessly-optimize-jupyter-notebooks-e864162a06ee

  • Thank you very much, you helped so much! In fact, I’m new to python, and although I’m at the beginning of the analysis and the processes aren’t taking so long to run again, I was worried when in fact there would be many transformations and procedures that would take longer in processing. I’m sorry about what I said about the disorganized code, since many cells are actually only used for testing/draft, do you have any suggestions for organizing them better? Or recommend some other software?

  • For the kind of data exploration that has to be done with dataframes, the notebook is the best thing ever. It’s best to gain familiarity with functions, as I mentioned in the reply - but there are several other Jupyter features in you, including macros, and directives not to run some cells automatically when you run everything - I’ll put the link to a good article about it in response.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.