Yes - the way is to actually use a serialization like Pickle, but it’s not so much work -
(actually, Jupyter has a shortcut to run all cells in sequence - are you using this? It shouldn’t be too much trouble)
The fact is that the way to work with Jupyter is Io that a media between traditional programming and a spreadsheet - when analyzing data, many of the cells you will use as draft, change parameters and expressions, will not want to break again - and there will be other cells as a base, where you are reading your data sources, assembling the dataframes, etc... which are probably what you want to run every time. So you can understand why simply "run all cells" might not be practical - since the cells used for draft and experimentation will be processed as well.
An output, before leaving to need to serialize things, is to put elements closer to "programming", with the use of functions.
Notebook cells, which can be re-executed at any time, do some of the roles that we have in functions - and then we tend to leave all the code "loose" in the cells - and this leads to this need to re-execute everything.
However, if you put all the code to create the dataframes you need in functions, and tie it all into a single boot function, you will need to perform just that function - without selecting cells to run, etc...
So, more concretely, let’s say that you have
cell 1:
import pandas as pd, numpy as np
# outros impots
df1 = pd.read_csv(...)
# outros passos para estruturar o df1
cell 2:
df2 = pd.read_csv(...)
# passos para estruturar o df2
And so on and so forth -
You can put the calls of these cells into functions - getting something like:
cel1
# isole todas as importações numa única celula
cel2
def cria_df1():
global df1
df1 = pd.read_csv(...)
# demais passos
...
cel n:
def inicializa():
cria_df1()
cria_df2()
cria_df3()
cria_df4()
cria_df5()
Okay, now when you start the job, all you have to do is run the imports,
and in the very cell in which you will start to work
inicializa()
- (the call can also be in the same cell where the function
inicializa
is set, clear - then you run only that cell)
Now, if the problem is not just the amount of steps and having to perform all by clicking on the cells, but rather that there is so much processing involved that the initialization takes more than a few seconds, can be worth serializing the dataframes and load back the data already processed.
For this, the Python pickle module is enough. I suggest structuring the rescue and loading of serialized in functions, for the same reasons above.
cell m:
def salva_tudo():
import pickle
pickle.dump((df1, df2, df3, df4, df5), open("dados_mastigados.pickle", "wb"), protocol=-1)
def carrega_tudo():
global df1, df2, df3, df4, df5
import pickle
df1, df2, df3, df4, df5 = pickle.load(open("dados_mastigados.pickle", "rb"))
Ready - just call the function carrega_tudo()
for data from the datframes to be restored in memory at the point they were when salva_tudo()
was called in a previous session.
Modularizing part of the code with functions is critical - but in addition, Jupyter notebbok itself has several functionalities, independent of Python, that allow improving the workflow as well - including running entire Python programs that are in files, mark cells not to be executed when "run all", etc ---
This article covers a lot of these features:
https://towardsdatascience.com/how-to-effortlessly-optimize-jupyter-notebooks-e864162a06ee
Thank you very much, you helped so much! In fact, I’m new to python, and although I’m at the beginning of the analysis and the processes aren’t taking so long to run again, I was worried when in fact there would be many transformations and procedures that would take longer in processing. I’m sorry about what I said about the disorganized code, since many cells are actually only used for testing/draft, do you have any suggestions for organizing them better? Or recommend some other software?
– Paulo Pasquale
For the kind of data exploration that has to be done with dataframes, the notebook is the best thing ever. It’s best to gain familiarity with functions, as I mentioned in the reply - but there are several other Jupyter features in you, including macros, and directives not to run some cells automatically when you run everything - I’ll put the link to a good article about it in response.
– jsbueno