3
I upload a CSV file with more than 3 million lines and about 770 Mb, I use pandas and need to convert a column that is in string format. Below the column 'lbBins', which when reading from CSV came in string format (what is the best standard to save the data in CSV?), and the columns: lnBin1 to lbBin5 resulting from the function "reshapeBin' below.
tempFrame[['lnBins','lnBin1', 'lnBin21, 'lnBin3', 'lnBin4', 'lnBin5']].tail(2) 2445169 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ... (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 1, 0, 1) 2445170 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ... (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 0, 1, 1)
As you can see in the reshapeBin function I need to perform several functions:
eval() np.array() .reshape(5,5) [num] .tolist() tuple()
Use Eval() to convert the table row, converting from string to tuple, then convert to array and reshape, pick row by row of the array in [num], convert to list and then convert to tuple to save in table, to be able to save the table in CSV again.
Function, but I think I can improve something else to be faster processing:
def reshapeBin(x, num): return tuple(np.array(eval(x)).reshape(5,5)[num].tolist()) for n in range(0,5): tempFrame['lnBin'+str(n+1)]=tempFrame['lnBins'].apply(reshapeBin, num=n) print('finalizei o ', n)
Probably the way I’m saving from pandas to csv is not the best option, at least the data format: in tuple table and for string csv, and vice versa.
I don’t understand: do you have the option to modify how the daodos are in CSV? Or you want tips just to decode as it is?
– jsbueno
Are they always 5x5 bit arrays? If they are and you want to store it more efficiently, you should be able to do it in 4 bytes;
– jsbueno
jsbueno . yes I have the option to modify the data by pressing in memory, applying the modifications and generating an updated CSV. when saved to CSV, the binary line, except in tuple, but any later load and modification need to convert from string to tuple and then to array, to then apply the modification. this is the problem, the performance drops dramatically to 3 million records.
– marcos rene mews
jsbueno . this analysis I demonstrated are 5x5 matrices, but I have another situation that I need to form 4 matrices of 10.
– marcos rene mews