Import CSV to Pandas database without converting string to tuple

Asked

Viewed 447 times

3

I upload a CSV file with more than 3 million lines and about 770 Mb, I use pandas and need to convert a column that is in string format. Below the column 'lbBins', which when reading from CSV came in string format (what is the best standard to save the data in CSV?), and the columns: lnBin1 to lbBin5 resulting from the function "reshapeBin' below.

tempFrame[['lnBins','lnBin1', 'lnBin21, 'lnBin3', 'lnBin4', 'lnBin5']].tail(2)

2445169 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...   (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 1, 0, 1)
2445170 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...   (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 0, 1, 1)

As you can see in the reshapeBin function I need to perform several functions:

eval()
np.array()
.reshape(5,5)
[num]
.tolist()
tuple()

Use Eval() to convert the table row, converting from string to tuple, then convert to array and reshape, pick row by row of the array in [num], convert to list and then convert to tuple to save in table, to be able to save the table in CSV again.

Function, but I think I can improve something else to be faster processing:

def reshapeBin(x, num):
    return tuple(np.array(eval(x)).reshape(5,5)[num].tolist())

for n in range(0,5):
    tempFrame['lnBin'+str(n+1)]=tempFrame['lnBins'].apply(reshapeBin, num=n)
    print('finalizei o ', n)

Probably the way I’m saving from pandas to csv is not the best option, at least the data format: in tuple table and for string csv, and vice versa.

  • I don’t understand: do you have the option to modify how the daodos are in CSV? Or you want tips just to decode as it is?

  • Are they always 5x5 bit arrays? If they are and you want to store it more efficiently, you should be able to do it in 4 bytes;

  • jsbueno . yes I have the option to modify the data by pressing in memory, applying the modifications and generating an updated CSV. when saved to CSV, the binary line, except in tuple, but any later load and modification need to convert from string to tuple and then to array, to then apply the modification. this is the problem, the performance drops dramatically to 3 million records.

  • jsbueno . this analysis I demonstrated are 5x5 matrices, but I have another situation that I need to form 4 matrices of 10.

1 answer

0

I couldn’t quite understand your problem - at least a whole row of the matrix would help - and understand if you want to optimize CSV coding - or just want the final reading as well.

In this type of questions it helps to have access to at least one valid sample of the data you want to transform - you only put two lines of text output (summarized by Pandas) - and anyone wanting to help you has to wonder what the input was like: not exactly the kind of thing that helps.

However I am writing the answer, why do you have a loud thing there: you do the whole process with the 25 numbers of your matrix repeat 5 times in each row, by calling 5 times the function "reshapeBin".

"Eval" is also far from being an ideal thing to use. (If people outside your team have access to write in your data, whichever Python program can be placed in the middle of it and will run on your server - but besides it is not the most performatic thing in the world).

Anyway, just considering the few lines you’ve put there, you can avoid having to turn each line the same 5 times by doing something like this:

...
from ast import literal_eval

tempFrame["data"] = tempFrame["lnBins"].apply(lambda x: np.array(literal_eval(x), dtype="u1")))

for i in range(5):
    tempFrame[f"lnBin{i+1}"] = tempFrame["data"].apply(lambda x: x[i * 5: i *5 + 5])

del tempFrame["data"]

Basically, instead of calling the expensive "Eval" 5 times for each row, I create a temporary column with the string decoded to numbers (and to preserve memory, instead of tuple, we put an np.array there). Then the same idea his to use the 5 applys in a row (note the current way of creating the column name), but each time picking only a slice of the sequence of 25 numbers (without needing the reshape)). I also use the ast.literal_eval that is less problematic than Eval.

  • jsbueno did not know the literal_eval function I will analyze and check the performance. what is your suggestion to record a binary sequence, such as a dataframe column or directly in an array, to then save to disk and load it more quickly?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.