Python pandas very slow

Question

Python pandas very slow

Asked 6 years, 9 months ago

Viewed 428 times

2

Can anyone help me? I am reading a file, make some changes and then saved in another folder. but that takes 2 hours, the file has 15 million lines, would have some different and more effective method?

# LER ARQUIVO NA PASTA STAGING
arq5 = pd.read_csv(r'C:\Users\Usuário\staging\arquivo5.txt',delimiter='\t',encoding='cp1252',engine='python')


# FAZ ALTERAÇÕES NO ARQUIVO 
columns = ['PERIODO', 'CRM', 'CAT', 'MERCADO', 'MERCADO_PX', 'CDGLABORATORIO', 'CDGPRODUTO', 'PX']
arq5.drop(columns, inplace=True, axis=1)

# SALVA O ARQUIVO 5 COMO CSV NA PASTA ALPHA
arq5.to_csv(r'C:\Users\Usuário\alpha\arquivo5.txt', index=False)

Really it is a very large file, I know no other way to do it. Your hardware is not limiting the processing?

– EmanuelF

2018/10/17 at 19:45

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by nosklo • **5,801** points · Answer 1 · 2018-10-17T20:42:43+00:00

3

The pandas loads the entire file into memory, and this can be slow in case of very large files.

Try not to load the entire file. The code below does the same as yours, but without using pandas and without loading the entire file into memory - it goes on reading the source file line by line, then modifying, and saving straight to destination:

colunas_remover = ['PERIODO', 'CRM', 'CAT', 'MERCADO', 
    'MERCADO_PX', 'CDGLABORATORIO', 'CDGPRODUTO', 'PX']
nome_arquivo = r'C:\Users\Usuário\staging\arquivo5.txt'
destino = r'C:\Users\Usuário\alpha\arquivo5.txt'

# LER ARQUIVO JA GRAVANDO O RESULTADO EM OUTRA PASTA
with open(nome_arquivo, encoding='cp1252', newline='') as f:
    cf = csv.DictReader(f, delimiter='\t')
    with open(destino, 'w', encoding='cp1252', newline='') as fw:
        colunas_manter = [c for c in cf.fieldnames if c not in colunas_remover]
        cw = csv.DictWriter(fw, colunas_manter, delimiter='\t',
            extrasaction='ignore') # ignora o que nao esta em "manter"
        cw.writeheader()
        cw.writerows(cf)

Thanks for the help! the script worked perfectly, the problem now is that the column name is not coming and the lines are alternating with a row with record and another not.

– Vitor Tenório

2018/10/18 at 17:38
@Vitortenório I put in the answer the writeheader() that was missing to put the name of the columns. Now as for the alternation, there are small adjustments in the treatment of line breaks depending on the version of python - which python is using?

– nosklo

2018/10/18 at 17:41
Another thing @Vitortenório, by way of curiosity, got faster? How long is it taking to run everything now?

– nosklo

2018/10/18 at 17:42
gave right the name of the column, perfect! I’m using version 3.6 of python. much faster, takes around 5min.

– Vitor Tenório

2018/10/18 at 17:55
That’s great, considerable reduction hein @Vitortenório, almost two orders of magnitude. I added the newline='' both of us open()s what should solve the double line problem for python 3.6; anything tell me!

– nosklo

2018/10/18 at 18:04
certinho :).. a doubt, I need to do a group by in a column, if it is not too much can show me how I do?

– Vitor Tenório

2018/10/18 at 18:57
I was thinking about doing it for the pandas, but that would take more processing time and we go back to the bottom of the kk pit

– Vitor Tenório

2018/10/18 at 18:58
@Vitortenório if the data are ordered, you can use the itertools.groupby()... I suggest opening new question with example input and output

– nosklo

2018/10/18 at 19:06
Cool!! Thanks so much for the help.

– Vitor Tenório

2018/10/19 at 16:02
@Vitortenório Don’t forget to put in the new question an example of what your source file looks like, and an example of how you would like the output file to be. This information is essential to help you with groupby

– nosklo

2018/10/19 at 18:00

Show 5 more comments