Python pandas very slow

Asked

Viewed 428 times

2

Can anyone help me? I am reading a file, make some changes and then saved in another folder. but that takes 2 hours, the file has 15 million lines, would have some different and more effective method?

# LER ARQUIVO NA PASTA STAGING
arq5 = pd.read_csv(r'C:\Users\Usuário\staging\arquivo5.txt',delimiter='\t',encoding='cp1252',engine='python')


# FAZ ALTERAÇÕES NO ARQUIVO 
columns = ['PERIODO', 'CRM', 'CAT', 'MERCADO', 'MERCADO_PX', 'CDGLABORATORIO', 'CDGPRODUTO', 'PX']
arq5.drop(columns, inplace=True, axis=1)

# SALVA O ARQUIVO 5 COMO CSV NA PASTA ALPHA
arq5.to_csv(r'C:\Users\Usuário\alpha\arquivo5.txt', index=False)
  • Really it is a very large file, I know no other way to do it. Your hardware is not limiting the processing?

1 answer

3


The pandas loads the entire file into memory, and this can be slow in case of very large files.

Try not to load the entire file. The code below does the same as yours, but without using pandas and without loading the entire file into memory - it goes on reading the source file line by line, then modifying, and saving straight to destination:

colunas_remover = ['PERIODO', 'CRM', 'CAT', 'MERCADO', 
    'MERCADO_PX', 'CDGLABORATORIO', 'CDGPRODUTO', 'PX']
nome_arquivo = r'C:\Users\Usuário\staging\arquivo5.txt'
destino = r'C:\Users\Usuário\alpha\arquivo5.txt'

# LER ARQUIVO JA GRAVANDO O RESULTADO EM OUTRA PASTA
with open(nome_arquivo, encoding='cp1252', newline='') as f:
    cf = csv.DictReader(f, delimiter='\t')
    with open(destino, 'w', encoding='cp1252', newline='') as fw:
        colunas_manter = [c for c in cf.fieldnames if c not in colunas_remover]
        cw = csv.DictWriter(fw, colunas_manter, delimiter='\t',
            extrasaction='ignore') # ignora o que nao esta em "manter"
        cw.writeheader()
        cw.writerows(cf)
  • Thanks for the help! the script worked perfectly, the problem now is that the column name is not coming and the lines are alternating with a row with record and another not.

  • @Vitortenório I put in the answer the writeheader() that was missing to put the name of the columns. Now as for the alternation, there are small adjustments in the treatment of line breaks depending on the version of python - which python is using?

  • Another thing @Vitortenório, by way of curiosity, got faster? How long is it taking to run everything now?

  • gave right the name of the column, perfect! I’m using version 3.6 of python. much faster, takes around 5min.

  • That’s great, considerable reduction hein @Vitortenório, almost two orders of magnitude. I added the newline='' both of us open()s what should solve the double line problem for python 3.6; anything tell me!

  • certinho :).. a doubt, I need to do a group by in a column, if it is not too much can show me how I do?

  • I was thinking about doing it for the pandas, but that would take more processing time and we go back to the bottom of the kk pit

  • @Vitortenório if the data are ordered, you can use the itertools.groupby()... I suggest opening new question with example input and output

  • Cool!! Thanks so much for the help.

  • @Vitortenório Don’t forget to put in the new question an example of what your source file looks like, and an example of how you would like the output file to be. This information is essential to help you with groupby

Show 5 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.