Concatenate large python csv files

Asked

Viewed 129 times

3

I am trying to concatenate files that reach a sum of 650MB, and is returning memory error:

> MemoryError: Unable to allocate 70.3 MiB for an array with shape (3, 6143249) and data type object

line I’m using to concatenate:

li # lista dos arquivos
df = pd.concat(li,axis=0,ignore_index=True)

Is there another way to concatenate large files? thank you

  • is a task that will be repeated several times? if not, use the shell cp a.txt b.txt c.txt tudo_junto.txt or using windows, use: "copy a.txt b.txt c.txt tudo_together.txt"

  • 1

    @Paulomarques Actually, Unix/Linux has to use cat (and not cp): cat a.txt b.txt c.txt > tudo_junto.txt - and in Windows is copy a.txt+b.txt+c.txt tudo_junto.txt (lacked the + among the archives)

  • @Saul the solution to your case is to use Chunks. It seems to me that a problem in your files is that it has many columns, so the Chunks should be applied to the columns, see this question: https://stackoverflow.com/questions/37727671/pandas-column-wise-chunking

  • @hkotsubo, thank you for the correction.

1 answer

3


I’m going to list some things that I know you can try to do or even combine several of them and know that there are many others. First of all ensure that you have the latest version of pandas.

Importing CSV - Defining dtype: The first thing you have to keep in mind is that whenever you upload a file to a DataFrame of pandas he will try to guess the data type of his CSV and this is bad because while he does this he needs to keep all the file data as objects (strings) in memory. If you tell pandas what kind of data each column should have, it doesn’t need to do that and you save a lot of memory. To set the data type of your CSV use dtype={'coluna':tipo} as a parameter of read_csv(). Just keep in mind that if you have corrupted data (a string in a column that should contain only numbers) the read_csv() make a mistake.

If you’re going to concatenate multiple files, slowly: you neither want nor need several DataFrames simultaneously loaded, prefer to concatenate two by two freeing those already concatenated in memory. Example of how to do this:

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

Saving Dataframes to a single CSV is better than concatenating: if memory is a problem, use your hard drive. Open each file at once and save to a csv using `mode='a'. See an example:

df1.to_csv('file.csv', index=False)
df2.to_csv('file.csv', mode='a', columns=False, index=False)
df3.to_csv('file.csv', mode='a', columns=False, index=False)

Don’t use pandas if you don’t need: if you want to concatenate files and don’t need to touch them, don’t waste time importing the files as DataFrames then concatenate. Use a bash command or do it in python:

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

Do not use CSV: This is perhaps the boldest question of all, but it is often the saving solution. CSV files are slow to manipulate because they convert data into text mode. An HDF5 solution is much more efficient because it converts data into binary mode.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.