I’m going to list some things that I know you can try to do or even combine several of them and know that there are many others. First of all ensure that you have the latest version of pandas.
Importing CSV - Defining dtype: The first thing you have to keep in mind is that whenever you upload a file to a DataFrame
of pandas he will try to guess the data type of his CSV and this is bad because while he does this he needs to keep all the file data as objects (strings) in memory. If you tell pandas what kind of data each column should have, it doesn’t need to do that and you save a lot of memory. To set the data type of your CSV use dtype={'coluna':tipo}
as a parameter of read_csv()
. Just keep in mind that if you have corrupted data (a string in a column that should contain only numbers) the read_csv()
make a mistake.
If you’re going to concatenate multiple files, slowly: you neither want nor need several DataFrames
simultaneously loaded, prefer to concatenate two by two freeing those already concatenated in memory. Example of how to do this:
df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3
Saving Dataframes to a single CSV is better than concatenating: if memory is a problem, use your hard drive. Open each file at once and save to a csv using `mode='a'. See an example:
df1.to_csv('file.csv', index=False)
df2.to_csv('file.csv', mode='a', columns=False, index=False)
df3.to_csv('file.csv', mode='a', columns=False, index=False)
Don’t use pandas if you don’t need: if you want to concatenate files and don’t need to touch them, don’t waste time importing the files as DataFrames
then concatenate. Use a bash command or do it in python:
def concat(file1, file2):
with open(file2, 'r') as filename2:
data = file2.read()
with open(file1, 'a') as filename1:
file.write(data)
concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')
Do not use CSV: This is perhaps the boldest question of all, but it is often the saving solution. CSV files are slow to manipulate because they convert data into text mode. An HDF5 solution is much more efficient because it converts data into binary mode.
is a task that will be repeated several times? if not, use the shell
cp a.txt b.txt c.txt tudo_junto.txt
or using windows, use: "copy a.txt b.txt c.txt tudo_together.txt"– Paulo Marques
@Paulomarques Actually, Unix/Linux has to use
cat
(and notcp
):cat a.txt b.txt c.txt > tudo_junto.txt
- and in Windows iscopy a.txt+b.txt+c.txt tudo_junto.txt
(lacked the+
among the archives)– hkotsubo
@Saul the solution to your case is to use Chunks. It seems to me that a problem in your files is that it has many columns, so the Chunks should be applied to the columns, see this question: https://stackoverflow.com/questions/37727671/pandas-column-wise-chunking
– Lucas
@hkotsubo, thank you for the correction.
– Paulo Marques