Manipulating 3 GB Dataset with Pandas using Chunks

Question

Manipulating 3 GB Dataset with Pandas using Chunks

Asked 6 years, 3 months ago

Viewed 1,184 times

2

Hello! I’m trying to work with a *.csv file using Pandas in Python3 embedded in a Googlecloud VM. This VM has 16GB of memory and even so gave Memoryerror. To solve this problem I used the attribute chunksize:

df = pd.read_csv("/home/enem/DADOS/MICRODADOS_ENEM_2018.csv", sep = ";",encoding = "ISO-8859-1", chunksize = 10000, header = 1)

The command worked, at least it loaded. However, I can’t do anything with date loaded, even using the read_csv function:

for data in df: data.describe()

functions do not return anything or return a heap of results relative to each Chunk, a simple describe() is something not pleasant.

Does anyone have a good reference that talks about working with Chunks?

Thank you

1 answer

Browser other questions tagged python-3.x pandas

You are not signed in. Login or sign up in order to post.

by Terry • **889** points · Answer 1 · 2019-06-26T15:42:13+00:00

The idea of chunksize is that you can work on the data in 'blocks', using some of the existing loop systems.
My tip is you pre-define your goals before reading the data using Chunk, since it 'distorts' the statistical data observed on describe.

Example

Discover the total of women and men who have done Enem.

array_df = []
for chunk in pd.read_csv('MICRODADOS_ENEM_2018.csv', encoding='Latin1', sep=';', chunksize = 10000):
    array_df.append(chunk['TP_SEXO'].value_counts())
df = pd.concat(array_df, ignore_index=False)
df.groupby(level=0).sum()

output:
F    145400
M     97904
Name: TP_SEXO, dtype: int64

Another thing you can do (and I’ve done myself since I’m exploring Enem’s micro data as well) is if your interest is in data relating to only one state, use Chunk to select them, so the dimensionality of the data will be much reduced, and possibly command describe will work smoothly.
In my case I was interested in the data of Rio Grande do Sul, I separated the data as follows:

array_df = []
for chunk in pd.read_csv('MICRODADOS_ENEM_2018.csv', encoding='Latin1', sep=';', chunksize = 10000):
    temp_df = chunk.loc[chunk['SG_UF_PROVA'] == 'RS']
array_df.append(temp_df)

df = pd.concat(array_df, ignore_index=True)

Ready, decreased data from 3GB to 159 MB.
Of course this last tip will only work if you have a focus group to do some sort of exploratory analysis, if what you want is at the national level, you will have to have well-defined parameters as I demonstrated in the first example.