The idea of chunksize is that you can work on the data in 'blocks', using some of the existing loop systems.
My tip is you pre-define your goals before reading the data using Chunk, since it 'distorts' the statistical data observed on describe
.
Example
Discover the total of women and men who have done Enem.
array_df = []
for chunk in pd.read_csv('MICRODADOS_ENEM_2018.csv', encoding='Latin1', sep=';', chunksize = 10000):
array_df.append(chunk['TP_SEXO'].value_counts())
df = pd.concat(array_df, ignore_index=False)
df.groupby(level=0).sum()
output:
F 145400
M 97904
Name: TP_SEXO, dtype: int64
Another thing you can do (and I’ve done myself since I’m exploring Enem’s micro data as well) is if your interest is in data relating to only one state, use Chunk to select them, so the dimensionality of the data will be much reduced, and possibly command describe
will work smoothly.
In my case I was interested in the data of Rio Grande do Sul, I separated the data as follows:
array_df = []
for chunk in pd.read_csv('MICRODADOS_ENEM_2018.csv', encoding='Latin1', sep=';', chunksize = 10000):
temp_df = chunk.loc[chunk['SG_UF_PROVA'] == 'RS']
array_df.append(temp_df)
df = pd.concat(array_df, ignore_index=True)
Ready, decreased data from 3GB to 159 MB.
Of course this last tip will only work if you have a focus group to do some sort of exploratory analysis, if what you want is at the national level, you will have to have well-defined parameters as I demonstrated in the first example.