How to save a CSV in memory using Python?

Question

How to save a CSV in memory using Python?

Asked 4 years, 7 months ago

Viewed 151 times

-1

Hello! I need to read a big CSV, break it into 1000-line Csvs, store them in memory and then reset a zip with these smaller files.

This is the code so far:

import pandas as pd
from io import BytesIO, StringIO
import gzip

csvfile = pd.read_csv('file.csv')
buffer = BytesIO()
f = StringIO()

with gzip.open(buffer, 'wb') as zf:   
    for i in range(len(csvfile)):
        if i % 1000 == 0:
            data = csvfile[i:i+1000]
            arq = data.to_csv(f, index=False)            
            zf.write(arq)

And this is the return of error:

TypeError: memoryview: a bytes-like object is required, not 'NoneType'

Please, could someone help me?

what is "break in 1000 lines"? The file has thousands of lines and you want to save from 1000 in 1000?

– Lucas

2020/12/29 at 18:44
Exactly, Lucas.

– marloswn

2020/12/29 at 18:47
1

Use Chunks: https://answall.com/questions/392787/manipulando-dataset-de-3-gb-com-pandas-usando-chunks

– Lucas

2020/12/29 at 19:50

1 answer

Browser other questions tagged python pandas csv memory zip-file

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2020-12-29T21:25:36+00:00

"Creating a CSV in memory" doesn’t make much sense - A Pandas Dataframe is a data table in memory, but already with much more facilities than a CSV. A CSV is a convenient file type to port data back and forth on disk - but it has a minimum of direct convenience.

By your question, you want to generate a . zip file within which are Csvs with your data. It is possible yes, in Python, to serialize the data as if they were a CSV in memory, and add each to a zip arqivo - but if the CSV files were created as temporary files on the disk, it would be the same job - that is: we can create "in memory" - but that does not matter to your problem.

The specific error you are having is why the call to_csv do Pandas does not return "the file in memory". It returns walks (None) - then when you try to add "Arq" to your "gzip" file the error you posted.

The data in the CSV file will be in the Stringio object itself - the one in the Sau variable "f". There are some logic errors in your code: you get a single Stringio object and never erase its contents - if everything else was working, you would be repeating all the data from the first files also in the last.

That’s easy to fix.

Another problem is that files of type "gzip" has not an internal structure such as a ZIP file - they are a single compressed data sequence - and are uncompressed as a single file. So much so that it is acidic to see that files distributed on the internet as "gzip" usually end in ".tar.gz" - indicating that within gzip there is a "tar" file - that yes, has information about files contained within it.

If you want to have a ZIP file that can be listed in any program that works with this type of file, and inside see the CSV’s from line 0 to 999, from 1000 to 1999, etc... you have to use the standard library zipfile module, not gzip;

the object "Stringio" does not have a filename - it should not be possible to pass it directly to gzip.write - we have to see the documentation of this call see a gzip file object accepts direct data, and metadata as filename - otherwise it will be necessary to create a temporary disk file, as mentioned above.

Well, in practice, let’s try something like this:


import pandas as pd
from io import StringIO
import zipfile

csvfile = pd.read_csv('file.csv')

f = StringIO()

with zipfile.ZipFile('arquivos_csv_agrupados.zip', 
                     'w', compression=zipfile.ZIP_DEFLATED,
                      compresslevel=9) as zf:   
    for starting_line in range(0, len(csvfile) , 1000):  # a funçao range já permite especificiar o passo de 1000 em 1000 linhas
        data = csvfile[starting_line: starting_line + 1000] 
        f = StringIO()
        data.to_csv(f, index=False)
        # Inserir dados no arquivo ZIP, criando o nome do arquivo e recuperando os dados escritos em 'f':
        zf.writestr(f"file_{linha_inicio:04d}_{linha_inicio + 1000:04d}.csv", f.getvalue())

The biggest difference is actually the use of zipfile instead of gzip - but what was even wrong were (1) the partial CSV content generated within Stringio is recovered with the method .getvalue, and not the value returned by write_csv.

With the zipfile, we have the right to create the names of the files that go inside the zip, with the method .writestr used above.

You were trying to create a zipped file in memory with the "bytesIO"

this doesn’t make much sense - the file is generated directly on disk with the example name "arquivos_csv_agrupados.zip" - but if you really want to create the zip only in memory, using bytesIO could work.

The other fix is that you don’t need to make the "i" move forward from 1 to 1 and use a if to detect 1000 multiple lines - the range function does this on its own.