How to save a CSV in memory using Python?

Asked

Viewed 151 times

-1

Hello! I need to read a big CSV, break it into 1000-line Csvs, store them in memory and then reset a zip with these smaller files.

This is the code so far:

import pandas as pd
from io import BytesIO, StringIO
import gzip

csvfile = pd.read_csv('file.csv')
buffer = BytesIO()
f = StringIO()

with gzip.open(buffer, 'wb') as zf:   
    for i in range(len(csvfile)):
        if i % 1000 == 0:
            data = csvfile[i:i+1000]
            arq = data.to_csv(f, index=False)            
            zf.write(arq)

And this is the return of error:

TypeError: memoryview: a bytes-like object is required, not 'NoneType'

Please, could someone help me?

  • what is "break in 1000 lines"? The file has thousands of lines and you want to save from 1000 in 1000?

  • Exactly, Lucas.

  • 1

    Use Chunks: https://answall.com/questions/392787/manipulando-dataset-de-3-gb-com-pandas-usando-chunks

1 answer

2

"Creating a CSV in memory" doesn’t make much sense - A Pandas Dataframe is a data table in memory, but already with much more facilities than a CSV. A CSV is a convenient file type to port data back and forth on disk - but it has a minimum of direct convenience.

By your question, you want to generate a . zip file within which are Csvs with your data. It is possible yes, in Python, to serialize the data as if they were a CSV in memory, and add each to a zip arqivo - but if the CSV files were created as temporary files on the disk, it would be the same job - that is: we can create "in memory" - but that does not matter to your problem.

The specific error you are having is why the call to_csv do Pandas does not return "the file in memory". It returns walks (None) - then when you try to add "Arq" to your "gzip" file the error you posted.

The data in the CSV file will be in the Stringio object itself - the one in the Sau variable "f". There are some logic errors in your code: you get a single Stringio object and never erase its contents - if everything else was working, you would be repeating all the data from the first files also in the last.

That’s easy to fix.

Another problem is that files of type "gzip" has not an internal structure such as a ZIP file - they are a single compressed data sequence - and are uncompressed as a single file. So much so that it is acidic to see that files distributed on the internet as "gzip" usually end in ".tar.gz" - indicating that within gzip there is a "tar" file - that yes, has information about files contained within it.

If you want to have a ZIP file that can be listed in any program that works with this type of file, and inside see the CSV’s from line 0 to 999, from 1000 to 1999, etc... you have to use the standard library zipfile module, not gzip;

the object "Stringio" does not have a filename - it should not be possible to pass it directly to gzip.write - we have to see the documentation of this call see a gzip file object accepts direct data, and metadata as filename - otherwise it will be necessary to create a temporary disk file, as mentioned above.

Well, in practice, let’s try something like this:


import pandas as pd
from io import StringIO
import zipfile

csvfile = pd.read_csv('file.csv')

f = StringIO()

with zipfile.ZipFile('arquivos_csv_agrupados.zip', 
                     'w', compression=zipfile.ZIP_DEFLATED,
                      compresslevel=9) as zf:   
    for starting_line in range(0, len(csvfile) , 1000):  # a funçao range já permite especificiar o passo de 1000 em 1000 linhas
        data = csvfile[starting_line: starting_line + 1000] 
        f = StringIO()
        data.to_csv(f, index=False)
        # Inserir dados no arquivo ZIP, criando o nome do arquivo e recuperando os dados escritos em 'f':
        zf.writestr(f"file_{linha_inicio:04d}_{linha_inicio + 1000:04d}.csv", f.getvalue())
        

The biggest difference is actually the use of zipfile instead of gzip - but what was even wrong were (1) the partial CSV content generated within Stringio is recovered with the method .getvalue, and not the value returned by write_csv.

With the zipfile, we have the right to create the names of the files that go inside the zip, with the method .writestr used above.

You were trying to create a zipped file in memory with the "bytesIO"

  • this doesn’t make much sense - the file is generated directly on disk with the example name "arquivos_csv_agrupados.zip" - but if you really want to create the zip only in memory, using bytesIO could work.

The other fix is that you don’t need to make the "i" move forward from 1 to 1 and use a if to detect 1000 multiple lines - the range function does this on its own.

  • It worked perfectly! I really appreciate it! However, I need to save the zip in memory, and not on disk, because I will transmit it to an Archive later. How I could use the bytesIO in this case?

  • in Voce memory not "saved" - but just do almost as you were doing - create "bytesIO" object in a variable before the with, and pass it as the first parameter for the ZipFile - should work. Then you use the .getvalue() of that object to get the zip bytes - if the goal is, for example, to send the zip content to an API or to put it in an HTTP message body.

  • If I understood, I would do something like this: memory_zip = Bytesio() n with zipfile.Zipfile(memory_zip, ...) n zip_to_bucket = memory_zip.getvalue() n upload()? I tried here, and reported "File b'PK x03 X04 x14 x00 x00 x0...<a lot of character>' uploaded to file.zip". However, when looking at Storage (I’m trying to go to Google Cloud), the file does not appear.

  • ai need to see how you are calling the update - is not in the code of that issue. Suddenly, open another. From the message, it seems all ok - the question is where the upload is being made and why it does not appear as you expect.

  • Other: if this update can accept an open file, you may not need .getvalue() - this is the case to call the method seek(0) to reset the object’s frame BytesIO - but that would just be cosmetic - by the message you put up, the upload is already working.

  • As far as I know, the update only accepts closed files. It follows: def upload_to_storage(bucket_name, source_file_name, destination_blob_name): n """Uploads a file to the Bucket."" n storage_client = Storage.Client() n Bucket = storage_client.Bucket(bucket_name) n blob = Bucket.blob(destination_blob_name) n blob.upload_string(source_file_name) n print( "File {} uploaded to {}."format( source_file_name, destination_blob_name ) ) n , where this method is called after the files, previously mentioned here.

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.