Compression and Reading of CSV file with large-scale rows x columns via Pandas

Asked

Viewed 351 times

0

I am stuck looking for a precise and intuitive way to read a file of 70.000KB formed by the concatenation of several files being them with varied sizes. Initially possessing several files in the format '.txt' I converted each one of them using an algorithm I performed to eliminate the existence of values 0 and for each of the spaces found ('t') I separated the values of the dataset with comma, and soon after I converted all the CSV’s I concatenei all in a single file with pandas:

inn="C:\\Documents\\experimento"    
out="C:\\Documents\\experimento\\full_dataset.csv"

    os.chdir(inn)
    FullCsv = gb.glob('*.csv')
    dfList=list()
    for simpleCsv in FullCsv:
        print(simpleCsv)
        df=pd.read_csv(simpleCsv,header=None)
        dfList.append(df)
    concatDf=pd.concat(dfList,axis=0)
    concatDf.to_csv(out,index=None)

Soon after, I executed this newly created dataset being an attempt without pandas(commented excerpt):

import csv
import pandas as pd

with open("C:\\Documents\\experimento\\full_dataset.csv",'r') as foutput:   
    '''reader = csv.reader(foutput)
    listaNova = list()
    for r in reader:
        listaNova.append(r)
    print(listaNova)
    '''
    reader = pd.read_csv("C:\\Documents\\experimento\\full_dataset.csv", chunksize=100000)      
    for read in reader:
        print(read)

But then I got:

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

And with the pandas gave this result:

             0         1         2         3         4         5         6  \
0      0.17730  0.016505  0.058989 -0.314010  0.079795  0.293890  0.035616   
1     -0.68875 -0.340940 -0.647040  0.108130  0.404710 -0.161510 -0.329860   
2      1.27170  0.913990  1.389600  0.834080  0.347450  0.705510  0.547070   
3     -0.53242 -0.566420 -0.558360 -0.813050 -0.365800 -0.352100  0.106440   
4      0.17730  0.016505  0.058989 -0.314010  0.079795  0.293890  0.035616   
..         ...       ...       ...       ...       ...       ...       ...   
238  117.00000 -0.532420 -0.566420 -0.558360 -0.813050 -0.365800 -0.352100   
239  118.00000  0.177300  0.016505  0.058989 -0.314010  0.079795  0.293890   
240  119.00000 -0.688750 -0.340940 -0.647040  0.108130  0.404710 -0.161510   
241  120.00000  1.271700  0.913990  1.389600  0.834080  0.347450  0.705510   
242  121.00000 -0.532420 -0.566420 -0.558360 -0.813050 -0.365800 -0.352100   

            7        8         9  ...  46611  46612  46613  46614  46615  \
0    0.390770  0.35301  0.425470  ...    NaN    NaN    NaN    NaN    NaN   
1    0.125460 -0.13454 -0.061552  ...    NaN    NaN    NaN    NaN    NaN   
2    0.357910  0.85464  0.346880  ...    NaN    NaN    NaN    NaN    NaN   
3   -0.545210 -0.64630 -0.519490  ...    NaN    NaN    NaN    NaN    NaN   
4    0.390770  0.35301  0.425470  ...    NaN    NaN    NaN    NaN    NaN   
..        ...      ...       ...  ...    ...    ...    ...    ...    ...   
238  0.106440 -0.54521 -0.646300  ...    NaN    NaN    NaN    NaN    NaN   
239  0.035616  0.39077  0.353010  ...    NaN    NaN    NaN    NaN    NaN   
240 -0.329860  0.12546 -0.134540  ...    NaN    NaN    NaN    NaN    NaN   
241  0.547070  0.35791  0.854640  ...    NaN    NaN    NaN    NaN    NaN   
242  0.106440 -0.54521 -0.646300  ...    NaN    NaN    NaN    NaN    NaN   

     46616  46617  46618  46619  46620  
0      NaN    NaN    NaN    NaN    NaN  
1      NaN    NaN    NaN    NaN    NaN  
2      NaN    NaN    NaN    NaN    NaN  
3      NaN    NaN    NaN    NaN    NaN  
4      NaN    NaN    NaN    NaN    NaN  
..     ...    ...    ...    ...    ...  
238    NaN    NaN    NaN    NaN    NaN  
239    NaN    NaN    NaN    NaN    NaN  
240    NaN    NaN    NaN    NaN    NaN  
241    NaN    NaN    NaN    NaN    NaN  
242    NaN    NaN    NaN    NaN    NaN  

[243 rows x 46621 columns]

I wonder if there is any way to visualize the entire dataset without them all being summarized and in your opinion what is the best method for concatenation and reading dataset, if in the case without the pandas would be better? My intention is to look for a way to work with this dataset comparing to the pre-converted values and read it by dividing it by rows x columns and standardizing the number of columns in all rows. Note: I am beginner in this area of data science.

  • You really need all this?

1 answer

4


If you have a file . csv want to view its contents, the best option is to open it in a spreadsheet, such as Libreoffice or Excel.

If the number of lines exceeds what your spreadsheet program can handle comfortably, then you can open in a text editor, such as Notepad++ or sublimetext.
Or you can even point your browser directly at the . csv file - the raw data will appear in the browser.

There is a reason why Pandas summarizes dataframes from a certain size: it has no use to see 1000+ lines of raw data, other than to have a "feeling" of the types of data and tracks on which they are distributed. And for this feeling, either you use the abridged version, or filter the lines to find interesting values (sort by a specific column and see the first values, etc...).

Seeing thousands of data lines is good for ... nothing ... .

Now, this is configurable in Pandas -just configure the option:

 pd.options.display.max_rows = <numero de linhas desejado>

I talked more about it here: I cannot list the unique values of the dataframe

And in case, you just need to change that

for read in reader:
    print(read)

for

print(reader)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.