How to process data without the kernel dying?


Viewed 65 times


I want to process the data into one notebook. However, every time I start, my computer almost freezes and it looks like the kernel dies. It seems to be generated because of a memory management error. In particular, this is when I run ** step 3 ** of the following function:

(130318, 4)
>>>def process_data(train):

    print("step 1")
    train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])

    print("step 2")
    train["target"] = train.apply(get_target, axis = 1)

    print("step 3")
    train['sent_emb'] = train['sentences'].apply(
        lambda x: [dict_emb[item][0] 
        if item in dict_emb 
        else np.zeros(4096) for item in x)

>>>train = process_data(train)

Maybe it’s a memory problem? Are there online solutions? For now, I’ll try Google Collaboratory ...

Maybe turn this into a loop that will handle a problem per line package? My attempt:

for i in range(0,len(train.shape[0]-200,200)):
    train['sent_emb'] = train['sentences'].iloc[i,i+200].apply(
        lambda x: [dict_emb[item][0] 
        if item in dict_emb 
        else np.zeros(4096) for item in x])      

But it gives me many mistakes:

step 1
step 2
step 3

ValueError                                Traceback (most recent call last)
<ipython-input-26-d3e879a8c753> in <module>()
----> 1 train = process_data(train)

<ipython-input-25-7063894d5c9a> in process_data(train)
     10     #train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
     11     #                                                       dict_emb else np.zeros(4096) for item in x])
---> 12     train['quest_emb'] =[]
     13     for i in range(0,len(train.shape[0]-200,200)):
     14         print(i)

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3121     def _setitem_slice(self, key, value):

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ in _set_item(self, key, value)
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ in _sanitize_column(self, key, value, broadcast)
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ in _sanitize_index(data, index, copy)
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index
  • Have you tried running the line on a Dataframe subset? Ex.: train['sent_emb'] = train['sentences'].iloc[:200].apply( ...

  • @Pedrovonhertwig maybe I didn’t understand anything, but I replaced my code with yours with .iloc [] and it worked! I don’t understand!

  • @Pedrovonhertwig So if I want to do this in all the examples I do train['sent_emb'] = train['sentences'].iloc[:130318].apply( ...?

  • the iloc selects by indexes. The purpose of using the iloc[:200] it was that he applied the operation only in the first 200 lines to see if the problem is really the memory; it seems strange that it works with .iloc[:130318] and not without iloc. Bizarre!

  • @Pedrovonhertwig No, you’re right, it doesn’t work using real-size iloc. This is a memory problem. Maybe I could cut into several pieces?

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.