0
I want to process the data into one unsupervised.py notebook. However, every time I start, my computer almost freezes and it looks like the kernel dies. It seems to be generated because of a memory management error. In particular, this is when I run ** step 3 ** of the following function:
>>>train.shape
(130318, 4)
>>>len(dict_emb)
179862
>>>def process_data(train):
print("step 1")
train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])
print("step 2")
train["target"] = train.apply(get_target, axis = 1)
print("step 3")
train['sent_emb'] = train['sentences'].apply(
lambda x: [dict_emb[item][0]
if item in dict_emb
else np.zeros(4096) for item in x)
>>>train = process_data(train)
Maybe it’s a memory problem? Are there online solutions? For now, I’ll try Google Collaboratory ...
Maybe turn this into a loop that will handle a problem per line package? My attempt:
for i in range(0,len(train.shape[0]-200,200)):
print(i)
train['sent_emb'] = train['sentences'].iloc[i,i+200].apply(
lambda x: [dict_emb[item][0]
if item in dict_emb
else np.zeros(4096) for item in x])
But it gives me many mistakes:
step 1
step 2
step 3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-d3e879a8c753> in <module>()
----> 1 train = process_data(train)
<ipython-input-25-7063894d5c9a> in process_data(train)
10 #train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
11 # dict_emb else np.zeros(4096) for item in x])
---> 12 train['quest_emb'] =[]
13 for i in range(0,len(train.shape[0]-200,200)):
14 print(i)
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3117 else:
3118 # set column
-> 3119 self._set_item(key, value)
3120
3121 def _setitem_slice(self, key, value):
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3192
3193 self._ensure_valid_index(value)
-> 3194 value = self._sanitize_column(key, value)
3195 NDFrame._set_item(self, key, value)
3196
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3389
3390 # turn me into an ndarray
-> 3391 value = _sanitize_index(value, self.index, copy=False)
3392 if not isinstance(value, (np.ndarray, Index)):
3393 if isinstance(value, list) and len(value) > 0:
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
3999
4000 if len(data) != len(index):
-> 4001 raise ValueError('Length of values does not match length of ' 'index')
4002
4003 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
Have you tried running the line on a Dataframe subset? Ex.:
train['sent_emb'] = train['sentences'].iloc[:200].apply( ...
– Pedro von Hertwig Batista
@Pedrovonhertwig maybe I didn’t understand anything, but I replaced my code with yours with
.iloc []
and it worked! I don’t understand!– Revolucion for Monica
@Pedrovonhertwig So if I want to do this in all the examples I do
train['sent_emb'] = train['sentences'].iloc[:130318].apply( ...
?– Revolucion for Monica
the
iloc
selects by indexes. The purpose of using theiloc[:200]
it was that he applied the operation only in the first 200 lines to see if the problem is really the memory; it seems strange that it works with.iloc[:130318]
and not withoutiloc
. Bizarre!– Pedro von Hertwig Batista
@Pedrovonhertwig No, you’re right, it doesn’t work using real-size iloc. This is a memory problem. Maybe I could cut into several pieces?
– Revolucion for Monica