Size of tuple lists in a df

Asked

Viewed 397 times

2

I have the following df

n_words                       Words                        .
   220     [('trabalho', 17), ('monitor', 17), ('via', 16... 
  3114     [('atend', 863), ('ortopedico', 863), ('proced... 
     5     [('anomalos', 2), ('feixes', 1), ('eletrofisio... 
     3     [('hr', 1), ('sistema', 1), ('fenotipagem'...

I need the amount of different words, that is, the size of each tuple list.

I tried to:

df['palvras_dif'] = ""
i = 0
for row in df['Words']:
    df['palvras_dif'][i] = len(df['Words'][i])
    i+=1
df

But it doesn’t count correctly. Someone can help me?

  • Is using the Pandas?

  • I am using yes!

  • And what does the number represent on each tuple? It should be considered also or just the word?

  • It is the frequency that the word appeared in another df. Example: on line 3 I had a list of ['anomalies', 'electrophysiotherapy', 'bundles', 'anomalies', 'electrophysiotherapy'] and I made the list of tuples with her word and phrquency. I need to know qts words are different, so I wanted the size of the list of tuples...

  • But should it be considered or not? For example, if there is ('trabalho', 2) and ('trabalho', 14), should be considered as the same word or as separate occurrences?

  • In this example you gave, I don’t have the same word 2x, just because the number is the word frequency.

  • Then it would not be enough to add the values in n_words?

  • Not pq in n_words I have the total number of words, also considering the repeated ones. I need the number of distinct words. Like the example on line 3: ['anomalies', 'electrophysiotherapy', 'bundles', 'anomalies', 'electrophysiotherapy'] I have n_words =5 and I need the number of different words, which would be: 3.

Show 3 more comments

1 answer

3


As discussed in:

You can use the type set Python which, by definition, has no repeated elements.

Utilize p[0] for palavras in df['Words'] for p in palavras to find all the words of dataframe. After, generate a set from these data and check its size:

num_palavras = len(set(p[0] for palavras in df['Words'] for p in palavras))

For example:

import pandas as pd

df = pd.DataFrame(data={
    'Words': [
        [('a', 1), ('b', 2)],
        [('c', 1), ('d', 2)],
    ]
})

num_palavras = len(set(p[0] for palavras in df['Words'] for p in palavras))

print(num_palavras)  # 4

See working on Repl.it

But, how commented, the words will not repeat themselves by the different lines, so just check the amount of tuples present in the dataframe.

num_palavras = sum(len(palavras) for palavras in df['Words'])
  • I got it!! Thank you so much, Anderson!!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.