Problems with Freqdist and Conditionalfreqdist from NLTK

Asked

Viewed 29 times

-1

I tokenized and tagged a Pandas column with nltk and then exported my column as a list. list1 = esquizo['Enunciados_limpos'].apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: modelo_treinado.tag(x)).tolist(). This list1 has the following structure (there are over a thousand of them, this is just one example): [[('minha', 'PROADJ'), ('infância', 'N'), ('na', 'ADV'), ('Bahia', 'NPROP'), ('era', 'V'), ('boa', 'ADJ'), ('mas', 'KC'), ('era', 'V'), ('sofrida', 'PCP'), ('também', 'PDEN'), ('né', 'IN'), ('doutor', 'N'), ('Oswaldo', 'NPROP')], [('porque', 'KS'), ('eu', 'PROPESS'), ('tinha', 'V')]], Now I’m trying to use nltk.ConditionalFreqDist andnltk.FreqDist' list1 num for loop, but I want the results based on the whole text, not lists/tuples. I tried: fd = [] cd = [] for tuple in list1: fd.append(nltk.FreqDist(tuple)) cd.append(nltk.ConditionalFreqDist(tuple)) And I get to cd(Conditionalfreqdist) and fd(Freqdist) (I show only one because it’s the same problem)

[ConditionalFreqDist(nltk.probability.FreqDist,
                     {'Bahia': FreqDist({'NPROP': 1}),
                      'Oswaldo': FreqDist({'NPROP': 1}),
                      'boa': FreqDist({'ADJ': 1}),
                      'doutor': FreqDist({'N': 1}),
                      'era': FreqDist({'V': 2}),
                      'infância': FreqDist({'N': 1}),
                      'mas': FreqDist({'KC': 1}),
                      'minha': FreqDist({'PROADJ': 1}),
                      'na': FreqDist({'ADV': 1}),
                      'né': FreqDist({'IN': 1}),
                      'sofrida': FreqDist({'PCP': 1}),
                      'também': FreqDist({'PDEN': 1})}),
 ConditionalFreqDist(nltk.probability.FreqDist,
                     {'a': FreqDist({'ART': 1}),
                      'avô': FreqDist({'N': 1}),
                      'em': FreqDist({'PREP': 1}),
                      'era': FreqDist({'V': 1}),
                      'eu': FreqDist({'PROPESS': 3})})

I ask: is it possible to use Freqdist and Conditionalfreqdist on list1 by counting all the text, not by lists/tuples? How would I do that? In pandas I tried with a lambda (esquizo['Clean statements']. apply(lambda x: nltk.word_tokenize(x)). apply(lambda x: modelo_trainer.tag(x)). apply(lambda x: nltk.Conditionalfreqdist(x))`, but the result is by row and not by the whole column (which would represent the whole text, in this case). I mean, I couldn’t do it with list1 or the dataframe! Thank you!

1 answer

0


To use Freqdist you have to iterate through all the words in the set.

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist

>>> sent = 'Eu quero analisar isso.'

>>> fdist = FreqDist()
>>> for word in word_tokenize(sent):
...    fdist[word.lower()] += 1

Therefore, you would have to iterate row by row from your dataframe and then from within your column

Something like below:

for key, value in df.iterrows():
    for tupla in value["sua_coluna_aqui"]:
        fdist[tupla[0]] += 1  # Sua palavra

ARUALIZATION - 2020/12/3 - according to the commentary

According to your post, the data is:

minha_lista = [ [('o', 'TAG1'), ('exemplo', 'TAG2')], [('um', 'TAG3'), ('exemplo', 'TAG2')], [('sem', 'TAG4'), ('exemplo', 'TAG2')] ]

We can then put in a pandas structure

import pandas as pd
df = pd.DataFrame({"sentence": minha_lista})

The dataframe at this time is:

>>> df
                         sentence
0    [(o, TAG1), (exemplo, TAG2)]
1   [(um, TAG3), (exemplo, TAG2)]
2  [(sem, TAG4), (exemplo, TAG2)]

Just then calculate the frequency:

from nltk.probability import FreqDist

fdist = FreqDist()

for key, value in df.iterrows():
    for tupla in value["sentence"]:
        fdist[tupla[0]] += 1
        print(tupla[0])  # Imprime a palavra só para confirmar que não está fazendo nada errado. :)

The exit will be:

o
exemplo
um
exemplo
sem
exemplo

Checking on the frequency:

print(fdist) Freqdist({'example': 3, 'o': 1, 'um': 1, 'sem': 1})


Espero que ajude
  • The first code, with an example of "I want to analyze this', does not work with the structure I have, which is not string, but tuples inside lists. In the second, from Pandas, fdist makes counts by characters and not by words. Thanks for the help, anyway.

  • @Juniorcosta - I updated the post before your comment.

  • Thank you! It worked for Freqdist. I’ll see if I can raise another question with just Conditional Freq Dist, because for her I haven’t been able to do it for the whole structure (as you just taught me about Freq Dist) and not just for the sublists with tuples that make up the data. Thanks for the help.

  • @Juniorcosta, good that it worked. Finding pertinent, mark the answer. :)

  • I checked now. I had forgotten. I’m new here. Thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.