Problems with Freqdist and Conditionalfreqdist from NLTK

Question

Problems with Freqdist and Conditionalfreqdist from NLTK

Asked 4 years, 7 months ago

Viewed 29 times

-1

I tokenized and tagged a Pandas column with nltk and then exported my column as a list. list1 = esquizo['Enunciados_limpos'].apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: modelo_treinado.tag(x)).tolist(). This list1 has the following structure (there are over a thousand of them, this is just one example): [[('minha', 'PROADJ'), ('infância', 'N'), ('na', 'ADV'), ('Bahia', 'NPROP'), ('era', 'V'), ('boa', 'ADJ'), ('mas', 'KC'), ('era', 'V'), ('sofrida', 'PCP'), ('também', 'PDEN'), ('né', 'IN'), ('doutor', 'N'), ('Oswaldo', 'NPROP')], [('porque', 'KS'), ('eu', 'PROPESS'), ('tinha', 'V')]], Now I’m trying to use nltk.ConditionalFreqDist andnltk.FreqDist' list1 num for loop, but I want the results based on the whole text, not lists/tuples. I tried: fd = [] cd = [] for tuple in list1: fd.append(nltk.FreqDist(tuple)) cd.append(nltk.ConditionalFreqDist(tuple)) And I get to cd(Conditionalfreqdist) and fd(Freqdist) (I show only one because it’s the same problem)

[ConditionalFreqDist(nltk.probability.FreqDist,
                     {'Bahia': FreqDist({'NPROP': 1}),
                      'Oswaldo': FreqDist({'NPROP': 1}),
                      'boa': FreqDist({'ADJ': 1}),
                      'doutor': FreqDist({'N': 1}),
                      'era': FreqDist({'V': 2}),
                      'infância': FreqDist({'N': 1}),
                      'mas': FreqDist({'KC': 1}),
                      'minha': FreqDist({'PROADJ': 1}),
                      'na': FreqDist({'ADV': 1}),
                      'né': FreqDist({'IN': 1}),
                      'sofrida': FreqDist({'PCP': 1}),
                      'também': FreqDist({'PDEN': 1})}),
 ConditionalFreqDist(nltk.probability.FreqDist,
                     {'a': FreqDist({'ART': 1}),
                      'avô': FreqDist({'N': 1}),
                      'em': FreqDist({'PREP': 1}),
                      'era': FreqDist({'V': 1}),
                      'eu': FreqDist({'PROPESS': 3})})

I ask: is it possible to use Freqdist and Conditionalfreqdist on list1 by counting all the text, not by lists/tuples? How would I do that? In pandas I tried with a lambda (esquizo['Clean statements']. apply(lambda x: nltk.word_tokenize(x)). apply(lambda x: modelo_trainer.tag(x)). apply(lambda x: nltk.Conditionalfreqdist(x))`, but the result is by row and not by the whole column (which would represent the whole text, in this case). I mean, I couldn’t do it with list1 or the dataframe! Thank you!

1 answer

Browser other questions tagged nltk

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2020-12-03T06:18:43+00:00

To use Freqdist you have to iterate through all the words in the set.

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist

>>> sent = 'Eu quero analisar isso.'

>>> fdist = FreqDist()
>>> for word in word_tokenize(sent):
...    fdist[word.lower()] += 1

Therefore, you would have to iterate row by row from your dataframe and then from within your column

Something like below:

for key, value in df.iterrows():
    for tupla in value["sua_coluna_aqui"]:
        fdist[tupla[0]] += 1  # Sua palavra

ARUALIZATION - 2020/12/3 - according to the commentary

According to your post, the data is:

minha_lista = [ [('o', 'TAG1'), ('exemplo', 'TAG2')], [('um', 'TAG3'), ('exemplo', 'TAG2')], [('sem', 'TAG4'), ('exemplo', 'TAG2')] ]

We can then put in a pandas structure

import pandas as pd
df = pd.DataFrame({"sentence": minha_lista})

The dataframe at this time is:

>>> df
                         sentence
0    [(o, TAG1), (exemplo, TAG2)]
1   [(um, TAG3), (exemplo, TAG2)]
2  [(sem, TAG4), (exemplo, TAG2)]

Just then calculate the frequency:

from nltk.probability import FreqDist

fdist = FreqDist()

for key, value in df.iterrows():
    for tupla in value["sentence"]:
        fdist[tupla[0]] += 1
        print(tupla[0])  # Imprime a palavra só para confirmar que não está fazendo nada errado. :)

The exit will be:

o
exemplo
um
exemplo
sem
exemplo

Checking on the frequency:

print(fdist) Freqdist({'example': 3, 'o': 1, 'um': 1, 'sem': 1})


Espero que ajude