I tokenized and tagged a Pandas column with nltk and then exported my column as a list.
list1 = esquizo['Enunciados_limpos'].apply(lambda x: nltk.word_tokenize(x)).apply(lambda x: modelo_treinado.tag(x)).tolist()
This list1 has the following structure (there are over a thousand of them, this is just one example):
[[('minha', 'PROADJ'), ('infância', 'N'), ('na', 'ADV'), ('Bahia', 'NPROP'), ('era', 'V'), ('boa', 'ADJ'), ('mas', 'KC'), ('era', 'V'), ('sofrida', 'PCP'), ('também', 'PDEN'), ('né', 'IN'), ('doutor', 'N'), ('Oswaldo', 'NPROP')], [('porque', 'KS'), ('eu', 'PROPESS'), ('tinha', 'V')]],
Now I’m trying to use nltk.ConditionalFreqDist
list1 num for loop, but I want the results based on the whole text, not lists/tuples. I tried:
fd = [] cd = [] for tuple in list1: fd.append(nltk.FreqDist(tuple)) cd.append(nltk.ConditionalFreqDist(tuple))
And I get to cd(Conditionalfreqdist) and fd(Freqdist) (I show only one because it’s the same problem)
{'Bahia': FreqDist({'NPROP': 1}),
'Oswaldo': FreqDist({'NPROP': 1}),
'boa': FreqDist({'ADJ': 1}),
'doutor': FreqDist({'N': 1}),
'era': FreqDist({'V': 2}),
'infância': FreqDist({'N': 1}),
'mas': FreqDist({'KC': 1}),
'minha': FreqDist({'PROADJ': 1}),
'na': FreqDist({'ADV': 1}),
'né': FreqDist({'IN': 1}),
'sofrida': FreqDist({'PCP': 1}),
'também': FreqDist({'PDEN': 1})}),
{'a': FreqDist({'ART': 1}),
'avô': FreqDist({'N': 1}),
'em': FreqDist({'PREP': 1}),
'era': FreqDist({'V': 1}),
'eu': FreqDist({'PROPESS': 3})})
I ask: is it possible to use Freqdist and Conditionalfreqdist on list1 by counting all the text, not by lists/tuples? How would I do that? In pandas I tried with a lambda (esquizo['Clean statements']. apply(lambda x: nltk.word_tokenize(x)). apply(lambda x: modelo_trainer.tag(x)). apply(lambda x: nltk.Conditionalfreqdist(x))`, but the result is by row and not by the whole column (which would represent the whole text, in this case). I mean, I couldn’t do it with list1 or the dataframe! Thank you!
The first code, with an example of "I want to analyze this', does not work with the structure I have, which is not string, but tuples inside lists. In the second, from Pandas, fdist makes counts by characters and not by words. Thanks for the help, anyway.
– Junior Costa
@Juniorcosta - I updated the post before your comment.
– Paulo Marques
Thank you! It worked for Freqdist. I’ll see if I can raise another question with just Conditional Freq Dist, because for her I haven’t been able to do it for the whole structure (as you just taught me about Freq Dist) and not just for the sublists with tuples that make up the data. Thanks for the help.
– Junior Costa
@Juniorcosta, good that it worked. Finding pertinent, mark the answer. :)
– Paulo Marques
I checked now. I had forgotten. I’m new here. Thank you very much!
– Junior Costa