Classification of grammar in nltk

Asked

Viewed 65 times

0

I am using the lib Natural Language Tool Kit to treat some texts and in this lib has the feature Rslpstemmer() removes almost the whole word and leaves the radical.
But the words Perfect homonyms and paronyms the Rslpstemmes() remove and when passing to Freqdist are classified as frequencies of a word.

There is a way to deal with this situation using Rslpstemmer() without losing the word and when passing to Freqdist without that repetitions occur that should not count ?

  • You’ve already taken a look at the difference between Porter and Lancaster to stemming?

  • no, I started reading some articles and see some examples of those I mentioned I did not find any reference for study of those you wrote but I would like to understand better, I could explain about such difference since I thank your attention.

1 answer

1

See the example below:

import nltk
from nltk.stem.snowball import SnowballStemmer


def stemming_bag(words, stemmer):
    return [stemmer.stem(w) for w in WORDS]


def imprime(w, result):
    return list(zip(w, result))


WORDS = ["correr", "corria", "correndo", "correio", "corredor", "corredora", "corredeira", "correia"]

porter = stemming_bag(WORDS, nltk.PorterStemmer())
print(imprime(WORDS, porter))


lancaster = stemming_bag(WORDS, nltk.LancasterStemmer())
print(imprime(WORDS, lancaster))

rslp = stemming_bag(WORDS, nltk.stem.RSLPStemmer())
print(imprime(WORDS, rslp))

snowball_pt = stemming_bag(WORDS, SnowballStemmer("portuguese"))
print(imprime(WORDS, snowball_pt))

The result would be

[('correr', 'correr'), ('corria', 'corria'), ('correndo', 'correndo'), ('correio', 'correio'), ('corredor', 'corredor'), ('corredora', 'corredora'), ('corredeira', 'corredeira'), ('correia', 'correia')]
[('correr', 'cor'), ('corria', 'corr'), ('correndo', 'correndo'), ('correio', 'correio'), ('corredor', 'cor'), ('corredora', 'corredor'), ('corredeira', 'corredeir'), ('correia', 'corre')]
[('correr', 'corr'), ('corria', 'corr'), ('correndo', 'corr'), ('correio', 'correi'), ('corredor', 'corr'), ('corredora', 'corr'), ('corredeira', 'corred'), ('correia', 'corre')]
[('correr', 'corr'), ('corria', 'corr'), ('correndo', 'corr'), ('correio', 'correi'), ('corredor', 'corredor'), ('corredora', 'corredor'), ('corredeira', 'corredeir'), ('correia', 'corr')]

Particularly would I use Snowball in this case.

However, if you need the frequency per word (eg.: corredor is different from corredora), this would have to be done before making the stemming.

See also about Lemmatization and look at the library Nlpyport

  • For example words that the writing is the same but in morphological analysis would be different in the text. As the word thirst is a noun (drinking beer water etc) and thirst for locality. headquarters of the company’s farm, In this use mentioned I think would be better but the use of Rslpstemmer() would spoil the morphological meaning.

  • I hope they are "isolated" cases. But for that, you can (i) create the POS TAG and/or (ii) create the IOB. With IOB, vc can give morphological meaning. IOB is used for named Entity recognition (NER). The system has to be trained and it is a super job to assemble the file. With the NER you can, for example, search for all company names or personal names.

  • Sensational, thank you so much for showing this way.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.