Classification of grammar in nltk

Question

Classification of grammar in nltk

Asked 4 years, 3 months ago

Viewed 65 times

0

I am using the lib Natural Language Tool Kit to treat some texts and in this lib has the feature Rslpstemmer() removes almost the whole word and leaves the radical.
But the words Perfect homonyms and paronyms the Rslpstemmes() remove and when passing to Freqdist are classified as frequencies of a word.

There is a way to deal with this situation using Rslpstemmer() without losing the word and when passing to Freqdist without that repetitions occur that should not count ?

You’ve already taken a look at the difference between Porter and Lancaster to stemming?

– Paulo Marques

2021/03/30 at 04:39
no, I started reading some articles and see some examples of those I mentioned I did not find any reference for study of those you wrote but I would like to understand better, I could explain about such difference since I thank your attention.

– stack.cardoso

2021/03/30 at 05:08

1 answer

Browser other questions tagged python nltk

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2021-03-30T17:42:14+00:00

See the example below:

import nltk
from nltk.stem.snowball import SnowballStemmer


def stemming_bag(words, stemmer):
    return [stemmer.stem(w) for w in WORDS]


def imprime(w, result):
    return list(zip(w, result))


WORDS = ["correr", "corria", "correndo", "correio", "corredor", "corredora", "corredeira", "correia"]

porter = stemming_bag(WORDS, nltk.PorterStemmer())
print(imprime(WORDS, porter))


lancaster = stemming_bag(WORDS, nltk.LancasterStemmer())
print(imprime(WORDS, lancaster))

rslp = stemming_bag(WORDS, nltk.stem.RSLPStemmer())
print(imprime(WORDS, rslp))

snowball_pt = stemming_bag(WORDS, SnowballStemmer("portuguese"))
print(imprime(WORDS, snowball_pt))

The result would be

[('correr', 'correr'), ('corria', 'corria'), ('correndo', 'correndo'), ('correio', 'correio'), ('corredor', 'corredor'), ('corredora', 'corredora'), ('corredeira', 'corredeira'), ('correia', 'correia')]
[('correr', 'cor'), ('corria', 'corr'), ('correndo', 'correndo'), ('correio', 'correio'), ('corredor', 'cor'), ('corredora', 'corredor'), ('corredeira', 'corredeir'), ('correia', 'corre')]
[('correr', 'corr'), ('corria', 'corr'), ('correndo', 'corr'), ('correio', 'correi'), ('corredor', 'corr'), ('corredora', 'corr'), ('corredeira', 'corred'), ('correia', 'corre')]
[('correr', 'corr'), ('corria', 'corr'), ('correndo', 'corr'), ('correio', 'correi'), ('corredor', 'corredor'), ('corredora', 'corredor'), ('corredeira', 'corredeir'), ('correia', 'corr')]

Particularly would I use Snowball in this case.

However, if you need the frequency per word (eg.: corredor is different from corredora), this would have to be done before making the stemming.

See also about Lemmatization and look at the library Nlpyport