See the example below:
import nltk
from nltk.stem.snowball import SnowballStemmer
def stemming_bag(words, stemmer):
return [stemmer.stem(w) for w in WORDS]
def imprime(w, result):
return list(zip(w, result))
WORDS = ["correr", "corria", "correndo", "correio", "corredor", "corredora", "corredeira", "correia"]
porter = stemming_bag(WORDS, nltk.PorterStemmer())
print(imprime(WORDS, porter))
lancaster = stemming_bag(WORDS, nltk.LancasterStemmer())
print(imprime(WORDS, lancaster))
rslp = stemming_bag(WORDS, nltk.stem.RSLPStemmer())
print(imprime(WORDS, rslp))
snowball_pt = stemming_bag(WORDS, SnowballStemmer("portuguese"))
print(imprime(WORDS, snowball_pt))
The result would be
[('correr', 'correr'), ('corria', 'corria'), ('correndo', 'correndo'), ('correio', 'correio'), ('corredor', 'corredor'), ('corredora', 'corredora'), ('corredeira', 'corredeira'), ('correia', 'correia')]
[('correr', 'cor'), ('corria', 'corr'), ('correndo', 'correndo'), ('correio', 'correio'), ('corredor', 'cor'), ('corredora', 'corredor'), ('corredeira', 'corredeir'), ('correia', 'corre')]
[('correr', 'corr'), ('corria', 'corr'), ('correndo', 'corr'), ('correio', 'correi'), ('corredor', 'corr'), ('corredora', 'corr'), ('corredeira', 'corred'), ('correia', 'corre')]
[('correr', 'corr'), ('corria', 'corr'), ('correndo', 'corr'), ('correio', 'correi'), ('corredor', 'corredor'), ('corredora', 'corredor'), ('corredeira', 'corredeir'), ('correia', 'corr')]
Particularly would I use Snowball in this case.
However, if you need the frequency per word (eg.: corredor
is different from corredora
), this would have to be done before making the stemming.
See also about Lemmatization and look at the library Nlpyport
You’ve already taken a look at the difference between Porter and Lancaster to stemming?
– Paulo Marques
no, I started reading some articles and see some examples of those I mentioned I did not find any reference for study of those you wrote but I would like to understand better, I could explain about such difference since I thank your attention.
– stack.cardoso