How to tokenize English words using NLTK?

Asked

Viewed 9,340 times

8

I’m having serious difficulties understanding this mechanism.

In English it would just be:

import nltk
tag_word = nltk.word_tokenize(text)

Whereas text is the English text that I would like to "tokenize", which occurs very well, but in Portuguese I still can not find any example. I am disregarding here the previous steps of stop_words and sent_tokenizer, just to make it clear that my doubt is in relation to tokenization.

  • Have you read this article or saw this repository?

  • Hello @Andersoncarloswoss, yes I have read, but I still can’t understand the flow. I was able to use stop_words with nltk.corpus.stopwords.words('Portuguese'), but I still could not taggear the words, this internet example found very little didactic.

1 answer

10


import nltk    
from nltk import tokenize    
palavras_tokenize = tokenize.word_tokenize(text, language='portuguese')    

Browser other questions tagged

You are not signed in. Login or sign up in order to post.