How to tokenize English words using NLTK?

Question

How to tokenize English words using NLTK?

Asked 8 years ago

Viewed 9,340 times

8

I’m having serious difficulties understanding this mechanism.

In English it would just be:

import nltk
tag_word = nltk.word_tokenize(text)

Whereas text is the English text that I would like to "tokenize", which occurs very well, but in Portuguese I still can not find any example. I am disregarding here the previous steps of stop_words and sent_tokenizer, just to make it clear that my doubt is in relation to tokenization.

Have you read this article or saw this repository?

– Woss

2017/07/20 at 22:44
Hello @Andersoncarloswoss, yes I have read, but I still can’t understand the flow. I was able to use stop_words with nltk.corpus.stopwords.words('Portuguese'), but I still could not taggear the words, this internet example found very little didactic.

– Mueladavc

2017/07/20 at 22:59

1 answer

Browser other questions tagged python-3.x natural-language

You are not signed in. Login or sign up in order to post.

by André Nascimento • **1,258** points · Answer 1 · 2017-10-15T12:31:48+00:00

import nltk    
from nltk import tokenize    
palavras_tokenize = tokenize.word_tokenize(text, language='portuguese')