word_tokenize creating tokens with only one character instead of words

Asked

Viewed 39 times

1

When creating the list below, with the tokens of multiple texts .txt, the for loop tokenize the characters and not the words themselves:

import glob
import nltk
      
l = []

for file in glob.glob('C:\\Users\\User\\Desktop\\lima\\*.txt'):
    texts = open(file,'r',encoding='utf-8').read()
        for text in texts:
            tokenize = nltk.word_tokenize(text, language='portuguese')
            l.extend(tokenize)

    print(l) 

output:

['N', 'ã', 'o', 'f', 'o', 'i', 'o', 'u', 'n', 'ã', 'o', 'é',]
etc

I’ve tried with the append, but it creates a list of character lists. What I’m doing wrong?

  • @jfaccioni, thanks for the tip on formatting the text as a whole.

1 answer

1


read() takes the entire contents of the file and returns in a single string. That is, texts is a string.

In doing for text in texts you are traversing the characters of texts, one by one. That is, with each iteration of the for, the variable text will contain only one character. That’s why word_tokenize ends up generating tokens with only one character.

If the idea is tokenize all the text, don’t do this for, pass the entire string at once:

l = []

for file in glob.glob('C:\\Users\\User\\Desktop\\lima\\*.txt'):
    texts = open(file,'r',encoding='utf-8').read()
    tokenize = nltk.word_tokenize(texts, language='portuguese')
    l.extend(tokenize)

print(l) 

If the files are too large, do not recommend reading everything at once with read() (because then you will be loading all the content into memory). One option is to read the file line by line, and then you need a loop:

tokens = []

for file in glob.glob('C:\\Users\\User\\Desktop\\lima\\*.txt'):
    with open(file ,'r',encoding='utf-8') as arq: # abre o arquivo
        for linha in arq: # para cada linha do arquivo
            tokenize = nltk.word_tokenize(linha, language='portuguese')
            tokens.extend(tokenize)

print(tokens)

Notice it’s different from yours loop: in his, the for iterates in each character of the string. In mine, I’m iterating for each line, and passing the entire line to word_tokenize.

I also used with to open the file, because this ensures that it will be closed at the end (even if an error occurs during reading). And I changed the name of the list l for something a little better - although it sounds silly, give names better help when programming (especially when that name is l, which, depending on the source used, can be easily confused with 1, I or |).

  • hkotsubo, thank you very much for the kindness of explaining to me. Really, now everything has become clear. Again, thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.