read()
takes the entire contents of the file and returns in a single string. That is, texts
is a string.
In doing for text in texts
you are traversing the characters of texts
, one by one. That is, with each iteration of the for
, the variable text
will contain only one character. That’s why word_tokenize
ends up generating tokens with only one character.
If the idea is tokenize all the text, don’t do this for
, pass the entire string at once:
l = []
for file in glob.glob('C:\\Users\\User\\Desktop\\lima\\*.txt'):
texts = open(file,'r',encoding='utf-8').read()
tokenize = nltk.word_tokenize(texts, language='portuguese')
l.extend(tokenize)
print(l)
If the files are too large, do not recommend reading everything at once with read()
(because then you will be loading all the content into memory). One option is to read the file line by line, and then you need a loop:
tokens = []
for file in glob.glob('C:\\Users\\User\\Desktop\\lima\\*.txt'):
with open(file ,'r',encoding='utf-8') as arq: # abre o arquivo
for linha in arq: # para cada linha do arquivo
tokenize = nltk.word_tokenize(linha, language='portuguese')
tokens.extend(tokenize)
print(tokens)
Notice it’s different from yours loop: in his, the for
iterates in each character of the string. In mine, I’m iterating for each line, and passing the entire line to word_tokenize
.
I also used with
to open the file, because this ensures that it will be closed at the end (even if an error occurs during reading). And I changed the name of the list l
for something a little better - although it sounds silly, give names better help when programming (especially when that name is l
, which, depending on the source used, can be easily confused with 1
, I
or |
).
@jfaccioni, thanks for the tip on formatting the text as a whole.
– Cygnus X-1