An initial idea is to do the split not only by spaces, but by any character that is not part of a word:
from collections import Counter
import re
r = re.compile(r'\W+')
c = Counter()
with open("1342-0.txt", encoding='utf8') as f:
for linha in f:
for word in r.split(linha):
c.update([word])
print(c)
The shortcut \W
is "all that nay for letter, number, or character _
" - and how in the text it has "words" as _she_
, That’s considered a different word than she
. I also consider that the numbers (like 1
) are "words", which are also accounted for.
As I find the words, I update the Counter
using method update
(if the key does not exist, it is created with the value 1, and if it exists, add 1 to its value - in the end we have the total count of each word).
Another detail is that read()
loads the entire contents of the file into memory at once. Depending on the size of the file, this may be a problem. Already the above code reads one line at a time (and I am assuming that there is no case of a word starting on one line and ending on another - although in this case, use read
and split
would also not consider that it is the same word).
If you don’t want to include the _
as part of a word, simply change the regex to:
r = re.compile(r'[\W_]+')
The problem is that there are also words with hyphenation, such as over-scrupulous
. The above code considers two different words ("over" and "scrupulous"). If you want them to be one word, you have to change a little:
from collections import Counter
import re
r = re.compile(r'\b\w+(?:-\w+)*\b')
c = Counter()
with open("1342-0.txt", encoding='utf8') as f:
for linha in f:
for word in r.findall(linha):
c.update([word])
print(c)
Now I use \w+
(one or more characters that form a word), and I place a section containing a hyphen and \w+
(and this whole section can repeat itself zero or more times). So I get words with one or more hyphens too.
If you don’t want to include the _
as part of a word, use:
r = re.compile(r'\b[^\W_]+(?:-[^\W_]+)*\b')
It is worth remembering that string.punctuation
only consider the characters !"#$%&'()*+,-./:;<=>?@[]^_{|}`
. If you have any other character in the text other than letter, number or _
, he will not be removed.
An example is the character “
(present in the text), which nay is the same thing as "
(are different quotes, the first is the "LEFT DOUBLE QUOTATION MARK" and the second is "QUOTATION MARK", and if you use punctuation
, will only remove the second).
I don’t understand what’s so strange about
"the,"
. Is the comma? If it is just remove all punctuation before processing the text.– Augusto Vasques
@Augusto Vasques: that’s what I’m trying to do! Remove punctuation, accents... but it didn’t work -
– Laurinda Souza