Note: if you are working with natural language processing, you better follow what was indicated in another answer and use a specific library for it. Anyway, here is an alternative to do "at hand":
Just create a set
with words to be ignored, and only add to Counter
the words that are not in this set
:
from collections import Counter
ignorar = { 'de', 'para', 'com', 'a', 'o', 'os', 'as' }
ocorrencias = Counter()
with open('arquivo.txt') as f:
for linha in f:
ocorrencias.update(palavra for palavra in linha.split() if palavra not in ignorar)
print(ocorrencias)
In case I stuck by the file with a for
, because files are iterable objects, and in the case of text files, the iteration is done line by line.
And for every line in the file, I do the split
to break it into words, but only add those that are not in the set
.
It could also be done with a list (ignorar = [ 'de', 'para', ... ]
- note the brackets in place of the keys), but a set
is more optimized for searches compared to a list (see more details here and here). Of course, for a few words the difference will be insignificant, but if you’re dealing with a lot of data, it might make a difference.
I chose to read line by line instead of a single f.read()
, because this loads all the contents of the file into memory and can be a problem if the file is too large. But nothing stops you from doing it all at once:
with open('arquivo.txt') as f:
ocorrencias = Counter(palavra for palavra in f.read().split() if palavra not in ignorar)
That is, I read the whole file, separate in words and apply the same logic above.
It is worth remembering that depending on how the sentence is, a simple split
can not break into words correctly. For example, if the phrase is "Hi, all right with my good?" the split
will break into ['Oi,', 'tudo', 'bem', 'com', 'meu', 'bem?']
and therefore bem
and bem?
shall be counted as different words.
In this case, it will depend a lot on your definition of a word: only letters (but what about the compound words, which have hyphens)? Case and lower case differences ("Hi" and "hi" should be counted together or separated)? Etc...
About that, you can take a look here, here and here, or use one’s own nltk
, which has functionalities for break a sentence into words. Once having the words, just apply the same logic above, just add in the Counter
those who are not in the set
.
Yes, create a list of prepositions and articles that you would like to ignore and filter the words by ignoring them before running the counter.
– Woss