Delete prepositions and text articles in python

Asked

Viewed 65 times

3

I imported a file to Python where I recognized all the words and I need to get the frequency they appear in this text, but the highest frequencies are prepositions (from, to, with) and articles (a, o, os, as).

There is a way to eliminate this type of information and bring only related words to the text?

Follow the code that brought me all words often:

from collections import Counter
with open('arquivo.txt') as f:
    ocorrencias = Counter(f.read().split())
print (ocorrencias)
  • Yes, create a list of prepositions and articles that you would like to ignore and filter the words by ignoring them before running the counter.

2 answers

5

What you are looking for is called "stopwords" and is a type of filtering traditionally used in natural language processing. See an example using the nltk package:

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

sw = stopwords.words('portuguese')

text= '''
Processamento de língua natural (PLN) é uma subárea da ciência da computação, inteligência artificial e da linguística que estuda os problemas da geração e compreensão automática de línguas humanas naturais. Sistemas de geração de língua natural convertem informação de bancos de dados de computadores em linguagem compreensível ao ser humano e sistemas de compreensão de língua natural convertem ocorrências de linguagem humana em representações mais formais, mais facilmente manipuláveis por programas de computador. Alguns desafios do PLN são compreensão de língua natural, fazer com que computadores extraiam sentido de linguagem humana ou natural e geração de língua natural. 
'''

new_text = ' '.join([k for k in text.split(" ") if k not in sw])

print(new_text)

Returns:

Processamento língua natural (PLN) subárea ciência computação, inteligência artificial linguística estuda problemas geração compreensão automática línguas humanas naturais. Sistemas geração língua natural convertem informação bancos dados computadores linguagem compreensível ser humano sistemas compreensão língua natural convertem ocorrências linguagem humana representações formais, facilmente manipuláveis programas computador. Alguns desafios PLN compreensão língua natural, fazer computadores extraiam sentido linguagem humana natural geração língua natural.

You could do the filtering by manually setting the prepositions as suggested in the comments, but the chance of error is higher.

  • Good afternoon, I tested and that’s what I want, as I can adapt this code in my code mentioned above, in case instead of pulling the "text", read my txt file

  • En Inglés "stop word" es "Palabra Vacía". Es la Palabra remover del texto. " In computation, an empty word is a word that is before or after processing a text..." https://en.m.wikipedia.org/wiki/Stop_word

2

Note: if you are working with natural language processing, you better follow what was indicated in another answer and use a specific library for it. Anyway, here is an alternative to do "at hand":


Just create a set with words to be ignored, and only add to Counter the words that are not in this set:

from collections import Counter

ignorar = { 'de', 'para', 'com', 'a', 'o', 'os', 'as' }
ocorrencias = Counter()
with open('arquivo.txt') as f:
    for linha in f:
        ocorrencias.update(palavra for palavra in linha.split() if palavra not in ignorar)
print(ocorrencias)

In case I stuck by the file with a for, because files are iterable objects, and in the case of text files, the iteration is done line by line.

And for every line in the file, I do the split to break it into words, but only add those that are not in the set.

It could also be done with a list (ignorar = [ 'de', 'para', ... ] - note the brackets in place of the keys), but a set is more optimized for searches compared to a list (see more details here and here). Of course, for a few words the difference will be insignificant, but if you’re dealing with a lot of data, it might make a difference.

I chose to read line by line instead of a single f.read(), because this loads all the contents of the file into memory and can be a problem if the file is too large. But nothing stops you from doing it all at once:

with open('arquivo.txt') as f:
    ocorrencias = Counter(palavra for palavra in f.read().split() if palavra not in ignorar)

That is, I read the whole file, separate in words and apply the same logic above.


It is worth remembering that depending on how the sentence is, a simple split can not break into words correctly. For example, if the phrase is "Hi, all right with my good?" the split will break into ['Oi,', 'tudo', 'bem', 'com', 'meu', 'bem?'] and therefore bem and bem? shall be counted as different words.

In this case, it will depend a lot on your definition of a word: only letters (but what about the compound words, which have hyphens)? Case and lower case differences ("Hi" and "hi" should be counted together or separated)? Etc...
About that, you can take a look here, here and here, or use one’s own nltk, which has functionalities for break a sentence into words. Once having the words, just apply the same logic above, just add in the Counter those who are not in the set.

  • very good, I will test this method

Browser other questions tagged

You are not signed in. Login or sign up in order to post.