Replace word list in text

Asked

Viewed 840 times

1

I always have a hard time with replace and sub. I know how they work, but with me it’s never right. I have a list of words and am trying to replace these words in a text:

Text:

Brazil, officially the Federative Republic of Brazil is the largest country in South America and the region of Latin America, being the fifth largest in the world in territorial area (equivalent to 47% of the South American territory) and sixth in population (with more than 200 million inhabitants). It is the only country in America where the Portuguese language and the largest Portuguese-speaking country on the planet are spoken, as well as being one of the most multicultural and ethnically diverse nations, due to the strong immigration from various locations in the world. Its current constitution, formulated in 1988, defines Brazil as a presidential republic formed by the union of the Federal District, of the 26 states and of the 5 570 municipalities.

List:

is

the

of

and

of

in

in

if

of

Script:

import re
import csv
import itertools

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    fichier = file.read().split('\n')
    stopwords = csv.reader(stop)

    for palavras in fichier:
        palavras = palavras.lower()

        for word in stopwords:
            merged_stopwords = list(itertools.chain(*stopwords))
            stopwords_regex = re.compile('|'.join(map(re.escape, merged_stopwords)))
            replace_stopwords = stopwords_regex.sub('', palavras)

            print(replace_stopwords)

The problem is that my script starts to replace vowels within words:

output:

brasil, ficialmnt rpública frativa brasil é mair país américa sul rgiã américa lati, n quint mair mun m ára trritrial (quivalnt a 47% trritóri sul-amrican) xt m ppulaçã (cm mais 200 milhõs habitants). é únic país américa n fala majritariamnt a língua prtugusa mair país lusófn planta, além r uma s çõs mais multiculturais tnicamnt divrsas, m crrência frt imigraçã riun varias lcais mun. its current cnstituiçã, frmula m 1988, fin brasil cm a prsincialista fraternity, frma Pla uniã distrit fral, s 26 Stas s 5 570 municippis.

EDITED

Hiccup found thanks to the help of Isac and Rickadt

Script:

import re
import csv
import itertools

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    fichier = file.read().split('\n')
    stopwords = csv.reader(stop)

    for palavras in fichier:
        palavras = palavras.lower()

        for word in stopwords:
            merged_stopwords = list(itertools.chain(*stopwords))
            # a soluçao esta aqui: para que cada palavra da variavel merged_stopwords seja utilizada, é preciso urilizar o word boundary
            stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, merged_stopwords)))
            replace_stopwords = stopwords_regex.sub('', palavras)

            print(replace_stopwords)
  • But the goal is to take the prepositions to be alone with the words "normal"? I couldn’t figure out exactly what you want to do with this code.

  • Exactly, take the prepositions and stay with the words "normal". But the script is taking letters from inside normal words do not know why...

  • This can be done with a regex in the whole text, using word Boundary with \bPalavra\b that becomes super simple. Has here a simplified example of what I’m trying to say

  • gives a look here https://answall.com/questions/310696/substr-palavras-entre-dois-arquivos/310812#310812

  • Thank you, but unfortunately for the error, as was well noted here: https://stackoverflow.com/questions/15658187/replace-all-words-from-word-list-with-another-string-in-python

  • How is coded the csv? Each word has an isolated line ? If yes why is a csv ?

  • Thank you Isac, I was able to resolve with your suggestion of word Boundary and that of Rickadt :)

  • 2

    Good. Take advantage and put your resolution properly explained, which ends up ending your question, and potentially help other people who may have the same question.

Show 3 more comments

3 answers

3

import re

txt  = open('texto').read()
lista= open('lista').read() 
sw   = re.findall('\w+',lista)
print(re.sub('\w+', lambda x: '' if x[0].lower() in sw else x[0] ,txt))

Here’s a Python3 variant:

  • re.findall('\w+',lista) extracts the stopwords.
  • re.sub('\w+', ... , txt) for each word of the text, replace it by
  • lambda x: '' if x[0].lower() in sw else x[0] that is to say
    • for '' if she belongs to sw
    • by itself if it does not belong

3


For the sake of clarity, and also because the solution you put forward was not quite as I had suggested, I leave here my suggestion.

The suggestion was to apply a regex to the whole text, which replaces only whole words using the \b from the syntax of regexes to word Boundary. This means that neither the words in the text nor the words to be excluded need to be iterated.

Assuming you read the text and words to be removed with:

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    fichier = file.read()
    csvstop = csv.reader(stop)
    stopwords = list(itertools.chain(*csvstop))

The regex application and other code would be just two more lines:

    regex = re.compile(r'\b' + r'\b|\b'.join(stopwords) + r'\b')
    replacedtext = re.sub(regex, '', fichier, re.IGNORECASE)

The regex was built using \bPalavra\b and with | between each. A flag of re.IGNORECASE causes it to pick up either uppercase or minuscule avoiding any types of lower(). Inspecting the regex built for the stopwords given has the following:

\bé\b|\bo\b|\bda\b|\be\b|\bdo\b|\bem\b|\bna\b|\bse\b|\bde\b

Each of the words are being captured instead with the | and the \b ensures that it only picks up isolated words, and not in the middle of other.

It is also worth remembering that if you remove an entire word in the middle of a sentence you can have two spaces in a row. Depending on what you do with the text you may not want these spaces. You can easily remove them with another regex:

replacedtext= re.sub(r'\s{2,}', ' ', replacedtext)

Replacing any string of 2 or more spaces with 1 space.

  • 2

    cool! (+1). Possibly r'\b(?:' + r'|'.join(sw) + r')\b'

  • @Jjoao This is really an interesting alternative and would say even better, because the regex is simpler and the non capturing group probably ensures the same level of efficiency.

  • Another way: "|".join("\\b{0}\\b".format(x) for x in stopwords) that brings the same output: \bé\b|\bo\b|\bda\b|\be\b|\bdo\b|\bem\b|\bna\b|\bse\b|\bde\b. If stopwords is a list

2

I believe the simplest is for you to break every line in words, with the method split and see if that word is or is not a stopword.

import csv
import itertools

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    lines = file.read().split('\n')
    csvstop = csv.reader(stop)
    stopwords = list(itertools.chain(*csvstop))

    for line in lines:
        palavras = line.lower().split()
        # filtra as palavras q nao sao stopwords
        palavras = [palavra for palavras if palavra not in stopwords]

        print(" ".join(palavras))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.