1
I always have a hard time with replace and sub. I know how they work, but with me it’s never right. I have a list of words and am trying to replace these words in a text:
Text:
Brazil, officially the Federative Republic of Brazil is the largest country in South America and the region of Latin America, being the fifth largest in the world in territorial area (equivalent to 47% of the South American territory) and sixth in population (with more than 200 million inhabitants). It is the only country in America where the Portuguese language and the largest Portuguese-speaking country on the planet are spoken, as well as being one of the most multicultural and ethnically diverse nations, due to the strong immigration from various locations in the world. Its current constitution, formulated in 1988, defines Brazil as a presidential republic formed by the union of the Federal District, of the 26 states and of the 5 570 municipalities.
List:
is
the
of
and
of
in
in
if
of
Script:
import re
import csv
import itertools
with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
fichier = file.read().split('\n')
stopwords = csv.reader(stop)
for palavras in fichier:
palavras = palavras.lower()
for word in stopwords:
merged_stopwords = list(itertools.chain(*stopwords))
stopwords_regex = re.compile('|'.join(map(re.escape, merged_stopwords)))
replace_stopwords = stopwords_regex.sub('', palavras)
print(replace_stopwords)
The problem is that my script starts to replace vowels within words:
output:
brasil, ficialmnt rpública frativa brasil é mair país américa sul rgiã américa lati, n quint mair mun m ára trritrial (quivalnt a 47% trritóri sul-amrican) xt m ppulaçã (cm mais 200 milhõs habitants). é únic país américa n fala majritariamnt a língua prtugusa mair país lusófn planta, além r uma s çõs mais multiculturais tnicamnt divrsas, m crrência frt imigraçã riun varias lcais mun. its current cnstituiçã, frmula m 1988, fin brasil cm a prsincialista fraternity, frma Pla uniã distrit fral, s 26 Stas s 5 570 municippis.
EDITED
Hiccup found thanks to the help of Isac and Rickadt
Script:
import re
import csv
import itertools
with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
fichier = file.read().split('\n')
stopwords = csv.reader(stop)
for palavras in fichier:
palavras = palavras.lower()
for word in stopwords:
merged_stopwords = list(itertools.chain(*stopwords))
# a soluçao esta aqui: para que cada palavra da variavel merged_stopwords seja utilizada, é preciso urilizar o word boundary
stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, merged_stopwords)))
replace_stopwords = stopwords_regex.sub('', palavras)
print(replace_stopwords)
But the goal is to take the prepositions to be alone with the words "normal"? I couldn’t figure out exactly what you want to do with this code.
– Isac
Exactly, take the prepositions and stay with the words "normal". But the script is taking letters from inside normal words do not know why...
– marin
This can be done with a regex in the whole text, using word Boundary with
\bPalavra\b
that becomes super simple. Has here a simplified example of what I’m trying to say– Isac
gives a look here https://answall.com/questions/310696/substr-palavras-entre-dois-arquivos/310812#310812
– Carlos H Marques
Thank you, but unfortunately for the error, as was well noted here: https://stackoverflow.com/questions/15658187/replace-all-words-from-word-list-with-another-string-in-python
– marin
How is coded the
csv
? Each word has an isolated line ? If yes why is acsv
?– Isac
Thank you Isac, I was able to resolve with your suggestion of word Boundary and that of Rickadt :)
– marin
Good. Take advantage and put your resolution properly explained, which ends up ending your question, and potentially help other people who may have the same question.
– Isac