How to use . replace() to remove Python strings

Asked

Viewed 1,335 times

5

I’m doing a Coursera exercise in Python. In this exercise he asks to build a Python function that takes a string (word) and removes the punctuation symbols. It provides me a list of strings ["'", '"', ",", ".", "!", ":", ";", '#', '@'] and puts as a solution suggestion for the problem use the method .replace() for strings.

The problem is that the syntax of .replace() is composed of the form: replace(old, new, count), where the first two parameters are required. How to resolve this? In the argument _old_ colloquial " " as empty space?

4 answers

5

Looking at the documentation of the method replace, it is said that:

Return a copy of the string with all occurrences of substring old replaced by new

I mean, if you do string.replace('alguma coisa', 'outra coisa'), all occurrences of "alguma coisa" shall be replaced by "outra coisa". What if instead of "something else" you pass the empty string ('' or "" - note that there is nothing between the quotes), the "alguma coisa" is exchanged for "nothing", which in practice is the same as removing the "alguma coisa".

So if you must use only replace and the list of characters to be removed, has no way, has to remove one by one. So:

simbolos_pontuacao = ["'", '"', ",", ".", "!", ":", ";", '#', '@']
texto = 'Lorem: "ipsum" \'dolor\', sit! Amet; bla... bl@ #etc#, whiskas sache!!'
# para cada símbolo, remove-o da string
for c in simbolos_pontuacao:
    texto = texto.replace(c, '')

print(texto) # Lorem ipsum dolor sit Amet blabl etc whiskas sache

It’s not very efficient because every call from replace creates a new string (which is why I need to assign the return to the variable texto).

For those who like a more functional "footprint", you can also use reduce:

from functools import reduce
texto = reduce(lambda t, s: t.replace(s, ''), simbolos_pontuacao, texto)

Who deep down does the same thing loop, only with more overhead...

Anyway, using replace, there’s not much to escape from. But there are other ways to do.


Another way is to traverse through each character of the string and check if it belongs to the list. Then I join everyone who does not belong and mount another string:

texto = ''.join(c for c in texto if c not in simbolos_pontuacao)

It should also not be very efficient because for each character of the string I have to go through the whole list simbolos_pontuacao to check if the character is one of the ones that should be removed (such as the operator in does a linear search, the list shall be covered several times).


Another alternative is to use regex. Since all symbols in the list have only one character, I can join them in one character class: ['",.!:;#@]. Thus, the substitution is made at once:

import re

r = re.compile(f'[{"".join(simbolos_pontuacao)}]')
texto = r.sub('', texto)

If any element of the list simbolos_pontuacao had more than one character, so I could not use the character class. For example, if I had ab on the list, then [ab] it wouldn’t do, because that means "the letter a or the letter b". In this case, I have to use alternation:

# incluí um elemento com mais de um caractere
simbolos_pontuacao = ['...', "'", '"', ",", "!", ":", ";", '#', '@']
texto = 'Lorem: "ipsum" \'dolor\', sit! Amet; bla... bl@ #etc#, whiskas sache!!'

import re

r = re.compile(f'({"|".join(map(re.escape, simbolos_pontuacao))})')
texto = r.sub('', texto)

print(texto) # Lorem ipsum dolor sit Amet blabl etc whiskas sache

In this case the regex became (\.\.\.|'|"|,|!|:|;|\#|@). Note that the dot was written as \., since it has special meaning in regex (meaning "any character, except line breaks"), then so that it matches only the character ., I have to escape it with \, and this is done by re.escape.

In the first case I didn’t need the escape because the metacharacters "lose their powers" when they are inside brackets (in that case I would only need to escape the brackets themselves).

4

That’s basically the idea, but it’s a slightly more complicated solution than that, because if you trade the symbol for a space, it could be two spaces in a row, which would be wrong, but in some cases there won’t be a space after and then space would be better because putting two words together will be much worse. The correct is to make an algorithm a little more sophisticated understanding how to exchange. If it were only this pattern would be simpler, but there may be others.

As the exercise seems to ask only a simple form and only intends to exchange a symbol for a space then the solution you thought to exchange the symbol for a space is appropriate, but you must put the space in the parameter new which is what will stand in the conclusion. Something like this:

texto = texto.replace('@', ' ')

Behold working in the ideone. And in the repl it.. Also put on the Github for future reference.

If you want to take a risk, and worth as an experiment is to change the space for nothing. Nothing is two quotes in a row and without the space, something like this:

texto = texto.replace('!', '')

You would have to make several of these commands, one for each "clear" symbol. This is extremely inefficient, but for an exercise has no problem.

Someone will probably suggest using a Regex on replace() to become more sophisticated, but seems advanced to what is learning now, and as they say:

When you try to solve a problem with Regex you happen to have two problems

2

TL;DR

Using str.translate is performatic but works only if the received strings are only 1 character.

chars = ["'", '"', ",", ".", "!", ":", ";", '#', '@']
texto = 'Lorem: "ipsum" \'dolor\', sit! Amet; bla... bl@ #etc#, whiskas sache!!'

resultado = texto.translate(str.maketrans('', '', "".join(chars)))
# 'Lorem ipsum dolor sit Amet bla bl etc whiskas sache'

Explanation

I will leave here another method not covered in the other answers yet, which would be the method str.translate.

The method str.translate receives a "translation table" that must be an object that allows indexed access, such as dictionaries or lists, where the accessed index is an integer representing the character’s Unicode Code Point (see ord()) and the accessed item must return the substitute character or None not to include in the result. See an example of using:

tabela = {
    ord("a"): "A",
    ord("b"): "*",
    ord("c"): None,
}
texto = "a-b-c-d"

print(texto.translate(tabela))    # 'A-*--d'

I am mounting this table "in hand" for demonstrative purposes, but usually the static method is used str.maketrans to create it more easily. The same table above could be created like this:

tabela = str.maketrans("ab", "A*", "c")
texto = "a-b-c-d"

print(texto.translate(tabela))  # 'A-*--d'

Better explaining the str.maketrans, it can take 1, 2 or 3 arguments:

  • 1 argument: the argument must be a dictionary (or some Mapping) whose keys are integers (Unicode Code Points) or string with size 1. Example:

    tabela = str.maketrans({
        97: 'A',  # ord('a') == 97
        'b': 66,  # ord('B') == 66
        'c': 'C',
        'd': None,
    })
    
    print(tabela)
    # {
    #     97: 'A', 
    #     98: 66, 
    #     99: 'C',
    #     100: None,
    # }
    
  • 2 arguments: both must be strings and must have the same size. The resulting table will map keys and values with the same indexes in the strings. Example:

    tabela = str.maketrans('abcd', 'ABCD')
    
    print(tabela)
    # {
    #     97: 65, 
    #     98: 66, 
    #     99: 67,
    #     100: 68,
    # }
    
  • 3 arguments: follows the same pattern of 2 arguments, but the 3rd argument will be a string containing the characters that must be mapped to None. Example:

    tabela = str.maketrans('abc', 'ABC', 'd')
    
    print(tabela)
    # {
    #     97: 65, 
    #     98: 66, 
    #     99: 67,
    #     100: None,
    # }
    

Then just use the table created by str.maketrans in the str.translate. How would you look in a role:

def remove_chars(string, chars):
    return string.translate(str.maketrans('', '', "".join(chars)))

Performance

I created a Repl.it with 4 different approaches to the problem:

  1. Using str.translate:

    def remove_chars_translate(string, chars):
        return string.translate(str.maketrans('', '', "".join(chars)))
    
  2. Using Generator Expressions as mentioned in another reply:

     def remove_chars_gen_comprehension(string, chars):
         return "".join(c for c in string if c not in chars)
    
  3. Same as before, but using set to search for characters:

     def remove_chars_gen_comprehension_with_set(string, chars):
         chars = set(chars)
         return "".join(c for c in string if c not in chars)
    
  4. Using regex (also extracted from the other answer):

     def remove_chars_regex(string, chars):
         r = re.compile(f'({"|".join(map(re.escape, chars))})')
         return r.sub('', texto)
    

The results on my machine (with 1_000_000 iterations) were:

                  remove_chars_translate:  1.6842 s
          remove_chars_gen_comprehension: 11.0966 s
 remove_chars_gen_comprehension_with_set:  4.2422 s
                      remove_chars_regex:  5.9396 s

Obs.: Decreased number of iterations of timeit in the Repl.it to not take long to get results (had results passing the 30s).

  • 1

    Very good, I confess that I have always been lazy to understand how the maketrans works, and finally understood :-) And I was also surprised that the regex was not the slowest of all (by the way check if each character is on the list turns out to be worse than I imagined, and the cost of creating a set ends up paying)

  • 1

    I created the answer by thinking that the translate was worth it for the C implementation and for apparently doing everything in one loop... But I get it with set Also caught me by surprise. I didn’t expect so much

-1

Replace works as it will go through your string searching for occurrence of the value passed in the parameter old, and when it finds this value in the string, it replaces the value passed in the second parameter.

The parameter count represents how many occurrences you want to be replaced in the string, assuming you have more than one occurrence of the searched value.

texto = 'Hello World';

texto.replace('llo', 'troquei');

#Hetroquei World

texto.replace('l','L', 2);

#HeLLo World

  • This I already knew, but how do I remove a string using this method, without putting anything in place?

  • to remove the part of the string is just replace with empty string for example 'Hello Word'.replace('llo','') will return He Word

Browser other questions tagged

You are not signed in. Login or sign up in order to post.