Regex Python Remove everything before the first letter in a string

Asked

Viewed 63 times

2

Another question from REGEX that none of the answers I found follows. I have a dataframe that some strings start erroneously with things that are not Letters, for example :

t = ['. Subordinam-se ao regime', '1º O acesso à informação']

The result I want is

'Subordinam-se ao regime', 'O acesso à informação'

I’m trying to re.sub(r'[[:alpha:]]+', '', t) unsuccessfully, what I’m doing wrong?

1 answer

3


In his example, t is a list, so it’s no use passing it directly to re.sub. You have to pass one string at a time.

Therefore, one option is to use:

import re

t = ['. Subordinam-se ao regime',  '1º O acesso à informação']
for s in t:
    print(re.sub('^[^a-zA-Z]+', '', s))

regex uses the marker ^ which indicates the start of the string. Then [^a-zA-Z]+ (one or more occurrences of any character which nay be a letter).

That is, we replace anything that is not letter and is at the beginning of the string. The output will be:

Subordinam-se ao regime
O acesso à informação

If you have accented letters, you can switch to:

import re
from unicodedata import normalize

t = ['. Subordinam-se ao regime',  '1º O acesso à informação', ' - É blabla etc']
for s in t:
    print(re.sub('^[^a-zA-Z]+', '', normalize('NFD', s)))

I use normalize to convert the string to NFD. To understand what normalization is, read here, here and here, but basically letters like É are broken down into two: the letter E without accent and the accent itself. Thus, we also consider accented characters.


The problem is that when normalizing you change the original content.

Of course another alternative is to include all accented characters in regex, something like:

re.sub('^[^a-zA-ZáéíóúÁÉÍÓÚ]+', '', s)

I just put the letters in a high-pitched accent, but then just add the rest inside the brackets.


And it can also be done without regex:

from string import ascii_lowercase

# retorna a string a partir da primeira letra
def apos_primeira_letra(s):
    # coloque aqui todas as letras válidas
    letras_validas = ascii_lowercase + 'çáéíóúãõâô'
    for i, c in enumerate(s):
        if c.lower() in letras_validas:
            return s[i:] # retorna da letra em diante
    return s # se não encontrou nenhuma letra, retorna a própria string

t = ['. Subordinam-se ao regime',  '1º O acesso à informação', ' - É blabla etc']
for s in t:
    print(apos_primeira_letra(s))

The idea is to have all valid letters in letras_validas (I put only the tiny ones to facilitate, so when checking you use lower() for each character being checked).

Initially I had thought to use isalpha() to check if it is letter, but this method returns True for the character º, then I thought it best to fix the letters that I will consider valid.

Thus, the function apos_primeira_letra goes through the string, and if you find a letter, it returns everything from there. If you don’t find any, it returns the same string without modification.

  • Second regex you provide me today @hkotsubo ! Thank you very much, it worked!

  • It is worth mentioning here that the expression will not correctly consider accented characters. For example, a phrase like "Of course it works" would be "of course it works". Perhaps this can have side effects and I should be careful.

  • @Calm Woss I’m updating the answer :-)

  • 1

    I’m calm, put xD

  • 1

    @Woss Ready, now yes :-)

  • 1

    Then it also comes in cases like the characters to be removed are part of the content, like "-5°C was the temperature at that moment" - I know you, Hugo, you know, but commenting just for Jessica to take care when using REGEX.

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.