Regex Python Remove everything before the first letter in a string


Another question from REGEX that none of the answers I found follows. I have a dataframe that some strings start erroneously with things that are not Letters, for example :

t = ['. Subordinam-se ao regime', '1º O acesso à informação']

The result I want is

'Subordinam-se ao regime', 'O acesso à informação'

I’m trying to re.sub(r'[[:alpha:]]+', '', t) unsuccessfully, what I’m doing wrong?

In his example, t is a list, so it’s no use passing it directly to re.sub. You have to pass one string at a time.

Therefore, one option is to use:

import re

t = ['. Subordinam-se ao regime',  '1º O acesso à informação']
for s in t:
    print(re.sub('^[^a-zA-Z]+', '', s))

regex uses the marker ^ which indicates the start of the string. Then [^a-zA-Z]+ (one or more occurrences of any character which nay be a letter).

That is, we replace anything that is not letter and is at the beginning of the string. The output will be:

Subordinam-se ao regime
O acesso à informação

If you have accented letters, you can switch to:

import re
from unicodedata import normalize

t = ['. Subordinam-se ao regime',  '1º O acesso à informação', ' - É blabla etc']
for s in t:
    print(re.sub('^[^a-zA-Z]+', '', normalize('NFD', s)))

I use normalize to convert the string to NFD. To understand what normalization is, read here, here and here, but basically letters like É are broken down into two: the letter E without accent and the accent itself. Thus, we also consider accented characters.

The problem is that when normalizing you change the original content.

Of course another alternative is to include all accented characters in regex, something like:

re.sub('^[^a-zA-ZáéíóúÁÉÍÓÚ]+', '', s)

I just put the letters in a high-pitched accent, but then just add the rest inside the brackets.

And it can also be done without regex:

from string import ascii_lowercase

# retorna a string a partir da primeira letra
def apos_primeira_letra(s):
    # coloque aqui todas as letras válidas
    letras_validas = ascii_lowercase + 'çáéíóúãõâô'
    for i, c in enumerate(s):
        if c.lower() in letras_validas:
            return s[i:] # retorna da letra em diante
    return s # se não encontrou nenhuma letra, retorna a própria string

t = ['. Subordinam-se ao regime',  '1º O acesso à informação', ' - É blabla etc']
for s in t:

The idea is to have all valid letters in letras_validas (I put only the tiny ones to facilitate, so when checking you use lower() for each character being checked).

Initially I had thought to use isalpha() to check if it is letter, but this method returns True for the character º, then I thought it best to fix the letters that I will consider valid.

Thus, the function apos_primeira_letra goes through the string, and if you find a letter, it returns everything from there. If you don’t find any, it returns the same string without modification.

