Extract content from "Name:" field until end of line

Asked

Viewed 253 times

2

Given a string in the following format:

'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'

I need to extract the full content from the "Name" field. For this I wrote the code:

Nome: ([a-zA-Z]+)

The problem is that this code has as output:

"Name: Foo"

Although I only need "Foo". Also, I noticed that this selection does not work for the composite name case. Without entering regular expressions for full names, I need this field to be selected at '\n', that is, at the end of the line.

How to make my expression more robust for these cases?

  • I am using Python3.x and testing through https://regex101.com/

  • This regex is right, but you have to use what is in group 1 and not the complete capture. Which python code you are using ?

2 answers

4


Depends on how your string is.

If after "Nome: " only has the name and soon after the line break, the simplest is to use:

import re

s = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
result = re.findall(r'Nome: (.+)', s)
print(result)

I’m taking advantage of three facts:

  • By default, the point (.) consider any character, except line breaks
  • The quantifier + (one or more occurrences) is greedy and tries to grab as many characters as possible. That’s why, .+ take everything until the next line break
  • The parentheses form a catch group, and findall returns a list with the capture groups, when they are present

The result is a list with the name:

['Foo']

You can get the name with result[0], if you want. regex works for names with spaces and multiple surnames.

Another alternative is to use search:

import re

s = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
match = re.search(r'Nome: (.+)', s)
if match:
    print(match.group(1))

With this I get what was captured in the capture group, using the method group (as the name is within the first pair of parentheses of regex, so it is group 1, so I do match.group(1)). The difference to findall is that now I already have the name as a string (instead of a list). The result is:

Foo

Anyway, your regex was also correct, the only detail is that you were not catching the capture group, and yes whole the part corresponding to match.

If you want, you can also use a lookbehind to check whether the "Nome: ":

import re

s = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
match = re.search(r'(?<=Nome: ).+', s)
if match:
    print(match.group())

The difference is that the lookbehind (the stretch between (?<= and )) only checks if something exists before the current position, but it will not be part of the match. So I no longer need the capture group and the result of the above code is the string Foo (see on regex101.com).

But honestly, I find it unnecessary in this case. Using the lookbehind leaves regex a little more complicated and inefficient. See here one debug of regex with lookbehind and compare with regex without lookbehind (see how many steps each takes). Use lookbehind makes regex slower because it needs to come back all the time to check if it has the snippet "Nome: " before the current position. Of course for small strings and running a few times, the difference will be insignificant (maybe milliseconds or even less), but it’s important to keep this in mind: it’s worth using a slightly more complicated and slow regex, just so you don’t have to use the capture group?

Anyway, in the examples below I will use findall and without the lookbehind (that is, I will use the capture group), but the same regex also works with search. Use what you think is best.


Validate the name

The problem is that the dot will accept anything (including special characters such as @!#$%^&, among others, which are not necessarily part of a name).

If you want, you can restrict the string more to accept only a sequence of letters, space, letters, space, etc... It would look like this:

import re

s = 'a1in1iionia\n\nDados do cliente\nNome: Fulano de Tal\nE-mail: [email protected]\n'
result = re.findall(r'Nome: ([a-zA-Z]+(?: [a-zA-Z]+)*)', s)
print(result)

Exit:

['So-and-so']

Now the regex is ([a-zA-Z]+(?: [a-zA-Z]+)*):

  • the first part is [a-zA-Z]+: one or more letters (uppercase or lower case)
  • then we have (?:, which creates a catch group (so he is not returned by findall, because I am interested only in the most external group, which contains the whole name)
    • within this group we have a space (there is a space between the : and the [), followed by several letters
    • this whole group (space plus letters) can repeat itself zero or more times (indicated by *). This ensures that we can have zero or more surnames

Improvements

Of course, there is still room for improvement. If you want the name to always start with a capital letter, you can use [A-Z][a-z]+. regex also does not include names with apostrophe (e.g., "D'Aquino") or hyphenated, nor accented characters.

For accents, some may suggest \w instead of [a-zA-Z], but the problem is that this shortcut also accepts numbers and the character _, then I wouldn’t use it if I wanted more precision. An alternative is to use something like:

re.findall(r'Nome: ([a-záéíóúâêôçãõ]+(?: [a-záéíóúâêôçãõ]+)*)', s, flags=re.IGNORECASE)

With this, all accented letters are part of the name, and the option IGNORECASE makes uppercase and lowercase be considered (so I don’t need to put ÁÉÍÓÚ.... in regex).


Another option is to use the module regex, an excellent extension of the module re native. You can install it with pip install regex.

This module has Unicode properties support, so I can use \p{L} for any character that is a letter.

I also use the module unicodedata to normalize the string, ensuring that it will not break the regex (Unicode normalization is beyond the scope here, but you can read more about it here, here and here).

import regex
import unicodedata

s = 'a1in1iionia\n\nDados do cliente\nNome: Fulâno D\'aquino Ávila Souza-e-Silva\nE-mail: [email protected]\n'
# regex para um nome ou sobrenome
nome = r'\p{L}+(?:[-\']\p{L}+)*'
# cria a regex (nome, espaço, sobrenome, espaço, sobrenome...)
r = regex.compile(r'Nome: ({0}(?: {0})*)'.format(nome))
result = r.findall(unicodedata.normalize('NFC', s))
print(result)

I also included the apostrophe or hyphen check in the name: (?:[-\']\p{L}+)* is a hyphen or apostrophe, followed by several letters (and the asterisk makes this whole group repeat zero or more times).

The exit is:

["Fulâno D'Aquino Ávila Souza-e-Silva"]

Remembering that \p{L} also includes, in addition to accented characters, letters from other languages such as Japanese, Arabic, etc.

2

You can use the flag re.MULTILINE which serves precisely to search for text in multiple lines in a simplified way:

>>> import re    
>>> texto = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
>>> m = re.search('Nome: (.*)$', texto, re.MULTILINE)
>>> print(m.group(1))
Foo

That way when using re.MULTILINE the sign of $ in regular expression means "end-of-the-line"; with this it is easy to pick up all the text until the end of the line

Browser other questions tagged

You are not signed in. Login or sign up in order to post.