Depends on how your string is.
If after "Nome: "
only has the name and soon after the line break, the simplest is to use:
import re
s = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
result = re.findall(r'Nome: (.+)', s)
print(result)
I’m taking advantage of three facts:
- By default, the point (
.
) consider any character, except line breaks
- The quantifier
+
(one or more occurrences) is greedy and tries to grab as many characters as possible. That’s why, .+
take everything until the next line break
- The parentheses form a catch group, and
findall
returns a list with the capture groups, when they are present
The result is a list with the name:
['Foo']
You can get the name with result[0]
, if you want. regex works for names with spaces and multiple surnames.
Another alternative is to use search
:
import re
s = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
match = re.search(r'Nome: (.+)', s)
if match:
print(match.group(1))
With this I get what was captured in the capture group, using the method group
(as the name is within the first pair of parentheses of regex, so it is group 1, so I do match.group(1)
). The difference to findall
is that now I already have the name as a string (instead of a list). The result is:
Foo
Anyway, your regex was also correct, the only detail is that you were not catching the capture group, and yes whole the part corresponding to match.
If you want, you can also use a lookbehind to check whether the "Nome: "
:
import re
s = 'a1in1iionia\n\nDados do cliente\nNome: Foo\nE-mail: [email protected]\n'
match = re.search(r'(?<=Nome: ).+', s)
if match:
print(match.group())
The difference is that the lookbehind (the stretch between (?<=
and )
) only checks if something exists before the current position, but it will not be part of the match. So I no longer need the capture group and the result of the above code is the string Foo
(see on regex101.com).
But honestly, I find it unnecessary in this case. Using the lookbehind leaves regex a little more complicated and inefficient. See here one debug of regex with lookbehind and compare with regex without lookbehind (see how many steps each takes). Use lookbehind makes regex slower because it needs to come back all the time to check if it has the snippet "Nome: "
before the current position. Of course for small strings and running a few times, the difference will be insignificant (maybe milliseconds or even less), but it’s important to keep this in mind: it’s worth using a slightly more complicated and slow regex, just so you don’t have to use the capture group?
Anyway, in the examples below I will use findall
and without the lookbehind (that is, I will use the capture group), but the same regex also works with search
. Use what you think is best.
Validate the name
The problem is that the dot will accept anything (including special characters such as @!#$%^&
, among others, which are not necessarily part of a name).
If you want, you can restrict the string more to accept only a sequence of letters, space, letters, space, etc... It would look like this:
import re
s = 'a1in1iionia\n\nDados do cliente\nNome: Fulano de Tal\nE-mail: [email protected]\n'
result = re.findall(r'Nome: ([a-zA-Z]+(?: [a-zA-Z]+)*)', s)
print(result)
Exit:
['So-and-so']
Now the regex is ([a-zA-Z]+(?: [a-zA-Z]+)*)
:
- the first part is
[a-zA-Z]+
: one or more letters (uppercase or lower case)
- then we have
(?:
, which creates a catch group (so he is not returned by findall
, because I am interested only in the most external group, which contains the whole name)
- within this group we have a space (there is a space between the
:
and the [
), followed by several letters
- this whole group (space plus letters) can repeat itself zero or more times (indicated by
*
). This ensures that we can have zero or more surnames
Improvements
Of course, there is still room for improvement. If you want the name to always start with a capital letter, you can use [A-Z][a-z]+
. regex also does not include names with apostrophe (e.g., "D'Aquino") or hyphenated, nor accented characters.
For accents, some may suggest \w
instead of [a-zA-Z]
, but the problem is that this shortcut also accepts numbers and the character _
, then I wouldn’t use it if I wanted more precision. An alternative is to use something like:
re.findall(r'Nome: ([a-záéíóúâêôçãõ]+(?: [a-záéíóúâêôçãõ]+)*)', s, flags=re.IGNORECASE)
With this, all accented letters are part of the name, and the option IGNORECASE
makes uppercase and lowercase be considered (so I don’t need to put ÁÉÍÓÚ....
in regex).
Another option is to use the module regex
, an excellent extension of the module re
native. You can install it with pip install regex
.
This module has Unicode properties support, so I can use \p{L}
for any character that is a letter.
I also use the module unicodedata
to normalize the string, ensuring that it will not break the regex (Unicode normalization is beyond the scope here, but you can read more about it here, here and here).
import regex
import unicodedata
s = 'a1in1iionia\n\nDados do cliente\nNome: Fulâno D\'aquino Ávila Souza-e-Silva\nE-mail: [email protected]\n'
# regex para um nome ou sobrenome
nome = r'\p{L}+(?:[-\']\p{L}+)*'
# cria a regex (nome, espaço, sobrenome, espaço, sobrenome...)
r = regex.compile(r'Nome: ({0}(?: {0})*)'.format(nome))
result = r.findall(unicodedata.normalize('NFC', s))
print(result)
I also included the apostrophe or hyphen check in the name: (?:[-\']\p{L}+)*
is a hyphen or apostrophe, followed by several letters (and the asterisk makes this whole group repeat zero or more times).
The exit is:
["Fulâno D'Aquino Ávila Souza-e-Silva"]
Remembering that \p{L}
also includes, in addition to accented characters, letters from other languages such as Japanese, Arabic, etc.
I am using Python3.x and testing through https://regex101.com/
– Kfcaio
This regex is right, but you have to use what is in group 1 and not the complete capture. Which python code you are using ?
– Isac