capture Pattern groups with regex

Asked

Viewed 593 times

4

How do I capture group-separated information with regex? I have a string with the following format:

/+1-541-754-3010 156 Alphand_St. <J Steeve>

 133, Green, Rd. <E Kustur> NY-56423 ;+1-541-914-3010

This string has 3 different information:

First string:

  • telephone (1-541-754-3010)
  • name (J Steeve)
  • address (156 Alphand St.)

Second string:

  • telephone (1-541-914-3010)
  • name (E Kustur)
  • address (133 Green Rd. NY-56423)

Is it possible to capture the 3 separate information in groups at once? So far I’ve only managed to capture the phone using the following pattern:

(\+(.*?)\s+?)

I tried to add groups then only with another parentheses and it doesn’t work, example:

(\+(.*?)\s+?)(\<(.*?)\>)

The pattern above ends up selecting the phone next to the address.

  • 1

    Are you using any specific language? Are the fields always in this format? (first and last name, enter <>; phone +1-123-123-1234, etc). In the second string, the address is with 2 pieces (the name is between the street and the zip code, as I understand it), that’s right?

  • 1

    An idea would be to capture the phone and the name with regex, then make a replace by removing this information, what is left is the address.

  • @Sam I tried to make a more general solution, but in the end it was getting too complicated. I think your idea is the simplest solution, I even commented on my answer :-)

  • @hkotsubo the strings are the same, the name will always be in <nome> and the phone always starts with +

  • the language I’m using is Go

1 answer

1


The problem of using . is that it matches any character, and so it’s too comprehensive and may end up picking up parts of the string that you don’t want.

The best is say exactly what you want. In the case of the phone, if it is always in the given format, use:

(\+\d-\d{3}-\d{3}-\d{4})

That is, the sign of +, followed by a digit (\d), dash, 3 digits (\d{3}), dash, three digits, dash, four digits. The number after the + is the IDD, and several countries have codes with more than one digit. So maybe it’s interesting to change the beginning to \+\d{1,3}: the sign of + followed by one, two or three digits ({1,3} means "between one and three occurrences").

Only if the IDD is from Brazil, for example, then phones do not have this format (123-123-1234). But I’m already speculating, because you only gave examples with IDD equal to 1, so let’s keep it that way.

With this the phone will not be confused with any other string snippet.


The same goes for other fields. The name, for example. If it is always in between < and >, we can use <(.*?)> like you did. Only that point (.) means "any character", so if the string has <@!#$>, the regex accepts and considers that @!#$ is the name. And how * means "zero or more occurrences", this means that the string can have inclusive <>, and the name will be empty.

If the name is always "Uppercase letter, space, letters", you can use <([A-Z] [A-Z][a-z]+)>, for example.

The clasps ([]) indicate a character class: they serve to indicate that you want whatever character is inside them. For example, [abc] means "the letter a or the letter b or the letter c" (only one of them, any one serves). It is an expression that corresponds to only one character.

Inside the brackets it is also possible to use shortcuts such as A-Z, which means "letters of A to Z" (that is, any capital letter). So [A-Z] [A-Z][a-z]+ means:

  • a capital letter followed by a space: [A-Z]
  • a capital letter ([A-Z]) followed by one or more lower case letters ([a-z]+)

Only that doesn’t consider accented characters (á, ñ, õ, etc.).

If you want to be more comprehensive, you can use the unicode categories (if the language/engine you are using supports this feature).

I can use the category Ll, considering any lower case letter (including Greek, Cyrillic and many other characters, the list is great), and the category Lu for capital letters (whose list is also large).
Then the regex would be <(\p{Lu} \p{Lu}\p{Ll}+)>.

If you don’t want to pick up so many characters and limit yourself to the Latin alphabet, you can include accented characters in brackets, for example [A-ZÁÂÃÉÊÍÎÓÔÕÚÛÇ] for capital letters (include all you need in brackets).

Anyway, choose the one that fits best in your use cases. If the entries are well controlled and there is no chance of having strings like $ @123 in place of name, even .*? is acceptable. The more accurate the regex, the more complex it becomes, but the simpler, the greater the chance of false positives.

Note: I’m not sure if in Go, the characters < and > need to be escaped to \< and \>.


Another problem is in the order in which the information is, which by the examples given, seems to vary:

  • in the first line we have telephone, address and name
  • in the second line we have part of the address, name, other part of the address (ZIP code) and phone

An alternative would be to have a regex with alternation (using |), more or less like this:

(?:(telefone) (endereço) (nome))|(?:(parte_endereço) (nome) (parte2_endereço) (telefone))|(?:....)

In the case, telefone would be the expression above (\+\d...), idem for nome and endereço (that we did not come to address in detail, but would have its own expression placed there).

For each alternation, an order is placed in which the information can be. The problem is that this causes us to repeat the same expression several times. Also you would have to check which group gave the match: the phone can be in the first or seventh group, for example (the most external parentheses do not enter the count because I put ?:, and that makes them are not considered as catch groups).

If the language/engine you are using supports subroutines of regex (also called recursive patterns), it is possible to use them to take advantage of the same expression at other points in regex. The idea of the subroutine is to reuse the expression of one of the parentheses in another point of the regex. For example:

(telefone) (endereço) (nome)|((?2)) ((?3)) ((?1))

The expression (?1) means "the same expression that is in the first capture group". And as a capture group is defined by parentheses, so (telefone) is the first group, and (?1) it’s just a shortcut so we don’t repeat the same expression (\+\d-etc...). Note that I put in parentheses: ((?1)). It makes her become another capture group, i.e., it is possible to capture if the string is found at this point (and it will be the sixth capture group, since it is the sixth pair of parentheses in the expression).

Behold here an example of how it would look.

But unfortunately it is not all languages and libraries that implement this feature. One I know is the regex module python (not the same as the module re, as it adds several features, including subroutines). And from what I saw in go documentation, it does not support syntax (?1), then the way is to repeat the same expression over and over again.

Another alternative is to do as @Sam suggested: you can extract the name and phone (using the expressions already explained above), and then make a substitution by removing them from the string. What’s left is the address.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.