4
<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;
It is as follows: above you find a text and below the regex capturing the information of the text. Something that should be taken into account is what text the regex will capture from, is a semi-structured text and has some repetitions. Below is regex. To contextualize, it is a regex that captures addresses.
, (established|located|established|located) (na|no|em) ([ (Municipality|State)]([0-9A-Za-zçãàáâéêôôôôôõOLOURION Q() <>,º°ª-.;/' " E]+)- s*[A-Z]{2})
I want to capture each of the existing addresses in the document and put each of the addresses between the tags <END>
and </END>
. Only the part delimited by
([0-9A-Za-zçãàáâéêôôôôôôãoª-ª Q() <>º°ª-. ;/' " E]+)- s*[A-Z]{2})
That is, the remainder is considered "normal text", which should not be captured, but should not be discarded. So, for the example given, let’s hope it stays that way:
<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05 e Inscrição Estadual nº 110.250.399;
However, as you can see from the text, I just pick up all the addresses at once. I thought about using regex, because that’s how I’ve been capturing other things. But if there’s any way I can solve it, great.
Exactly what part of the text you need to capture?
– MagicHat
@Magichat each of the text fragments that are between <END> and </END> (second code).
– JNMarcos
<END>Av. Dr. Walter Belian, nº 2.230, Industrial District, João Pessoa-PB</END>
– JNMarcos