How to capture only the first part of a text that fits in regex?

Asked

Viewed 93 times

4

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;

It is as follows: above you find a text and below the regex capturing the information of the text. Something that should be taken into account is what text the regex will capture from, is a semi-structured text and has some repetitions. Below is regex. To contextualize, it is a regex that captures addresses.

, (established|located|established|located) (na|no|em) ([ (Municipality|State)]([0-9A-Za-zçãàáâéêôôôôôõOLOURION Q() <>,º°ª-.;/' " E]+)- s*[A-Z]{2})

I want to capture each of the existing addresses in the document and put each of the addresses between the tags <END> and </END>. Only the part delimited by

([0-9A-Za-zçãàáâéêôôôôôôãoª-ª Q() <>º°ª-. ;/' " E]+)- s*[A-Z]{2})

That is, the remainder is considered "normal text", which should not be captured, but should not be discarded. So, for the example given, let’s hope it stays that way:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05 e Inscrição Estadual nº 110.250.399;

However, as you can see from the text, I just pick up all the addresses at once. I thought about using regex, because that’s how I’ve been capturing other things. But if there’s any way I can solve it, great.

  • Exactly what part of the text you need to capture?

  • @Magichat each of the text fragments that are between <END> and </END> (second code).

  • <END>Av. Dr. Walter Belian, nº 2.230, Industrial District, João Pessoa-PB</END>

1 answer

3


I see that in your text, the different addresses are separated by semicolons. This makes the task very simple:

import java.util.Arrays;
import java.util.stream.Collectors;

public class Enderecos {

    private static String localizarInicio(String s) {
        String[] loc = {"estabelecida ", "estabelecido ", "localizada ", "localizado "};
        String[] cj = {"em ", "na ", "no "};
        for (String a : loc) {
            for (String b : cj) {
                if (s.contains(a + b)) return s.replace(a + b, a + b + "<END>");
            }
        }
        return "<END>" + s;
    }

    private static String localizarFim(String s) {
        String busca = ", com CNPJ";
        if (s.contains(busca)) return s.replace(busca, "</END>" + busca);
        return s + "</END>";
    }

    public static String formatarListaEnderecos(String malformatado) {
        return Arrays
                .asList(malformatado.split(";"))
                .stream()
                .map(t -> t.replace("<END>", "").replace("</END>", "").trim())
                .filter(t -> !t.isEmpty())
                .map(Enderecos::localizarInicio)
                .map(Enderecos::localizarFim)
                .collect(Collectors.joining());
    }

    public static void main(String[] args) {
        String texto = "<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;<END>";
        String formatado = formatarListaEnderecos(texto);
        System.out.println(formatado);
    }
}

The method I’ve done for the purpose of doing what you want is the formatarListaEnderecos(String). This method does the following:

  1. Divide everything into semicolons, generating an array of addresses, which is then converted into a list and into a Stream.

  2. Remove the tags "<END>" and "</END>" which already exist from each address, since they will not be applied correctly from start (will be replaced after).

  3. Remove spaces at the beginning and end of each address with the trim().

  4. Deletes "addresses" that are reduced to empty strings only.

  5. Locates where to place the "<END>" and put it in each address.

  6. Locates where to place the "</END>" and put it in each address.

  7. Joins everything in one string and returns the result.

The place where the "<END>" is determined by the method localizarInicio(String). He looks for "(estabelecid|localizad)(o|a) (em|na|no) " and puts the <END> after. If he finds nothing, he puts it in the beginning of everything.

The place where the "</END>" is before the text ", com CNPJ". If he doesn’t find it, put it in the end.

The method main(String[]) is there for you to test this method. By executing it, here is the output:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157(NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5(NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399

As for the use of regex, I think the idea of using them in this is an example of XY problem. I mean, I think you’re looking for a tool that might not be the best one to solve this problem.

  • Victor, I think I understand about the XY problem and also its solution, but unfortunately it does not satisfy me. Not because it doesn’t use regex, which may not be the best solution, as you warned, because it might not work, but because I don’t make it clear, which would be the address, which is only group 3. I reformulated the question to make it clearer and modified the example text, that put the final tag in the wrong place, is after "Camaçari-BA", not after every text presented. Thank you.

  • @Jnmarcos Updated response.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.