How to capture only the first part of a text that fits in regex?

Question

How to capture only the first part of a text that fits in regex?

Asked 8 years, 6 months ago

Viewed 93 times

4

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;

It is as follows: above you find a text and below the regex capturing the information of the text. Something that should be taken into account is what text the regex will capture from, is a semi-structured text and has some repetitions. Below is regex. To contextualize, it is a regex that captures addresses.

, (established|located|established|located) (na|no|em) ([ (Municipality|State)]([0-9A-Za-zçãàáâéêôôôôôõOLOURION Q() <>,º°ª-.;/' " E]+)- s*[A-Z]{2})

I want to capture each of the existing addresses in the document and put each of the addresses between the tags <END> and </END>. Only the part delimited by

([0-9A-Za-zçãàáâéêôôôôôôãoª-ª Q() <>º°ª-. ;/' " E]+)- s*[A-Z]{2})

That is, the remainder is considered "normal text", which should not be captured, but should not be discarded. So, for the example given, let’s hope it stays that way:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05 e Inscrição Estadual nº 110.250.399;

However, as you can see from the text, I just pick up all the addresses at once. I thought about using regex, because that’s how I’ve been capturing other things. But if there’s any way I can solve it, great.

Exactly what part of the text you need to capture?

– MagicHat

2017/01/22 at 03:33
@Magichat each of the text fragments that are between <END> and </END> (second code).

– JNMarcos

2017/01/22 at 22:54
<END>Av. Dr. Walter Belian, nº 2.230, Industrial District, João Pessoa-PB</END>

– JNMarcos

2017/01/22 at 22:54

1 answer

Browser other questions tagged java regex

You are not signed in. Login or sign up in order to post.

by Victor Stafusa • **63,338** points · Answer 1 · 2017-01-22T02:34:49+00:00

I see that in your text, the different addresses are separated by semicolons. This makes the task very simple:

import java.util.Arrays;
import java.util.stream.Collectors;

public class Enderecos {

    private static String localizarInicio(String s) {
        String[] loc = {"estabelecida ", "estabelecido ", "localizada ", "localizado "};
        String[] cj = {"em ", "na ", "no "};
        for (String a : loc) {
            for (String b : cj) {
                if (s.contains(a + b)) return s.replace(a + b, a + b + "<END>");
            }
        }
        return "<END>" + s;
    }

    private static String localizarFim(String s) {
        String busca = ", com CNPJ";
        if (s.contains(busca)) return s.replace(busca, "</END>" + busca);
        return s + "</END>";
    }

    public static String formatarListaEnderecos(String malformatado) {
        return Arrays
                .asList(malformatado.split(";"))
                .stream()
                .map(t -> t.replace("<END>", "").replace("</END>", "").trim())
                .filter(t -> !t.isEmpty())
                .map(Enderecos::localizarInicio)
                .map(Enderecos::localizarFim)
                .collect(Collectors.joining());
    }

    public static void main(String[] args) {
        String texto = "<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;<END>";
        String formatado = formatarListaEnderecos(texto);
        System.out.println(formatado);
    }
}

The method I’ve done for the purpose of doing what you want is the formatarListaEnderecos(String). This method does the following:

Divide everything into semicolons, generating an array of addresses, which is then converted into a list and into a Stream.
Remove the tags "<END>" and "</END>" which already exist from each address, since they will not be applied correctly from start (will be replaced after).
Remove spaces at the beginning and end of each address with the trim().
Deletes "addresses" that are reduced to empty strings only.
Locates where to place the "<END>" and put it in each address.
Locates where to place the "</END>" and put it in each address.
Joins everything in one string and returns the result.

The place where the "<END>" is determined by the method localizarInicio(String). He looks for "(estabelecid|localizad)(o|a) (em|na|no) " and puts the <END> after. If he finds nothing, he puts it in the beginning of everything.

The place where the "</END>" is before the text ", com CNPJ". If he doesn’t find it, put it in the end.

The method main(String[]) is there for you to test this method. By executing it, here is the output:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157(NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5(NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399

As for the use of regex, I think the idea of using them in this is an example of XY problem. I mean, I think you’re looking for a tool that might not be the best one to solve this problem.