Regex to get different snippets of a file

Asked

Viewed 430 times

1

I need to make a regex that takes the address of the person.

I can receive 2 types of text with the address. In the first case it comes alone, and get with this regex:

"(?!/\d{1,5}\)\s/)+?\d{1,5}\)\s(.+?)\nNúmero"

So far so good. My problem is in the second case that txt brings me 2 addresses together.

Case 1:

…(685) R. Bolonha
Passa Vinte
Palhoça - SC CEP: 00000-000
Número …

Case 2:

..(783) R. Papa Paulo VI  Endereço Entrega
Ponte do Imaruim (4398) R. SANTA TEREZA 
Palhoça - SC CEP: 000000-000 Balneário
Florianópolis - SC CEP: 000000-000
  CADASTRO DO CLIENTE
RES COM PÚB IND TOTAL Número..

And in these cases I have to take only " R. Pope Paul VI Ponte do Imamau Palhoça - SC CEP: 000000" which in this case would be the first address.

I need to get the Address with only 1 regex in both cases, has as?

The data is fictitious, and the reticence means that the text continues.

  • 1

    I don’t understand. In both cases you only want the street (without the number), city, state and zip code? A string Endereço Entrega can it be there? Are there other variations in the format? The more variations and possibilities, the more complex the regex. And why does the second zip code have 6 digits? In fact, the second example has two zip codes in two different cities, not to take the second? If you can please [Dit] the question and make it a little clearer...

  • then, alas that my problem in the first it only returns 1 address, in the second case it returns 2, then have to take only the first address

  • 2

    It’s not clear to me yet. What are these numbers in parentheses? In the second case, why is the result "R. Pope Paul VI Bridge of the Imamau Palhoça", if the passage "Bridge of the Imamau" seems to be part of the other address (Santa Tereza)? If you have to ignore R. Santa Tereza, why was not ignored the "Imamau Bridge" that is on the same line? By the way, are the data always on separate lines? Not knowing the exact criterion (where an address begins and ends, what data to consider, etc.) it is difficult to give an accurate answer, even more so with regex, where accuracy makes all the difference

  • the numbers between () are some kind of id, this data is from a pdf, which are transferred to txt and delivered to me, they always come in format 1 or format 2

  • it always starts with the first id of the first address ( d{1,5} ), only it doesn’t end with the same value

  • I’m trying to pick up the separate lines, but I can only take the street, the neighborhood and the city I can’t. I took the street like this "(?! / d{1,5} ) s/)+? d{1,5} ) s(.+? )( n|Address)"

  • Well, I put a more simplified solution that I think will serve (at least for the examples cited in the question). But I think a single regex that does everything is too complicated, so I used 3 separate regex...

  • Henrique, if you have found a solution, just add an answer below (it is better because the site is more organized, with the question separate from the answers). Until I was curious to see your code, since doing a test with your regex, I did not get the same result... :-)

  • I replied, your reply helped me very much thanks for the help :)

Show 4 more comments

2 answers

3

It would be interesting to have more details about the exact file format and a few more examples, but anyway, I’ll leave a more general solution, and you adapt according to what you need.


For this solution, I created a file based on question and comment information:

(685) R. Bolonha
Passa Vinte
Palhoça - SC CEP: 00000-000
Número
(783) R. Papa Paulo VI  Endereço Entrega
Ponte do Imaruim (4398) R. SANTA TEREZA 
Palhoça - SC CEP: 00000-000 Balneário
Florianópolis - SC CEP: 00000-000
  CADASTRO DO CLIENTE
RES COM PÚB IND TOTAL Número

Some premises I considered:

  • the first line contains a number in parentheses, followed by the name of the street/avenue, etc.
    • optionally can have the text "Address Delivery" at the end of the line
    • as I understand it, you don’t have the address number, just the street name
  • the lines containing the CEP are always in the format "city - UF CEP: 00000-000"
    • I considered that the ZIP with 6 digits before the hyphen is a typo, so I’m always searching in the format "5 digits, hyphen, 3 digits"
  • in case of having a second address, it appears in the same line of the neighborhood of the first address - ie, I considered that "Bridge of Imamau" is the neighborhood of the first address (R. Pope Paul VI) and R. Santa Tereza is the second address
    • if you have a list of multiple Zip Codes followed, the first one that appears is the one that counts, any other that comes soon after is ignored
  • if the line starts with the ID (number in parentheses), then this is the first address, and the neighborhood and ZIP code you have then refer to it
  • there are no cases where the street or the CEP are on the same line, or where there is information other than that given in the examples

Note that there are many premises and "achismos" on my part, but this is what can be done with the information that has been passed. Anyway, if the file doesn’t really have a well defined format, there’s not much to do but start from a point and adjust accordingly as new cases arise.

That said, I don’t think it’s worth using a single giant regex that does everything. I think it’s best to read the file line by line, and for each case you use a different regex, because as there are many variations, each regex individually is already (or can become) complicated by itself.


A first attempt (based solely on the test file I generated above) would be:

Pattern regexRua = Pattern.compile("^\\(\\d{1,5}\\) (.+?)(?:\\s*Endereço Entrega)?$");
Pattern regexBairro = Pattern.compile("^([^(]+)");
Pattern regexCep = Pattern.compile("^(.+) - ([A-Z]{2}) CEP: (\\d{5}-\\d{3})$");
int status = 0; // 0=rua, 1=bairro, 2=CEP
try (Scanner sc = new Scanner(new File("/caminho/do/arquivo.txt"))) {
    while (sc.hasNextLine()) {
        String texto = sc.nextLine();
        switch (status) {
            case 0:
                Matcher matcherRua = regexRua.matcher(texto);
                if (matcherRua.find()) {
                    String rua = matcherRua.group(1).trim();
                    System.out.printf("Endereço: %s\n", rua);
                    status = 1; // consegui ler o endereço, passo para o bairro
                }
                break;
            case 1:
                Matcher matcherBairro = regexBairro.matcher(texto);
                if (matcherBairro.find()) {
                    String bairro = matcherBairro.group(1);
                    System.out.printf("Bairro: %s\n", bairro);
                    status = 2; // consegui ler o bairro, passo para a cidade/uf/cep
                }
                break;
            case 2:
                Matcher matcherCep = regexCep.matcher(texto);
                if (matcherCep.find()) {
                    String cidade = matcherCep.group(1);
                    String uf = matcherCep.group(2);
                    String cep = matcherCep.group(3);
                    System.out.printf("Cidade: %s, UF: %s, CEP: %s\n", cidade, uf, cep);
                    status = 0; // consegui ler cidade/uf/cep, volto a ler o próximo endereço
                }
                break;
        }
    }
}

First I use a Try-with-Resources to open the file, and I read line by line (I did not put a block catch to keep the focus on the algorithm, but on its final code, put it and handle the mistakes).

I use a status variable to know what information I’m currently reading. The logic is: I try to read the address (the line that starts with the ID - a number in parentheses), if I can, I try to read the neighborhood. And if I can read the neighborhood, I try to read the city/UF/ZIP (and if I can, I’ll come back and try to read the next address).

If the information I am currently trying to read is not found, it passes the next line, and so on, until the end of the file. The output is:

Endereço: R. Bolonha
Bairro: Passa Vinte
Cidade: Palhoça, UF: SC, CEP: 00000-000
Endereço: R. Papa Paulo VI
Bairro: Ponte do Imaruim 
Cidade: Florianópolis, UF: SC, CEP: 00000-000

On the regex

Now let’s see some details of each regex.

Address

To the address I used:

Pattern regexRua = Pattern.compile("^\\(\\d{1,5}\\) (.+?)(?:\\s*Endereço Entrega)?$");

The markers ^ and $ are respectively the beginning and end of the string. Then I check the ID (1 to 5 digits in parentheses, following the same regex you used in the question). Note that the parentheses must be escaped with \ (that within a String should be written as \\), since they are characters with special meaning in regex (we will see this below), and so that the regex understands that we want the characters themselves ( and ), we need the exhaust with \.

Then we have a space, followed by .+? (one or more characters). This section is within parentheses to form a catch group, because then I can retrieve this stretch later, using the method group.

Then we have \\s* (zero or more spaces), followed by "Delivery Address". And the ? makes this stretch optional. And also use (?: so that this pair of parentheses does not become a capture group (because I do not need to recover this stretch later, and so the regex does not need to create random groups).

The excerpt .+? is a simplification, as it takes everything that is on the line (after the ID and before "Delivery Address", if available). One detail is that if you only use .+, it takes the whole line (including "Delivery Address"), because the quantifier is greedy and tries to grab as many characters as possible. To avoid this behavior, the syntax is used .+?.

With this, regex will have the address in group 1 (as it is the first pair of parentheses), and this is what is done when we use matcherRua.group(1). I also use the method trim() to remove any spaces from the beginning and end.

Neighborhood

For the neighborhood, we have:

Pattern regexBairro = Pattern.compile("^([^(]+)");

This one’s a little simpler, and I made it simple well because I was based on the format indicated in the question.

It has the beginning of the string (^) and then uses [^(]+: one or more characters other than the (. This means that in the first case ("Pass Twenty"), she takes the whole line, and in the second case ("Bridge of the Imamau (4398) R. SANTA TEREZA"), she takes everything up to the ( - that is, only the stretch "Bridge of Imamau " (with the space at the end inclusive, so I also use trim() here).

An important point is that, thanks to the status variable, I guarantee that I will only read the neighborhood if previously I was able to read the address. This means that the file should have the neighborhood right after the address (otherwise, this regex could also pick up the lines as "CUSTOMER REGISTRATION", for example - In fact, even the zip code line could be confused with a neighborhood, since that [^(] is "whichever character other than (", then the zip code line would also fit this case).

This is one more reason not to use a single giant regex, as it would be much more complex to differentiate a text that is a neighborhood from a generic text. And it’s also based on the premise that the file format will always be correct, with the neighborhood just below the address.

City/UF/CEP

For the city, UF and ZIP, we have:

Pattern regexCep = Pattern.compile("^(.+) - ([A-Z]{2}) CEP: (\\d{5}-\\d{3})$");

She also uses ^ and $ to mark the beginning and end of the string. Then we have 3 capture groups (3 pairs of parentheses):

  • the first, for the city, uses the ultra-simplified expression .+ (one or more characters)
  • the second, for the UF, uses [A-Z]{2} (two capital letters)
  • the third, for the CEP uses \\d{5}-\\d{3} (5 digits, hyphen, 3 digits)

The part of UF and ZIP are well defined, already for the city I was very lazy, because I am relying on the format of the file: if it is guaranteed that this line will always have the format above, the regex will capture everything correctly, so I don’t have to worry if in the name of the city there are things like meaningless !@#$%*-x (because I’m assuming you won’t have cases like this).

By the way, the same goes for the neighborhood and address. If you know that the file will always bring valid names, you can use the simplest regex above. But if you want to make more complex validations (like "the city has to have X letters", "the address has to start with R. or Av.", etc.), then the expressions get bigger and more complex.

But I think with the above solution you already have a place to start.

1


SOLUTION

I managed to make a regex that takes everything, but in different groups

(?!/\\d{1,}\\)\\s/)+?\\d{1,}\\)\\s(.+?)(\\n|Endereço(.+?)\\n)(.+?)(\\n|\\((.+?)\\n)(.+?)CEP:\\s(.+?)\\s

street = group 1

neighborhood = group 4

city = group 7

cep = group 8

this is the method I use to make the regex

public static final String executeRegexp(final String text, final String er, final int group) {
        final Matcher matcher = Pattern.compile(er, Pattern.DOTALL).matcher(text);
        final boolean match = matcher.find();
        if (!match) {
            log.info("no match " + er);
            return "";
        }
        return matcher.group(group);
    }
  • 1

    I did some tests and I think the excerpt (?!/\\d{1,}\\)\\s/)+? is not necessary. This is a Negative Lookahead, that checks if something doesn’t exist in front. And inside it has a bar, followed by digits, ), space and another bar. And since there are no bars in the string, this snippet will never give match (but as it is a Negative lookbehind, it always works in this case, so it is redundant and unnecessary - in addition to slowing down the regex, because every time has to check if it exists, when she could already fetch the rest of the expression without needing it)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.