Regular expression to deny anything other than social reason

Asked

Viewed 724 times

0

I’m trying to create a regular expression to remove everything that is not part of the social reason in a string, but I’m having a hard time not removing the symbols that are in the middle of it.

Entree:

201700000000111 01/02/2017 11.111.111/0001-74 ADAMA BRASIL S/A ATIVA 0,00 160,00 160,00 0,00 0,00 0,00 0,00 0,00
201700000000122 01/02/2017 22.222.222/0002-75 AGRITEX COMERCIAL AGRÍCOLA LTDA (QUERÊNCIA) ATIVA 2,79 170,00 170,00 0,00 0,00 0,00 4,74 0,00
201700000000133 07/02/2017 33.333.333/0001-76 CREMONESE WANDSCHEER & CIA LTDA - ME ATIVA 0,00 50,00 50,00 0,00 0,00 0,00 0,00 0,00
201700000000144 23/02/2017 44.444.444/0001-77 G3 SEMENTES LTDA ATIVA 0,00 230,00 230,00 0,00 0,00 0,00 0,00 0,00

Required exit:

ADAMA BRASIL S/A ATIVA
AGRITEX COMERCIAL AGRÍCOLA LTDA (QUERÊNCIA) ATIVA
CREMONESE WANDSCHEER & CIA LTDA - ME ATIVA

Currently I created one of the form below, but it is not getting as I need. I’m using it in java, but you can post it in other ways.

s.replaceAll("[^A-zÀ-ú\\s]", "").trim();
  • Does the text always start from this fixed position? Or in the 4th token? It already makes the work easier.

  • You can change the rule of your regular expression, instead of seeking to remove what you do not want, you can do the search bringing only what you want type: \b[A-zÀ-ú\s\\\/&\-\(|)]{2,}\b see this example: http://rubular.com/r/4LdX3PR6s1

  • I edited the answer, thus arriving at this expression: \b(\d{2}\.\d{3}\.\d{3}\/\d{4}\-\d{2})\b([A-zÀ-ú-1-9\s\\\/&\-\(|)]{5,}.*[a-zA-Z])\b

  • 1

    It seems to me totally dispensable to regex. Apparently it would be enough to consider the spaces and dispense 3 items left and 8 right.

2 answers

1

Good afternoon, I believe you can search the entire set of words within the expression:

I made a test, it follows below:

Blush

Set the Voce regular expression can make the escape to java with the freeformatter

As this expression I can pick up the expected output in this way:

public static void main(String args[]) {

    String input = "201700000000111 01/02/2017 11.111.111/0001-74 ADAMA BRASIL S/A ATIVA 0,00 160,00 160,00 0,00 0,00 0,00 0,00 0,00"
            + System.lineSeparator()
            + "201700000000122 01/02/2017 22.222.222/0002-75 AGRITEX COMERCIAL AGRÍCOLA LTDA (QUERÊNCIA) ATIVA 2,79 170,00 170,00 0,00 0,00 0,00 4,74 0,00"
            + System.lineSeparator()
            + "201700000000133 07/02/2017 33.333.333/0001-76 CREMONESE WANDSCHEER & CIA LTDA - ME ATIVA 0,00 50,00 50,00 0,00 0,00 0,00 0,00 0,00"
            + System.lineSeparator()
            + "201700000000204 23/02/2017 23.972.199/0001-15 G3 SEMENTES LTDA ATIVA 0,00 230,00 230,00 0,00 0,00 0,00 0,00 0,00";

    String regex = "\\b(\\d{2}\\.\\d{3}\\.\\d{3}\\/\\d{4}\\-\\d{2})\\b([A-zÀ-ú-1-9\\s\\\\\\/&\\-\\(|)]{5,}.*[a-zA-Z])\\b";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);

    while (matcher.find()) {
        String cnpj = matcher.group(1).trim();
        String nome = matcher.group(2).trim();
        System.out.println(nome);
    }

}

Now explaining my regular expression:

\b(\d{2}\.\d{3}\.\d{3}\/\d{4}\-\d{2})\b([A-zÀ-ú-1-9\s\\\/&\-\(|)]{5,}.*[a-zA-Z])\b

The \b before and after means that there can be any special character before and after the regular expression, which is defined by the character set between [] where they occur 5 times or more in sequence. You can go on adding more characters within [] as needed Another important point here was to use the group basically everything in parentheses are groupings, I used 2. The first grouping is the cnpj pattern and the second grouping is the sequence pattern for the name.

when you use group1 you will recover the cnpj when you use group2 you will recover the name

See how it works on ideone

I hope I helped hug

  • Oops, good afternoon... It worked, but I came across problem having number in the middle, do you have any way to go through it too? I tried to change his expression, but I was unsuccessful. Example: 201700000000204 23/02/2017 23.972.199/0001-15 G3 LTDA ACTIVE SEEDS 0,00 230,00 0,00 0,00 0,00 0,00 0,00 0,00

  • I have another solution, you can work with the group, basically map the cnpj followed by the expression of the name, when you use group 1 vc has cnpj and when you use group 2 will have the full name

  • I managed to get to this expression \b(\d{2}\.\d{3}\.\d{3}\/\d{4}\-\d{2})\b([A-zÀ-ú-1-9\s\\\/&\-\(|)]{5,}.*[a-zA-Z])\b edited the answer see if it works

  • Thank you, you’ve been a great help.

0


If the sentence always follows this pattern. Just check the borders.

  • Left : Preceded by a CNPJ, end of CNPJ \d{4}-\d{2}
  • Right : Followed by a monetary value : \d+,\d{2}

Resolution

  • Pattern : .*\d{4}-\d{2} (.*?) \d+,\d{2}.*
  • Replace : $1

See Running in REGEX101

  • Thanks for the help, it worked well.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.