How to capture string between square brackets

Asked

Viewed 1,971 times

3

I need to capture strings between brackets within a string. I found a solution that doesn’t solve my problem completely: \\[(.*?)\\]

Use like this:

Matcher mat = Pattern.compile("\\[(.*?)\\]").matcher(stringlToVerify);

if(mat.find()) {
   // Faz o que quero
}

That way, if I run the regex with: 'ol[a' + 'm]undo'

He’s gonna get: [a' + 'm]

But in this case it is not to take, because the two strings are being concatenated, so it makes no sense.

Example of what I need:

  Entrada             Captura

1 + [aa]                  [aa]
[bb] + 2                  [bb]
'a' + [cc]                [cc]
['ola' + 'mundo']      ['ola' + 'mundo']
'[a' + 'b]'            
'[' + ']'        

[]                        []   (ou nada, também serve)
'Ola [world] legal'         
Oi ['[aa]'] ola           '[aa]'

In the latter case, if it is not possible to do it simply, it is no problem. I made a method that removes all strings between single quotes.

  • 1

    Hello Eduardo, say, a valid input would be a string that must be in quotes and in brackets? For example: 'hello [dear] world', where the word "wanted" would be captured? If possible try to further detail your requirements. Very complex regular expressions can disturb rather than help.

  • @Eduardo And what to capture with this text: ['aa']?

  • @Renatocolace, if the string 'hello [wanted] world' was tested, the return should be empty. I updated the list of inputs and outputs in my question by adding this scenario. Thank you.

  • @Mariano, if the string is input ['aa'] the return should be 'aa' or ['aa'], whatever, the important thing is to capture what is in parentheses.

  • @Eduardoh.M.Garcia There may be quotes with escapes? ('a \'b\' [c] d' [e])

  • The latter case is inconsistent. It should be ['[aa]']

Show 1 more comment

2 answers

2


Regular expression:

\G[^\[']*(?:'[^']*'[^\['*]*)*(\[[^]']*(?:'[^']*'[^]']*)*\])


Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\\G"                    // Início do texto ou fim do casamento anterior
                   + "[^\\[']*"               // Texto sem colchetes nem aspas simples
                   + "(?:'[^']*'[^\\['*]*)*"  // Opcional: Texto em aspas + texto sem "[" nem "'"
                   + "(\\["                   // Grupo 1: Colchete de abertura
                   +     "[^]']*"             //        + texto sem "]" nem "'"
                   +     "(?:'[^']*'[^]']*)*" //        + texto em aspas + texto sem "]" nem "'"
                   + "\\])";                  //        + colchete de fechamento
final Pattern pat = Pattern.compile(regex);
Matcher mat;

final String[] entrada = {
    "1 + [aa]",
    "[bb] + 2",
    "'a' + [cc]",
    "['ola' + 'mundo']",
    "'[a' + 'b]'",
    "'[' + ']'",
    "[]",
    "'Ola [world] legal'",
    "Oi ['[aa]'] ola"
};

//Loop cada string na entrada
for (String stringlToVerify :  entrada) {
    mat = pat.matcher(stringlToVerify);
    System.out.println("\nEntrada: " + stringlToVerify);

    if (mat.find())
        do { // Loop cada texto entre colchetes casado
            System.out.println("Captura: " + mat.group(1));
        } while (mat.find());
    else
        System.out.println("Não há colchetes fora das aspas");
}

Upshot:

Entrada: 1 + [aa]
Captura: [aa]

Entrada: [bb] + 2
Captura: [bb]

Entrada: 'a' + [cc]
Captura: [cc]

Entrada: ['ola' + 'mundo']
Captura: ['ola' + 'mundo']

Entrada: '[a' + 'b]'
Não há colchetes fora das aspas

Entrada: '[' + ']'
Não há colchetes fora das aspas

Entrada: []
Captura: []

Entrada: 'Ola [world] legal'
Não há colchetes fora das aspas

Entrada: Oi ['[aa]'] ola
Captura: ['[aa]']

You can test here: http://ideone.com/6ZSzSz


Description:

\G[^\[']*(?:'[^']*'[^\['*]*)*(\[[^]']*(?:'[^']*'[^]']*)*\])
  • \G - Anchor (or atomic assertion) that matches the beginning of the string or end of the previous marriage (Continuing at the end of the Previous match).

    This is the most important building in this regex. It’s to ensure that every attempt at marriage begins only where the engine stopped at the previous wedding. Thus, a marriage cannot begin in the middle of the text, avoiding a capture in, for example:

    '....   [a' + 'b]  .....'
            ^       ^
            |- Aqui-|
    
  • [^\[']* - List that matches all characters that are not square brackets or single quotes.

  • (?:'[^']*'[^\['*]*)* - This is a group which is repeated 0 or more times by marrying:

    • '[^']*' - Quoted text
    • [^\['*]* - followed by more characters that are not brackets or quotes.

    This construction uses a technique known as "unrolling the loop" (unrolling the loop).


    So far, we can match all the characters in the string before the brackets.


  • (\[[^]']*(?:'[^']*'[^]']*)*\]) - Capture group (capturing group) that allows reference the married text (using Matcher#group(int)) with:

    • \[ - opening bracket

    • [^]']* - more characters that are not brackets or quotes

    • (?:'[^']*'[^]']*)* - optionally quotes inside the brackets and more characters that are not brackets or quotes (also unrolling the loop)

    • \] - closing bracket.

  • 1

    Thank you very much! Solved my problem completely. I managed to better understand regex with your code description. Hug.

  • I tested your solution in a few more scenarios, and there were failures. Failed scenario: Hi ['[aa'] ola returns ['[aa]

  • @Eduardo - I edited the answer to your new requirements.

  • Perfect! Thank you very much!

  • Can you tell me what it would look like if we added double quotes in the middle of it? I mean, make regex work the same way it works now, but apply the same rules to double quotes.

1

The regex below will capture letters, numbers or "_" that are in square brackets. If you need a more restrictive version just exchange " w+" for [a-z]+, for example.

\[(\w+)\]

I made an example you can check out: http://www.regexpal.com/? fam=96641

  • thanks for the answer, but it did not solve my problem completely. I will show: ['aa'] should return ['aa'] but '[aa]' should not return anything.

  • Complementing, ['a'+'] must return ['a'+'] or 'a'+’s'

Browser other questions tagged

You are not signed in. Login or sign up in order to post.