Regex to capture spaces, except within quotation marks

Question

Regex to capture spaces, except within quotation marks

Asked 6 years, 2 months ago

Viewed 313 times

3

I would like a regular expression (PHP) to exclude double spaces out of quotes.

update tabela set¬
   nome='alberto da   silva',¬
   telefone='1234'¬
)

I want to capture double spaces, enters, tabs out of quotes.

I read it, and I only got to /\s{2,}/.

But how do I condition it not to pick inside the quotes?

Fabio, the answer below solved the problem or was missing something? If you solved, you can accept the answer, see here how and why to do it. It is not mandatory, but it is a good practice of the site, to indicate to future visitors that it solved the problem. If not solved, just [Dit] the question explaining what was missing.

– hkotsubo

2020/05/05 at 11:22

1 answer

Browser other questions tagged php regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-04-22T20:18:45+00:00

First of all, it is worth remembering that \s corresponds to spaces, tabs, line breaks, among other characters. So if you have a space and then a line break, \s{2,} will consider these two characters as one thing only.

^{Before the question said to capture these spaces, then asked to remove. Below is a solution for both cases.}

One way to solve this is to try to capture something that’s in quotes, and if you can, discard this match. Or, if there is nothing in quotes, take the sequence of two or more \s:

//coloquei 2 espaços depois de "set"
$str = "update tabela set  
   nome='alberto da   silva',
   telefone='1234'
)";
if (preg_match_all('/\'[^\']*\'(*SKIP)(*F)|[^\'\s]*(\s{2,})[^\s\']*/', $str, $matches, PREG_SET_ORDER, 0)) {
    foreach ($matches as $m) {
        if (count($m) > 1) { // grupo de captura preenchido (número não está entre aspas)
            var_dump($m[1]);
        }
    }
}

The regex uses alternation (the character |, which means or), with 2 options.

The first is \'[^\']*\'(*SKIP)(*F):

\'[^\']*\': begins and ends with quotation marks (which by being inside a string bounded by ', should be written as \'), and between the quotation marks [^\']* (zero or more characters that nay sane ')
(*SKIP)(*F): sane "control verbs" who... control the engine regex. In this case, they cause the regex to "give up" the match found (ie, it discards the text that is in quotes) - there is a more detailed explanation about these "verbs" here. Searching the text and then discarding it seemed simpler than doing several checks before and after the spaces to detect that they are part of a text that is in quotes.

The second option is [^\'\s]*(\s{2,})[^\s\']*:

[^\'\s]*: zero or more characters that are not \s nor '
(\s{2,}): two or more \s, in brackets to form a catch group

Since regex can only take the word in quotes or the sequence of \s, there are cases where the catch group will be filled (when catching the \s) and cases where you won’t be (when you can’t find 2 or more \s, or find text between quotation marks).

So, inside the foreach i check if there is the capture group (the array $m will have more than one element). The output is:

string(7) "  
   "
string(5) "
   "

That is, 2 occurrences: one corresponds to spaces after "set", plus line break, plus spaces before "name". The second is the line break after "silva'," plus the spaces before phone".

Note: if the string has a Windows line break, which consists of 2 characters (\r\n), then these too will be captured by \s{2,}.

See here the regex working.

To make it a little easier to "see" the characters, you can print the value of the ascii table of every character:

if (preg_match_all('/\'[^\']*\'(*SKIP)(*F)|[^\'\s]*(\s{2,})[^\s\']*/', $str, $matches, PREG_SET_ORDER, 0)) {
    foreach ($matches as $m) {
        if (count($m) > 1) { // grupo de captura preenchido (número não está entre aspas)
            for($i = 0; $i < strlen($m[1]); $i++) {
                echo ord($m[1][$i]), ",";
            }
            echo PHP_EOL;
        }
    }
}

Exit:

32,32,10,32,32,32,
10,32,32,32,

Being that 32 is the space and 10 is the new line (\n).

In the comments you said you want to use preg_replace, so you really don’t want to capture (as was previously in the question, before it was edited), and replace these spaces.

Well, totally eliminating spaces does not seem to be the most appropriate option, because the string has a query and if it leaves no spaces it will be invalid. So, assuming you want to swap the spaces/line breaks/etc for just one space, you could adapt the above regex:

echo preg_replace('/\'[^\']*\'(*SKIP)(*F)|([^\'\s]*)\s{2,}([^\s\']*)/', '$1 $2', $str);

It’s very similar, only now I put the capture groups around the [^\'\s]* before and after the \s{2,}. And I substitute for '$1 $2' (the contents of the first pair of parentheses, followed by a single space, followed by the contents of the second pair of parentheses). So I change the sequence of two or more \s by a single space. The output is:

update tabela set nome='alberto da   silva', telefone='1234'
)

If you want to delete only spaces (but not line breaks), switch to:

echo preg_replace('/\'[^\']*\'(*SKIP)(*F)|([^\' ]*) {2,}([^ \']*)/', '$1$2', $str);

The way out becomes:

update tabela set
nome='alberto da   silva',
telefone='1234'
)

But if the idea is to remove two or more spaces, or a TAB, or a line break, one option is:

echo preg_replace('/\'[^\']*\'(*SKIP)(*F)|([^\'\s]*)(?: {2,}|[\t\n\r])+([^\'\s]*)/', '$1 $2', $str);

The excerpt (?: {2,}|[\t\n\r]) is a alternation with two options:

{2,}: two or more spaces (note the space before the {)
[\t\n\r]: a TAB or line break (including the \r to pick up line breaks from Windows

This whole stretch can occur one or more times (because it has a + then indicating the repetition). All this is exchanged for a single space (because I put space between $1 and $2).

All this is among a group of non-registered (marked by (?:), so I don’t create extra groups and can continue using $1 $2 in the second parameter (if I didn’t use ?:, another group would be created and I would have to change the parameter to $1 $3, since now the regex would have 3 groups).

Exit:

update tabela set nome='alberto da   silva', telefone='1234' )

Finally, if the idea is to remove exactly two spaces (not "two or more"), just change {2,} for {2}.