Find quotes with blanks with Regex

Asked

Viewed 3,662 times

8

I need to find errors where inside quotes have blank spaces at the beginning or at the end of them.

Examples of errors:

  1. The news was given by " Jornal do Brasil".
  2. Paris is considered the "City of Light ".

Note that in the first case, inside the quotes, it starts with a blank space, and in the second case, it ends with a blank space.

I want to remove these unnecessary blanks by using regular expressions to point out the error.

I used two Regex for that:

" .*?"
".*? "

In the first, I can point out the quotes that start with the blank space, in the second, when it ends.

Turns out, there’s a problem with these expressions.

Example:

  1. I like the colors "blue" and "black".

Note that there is no error in the sentence. The two words "blue" and "black" do not start or end with white spaces, but, using the regular expression above, he finds a false positive in the " and ".

I tried several ways, but my knowledge in regular expressions is still very poor and I could not correct this mistake.

Which Expreg should I use in this case?

Thank you very much!

  • @Kyllopardiun In my opinion, the example of the question is correct. Even because no false positive would occur if the e was inside the quotes.

  • @Kyllopardiun No. The phrase is like that. With the and out of quotes.

  • @mgibsonbr Exact.

3 answers

4

The best I can suggest is a regex that matches the string as a whole. Because the problem here is that an analysis local can produce different results from an analysis global.

My attempt at solution would be:

^[^"]*(?:"(?:[^"\s]|[^"\s][^"]*[^"\s])?"[^"]*)*$

Example in the Ruble. Explanation:

  • ^ - string start
  • [^"]* - followed by zero or more characters that are not quotation marks (text outside quotation marks)
  • (?:...)* - followed by zero or more of:
    • " - quote
    • (?:...|...)? - with or without:
      • [^"\s] - a single character which is neither quotation marks nor spaces; or:
      • [^"\s] - a character that is neither quotation marks nor spaces, followed by
      • [^"]* - zero or more characters that are not quotes, followed by
      • [^"\s] - a character that is neither quotation marks nor spaces, followed by
    • " - unquote
    • [^"]* - zero or more characters that are not quotation marks (text outside quotation marks)
  • $ - end of string

Explaining in natural language, she takes a stretch out of the quotation marks, then a stretch in, a stretch out, a stretch in, and so on. Quotation marks can be of three types: a) empty - ""; b) with a single character - "a"; c) with a character that before and after, and anything in between - "a...b".

It should be noted that all this regex says is whether the string is valid or invalid: it cannot show you what character the error is in.

Updating: if what you want is a regex that marries strings with error - and tell you where the mistake is - that’s the best I could do:

^[^"]*(?:"(?:[^"\s]|[^"\s][^"]*[^"\s])?"[^"]*)*("(?:\s[^"]*|[^"]*\s)")[^"]*(?:"(?:[^"\s]|[^"\s][^"]*[^"\s])?"[^"]*)*$

Example in jsFiddle. This "monstrosity" boils down to:

^ regex_original ("(?:\s[^"]*|[^"]*\s)") regex_original $

That is: "Marry something that is correct, followed by something that is incorrect, followed by something that is correct". It will detect one and only one error like this - if the string has two or more errors, or if it has a quote that opens but does not close, etc, the regex will not be able to catch.

I believe that with a little more effort we can improve this a little, but we are getting to the point where regex is no longer the most suitable tool for the work...

  • I tested it here and it works at first, but with this code it highlights what is correct. I want him to do the exact opposite, point out what’s wrong. Example: http://www.regexr.com/39d0n At this link try to correct what is wrong you will understand what I mean...

  • 1

    @Dinho In his example the $ in the end, so partial marriages. As for pointing out the wrong part, I don’t even know if it’s possible, but I’ll give it some thought if I get something put in here. (the reason why I think maybe is impossible because this problem is much like that of "parenthesis balancing" - something theoretically impossible to do only with regex)

  • 2

    @Dinho Updated the answer. It’s not perfect, by the way it’s 99% sure that if you look you will find some false positive or negative. However, we are reaching the limit of what we can do "with sanity" using regex, so that I can even try to improve more, but it would be in the sense of "programming challenge" and not "something I would use in production"...

  • Wow! hehehe! I tested by your jsFiddle link and it worked all right. I just don’t understand why it didn’t work on Regexr and the other regular expression editors I tested... Take this other example: http://www.regexr.com/39d16 It is indicating that the second sentence (which has no error at all) is marrying the first that contains the error. Try to correct the error of the first sentence, the highlight goes away. Why will it be? Thank you very much, friend! Already helped too much! D

  • 1

    @Dinho Comigo also failed on this site. It’s strange, because when I activate the "multiline" flag, this is what you said (normal, because it is considering the entire text, not only line by line), but when I deactivate everything stops working!!! I don’t understand why...

3

Here is a suggestion, which works in the text I tried:

var texto = '" Jornal do Brasil". Paris é considerada a "Cidade Luz " Gosto das cores "azul" e "preto".';

var textoLimpo = texto.replace(/"([^"]*)"/g, function (match, r) {
    return '"' + r.replace(/^\s+|\s+$/g, '') + '"';
});
console.log(textoLimpo); // "Jornal do Brasil". Paris é considerada a "Cidade Luz" Gosto das cores "azul" e "preto".

Demo: http://jsfiddle.net/zg7otqv9/1/

In the background I divide the process into two parts. First isolating pieces starting and ending in " (quotes) and then cleaning one by one with r.replace(/^ | $/g, '').

The first part /"([^"]*)"/g picks up anything between two quotes, ie using the [^"]* I look for anything without quotation marks, because I close the regex with ".

The second part uses the string start and end flag (Resp: ^ and $) and using the alternator | in the middle.

  • 2

    +1 just one suggestion: use \s+ instead of a simple space, not only because it takes other types of white space (such as tabs for example) but also prevents the case where there is more than one space in the quotation marks. Ex: Paris é considerada a "Cidade Luz "

  • 1

    @mgibsonbr good tip. Fixed. I gave you another +1 if you could vote twice :)

  • @Sergio I liked the alternative, but I can’t use it like this. In the program I am using there is no way to use programming, accepting only Regex. The idea is, use Regex only to point out errors without making the correction.

  • 1

    @Dinho, in this case I suggest you use/accept the mgibsonbr response.

1

This expression will take anything inside quotation marks (including quotation marks). Ex: "This is not done"

"(.*?)(\w+)\b"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.