How to use the negation operator in regular expressions for a specific string?

Asked

Viewed 202 times

1

Studying the negation operator in regular expression ([^]) regular, I understood that it is possible to deny an isolated character (e. g., [^x]: anything other than "x") or a range of characters (e.g., [^A-Z]: anything other than a capital letter).

However, I am faced with a specific situation where I need to ban a specific sequence of characters and I am not sure how to resolve it. Below I present an example that replicates the problem I am having, as well as report what I have tried so far.

Let’s say I have a list of book data, as author and title. Here is an example of a list of this type:

books = ["LEAL, Victor Nunes. Coronelismo, enxada e voto", 
"FLORY, Thomas. Judge and Jury in Imperial Brazil, 1808–1871: Social Control and Political Stability in the New State",
"Prado Jr., Caio. Formação do Brasil contemporâneo", "GRASS, Günter. Dog Years" , 
"ASSANGE, J.; APPELBAUM, J.; MULLER-MAGUHN, A.; ZIMMERMANN, J. Cypherpunks: Freedom and the Future of the Internet",
"BÖLL, Heinrich; VENNEWITZ, Leila. The Train Was On Time"]

From that list, I would like to remove only the names of the authors. What I have tried so far is the following:

import re

authors = []

for book in books:
    author = re.findall(r"^[A-Z-ÁÀÂÄÃÉÈÍÏÓÔÕÖÚÜÇÑ]+[^\.]+", book)[0]
    authors.append(author)

for author in authors:
    print(author)

Which results in:

LEAL, Victor Nunes
FLORY, Thomas
Prado Jr
GRASS, Günter
ASSANGE, J
BÖLL, Heinrich; VENNEWITZ, Leila

Note that the code worked for the first two authors, but failed in the third and fifth case. This failure is occurring because in my regex I’m saying "take what starts with a capital letter followed by any string except a period".

It is clear that the regex does not work in the third and fifth case on account of the . in the names of the authors. My idea then was to change the regex to something like "take what starts with a capital letter followed by any string of characters, except a period followed by a space". So I used the same code above, but this time the regex was:

r"^[A-Z-ÁÀÂÄÃÉÈÍÏÓÔÕÖÚÜÇÑ]+[^\.\s]+"

which resulted in:

LEAL,
FLORY,
Prado
GRASS,
ASSANGE,
BÖLL,

Obviously, it’s not what I expected. How do I ban the sequence \.\s?

  • I used python to give the example, but understand that the answer can use any language with similar regex

1 answer

4


[^\.\s] is a character class denied which means "a character that nay is neither point nor space". What is between [^ and ] is a list of characters, and there is no order defined between them - so much so that [^\s\.] is exactly the same.

If you want to check a sequence of more than one character, the way is to use Lookahead negative:

books = ["LEAL, Victor Nunes. Coronelismo, enxada e voto", 
"FLORY, Thomas. Judge and Jury in Imperial Brazil, 1808–1871: Social Control and Political Stability in the New State",
"Prado Jr., Caio. Formação do Brasil contemporâneo", "GRASS, Günter. Dog Years" , 
"ASSANGE, J.; APPELBAUM, J.; MULLER-MAGUHN, A.; ZIMMERMANN, J. Cypherpunks: Freedom and the Future of the Internet",
"BÖLL, Heinrich; VENNEWITZ, Leila. The Train Was On Time"]

import re

authors = []
r = re.compile(r"^[A-Z-ÁÀÂÄÃÉÈÍÏÓÔÕÖÚÜÇÑ]+(?:(?!\.\s).)+")
for book in books:
    author = r.findall(book)[0]
    authors.append(author)

for author in authors:
    print(author)

In the case, (?!\.\s) checks whether the sequence \.\s nay there is ahead. Then I put . which corresponds to any character (except for line breaks), and all of that sequence (a character, as long as it doesn’t have a point followed by a space in front) repeats once or more times.

I grouped that with (?: to form a catch group, because if it used only parentheses a capture group would be formed, and in that case findall return only the groups.

I also used compile for the expression to be compiled only once, since the documentation says so is more efficient when you need to use the same regex several times.


In this particular case, you could also do split by "point followed by space":

r = re.compile(r"\.\s")
for book in books:
    author = r.split(book)[0]
    authors.append(author)

Or still without regex:

author = book.split('. ')[0]

Remembering that in regex the shortcut \s not only takes spaces, but also line breaks and several other characters. If you want to consider only the space (and not the other characters that \s handle), just switch to r"^[A-Z-ÁÀÂÄÃÉÈÍÏÓÔÕÖÚÜÇÑ]+(?:(?!\. ).)+" - notice that there is a gap between the \. and the ) (and in the case of split, would be r"\. " - with a space before the closing quotes).

  • 1

    Blimey. You are the wizard of regex. Thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.