1
Studying the negation operator in regular expression ([^]
) regular, I understood that it is possible to deny an isolated character (e. g., [^x]
: anything other than "x") or a range of characters (e.g., [^A-Z]
: anything other than a capital letter).
However, I am faced with a specific situation where I need to ban a specific sequence of characters and I am not sure how to resolve it. Below I present an example that replicates the problem I am having, as well as report what I have tried so far.
Let’s say I have a list of book data, as author and title. Here is an example of a list of this type:
books = ["LEAL, Victor Nunes. Coronelismo, enxada e voto",
"FLORY, Thomas. Judge and Jury in Imperial Brazil, 1808–1871: Social Control and Political Stability in the New State",
"Prado Jr., Caio. Formação do Brasil contemporâneo", "GRASS, Günter. Dog Years" ,
"ASSANGE, J.; APPELBAUM, J.; MULLER-MAGUHN, A.; ZIMMERMANN, J. Cypherpunks: Freedom and the Future of the Internet",
"BÖLL, Heinrich; VENNEWITZ, Leila. The Train Was On Time"]
From that list, I would like to remove only the names of the authors. What I have tried so far is the following:
import re
authors = []
for book in books:
author = re.findall(r"^[A-Z-ÁÀÂÄÃÉÈÍÏÓÔÕÖÚÜÇÑ]+[^\.]+", book)[0]
authors.append(author)
for author in authors:
print(author)
Which results in:
LEAL, Victor Nunes
FLORY, Thomas
Prado Jr
GRASS, Günter
ASSANGE, J
BÖLL, Heinrich; VENNEWITZ, Leila
Note that the code worked for the first two authors, but failed in the third and fifth case. This failure is occurring because in my regex I’m saying "take what starts with a capital letter followed by any string except a period".
It is clear that the regex does not work in the third and fifth case on account of the .
in the names of the authors. My idea then was to change the regex to something like "take what starts with a capital letter followed by any string of characters, except a period followed by a space". So I used the same code above, but this time the regex was:
r"^[A-Z-ÁÀÂÄÃÉÈÍÏÓÔÕÖÚÜÇÑ]+[^\.\s]+"
which resulted in:
LEAL,
FLORY,
Prado
GRASS,
ASSANGE,
BÖLL,
Obviously, it’s not what I expected. How do I ban the sequence \.\s
?
I used python to give the example, but understand that the answer can use any language with similar regex
– Lucas