re.error: bad escape c at position 0

Question

re.error: bad escape c at position 0

Asked 5 years, 8 months ago

Viewed 222 times

0

I’m trying to make a search between arrays, and return the value when it is corresponding, and when it is not.

import re

array = ['brasil','argentina','chile','canada']
array2 = ['brasil.sao_paulo','chile','argentina']

for x,y in zip(array,array2):
  if re.search('\\{}\\b'.format(x), y, re.IGNORECASE):
    print("Match: {}".format(x))
  else:
    print("Not match: {}".format(y))

Exit:

Not match: brasil.sao_paulo
Not match: chile
Traceback (most recent call last):  File "main.py", line 7, in <module>    if re.search('\\{}\\b'.format(x), y, re.IGNORECASE):  File "/usr/local/lib/python3.7/re.py", line 183, in search
re.error: bad escape \c at position 0

Desired exit:

MATCH: brasil
MATCH: argentina
MATCH: chile
NOT MATCH:canada

Luis, apparently you’re trying to search the array2 using as search terms the items of array. Only if that’s really what you want to do 'cause they’re using zip? You can’t understand what exactly you want or what exactly your code is trying to do. I strongly advise to [Dit] your question and explain your intentions and doubts better.

– fernandosavio

2019/12/18 at 20:12
I would like it to look independent of the array order

– Luis Henrique

2019/12/18 at 20:14
And return to me what there is in common

– Luis Henrique

2019/12/18 at 20:14
for example: Array 1 -> Line1 == Array2 -> All Lines

– Luis Henrique

2019/12/18 at 20:15

3 answers

2

Complementing the reply from @fernandosavio (which already explains why to use zip is wrong in this case), follows an explanation about why your regex failed.

You are using '\\{}\\b'.format(x) to create regex. Inside a string, \\ is interpreted as the character \ (since the \ is used for escape sequences, such as \n to designate a line break, then to the character itself \, is used \\).

So when x has the value brasil, the result is the string \brasil\b. And the black regex starts the string as the shortcut \b, indicating a "boundary between words" (word Boundary, a position that has a prior alphanumeric character and a non-alphanumeric character after, or vice versa). That is, regex only looks for the word "Rasil" (see).

When x has the value argentina, the result is the string \argentina\b. The escape sequence \a corresponds to the character BELL (a control character that was used to make a sound in the terminal, although nowadays not everyone does it). Anyway, this regex does not search for the word "argentina", but for a \a followed by the word "rgentina" (see).

And finally when the x has the value canada, the result is the string \canada\b. Only that \c is an invalid escape sequence (see here the list of valid sequences), and module documentation re says the use of invalid exhaust sequences makes a mistake.

You probably wanted to use \b at the beginning too, so just do:

for termo in termos:
    r = re.compile(r'\b{}\b'.format(termo))
    for palavra in palavras:
        if r.search(palavra):
            print(f'Encontrado {termo!r} em {palavra!r}')
            break
    else:
        print(f'{termo!r} não encontrado.')

Note the r before opening the quotation marks (r'\b{}\b'). This indicates a raw string literal, within which the character \ does not need to be written as \\, which makes regex a little more readable. Now the expressions will be \bbrasil\b, \bargentina\b and \bcanada\b, that is, all are valid regex.

I also compile the expression once before searching the list of words, so I reuse the same regex in for internal (it’s okay that there is a regex cache, but still I think you do not need to recreate several times the string inside the loop).

This solution is almost equal to the @fernandosavio, with a slight difference. For example, if we have something like:

termos = ['brasil']
palavras = ['brasileiro', 'brasil']

How his solution uses termo.casefold() in palavra.casefold(), she finds first brasileiro, since the term brasil is actually contained in this string.

Already the regex \bbrasil\b only finds the second word ("brazil"), because the regex looks for brasil as long as there is a \b before and after (and how \b is the "boundary between words", the regex only finds a match when the word is exactly "brazil").

Note also that it uses casefold() to leave the search case insensitive. For regex to behave the same, you can use the flag I:

r = re.compile(r'\b{}\b'.format(termo), re.I)

Or use your own casefold() in strings (both in search terms and in the words being searched):

for termo in termos:
    r = re.compile(r'\b{}\b'.format(termo.casefold()))
    for palavra in palavras:
        if r.search(palavra.casefold()):
            print(f'Encontrado {termo!r} em {palavra!r}')
            break
    else:
        print(f'{termo!r} não encontrado.')

The difference occurs in some cases as the character ß, the capital version of which is SS (see). So a search case insensitive should find so much ß how much SS or ss. Only that we use the flag re.I is not sufficient for this case, and only with casefold works. It’s up to you, because depending on the strings you search for, it might not make a difference (anyway, it’s good to know that there are these options).

Finally, adjusting the messages to what you wanted:

termos = ['brasil', 'argentina', 'chile', 'canada']
palavras = ['brasil.sao_paulo', 'chile', 'argentina']

for termo in termos:
    r = re.compile(r'\b{}\b'.format(termo), re.I)
    for palavra in palavras:
        if r.search(palavra):
            print(f'MATCH: {termo}')
            break
    else:
        print(f'NOT MATCH: {termo}')

Exit:

MATCH: brasil
MATCH: argentina
MATCH: chile
NOT MATCH: canada

1

Thank you very much for the explanation, I think answers like this make the community grow, congratulations !!!

– Luis Henrique

2019/12/19 at 12:54

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by fernandosavio • **9,013** points · Answer 1 · 2019-12-18T20:57:41+00:00

termos = ['brasil', 'argentina', 'chile', 'canada']
palavras = ['brasil.sao_paulo', 'chile', 'argentina']

for termo in termos:
    for palavra in palavras:
        if termo.casefold() in palavra.casefold():
            print(f'Encontrado {termo!r} em {palavra!r}')
            break
    else:
        print(f'{termo!r} não encontrado.')

^{Code running on Repl.it}

Upshot:

Encontrado 'brasil' em 'brasil.sao_paulo'
Encontrado 'argentina' em 'argentina'
Encontrado 'chile' em 'chile'
'canada' não encontrado.

To search whether each term in the first list is in some string in your second list is not required zip, because you don’t have to go through the two lists side by side.

A step by step would be:

Browse the search terms;
For each search term, scroll through the search list;
Check if the search term is contained in the searched strings.

How are you just testing yourself termo is a substring of palavra, no need to use regular expressions for this (and its regex is not correct anyway), you can convert the string to a "special low box" made for comparison case insensitive which takes into account some exceptions of Unicode using the method str.casefold().

Although it seems strange, the else of the above code is correct because I am using a for...else (explain in detail in this answer) so that the else is executed only if the break not occur.

Constructive tip: choose better names for your functions, classes and variables, so it’s much easier to understand your code.

by Lucas • **3,858** points · Answer 2 · 2019-12-18T20:38:32+00:00

The problem was in the exhaust caracteter \, without it your code runs normally, but does not execute what you want. Using the function zip it will not be possible to test for all entries of the array2, only for those who have the same index of string reference. This is because the zip makes an even match. See:

array = ['brasil','argentina','chile','canada']
array2 = ['brasil.sao_paulo','chile','argentina']

for x, y in zip(array, array2):
    print(x,y)

Output:

brasil brasil.sao_paulo
argentina chile
chile argentina

A more correct way to build your code, therefore, would be:

import re

array = ['brasil','argentina','chile','canada']
array2 = ['brasil.sao_paulo','chile','argentina']

for x in array:
    if True in [bool(re.search(r'{}'.format(x), y, re.IGNORECASE)) for y in array2]:
        print("Match: {}".format(x))
    elif False in [bool(re.search(r'{}'.format(x), y, re.IGNORECASE)) for y in array2]:
        print("Not match: {}".format(x))

Output:

Match: brasil
Match: argentina
Match: chile
Not match: canada