How to compare two identical`strings, but different encoded strings?

Asked

Viewed 432 times

2

I want to compare two strings, which are equal but have different encoding.

G%C3%A9rard Depardieu and Gérard Depardieu

I need to make several comparisons in two lists, but I came across this. The list A is full of names encoded in the form url(at least I think it is) and the second B is this way, showing the accents and all kinds of special characters. But I’m not sure how to encode accented characters for url type encoding and make comparisons.

name1 = 'G%C3%A9rard Depardieu'
name2 = ''
arq = open('gerard.txt', 'r', encoding='utf-8')
for a in arq:
    name2 = a.replace('\n', '')
print(name1==name2) #false

Also print the name: print(name2) gives the following error:

Out[11]: ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-11-e1187e69ab52> in <module>()
----> 1 name2

~/.virtualenv/crawler/lib/python3.5/site-packages/IPython/core/displayhook.py in __call__(self, result)
    259             self.fill_exec_result(result)
    260             if format_dict:
--> 261                 self.write_format_data(format_dict, md_dict)
    262                 self.log_output(format_dict)
    263             self.finish_displayhook()

~/.virtualenv/crawler/lib/python3.5/site-packages/IPython/core/displayhook.py in write_format_data(self, format_dict, md_dict)
    188                 result_repr = '\n' + result_repr
    189 
--> 190         print(result_repr)
    191 
    192     def update_user_ns(self, result):

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 2: ordinal not in range(128)

But my goal is not to print the names but to make comparisons.

The file Gerard.txt has only one line: Gérard Depardieu

2 answers

3


Strings are not "equal" - if one is with all non-ASCII characters encoded with "%XX", it has to be decoded before.

The standard library has the function urllib.parse.unquote which can transform characters encoded in this way into text characters. There is one more important thing to note: you can notice that since you have two bytes to represent a single accented character (and the first having xC3 code is another tip), besides the escape in url-quote, this original string was encoded in utf-8. Utf-8 is the default encoding that unquote Python therefore only considers calling this function directly solves:

In[202]: import urllib.parse

In [203]: name1 = 'G%C3%A9rard Depardieu'

In [204]: print(urllib.parse.unquote(name1))
Gérard Depardieu

If by chance, in your data, you find characters encoded as a single byte, it could mean that the coding of the original string was in latin1. In this case just pass the encoding in the "encoding" parameter of the function unquote:

In [206]: urllib.parse.unquote("G%e9rard", encoding="latin1")
Out[206]: 'Gérard'

how to do:

The best thing you do is to de-escape these strings as soon as you read them into your program. In the question you do not say if they are in a file, if they arrive in a web request, etc...but if they were like this inside a file - you could do so:

from urllib.parse import unquote

with open("meu_arquivo.txt") as file:
    text = unquote(file.read())

lines = text.split("\n")

1

You can use urllib.parse, you decodes the url, then when comparing with the string that this in the file the return is true.

import urllib.parse

name1 = urllib.parse.unquote('G%C3%A9rard Depardieu')
name2 = ''
arq = open('gerard.txt', 'r', encoding='utf-8')
for a in arq:
    print(a)
    name2 = a.replace('\n', '')
print(name1==name2) #true

Source

Browser other questions tagged

You are not signed in. Login or sign up in order to post.