Python string prefix

Question

Python string prefix

Asked 7 years, 3 months ago

Viewed 260 times

-1

I made a script to download an attachment from an email, this attachment is an XML file, and I want to save it in a database. But when I get the XML body, it comes with the prefix 'b' and therefore the error when saving the XML in the database.

The string that goes to SQL ends up like this:

INSERT  INTO NFes (xml) VALUES (b'<?xml version...')

Those are the mistakes:

"Operand type conflict: image is incompatible with xml (206)" "Unable to prepare one or more instructions. (8180)"

I have tried to change the encoding using str(xml, "utf-8"), for example, which would solve the prefix problem. But an error occurs with ODBC SQL Server Driver: "XML Parsing: Line 1, character 38, cannot toggle encoding (9402) (Sqlparamdata)"

He’s not complaining about that comma right at the beginning of XML?

– Giovanni Nunes

2018/04/09 at 23:22
agree with @Giovanni, I think this comma should generate this error...

– aa_sp

2018/04/10 at 12:54
I’m sorry, the comma at the beginning of XML was a typo. I added the error message to the topic when trying to save XML with b in front.

– Gustavo Primo

2018/04/10 at 14:03
Include the code you use to generate this string and the type of the xml column in your database.

– Leandro Angelo

2018/04/11 at 00:31

2 answers

-1

My XML has an attribute "encoding" and so the error message said that it was not possible to toggle the encoding.

Then I made a replace to remove the 'encoding="utf-8"'. And to remove the prefix 'b' I just did what I had tried before, I used a str(xml,'utf-8'). After these changes it was possible to save to the bank normally!

Browser other questions tagged sql python odbc

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2018-04-10T17:33:33+00:00

The "prefix b" indicates that the object you have at hand is not a text string - but a set of bytes - In Python 3 the two things are fundamentally different, why you always need to know how the text is coded in bytes to be able to transform them into characters. Nowadays it is increasingly common for the text to be in the "utf-8" encoding, but some legacy systems and Windows use the "latin-1" encoding - which allows all characters in the Portuguese language to be in a single byte.

Python objects of type "bytes" have a "Decode" method - just call it and the result will be the text string (which is specified in Python without the prefix 'b'). but beyond the "Decode" method, the call str(xml, 'utf-8') would also make this transformation - the error message changes. Since it is not the Python error saying that there is an invalid utf-8 sequence, the chance is that your XML is in utf-8 - only ODBC complains of an invalid character: utf-8 supports universal characters - other encodings, such as latin1, no - if there are characters in languages with Greek, Russian, Hebrew, or even punctuation signs that are not defined in Latin-1, an error will occur, which may well be that.

The remedy would be to force an Escaping encoding to pass the data to the driver - only, here’s another problem: the function does not accept bytes (the already encoded text). Result: you will have to maim the Python text, replacing all the characters outside of "latin1" with "?", turn it back into text and then make your call. There, if there is no other error in XML should work.

I’d recommend contacting whoever designed the bank you’re feeding to accept universal coding.

To understand more about these processes, now stop everything you’re doing and read http://local.joelonsoftware.com/wiki/O_M%C3%ADnimo_Absoluto_Que_Todos_os_Programadores_de_Software_Precisam,_Absolutamente,_Positivamente_de_Saber_Sobre_Unicode_e_Conjuntos_de_Caracteres_(Sem_Desculpas!)

To fix your problem and remove problematic characters from the text:

An error equivalent to this is what is now occurring within the ODBC code - if you send a text with Cyrillic characters, for example:

In [119]: a = "texto inválido: Ут пауло интерессет темпорибус пер"

In [120]: a.encode("latin-1")
UnicodeEncodeError                        Traceback (most recent call last)

So - you must: decode your data using utf-8, code it back to latin-1, swap the unknown characters for "?" ,and decode back to text - there will be data that can be sent to your database:

In [122]: dados
Out[122]: b'texto inv\xc3\xa1lido: \xd0\xa3\xd1\x82'

In [123]: dados_str = dados.decode("utf-8").encode("latin1", errors="replace").decode("latin1")

In [124]: dados_str
Out[124]: 'texto inválido: ??'

(The "data" variable in this example is equivalent to what you have there at the beginning: a bytes object representing text encoded in utf-8, with invalid characters in latin-1). If you keep having the same mistake não é possível alternar a codificação, expriemnte filter out all not ASCII characters - use "ASCII" instead of "latin-1" in the above code.