XML returned by web service with encoding error

Asked

Viewed 1,354 times

2

Hello, I have an XML that returns from a webservice with encoding errors, on encoding XML is like UTF-8 but it doesn’t display accents correctly and I can’t detect which correct encoding it should be. I have no information as to how it is stored in the original database or any other of the type, I just get back the same. here an example of how I receive XML:

<?xml version="1.0" encoding="UTF-8"?>
<test>
<title>Muitos cientistas intrØpidos se aventura no coraĿªo dos dois vulcıes mais explosivos do planeta</title>
</test>

How can I convert/detect encoding and fix accent errors? Is there any way to convert this characters after generator by the webservice or this correction can only be done in the webservice itself once the XML has already been generated?

Note: I have tried functions like iconv, utf8_decode, mb_convert_encode.

Thanks in advance.

  • tried a solution using iconv()? Here is an example: http://stackoverflow.com/a/7980354/1001109

  • Yeah, buddy just read the description of the question.

  • It would be interesting to AP, that the negatives were explained, so he can improve the question...

  • @rafaelphp put the code that takes this XML...

  • Correct in this case is correct in XML at source

  • do not have access to Ws. as said in the question.

  • 1

    Is it not mixing something from xml or some other script and at the time of output it is conflicting? Are you sure it is in WS? Could put the working code?

  • already tried several ways, Curl, file_get_contents etc to memso when accessing the url the browser comes with the same xml

  • Have you tried utf8-Ncode ?!? Did you check the file Find? *(not likely) Already gave mb_detect_encoding() to find the encoding that is coming ?

  • Just to understand you are using an external service that returns the broken string in the right xml ?

  • Ideally you provide the endpoint of this xml for us to look at.

Show 6 more comments

4 answers

6

In this case a solution would be, you make a map with the patterns of encoding returned, and replace them, example :

<?php 
$arr = array("Ø" => "é","Ŀ" => "ç","ª" => "ã", "ı" => "õ"); 
$word = "Muitos cientistas intrØpidos se aventura no coraĿªo dos dois vulcıes mais explosivos do planeta"; 
echo strtr($word,$arr); 
?> 

See working on Ideone

  • I think it should be "Ø" => "é".

  • 1

    It’s a very unusual word to use, but I personally knew it from some RPG narrations: https://www.dicio.com.br/intrepido/

  • I gave -1 because it is not solving the problem, but suggesting something that will cause many future problems in the routine that treats the data.

  • @jlHertel on the contrary in case he does not have access to Ws the only solution is to process the received data... And my answer is one of the possible solutions... Indeed within the answers presented is the only solution... Everyone talks about why not to give but no one offered any real solution

  • Ok, it may be a solution, but if any new character ever appears it will require routine maintenance. So, this becomes unsustainable.

  • At what point in the question, is there a reference on later maintenance ? You mean to say that if a problem like this appears, you do not seek a solution ? You will be waiting to have access to Ws ? I think about how to solve and not to point out where the problem is, this has already been pointed out by the AP itself... Your negative is unfair

  • Thank you @Magichat You are right for what I saw in the comments, and I believe that Oce proposes something to solve (read the question) and presented his opinion. I believe that for this the only solution I have, even if I have more maintenance in the near future.

  • 1

    The most difficult (and not so much) is you mount the map, ie if you get all the characters Ws sends you incorrectly and identify which is the correct character is even easy, however if you need to deliver a valid xml you will have to after the changes generate a valid new xml, but num is a 7 head bug...

  • Given the current situation, it is the best alternative. And, in fact, there are not so many characters to make it impossible to mount a mapping. It will be boring, but it will solve a problem that, according to the author, can not be solved at the source.

Show 4 more comments

4

You are already receiving the data incorrectly and therefore it will be virtually impossible detect the coding that information is coming in.

My recommendation is that you check the routine that is generating XML and perform a correction on it.

Some important details: The XML standard defines the character encoding in your header with the <?xml version="1.0" encoding="UTF-8"?>. The routine that will read this data must respect this encoding and use it to read the data. If the encoding is different from the one reported in the file, XML is invalid.

  • I believe you confirmed what you were thinking. And this kind of problem has no way to fix after xml is generated the solution is to change in Ws, however I have no access. if there is no solution solved this. thank you.

2

When I had this problem, the file that generated my XML was saved with ANSI encoding inserir a descrição da imagem aqui , I just changed the file format to UTF-8 and it worked. inserir a descrição da imagem aqui

  • 1

    I do not have access to Ws that returns the xml this data is coming this way, the question is how to treat the encoding after generated

  • Excuse me, I understood that you were the provider of XML content, in this case only by substitution, as suggested by @magichat: echo strtr($xml->saveXML(), array("Ø" => "is","TEI" => "ç",""ª" => "ã", "I" => "õ")); But if your XML really needs the special characters <title>In 1962, Armstrong became an astronaut and four years later participated for the first time in a space mission</title> 1st is not incorrect, but will remain as , then try to combine possible syllables such as "Ŀª" => "ça"

  • Thanks for the suggestion.

1


There are a couple of typical mistakes that lead to these kinds of situations. In this case there was a conversion to utf8 of a latin1 but indicating that it was something else (in this case ISO6937).

Solution with Iconv:

iconv -f utf8 -t ISO6937 x.xml | iconv -f latin1 -t utf8

Explanation : But how to arrive and this miraculous "ISO6937"?: Reverse the process with all known encodings and see the right ones!

1: which existing encodes known by iconv? -- iconv -l

2: reverse the process for everyone (creating a file _encode):

for a in `iconv -l | cut -d/ -f 1 `
do   
   iconv -c -f utf8 -t $a x.xml | iconv -f latin1 -t utf8 > _$a
done

3: look for your products that have produced the desired result (and choose a):

$ grep -l 'intrépido.*coração.*vulcões' _* 
_ANSI_X3.110
_ANSI_X3.110-1983
_CSA_T500
_CSISO103T618BIT
_CSISO90
_CSISO99NAPLPS
_ISO6937
_ISO_6937
_ISO_6937:1992
.... 

some of these names are alias of the same chaset

Browser other questions tagged

You are not signed in. Login or sign up in order to post.