Encoding treatment in XML

Asked

Viewed 393 times

1

I’m trying to make a request for a URL that contains a return on XML with the function simplexml_load_file('url') and all I get is falsewhen I debug using var_dump.

$xml = simplexml_load_file('http://www.cinemark.com.br/programacao.xml');

var_dump($xml);

Return:

bool(false)

What other method can I use to treat XML out this function, and what would be the "correct" application for this function?

After enabling php error display, I get the following Warning:

Warning: simplexml_load_file(): http://www.cinemark.com.br/programacao.xml:1: parser error : Start tag expected, '<' not found

Warning: simplexml_load_file(): ...

I believe it’s some encoding problem, already tried to put the function header() and use the chatset=utf-8 but the error is this shown above.

After using the mb_detect_encoding receive: "UTF-8"

  • put the header as iso-8859-1, and made the request with file_get_contents and then simplexml_load_string with the result of file_get_contents and the error is the same only with the different character " �"

  • can draw up a response?

  • I recommended as a test only, probably for definitive use is better another solution. Who knows how to try Domdocument or Xpath is better.

  • tried utf8_decode and encode as a result of file_ge_contents but without success, the characters remain "scrambled"

  • Have you tried changing the first line of XML to stay : <?xml version="1.0" encoding="UTF-8" ?>

  • @jlHertel really he would have to change the 1st line and change the encoding also (with utf8_encode), is that I deleted the comments not to get too much here, but had already verified that the original encoding is ISO-8859-1 (not only by the top statement, I saw by the returned bytes even to make sure it wasn’t conflicting). In fact, I had no problem reading this XML in other languages, it is a more specific problem of the same author’s situation.

  • @Bacco, but opening the URL by the browser, even with the ISO-8859-1 declaration, has invalid characters, that is, the browser can read, but sees that the encoding is incorrect. What I’m trying to say is that probably the function you used in another language ignores this character problem, while the PHP function when finding this is returning false. Can confirm if in another language the characters appear correctly?

  • What you see on the browser screen is the encoding that the browser understands. Since this page does not have a header saying encoding (not to be confused with the XML declaration) your browser is trying to display in UTF-8 (after all, it does not have an HTTP header saying it is ISO). Now, an XML parser should not take into account the HTTP header when importing the file, but rather the declaration. Reading the original file and displaying as ISO-8859-1 the characters are perfect. The problem is not the encoding, but the way to make the diagnosis.

  • @jlHertel see the page being displayed in ANSI - https://i.stack.Imgur.com/2m1hX.png - And displaying this XML in a desktop application, everything is normal too. The fact is that the best encoding test is to download the data and display in hexadecimal, for example in an editor like Hxd, so it does not depend on errors on the screen. Looking at the binary data has no way to err the diagnosis. Output on the screen confuses as the error may be of display only. (PS: I tested all this when the question was asked too, no problem).

  • @Bacco, in this case, removing the line that defines the charset should be possible to read the XML. Or if you still have an error, it could set the charset line to ANSI itself.

  • @jlHertel gives me the impression that the Loader he is using (simplexml) only understands UTF-8, so the conversion + your string swap solution should solve. Out of curiosity, follow a Hex Dump from the top of the page, with all headers including: https://i.stack.Imgur.com/Qowu2.png - You can see the encoding more clearly. (this was downloaded directly via HTTP, and saved to the hard drive just for ease, without any filter, using pure socket, ie no client-side conversion)

  • @jlHertel believe that "changing" the first line would not be the case because when I receive the request data it already comes with the scrambled characters. accept suggestions from other classes/functions to use as well, let’s not get stuck only in that function in specific.

Show 7 more comments

1 answer

0


It seems to me that the server is always forcing data compression, leaving the data flat only when the client shows no support for it.

You can tell the server that it does not support data compression and receive XML. The following code does this:

<?php

header('Content-Type: text/html; charset=ISO-8859-1');

$curl = curl_init('http://www.cinemark.com.br/programacao.xml');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

/* Esta Opção que resolve o problema */
curl_setopt($curl, CURLOPT_ENCODING, 'identity');

$conteudo = curl_exec($curl);

var_dump($conteudo);

$xml = simplexml_load_string($conteudo);

var_dump($xml);
  • That solves! Thanks. Which class/function would you indicate to treat XML? I am using this simplexml_ but it seems like I’m writing a lot of stuff and I feel like there’s something that makes it simple!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.