Error reading html page with Html Agility Pack

Asked

Viewed 703 times

3

I’m reading an HTML page using Html Agility Pack. I run the code on my laptop and it works perfectly. The problem is when I run it on Windows Phone 7.1.

Accented characters (ç) are encoded. And the strange thing is that the same code is used to download two pages, the two have words with accents, but only one of them does not return the text as displayed on the page.

Code to load the file

    CustomEncoding enc = new CustomEncoding();
    HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
    document.OptionDefaultStreamEncoding = enc; //CustomEncoding.Unicode;

Code to execute the download

    HtmlWeb web = new HtmlWeb();
    //CustomEncoding is "iso-8859-2"
    web.LoadCompleted += web_LoadCompleted;
    web.LoadAsync(_filme.Detalhes, enc);// GetEncoding("iso-8859-1"));

I use the property InnerHtml to retrieve the text.

    void web_LoadCompleted(object sender, HtmlDocumentLoadCompleted e)
    {
        HtmlDocument document = new HtmlDocument();
        document.OptionDefaultStreamEncoding = CustomEncoding.Unicode;// System.Text.Encoding.UTF8;
        //document.LoadHtml(e.Document);
        HtmlNode html = GetNodeByName(e.Document.DocumentNode, "html");
        HtmlNode body = GetNodeByName(html, "body");
        HtmlNode allOut = GetNodeById(body, "all-out");
        HtmlNode allIn = GetNodeById(allOut, "all-in");
        HtmlNode content = GetNodeById(allIn, "content");
        HtmlNode lojas = GetNodeById(content, "lojas");
        HtmlNode leftSideMovie = GetNodeById(lojas, "left-side-movie");
        HtmlNode infoLoja = GetNodeById(leftSideMovie, "info-loja");

        HtmlNode censuraNode = GetNodeById(infoLoja, "censura-3d-leg-dub");
        HtmlNode sinopseNode = GetNodeById(infoLoja, "sinopse");
        HtmlNode marmota = GetNodeByNameAndClass(sinopseNode, "div", "margin_20b");
        HtmlNode preSinopseNode = marmota;

        //percorrer todos os filhos até encontrar a ultima marmota
        while (marmota != null)
        {
            preSinopseNode = marmota;
            marmota = GetNodeByNameAndClass(marmota, "div", "margin_20b");
        }

        string sinopse;
        try
        {
            //TODO: remover o try e refatorar para armazenar cada chamada de metodo em uma variavel
            //tentar com o span
            _filme.Descricao = GetNodeByName(GetNodeByName(preSinopseNode, "p"), "span").InnerHtml;
        }
        catch (Exception ex)
        {
            _filme.Descricao = GetNodeByName(preSinopseNode, "p").InnerHtml;
        }
    }

Page used

I will add just one of the methods, as they are all very similar.

    private HtmlNode GetNodeByName(HtmlNode root, string node)
    {
        foreach (HtmlNode link in root.ChildNodes)
            if (link.Name.Equals(node))
                return link;
        return null;
    }

One of the links that talks about Customencoding

2 answers

1

I’ve been doing a simple html and csv reader, where I found the same problem. I solved my problem by replacing the UTF8 for iso-8859-1, then I suggest you replace your Enconding as follows

Code to load the file

HtmlDocument document = new HtmlDocument();
document.OptionDefaultStreamEncoding = Encoding.GetEncoding("iso-8859-1");

Code to run the download

HtmlWeb web = new HtmlWeb();
web.LoadCompleted += web_LoadCompleted;
web.LoadAsync(_filme.Detalhes, Encoding.GetEncoding("iso-8859-1"));

More information about the Enconding you can find in the link http://en.wikipedia.org/wiki/ISO/IEC_8859-1

  • It didn’t work, Paulo. You did it on Windows Phone 7.1?

  • No, it was Aplication console.

  • Because it’s, my code worked perfectly console, with UTF8, the problem is only on Windows Phone.

  • Does Windows phone not use a different Encoding?

  • I found similar questions in English only, but I couldn’t solve them. It said that I would have to use a pogram to generate an encoding class. But the strange thing is that this is happening with one page and another not.

  • https://htmlagilitypack.codeplex.com/discussions/394082

  • I’ve never worked with custom encoding. But it seems to be the way.

  • You wouldn’t be able to publish an example of the project you’re doing?

  • I’ve tried iso-8859-1, iso-8859-2, Encoding.Getencoding, UTF8, Unicode and I still don’t understand what’s going on. I can. I’ll prepare it here

  • Updated, Paulo.

Show 5 more comments

0


Once again my theory proves correct.
Quando você não consegue encontrar uma solução, é provável que esteja bem na frente do seu nariz.

This was the solution, without having to configure the encoding.

    System.Web.HttpUtility.HtmlDecode

Browser other questions tagged

You are not signed in. Login or sign up in order to post.