Read special characters (such as accents and cedillas) in the html body

Asked

Viewed 514 times

0

I am trying to read a news site, but the special characters (like accents and cedillas) are coming wrong. Example:

In html code (and news site) is, for example:

Examples 1: "Brazil prohibits people from entering the border with Venezuela". But my code returns: "Brazil proh-be entry of people on the border with Venezuela"

Examples 2: "Without tourists and boats, Venice’s water becomes clearer and clearer". But my code returns: "Without tourists and boats, Venice’s water becomes clearer and unlisted"

I saw that a solution would be to introduce an ADO Stream object, but I couldn’t implement it. Someone can help?

Public Function getHTTP(ByVal Url As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", Url, False: .Send
        getHTTP = StrConv(.ResponseBody, vbUnicode)
    End With
End Function

=====================================================================

Sub analisar()

Url = "https://g1.globo.com/"
Html = getHTTP(Url)
inicio_titulo = 1
i = 0

For Each c In Range("A1:A20")

    inicio_titulo = InStr(inicio_titulo, Html, """title"":""") + 9
    fim_titulo = InStr(inicio_titulo, Html, """,""url"":""")
    titulo = Mid(Html, inicio_titulo, fim_titulo - inicio_titulo)
    c.Value = titulo

Next

End Sub
  • If you put .Charset = "utf-8" in the CreateObject, before the .Open, works?

  • No. Gave invalid property (error 438).

  • You are using conversion to Unicode and will give problem with accent in Latin even. See this article

  • And the way you are performing, you may have problems reaching the maximum number of characters, see how to extract HTML to a . txt in this answer

1 answer

1


First, special thanks to danieltakeshi.

  • The conversion to Hexa and then to Utf8 I didn’t know, I was able to reproduce according to article.
  • The burst of characters that danieltakeshi mentioned may actually occur. I will implement a txt as suggested.

For the specific problem presented, I have achieved an even simpler solution:

Replaces:

getHTTP = StrConv(.ResponseBody, vbUnicode)

For:

getHTTP = .Responsetext

This property I didn’t know, and I don’t need to go through the conversion of 'Strconv' that loses the information of utf8.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.