How to recognize and change the encoding of Latin characters in R?

Asked

Viewed 2,429 times

13

Is there any efficient way to recognize the encoding of texts downloaded from the internet? I made a scraping of any site (see code below) and I can’t find the correct encoding.

In the META tag of the source code the specification is "iso-8859-1" (latin1). But when I specify this configuration, it still doesn’t work...

library(XML); library(httr)
url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"
site_gt = content(GET(url))
resumos_gt = xpathSApply(site_gt,'//div[@style="display:none;"]', xmlValue)
resumos_gt[1]

In this result, I get something like: "Estudos Legislativo no Brasil têm concentrado suas pesquisas nos âmbito federal e estadual". How to do têm transform into tambémand âmbito transform into âmbito?

I tried everything that came to mind. And nothing worked:

    iconv(resumos_gt[1], from="UTF-8", to = "latin1")
    iconv(resumos_gt[1], from="UTF-8", to = "latin2")
    iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15")
    iconv(resumos_gt[1], from="UTF-8", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="UTF-8", to = "latin2//TRANSLIT")
    iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15//TRANSLIT")


    iconv(resumos_gt[1], from="latin1", to = "UTF-8")
    iconv(resumos_gt[1], from="latin2", to = "UTF-8")
    iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8")
    iconv(resumos_gt[1], from="latin1", to = "UTF-8//TRANSLIT")
    iconv(resumos_gt[1], from="latin2", to = "UTF-8//TRANSLIT")
    iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8//TRANSLIT")

    ####

    iconv(resumos_gt[1], from="latin1", to = "ASCII")
    iconv(resumos_gt[1], from="latin2", to = "ASCII")
    iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
    iconv(resumos_gt[1], from="latin1", to = "ASCII//TRANSLIT")
    iconv(resumos_gt[1], from="latin2", to = "ASCII//TRANSLIT")
    iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")

    iconv(resumos_gt[1], from="ASCII", to = "latin1")
    iconv(resumos_gt[1], from="ASCII", to = "latin2")
    iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15")
    iconv(resumos_gt[1], from="ASCII", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="ASCII", to = "latin2//TRANSLIT")
    iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15//TRANSLIT")

    ####

    iconv(resumos_gt[1], from="UTF-8", to = "ASCII")
    iconv(resumos_gt[1], from="UTF-8", to = "ASCII//TRANSLIT")

    iconv(resumos_gt[1], from="ASCII", to = "UTF-8")
    iconv(resumos_gt[1], from="ASCII", to = "UTF-8//TRANSLIT")


    ####

    iconv(resumos_gt[1], from="latin1", to = "latin2")
    iconv(resumos_gt[1], from="latin1", to = "iso-8859-15")
    iconv(resumos_gt[1], from="latin1", to = "latin2//TRANSLIT")
    iconv(resumos_gt[1], from="latin1", to = "iso-8859-15//TRANSLIT")

    iconv(resumos_gt[1], from="latin2", to = "latin1")
    iconv(resumos_gt[1], from="latin2", to = "iso-8859-15")
    iconv(resumos_gt[1], from="latin2", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="latin2", to = "iso-8859-15//TRANSLIT")

    iconv(resumos_gt[1], from="iso-8859-15", to = "latin1")
    iconv(resumos_gt[1], from="iso-8859-15", to = "latin2")
    iconv(resumos_gt[1], from="iso-8859-15", to = "latin1//TRANSLIT")
    iconv(resumos_gt[1], from="iso-8859-15", to = "latin2//TRANSLIT")

I am using an R 3.2.5 on a Windows 7 (and yes... I have to maintain this operating system. Apparently, in linux this problem does not occur -- or is easier to solve).

  • Just one detail: têm is têm, and not também. If it were também you would see the letters t a m b m normally.

  • Your example didn’t run here: Error in UseMethod("xpathApply") : 
 no applicable method for 'xpathApply' applied to an object of class "c('xml_document', 'xml_node')".

  • In fact, @Molx... têm is tem... hehe. But I couldn’t reproduce that error you encountered with the xpathSApply . At first, the function content of the httr package returns an object that can be read by the function xpathSApply of the XML package.

2 answers

10


library(XML); library(httr)

url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"

site_gt =  GET(url)

site_gt = content(site_gt, as = "text")

site_gt <- htmlParse(site_gt, encoding = "UTF-8")

resumos_gt = xpathSApply(site,'//div[@style="display:none;"]', xmlValue)

resumos_gt

Solution was to first read the content of the page as text, and then apply htmlParse with UTF-8 encoding

  • Thank you @Denisson Silva. It worked here. An observation: in your code, it is missing to specify what is the object url

  • 1

    No @Rogeriojb. Already include url specification.

0

Sometimes the text may contain more than one language, presenting "more than one encoding" as an example below, you can check with the function

stringi::stri_enc_detect(as.character(msg$body))

Encoding      Language Confidence
UTF-8                     1.00
ISO-8859-1       pt       0.49
ISO-8859-2       ro       0.28
UTF-16BE                  0.10
UTF-16LE                  0.10
ISO-8859-9       tr       0.10
Shift_JIS        ja       0.10
GB18030          zh       0.10
EUC-JP           ja       0.10
EUC-KR           ko       0.10
Big5             zh       0.10

in my case I decided to assume that the primary encoding was 'UTF-8', and the secondary was also 'UTF-8', certainly in my case.

iconv(as.character(msg$body), from = "UTF-8", to = "UTF-8")

Browser other questions tagged

You are not signed in. Login or sign up in order to post.