13
Is there any efficient way to recognize the encoding of texts downloaded from the internet? I made a scraping of any site (see code below) and I can’t find the correct encoding.
In the META tag of the source code the specification is "iso-8859-1" (latin1). But when I specify this configuration, it still doesn’t work...
library(XML); library(httr)
url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"
site_gt = content(GET(url))
resumos_gt = xpathSApply(site_gt,'//div[@style="display:none;"]', xmlValue)
resumos_gt[1]
In this result, I get something like: "Estudos Legislativo no Brasil têm concentrado suas pesquisas nos âmbito federal e estadual"
. How to do têm
transform into também
and âmbito
transform into âmbito
?
I tried everything that came to mind. And nothing worked:
iconv(resumos_gt[1], from="UTF-8", to = "latin1")
iconv(resumos_gt[1], from="UTF-8", to = "latin2")
iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15")
iconv(resumos_gt[1], from="UTF-8", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="UTF-8", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="latin1", to = "UTF-8")
iconv(resumos_gt[1], from="latin2", to = "UTF-8")
iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8")
iconv(resumos_gt[1], from="latin1", to = "UTF-8//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "UTF-8//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8//TRANSLIT")
####
iconv(resumos_gt[1], from="latin1", to = "ASCII")
iconv(resumos_gt[1], from="latin2", to = "ASCII")
iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
iconv(resumos_gt[1], from="latin1", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
iconv(resumos_gt[1], from="ASCII", to = "latin1")
iconv(resumos_gt[1], from="ASCII", to = "latin2")
iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15")
iconv(resumos_gt[1], from="ASCII", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15//TRANSLIT")
####
iconv(resumos_gt[1], from="UTF-8", to = "ASCII")
iconv(resumos_gt[1], from="UTF-8", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "UTF-8")
iconv(resumos_gt[1], from="ASCII", to = "UTF-8//TRANSLIT")
####
iconv(resumos_gt[1], from="latin1", to = "latin2")
iconv(resumos_gt[1], from="latin1", to = "iso-8859-15")
iconv(resumos_gt[1], from="latin1", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="latin1", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "latin1")
iconv(resumos_gt[1], from="latin2", to = "iso-8859-15")
iconv(resumos_gt[1], from="latin2", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin1")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin2")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin2//TRANSLIT")
I am using an R 3.2.5 on a Windows 7 (and yes... I have to maintain this operating system. Apparently, in linux this problem does not occur -- or is easier to solve).
Just one detail:
têm
istêm
, and nottambém
. If it weretambém
you would see the letterst a m b m
normally.– Molx
Your example didn’t run here:
Error in UseMethod("xpathApply") : 
 no applicable method for 'xpathApply' applied to an object of class "c('xml_document', 'xml_node')"
.– Molx
In fact, @Molx...
têm
istem
... hehe. But I couldn’t reproduce that error you encountered with thexpathSApply
. At first, the functioncontent
of the httr package returns an object that can be read by the functionxpathSApply
of the XML package.– RogerioJB