13
Is there any efficient way to recognize the encoding of texts downloaded from the internet? I made a scraping of any site (see code below) and I can’t find the correct encoding.
In the META tag of the source code the specification is "iso-8859-1" (latin1). But when I specify this configuration, it still doesn’t work...
library(XML); library(httr)
url = "http://www.encontroabcp2014.cienciapolitica.org.br/site/anaiscomplementares?AREA=8"
site_gt = content(GET(url))
resumos_gt = xpathSApply(site_gt,'//div[@style="display:none;"]', xmlValue)
resumos_gt[1]
In this result, I get something like: "Estudos Legislativo no Brasil têm concentrado suas pesquisas nos âmbito federal e estadual". How to do têm transform into tambémand âmbito transform into âmbito?
I tried everything that came to mind. And nothing worked:
iconv(resumos_gt[1], from="UTF-8", to = "latin1")
iconv(resumos_gt[1], from="UTF-8", to = "latin2")
iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15")
iconv(resumos_gt[1], from="UTF-8", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="UTF-8", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="UTF-8", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="latin1", to = "UTF-8")
iconv(resumos_gt[1], from="latin2", to = "UTF-8")
iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8")
iconv(resumos_gt[1], from="latin1", to = "UTF-8//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "UTF-8//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "UTF-8//TRANSLIT")
####
iconv(resumos_gt[1], from="latin1", to = "ASCII")
iconv(resumos_gt[1], from="latin2", to = "ASCII")
iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
iconv(resumos_gt[1], from="latin1", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "ASCII")
iconv(resumos_gt[1], from="ASCII", to = "latin1")
iconv(resumos_gt[1], from="ASCII", to = "latin2")
iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15")
iconv(resumos_gt[1], from="ASCII", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "iso-8859-15//TRANSLIT")
####
iconv(resumos_gt[1], from="UTF-8", to = "ASCII")
iconv(resumos_gt[1], from="UTF-8", to = "ASCII//TRANSLIT")
iconv(resumos_gt[1], from="ASCII", to = "UTF-8")
iconv(resumos_gt[1], from="ASCII", to = "UTF-8//TRANSLIT")
####
iconv(resumos_gt[1], from="latin1", to = "latin2")
iconv(resumos_gt[1], from="latin1", to = "iso-8859-15")
iconv(resumos_gt[1], from="latin1", to = "latin2//TRANSLIT")
iconv(resumos_gt[1], from="latin1", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "latin1")
iconv(resumos_gt[1], from="latin2", to = "iso-8859-15")
iconv(resumos_gt[1], from="latin2", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="latin2", to = "iso-8859-15//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin1")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin2")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin1//TRANSLIT")
iconv(resumos_gt[1], from="iso-8859-15", to = "latin2//TRANSLIT")
I am using an R 3.2.5 on a Windows 7 (and yes... I have to maintain this operating system. Apparently, in linux this problem does not occur -- or is easier to solve).
Just one detail:
têmistêm, and nottambém. If it weretambémyou would see the letterst a m b mnormally.– Molx
Your example didn’t run here:
Error in UseMethod("xpathApply") : 
 no applicable method for 'xpathApply' applied to an object of class "c('xml_document', 'xml_node')".– Molx
In fact, @Molx...
têmistem... hehe. But I couldn’t reproduce that error you encountered with thexpathSApply. At first, the functioncontentof the httr package returns an object that can be read by the functionxpathSApplyof the XML package.– RogerioJB