4
I’m taking Twitter data with the package twitteR
for r but the tweets are coming with encoding. Someone knows how to get around this problem?
library(twitteR)
library(stringr)
library(ROAuth)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
setwd("XXXXXXXXX")
download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
cred <- OAuthFactory$new(consumerKey='XXXXXXXXXXXXXX',
consumerSecret='XXXXXXXXXXXX',
requestURL='https://api.twitter.com/oauth/request_token',
accessURL='https://api.twitter.com/oauth/access_token',
authURL='http://api.twitter.com/oauth/authorize')
cred$handshake(cainfo="cacert.pem")
registerTwitterOAuth(cred)
tweets = searchTwitter("#Copa2014", n=200, cainfo="cacert.pem")
Tweets.text = laply(tweets,function(t)t$getText())
The data is coming this way, with problems in accents and cedilhas:
head(Tweets.text)
[1] "Não fui sorteado dessa vez, mas dia 12/03 começa uma nova fase de vendas... #copa2014"
[2] "RT @obsate: @RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Pia…"
[3] "@RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Piada!"
[4] "Nem pra saude! \"@mordomoeugenio: Bilhão de reais pra ensino público não tem né #copa2014 #JN\""
[5] "RT @soldadonofront: \"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\""
[6] "\"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\""
I’m using:
Rstudio 0.98.501
R version 3.0.2 (2013-09-25)
Platform: x86_64-W64-mingw32/x64 (64-bit)
PS: The problem is apparently occurring in Windows 7. Following Luis Cipriani’s instructions and running the code on Linux, there were no encoding problems. The question still permance to avoid problems on Windows,
This question is good and quite useful. Nobody?
– Carlos Cinelli
It looks like it comes in UTF-8 and in some error the library or program is encoding a second time. Unfortunately common error in Python :(
– epx