How to avoid encoding problems when picking up data with Twitter?

Asked

Viewed 6,503 times

4

I’m taking Twitter data with the package twitteR for but the tweets are coming with encoding. Someone knows how to get around this problem?

library(twitteR)
library(stringr)
library(ROAuth)
library(RCurl)

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))


setwd("XXXXXXXXX")

download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")

cred <- OAuthFactory$new(consumerKey='XXXXXXXXXXXXXX',
                         consumerSecret='XXXXXXXXXXXX',
                         requestURL='https://api.twitter.com/oauth/request_token',
                         accessURL='https://api.twitter.com/oauth/access_token',
                         authURL='http://api.twitter.com/oauth/authorize')

cred$handshake(cainfo="cacert.pem")

registerTwitterOAuth(cred)

tweets = searchTwitter("#Copa2014", n=200, cainfo="cacert.pem")

Tweets.text = laply(tweets,function(t)t$getText())

The data is coming this way, with problems in accents and cedilhas:

head(Tweets.text)
[1] "Não fui sorteado dessa vez, mas dia 12/03 começa uma nova fase de vendas... #copa2014"                                                       
[2] "RT @obsate: @RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Pia…"
[3] "@RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Piada!"            
[4] "Nem pra saude! \"@mordomoeugenio: Bilhão de reais pra ensino público não tem né #copa2014 #JN\""                                           
[5] "RT @soldadonofront: \"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\"" 
[6] "\"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\""   

I’m using:

Rstudio 0.98.501
R version 3.0.2 (2013-09-25)
Platform: x86_64-W64-mingw32/x64 (64-bit)

PS: The problem is apparently occurring in Windows 7. Following Luis Cipriani’s instructions and running the code on Linux, there were no encoding problems. The question still permance to avoid problems on Windows,

  • 1

    This question is good and quite useful. Nobody?

  • 2

    It looks like it comes in UTF-8 and in some error the library or program is encoding a second time. Unfortunately common error in Python :(

3 answers

4


I don’t know if there’s any way to permanently fix the character coding problems. There are several factors that hinder the correct identification of the encoding. On a given web page, the encoding informed to the browser in the META tag (within the HEAD section) may not be the encoding effectively used; there are also the local settings of your computer; the encoding that was set as default within the R, etc.

The general hint is as follows: Portuguese texts are usually encoded as "latin1" or "Latin2". Then it is possible to test some conversions between coding systems.

See an example using your data:

    tweets <- c("Não fui sorteado dessa vez, mas dia 12/03 começa uma nova fase de vendas... #copa2014",
    "RT @obsate: @RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Pia…",
    "@RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Piada!",
    "Nem pra saude! \"@mordomoeugenio: Bilhão de reais pra ensino público não tem né #copa2014 #JN\"",
    "RT @soldadonofront: \"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\"",
    "\"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\"")

Now I execute the following:

    iconv(tweets, from="UTF-8", to="latin1//TRANSLIT")

And I get:

    [1] "Não fui sorteado dessa vez, mas dia 12/03 começa uma nova fase de vendas... #copa2014"                                                        
    [2] "RT @obsate: @RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Pia." 
    [3] "@RodP13 @gugakuerten A #Copa2014 virou a Geni; todo mundo bate nela. Agora a copa tem de resolver todos os problemas do BR. Piada!"           
    [4] "Nem pra saude! \"@mordomoeugenio: Bilhão de reais pra ensino público não tem né #copa2014 #JN\""                                              
    [5] "RT @soldadonofront: \"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\""
    [6] "\"@fsouzajrJuca: Quanto mais eu leio sobre esses grupos que protestam contra a Copa, mais eu simpatizo com a #Copa2014.\""       

Worked.

Other options for testing could be:

    iconv(tweets, from="UTF-8", to="latin2//TRANSLIT")
    iconv(tweets, from="UTF-8", to="latin1")
    iconv(tweets, from="UTF-8", to="latin2")

I helped?

1

Hello, I ran your code in the following configuration:

RStudio: 0.98.501
R version 3.0.1 (2013-05-16)
platform: Darwin Kernel Version 13.1.0: Thu Jan 16 19:40:37 PST 2014; root:xnu-2422.90.20~2/RELEASE_X86_64 x86_64 (Mac)
Versões das bibliotecas:
    twitteR 1.1.7 (from CRAN)
    stringr 0.6.2
    ROauth 0.9.3
    RCurl 1.95-4.1
    rjson 0.2.13

And the result was:

> tweets
[[1]]
[1] "sigaCopa2014: 11h10. Arena da Baixada terá evento-teste em março - Via Portal 2014 http://t.co/GmhnF87ji2 #copa2014"

[[2]]
[1] "_copadomundo_: #futebol #copa2014 Eto'o ironiza boatos sobre sua idade usando seus gols no Chelsea como argumento: Gazeta Ata... http://t.co/4YkV7g1Wgv"

[[3]]
[1] "_copadomundo_: #futebol #copa2014 Apesar de empate, Renato Gaúcho aprova quarteto ofensivo do Fluminense: Gazeta Flu encarou ... http://t.co/Co0MmJCEC1"

See that the accents are ok, so it does not seem to me to be a problem of Twitter. Could send your libs versions?

  • My versions of the library are the same as yours. The only difference I see is in the operating system (I am using Windows). Can it be because of that? I will try on Ubuntu when I have time.

  • Yes, it can be. The bad thing about encoding problems is that they can happen in several layers between the operating system and your application.

  • It worked on Ubuntu 12.04.2. I will edit the question mentioning that in Ubuntu it worked.

0

I have dealt with this very situation and discovered that it is Double Enconding. So just pull it open two times and then bringing back for the format ideal.

PROBLEM: DOUBLE ENCODING

campo1_txt = iconv(Tweets.text, to="latin1", from="utf-8")

FIX IT WITH DOUBLE-DECODING

campo1_txt = iconv(campo1_txt, to="latin1", from="utf-8")

RECODE TO GET THE RIGHT INFORMATION

campo1_txt = iconv(campo1_txt, to="UTF-8", from="latin1")

From this point you can proceed normally as it will work.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.