Error "invalid input '.. 'in utf8towcs" with "read.csv"

Asked

Viewed 930 times

11

I have a database on . csv that gathers posts from both Facebook and Twitter. For the reading of the bank in R, the code I have used is

 bancodedados <- read.csv("nomedobanco.csv", sep=";", encoding="UTF-8")

The code loads the database almost to the end, only an error interrupts the reading:

invalid input 'RT @jmlara02: @Lizcorreaa Comrade define multicentric comrade. ẓà @90Javier @Nicolasmaduro' in 'utf8towcs'

Doing a search on the Internet, I saw that the problem is, in a way, recurrent. It is caused by the non-recognition of characters provided in my code (UTF-8), which in this case is "?".

Some proposals for a solution seen on the Internet:

  • Manually remove the characters from the original base. _In this case, I I dismissed that hypothesis because the database is very large and the computer RAM memory not so big.
  • Use tryCatch() function, R "error Handling", ignore this error and proceed with reading. I found this the best chance, only the code usage is rather unfriendly. I tried the "debug" package from CRAN...also did not find much better than the default.

  • Load, through the CRAN "tm" package to Vcorpus. I actually managed to load the database and data through this way, however it did not come in dataframe format, IE...it was the pure ,csv there.

So the question that remains is:

Is solution 2 really the best? If yes, how to implement tryCatch along with read.csv to ignore the error and finish reading the database?

If someone has a manual of "Error Handling" in Portuguese can also help.

Some links to the problem :

https://stackoverflow.com/questions/26143270/read-umlaut-from-csv-file-in-rattle

http://minimalr.com/2013/01/06/tolower-error-catching-unmappable-characters/

https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs

  • This whole encoding thing is always complicated. Without seeing the data it is difficult to give a right answer, but probably something in the following lines would work: (i) read the csv using readLines(). This will generate a large text object in R. (ii) convert the text to the correct encoding using iconv(). (iii) convert to data.frame using read.table(text = objeto_que_voce_criou_com_readLines). That should work.

  • Have you tried saving to . txt and loading pro dataframe? Remember to put the UTF-8 enconding. Provide a small example of the file to be able to repeat the situation.

  • Eric, the above suggestion worked, you managed to solve the problem?

2 answers

0

Hello! I use the file.choose() for myself to select the file.

install.packages("dplyr")
library(dplyr)

dados = read.csv2(file.choose(), header = T)
print(dados)

The read.csv2 already separates csv into ";" and the T or F is to set the "head" of the database.

I hope it helped.

-1

allow me to 'upgrade' the topic with a possible solution.

Try the following:

install.packages('stringr')
txt.tmp <- str_replace_all(conteudo_do_tweet,"[^[:graph:]]", " ") 

The above call removes existing graphic content in the tweet.

  • 1

    Jhonatas, your answer doesn’t solve the problem directly because AP couldn’t load the data into R. Maybe it’s a solution in conjunction with Carlos' suggestion in the comments, but the question can’t be tested without an example of reproducible data.

  • Hmmm, in fact. I understood what you meant. I could try this solution in the application that generates CSV, perhaps. Anyway, I think Eric’s given up or already found the solution to the problem.. hehe

Browser other questions tagged

You are not signed in. Login or sign up in order to post.