Error "invalid input '.. 'in utf8towcs" with "read.csv"


Viewed 930 times


I have a database on . csv that gathers posts from both Facebook and Twitter. For the reading of the bank in R, the code I have used is

 bancodedados <- read.csv("nomedobanco.csv", sep=";", encoding="UTF-8")

The code loads the database almost to the end, only an error interrupts the reading:

invalid input 'RT @jmlara02: @Lizcorreaa Comrade define multicentric comrade. ẓà @90Javier @Nicolasmaduro' in 'utf8towcs'

Doing a search on the Internet, I saw that the problem is, in a way, recurrent. It is caused by the non-recognition of characters provided in my code (UTF-8), which in this case is "?".

Some proposals for a solution seen on the Internet:

  • Manually remove the characters from the original base. _In this case, I I dismissed that hypothesis because the database is very large and the computer RAM memory not so big.
  • Use tryCatch() function, R "error Handling", ignore this error and proceed with reading. I found this the best chance, only the code usage is rather unfriendly. I tried the "debug" package from CRAN...also did not find much better than the default.

  • Load, through the CRAN "tm" package to Vcorpus. I actually managed to load the database and data through this way, however it did not come in dataframe format, was the pure ,csv there.

So the question that remains is:

Is solution 2 really the best? If yes, how to implement tryCatch along with read.csv to ignore the error and finish reading the database?

If someone has a manual of "Error Handling" in Portuguese can also help.

Some links to the problem :

  • This whole encoding thing is always complicated. Without seeing the data it is difficult to give a right answer, but probably something in the following lines would work: (i) read the csv using readLines(). This will generate a large text object in R. (ii) convert the text to the correct encoding using iconv(). (iii) convert to data.frame using read.table(text = objeto_que_voce_criou_com_readLines). That should work.

  • Have you tried saving to . txt and loading pro dataframe? Remember to put the UTF-8 enconding. Provide a small example of the file to be able to repeat the situation.

  • Eric, the above suggestion worked, you managed to solve the problem?

2 answers


Hello! I use the file.choose() for myself to select the file.


dados = read.csv2(file.choose(), header = T)

The read.csv2 already separates csv into ";" and the T or F is to set the "head" of the database.

I hope it helped.


allow me to 'upgrade' the topic with a possible solution.

Try the following:

txt.tmp <- str_replace_all(conteudo_do_tweet,"[^[:graph:]]", " ") 

The above call removes existing graphic content in the tweet.

  • 1

    Jhonatas, your answer doesn’t solve the problem directly because AP couldn’t load the data into R. Maybe it’s a solution in conjunction with Carlos' suggestion in the comments, but the question can’t be tested without an example of reproducible data.

  • Hmmm, in fact. I understood what you meant. I could try this solution in the application that generates CSV, perhaps. Anyway, I think Eric’s given up or already found the solution to the problem.. hehe

Browser other questions tagged

You are not signed in. Login or sign up in order to post.