Error "invalid input '.. 'in utf8towcs" with "read.csv"

Question

Error "invalid input '.. 'in utf8towcs" with "read.csv"

Asked 10 years, 8 months ago

Viewed 930 times

11

I have a database on . csv that gathers posts from both Facebook and Twitter. For the reading of the bank in R, the code I have used is

 bancodedados <- read.csv("nomedobanco.csv", sep=";", encoding="UTF-8")

The code loads the database almost to the end, only an error interrupts the reading:

invalid input 'RT @jmlara02: @Lizcorreaa Comrade define multicentric comrade. ẓà @90Javier @Nicolasmaduro' in 'utf8towcs'

Doing a search on the Internet, I saw that the problem is, in a way, recurrent. It is caused by the non-recognition of characters provided in my code (UTF-8), which in this case is "?".

Some proposals for a solution seen on the Internet:

Manually remove the characters from the original base. _In this case, I I dismissed that hypothesis because the database is very large and the computer RAM memory not so big.
Use tryCatch() function, R "error Handling", ignore this error and proceed with reading. I found this the best chance, only the code usage is rather unfriendly. I tried the "debug" package from CRAN...also did not find much better than the default.
Load, through the CRAN "tm" package to Vcorpus. I actually managed to load the database and data through this way, however it did not come in dataframe format, IE...it was the pure ,csv there.

So the question that remains is:

Is solution 2 really the best? If yes, how to implement tryCatch along with read.csv to ignore the error and finish reading the database?

If someone has a manual of "Error Handling" in Portuguese can also help.

Some links to the problem :

https://stackoverflow.com/questions/26143270/read-umlaut-from-csv-file-in-rattle

http://minimalr.com/2013/01/06/tolower-error-catching-unmappable-characters/

https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs

This whole encoding thing is always complicated. Without seeing the data it is difficult to give a right answer, but probably something in the following lines would work: (i) read the csv using readLines(). This will generate a large text object in R. (ii) convert the text to the correct encoding using iconv(). (iii) convert to data.frame using read.table(text = objeto_que_voce_criou_com_readLines). That should work.

– Carlos Cinelli

2015/01/28 at 13:52
Have you tried saving to . txt and loading pro dataframe? Remember to put the UTF-8 enconding. Provide a small example of the file to be able to repeat the situation.

– Artur_Indio

2015/02/22 at 21:53
Eric, the above suggestion worked, you managed to solve the problem?

– Carlos Cinelli

2015/04/16 at 13:24

2 answers

Browser other questions tagged r utf-8 csv rstudio

You are not signed in. Login or sign up in order to post.

by Izak Mandrak • **1,059** points · Answer 1 · 2019-01-23T13:21:28+00:00

Hello! I use the file.choose() for myself to select the file.

install.packages("dplyr")
library(dplyr)

dados = read.csv2(file.choose(), header = T)
print(dados)

The read.csv2 already separates csv into ";" and the T or F is to set the "head" of the database.

I hope it helped.

by Jhonatas Kleinkauff • 99 points · Answer 2 · 2016-03-23T19:58:32+00:00

-1

allow me to 'upgrade' the topic with a possible solution.

Try the following:

install.packages('stringr')
txt.tmp <- str_replace_all(conteudo_do_tweet,"[^[:graph:]]", " ")

The above call removes existing graphic content in the tweet.

1

Jhonatas, your answer doesn’t solve the problem directly because AP couldn’t load the data into R. Maybe it’s a solution in conjunction with Carlos' suggestion in the comments, but the question can’t be tested without an example of reproducible data.

– Molx

2016/03/24 at 02:49
Hmmm, in fact. I understood what you meant. I could try this solution in the application that generates CSV, perhaps. Anyway, I think Eric’s given up or already found the solution to the problem.. hehe

– Jhonatas Kleinkauff

2016/03/24 at 12:49