11
I have a database on . csv that gathers posts from both Facebook and Twitter. For the reading of the bank in R, the code I have used is
bancodedados <- read.csv("nomedobanco.csv", sep=";", encoding="UTF-8")
The code loads the database almost to the end, only an error interrupts the reading:
invalid input 'RT @jmlara02: @Lizcorreaa Comrade define multicentric comrade. ẓà @90Javier @Nicolasmaduro' in 'utf8towcs'
Doing a search on the Internet, I saw that the problem is, in a way, recurrent. It is caused by the non-recognition of characters provided in my code (UTF-8), which in this case is "?".
Some proposals for a solution seen on the Internet:
- Manually remove the characters from the original base. _In this case, I I dismissed that hypothesis because the database is very large and the computer RAM memory not so big.
Use tryCatch() function, R "error Handling", ignore this error and proceed with reading. I found this the best chance, only the code usage is rather unfriendly. I tried the "debug" package from CRAN...also did not find much better than the default.
Load, through the CRAN "tm" package to Vcorpus. I actually managed to load the database and data through this way, however it did not come in dataframe format, IE...it was the pure ,csv there.
So the question that remains is:
Is solution 2 really the best? If yes, how to implement tryCatch along with read.csv to ignore the error and finish reading the database?
If someone has a manual of "Error Handling" in Portuguese can also help.
Some links to the problem :
https://stackoverflow.com/questions/26143270/read-umlaut-from-csv-file-in-rattle
http://minimalr.com/2013/01/06/tolower-error-catching-unmappable-characters/
https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs
This whole encoding thing is always complicated. Without seeing the data it is difficult to give a right answer, but probably something in the following lines would work: (i) read the csv using
readLines()
. This will generate a large text object in R. (ii) convert the text to the correct encoding usingiconv()
. (iii) convert to data.frame usingread.table(text = objeto_que_voce_criou_com_readLines)
. That should work.– Carlos Cinelli
Have you tried saving to . txt and loading pro dataframe? Remember to put the UTF-8 enconding. Provide a small example of the file to be able to repeat the situation.
– Artur_Indio
Eric, the above suggestion worked, you managed to solve the problem?
– Carlos Cinelli