Removing lines from a problematic database in R

Asked

Viewed 89 times

2

A few days ago I asked a question in the stack, according to link. The reply I received from prof. @Marcus Nunes shows that less than 10% of lines have more than 15 tabs of the type ;. In that case, I wish to erase all lines more than 15 ; to separate columns. A first code to perform this cleanup and generate a new database follows below, however, it did not work properly. I would like to receive some suggestion on how to proceed to achieve my goal?

library("tidyverse")
library("stringr")

teste <- readLines("2019_Viagem.csv")
count <- str_count(teste, ';')
teste <- teste[count==15]
write.csv2(teste,"plan2019.csv",row.names = FALSE)
Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 plan2019.csv"
Diaria2019 <- data.table::fread(Diaria2019_Via, dec = ",")

1 answer

5


I believe the following code R, with some awk preparatory, does what the question asks.

First I’ll redirect the command output iconv question for a new file, the file 2019_Viagem_UTF8.csv.

iconv -f ISO-8859-1 -t UTF-8 2019_Viagem.csv > 2019_Viagem_UTF8.csv

This is the file I’m processing.

Now I will slightly change the command line in the @Marcusnunes reply to have a text file with the numbers of columns of each line and not the numbers of ";". Actually it’s the same, later on in the R would just compare with 15 and not with 16 how I’m gonna do.
The new command line is as follows:.

cat 2019_Viagem.csv | awk -F";" '{print NF}' > colunas.txt

This creates a file only with how many columns each row has, a number per row of colunas.txt.

Finally the code R.

  1. Read how many columns each row of the file has 2019_Viagem_UTF8.csv.
  2. Read that file.
  3. Stay with the lines we want, with 16 columns.
  4. Create a data.frame, using read.csv2 with the argument text = txt.
  5. And save to disk as csv file.

Here it comes

colunas <- scan(file = "colunas.txt")
txt <- readLines("2019_Viagem_UTF8.csv")
txt <- txt[colunas == 16]
limpo <- read.csv2(text = txt)
rm(txt)
dim(limpo)
#[1] 125695     16

write.csv2(limpo, file = "plan2019.csv", row.names = FALSE)

Note: In question is plan2018.csv and not 2019. I believe this is a mistake that I corrected.

  • Very good @Rui Barradas, thank you!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.