Read error with the fread function of the data.table package

Asked

Viewed 317 times

1

By downloading the database of values of daily and tickets 2019 the portal of transparency http://www.portaltransparencia.gov.br/download-de-dados/viagens, I used the code below to read the daily log file received in the year 2019, but I have received some errors with the fread function and errors with the read.delim function, because in the last case when reading the column "Daily Value" the values are not recognized correctly. Follow the codes with their errors:

rm(list=ls())
library("tidyverse")
library("readr")
library("data.table")
library("stringr")
library("lubridate")
#unzip("2019_20190630_Viagens.zip")
options(datatable.fread.input.cmd.message=FALSE)
Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 2019_Viagem.csv"
Diaria2019 <- data.table::fread(Diaria2019_Via,dec = ",")


Warning messages:
1: In data.table::fread(Diaria2019_Via, dec = ",") :
  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(Diaria2019_Via, dec = ",") :
  Stopped early on line 7378. Expected 16 fields but found 18. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"0000000000015769552";"Realizada";"53000";"Ministério do Desenvolvimento Regional";"53000";"Ministério do Desenvolvimento Regional - Unidades com vínculo direto";"***.043.57*-**";"ARMIN AUGUSTO BRAUN";"";"20190115";"20190116";"São Paulo/SP";"Representar a Secretaria Nacional de Proteção e Defesa Civil - SEDEC, no Seminário "Proteção e Defesa Civil Aplicada", onde Ministrará palestra sobre "Apoio Federal na Resposta a Desastres"; participará reunião com pessoal do Hospital Albert Eins>>

The above message suggests using quote="" and fill=NULL, however, none of them works. The code below does the incorrect reading of the Daily Value column:

Diaria_2019 <- read_delim("2019_Viagem.csv", 
                          ";", escape_double = FALSE, locale = locale(decimal_mark = ".",encoding = "ISO-8859-1"), 
                          trim_ws = TRUE)

Does anyone suggest any other function that can correctly read the database 2019_Viagem.csv?

2 answers

4


The file is filled incorrectly in the source. As far as I know, it will be impossible to read it, this way, inside the R. I discovered this when running, in the terminal, the command

cat 2019_Viagem.csv | awk -F";" '{print NF-1}' | head -7379 | tail -3

Going by parts, what this command does is this:

  1. cat 2019_Viagem.csv: prints the contents of the file 2019_Viagem.csv on the terminal screen

  2. awk -F";" '{print NF-1}': the language awk search for all occurrences of ; in each file line 2019_Viagem.csv and count them. How ; is the column separator, a file with 16 columns must have 15 characters of the type ;

  3. head -7379 and tail -3: perform the same function as R, in which head shows the n first lines of a file and tail, as m latest

The character | serves to concatenate commands into a function identical to the command %>% of packages magrittR or dplyr. So what I got was the result

15
17
15

This means that there are 16 columns in the 7377 and 7379 rows of the file 2019_Viagem.csv, while in row 7378 there are 18 columns. Note that this result has everything to do with the Warning you received in your attempt:

2: In data.table::fread(Diaria2019_Via, dec = ",") :
  Stopped early on line 7378. Expected 16 fields but found 18.

As R does not know how to deal with this incongruity between the amounts of columns between the rows, you end up having problems with reading this data. As a matter of interest, these are the three lines whose quantities of ; I counted and displayed above:

"0000000000015769550";"Realizada";"30000";"Minist�rio da Justi�a e Seguran�a P�blica";"30108";"Departamento de Pol�cia Federal";"***.000.00*-**";"Informa��es protegidas por sigilo";"Informa��es protegidas por sigilo";"20190117";"20190117";"Informa��es protegidas por sigilo";"Informa��o protegida por sigilo nos termos da legisla��o vigente";",00";"3792,34";",00"
"0000000000015769552";"Realizada";"53000";"Minist�rio do Desenvolvimento Regional";"53000";"Minist�rio do Desenvolvimento Regional - Unidades com v�nculo direto";"***.043.57*-**";"ARMIN AUGUSTO BRAUN";"";"20190115";"20190116";"S�o Paulo/SP";"Representar a Secretaria Nacional de Prote��o e Defesa Civil - SEDEC, no Semin�rio "Prote��o e Defesa Civil Aplicada", onde Ministrar� palestra sobre "Apoio Federal na Resposta a Desastres"; participar� reuni�o com pessoal do Hospital Albert Einstein para prepara��o de Simulado de Preven��o de desastres; participar� de reuni�o com Diretor do Centro Nacional de Monitoramento e Alertas de Desastres Naturais - CEMADEN, que ser�o realizados no dia 16 de janeiro de 2019.";"475,70";"2201,04";",00"
"0000000000015769553";"Realizada";"30000";"Minist�rio da Justi�a e Seguran�a P�blica";"30108";"Departamento de Pol�cia Federal";"***.000.00*-**";"Informa��es protegidas por sigilo";"Informa��es protegidas por sigilo";"20190101";"20190103";"Informa��es protegidas por sigilo";"Informa��o protegida por sigilo nos termos da legisla��o vigente";"725,96";",00";",00"

I see two lines of action to solve your problem:

The first would be to manually correct the problematic lines. Yes, there are several lines with problems. Below I show a table with the number of occurrences of ; in the entire data set, obtained through the command cat 2019_Viagem.csv | awk -F";" '{print NF-1}' | sort | uniq -c:

266878 15
  2929 16
  1282 17
   500 18
   521 19
   253 20
    80 21
    54 22
    27 23
    31 24
     6 25
     6 26
     2 27
     7 28
     3 29
     4 33
     2 37

See that 266878 rows have 15 column separators as expected, 2929 has 16 and so on. We have reached the height of 2 rows having 37 ;. Particularly, I don’t see how to create an algorithm that can handle all the exceptions that may arise when dealing with a problematic data set like this.

The other solution I see is to get in direct contact with those who provide this data and comment on these inconsistencies. Maybe someone will come to the federal government to help you?

1

  • Thanks CGU, congratulations for the work and for the prompt correction of the files!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.