Remove data frame row with string divided into multiple columns

Asked

Viewed 95 times

1

I have several files with data stored in text of fixed width. All files have a header with information about the data entity and a last line in FINALIZER format##########

Where ###########################################################################################################).

I am using read.fwf("caminho do arquivo", widths = c(4,2,80), col.names = c("exercicio", "codigo", "nome"), strip.white = TRUE, skip = 1) of the R language and with it I can skip the header, but the FINALIZER line I am not able to "jump".

Note that widths and col.names are specific to each file (there are 27 files in total).

I’m looking for a way to:

Or not processing the line starting with "FINALIZER" before importing or importing; or

Delete the row that contains FINALIZER even if it is "split" into several columns. For example, in the file ORGAO.TXT, whose first field has 4 characters, the second has 2 and the third has 80, FINALIZER would be divided into FIN, AL, IZADOR0000000037, in the three columns.

3 answers

2


If the data frame is called dados and the goal is to always remove your last line, do

dados.limpos <- head(dados, -1)

to create the data frame dados.limpos, identical to the object dados, but without his last line.

1

The following function reads the file with readLines as if it were a text file, does a preprocessing, removing the first and last lines and creates a connection with the resulting text in the form of a textConnection, tc. The table is read by read.fwf from tc.

Both the file header and the last line are members of the function value, a list of names

  1. Cabecalho
  2. Finalizador
  3. Dados

If you only want the data table, this final value is easy to modify.

lerFich <- function(con, widths, col.names){
  txt <- readLines(con = con)
  Header <- txt[1]
  n <- grep("FINALIZADOR", txt)
  Last <- txt[n]
  txt <- txt[-c(1, n)]
  tc <- textConnection(txt)
  dados <- read.fwf(file = tc, widths = widths, col.names = col.names, strip.white = TRUE)
  close(tc)
  list(Cabecalho = Header, Finalizador = Last, Dados = dados)
}

res <- lerFich("ORGAO.txt", widths = c(4, 10, 10), col.names = c("exercicio", "codigo", "nome"))

res
#$Cabecalho
#[1] "col1      col2      col3"
#
#$Finalizador
#[1] "FINALIZADOR0000000037"
#
#$Dados
#  exercicio     codigo       nome
#1      1234 abcdefghij 1234567890
#2      1234 abcdefghij 1234567890
#3      1234 abcdefghij 1234567890
#4      1234 abcdefghij 1234567890
#5      1234 abcdefghij 1234567890
#
#Warning message:
#In readLines(con = con) : incomplete final line found on 'ORGAO.txt'

The table will then be

res$Dados

File

The test file has the following contents.

col1      col2      col3
1234abcdefghij1234567890
1234abcdefghij1234567890
1234abcdefghij1234567890
1234abcdefghij1234567890
1234abcdefghij1234567890
FINALIZADOR0000000037

1

You can use the option nrows to read only to the next to last line. Thus the data will be loaded in the appropriate format and does not need to pre-process the file. Only you must first determine the number of lines:

# Arquivo de exemplo
writeLines(
"Cabecalho
ABCD11Paulo
EFGH21Henrique
FINALIZADOR000000002",
"exemplo.txt")

nlinhas <- length(readLines("exemplo.txt")) # veja o comentário

dados <- read.fwf("exemplo.txt", widths = c(4,2,80),
                  col.names = c("exercicio", "codigo", "nome"),
                  skip = 1, nrows = nlinhas-2)

> dados
  exercicio codigo     nome
1      ABCD     11    Paulo
2      EFGH     21 Henrique

> str(dados)
'data.frame':   2 obs. of  3 variables:
$ exercicio: Factor w/ 2 levels "ABCD","EFGH": 1 2
$ codigo   : int  11 21
$ nome     : Factor w/ 2 levels "Henrique","Paulo": 2 1

If your files are too large, use readLines will be inefficient. If you are using *Nix, you can call wc -l:

nlinhas <- as.integer(sub("\\D.*", "", system("wc -l exemplo.txt", intern = TRUE)))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.