Use of the sub function in R - string with special characters

Asked

Viewed 1,220 times

3

I am operating a database with the following values:

data$Col_new <- data$Col_old

Col_Velha              Col_Nova
Médico                Médico
Médica Intensivista   Médico
Técnica em Enfermagem Técnica em Enfermagem
Enfermeira             Enfermeiro

To change the names I used the sub function with the following parameters

dados$Col_Nova <- sub(pattern = "[A-z]nfermeir[A-z].*", "Enfermeiro", dados3$Col_Nova)
dados$Col_Nova <- sub(pattern = "[A-z].{3}ic[A-z].*", "Médico", dados$Col_Nova)

However, when I’m trying to apply for nursing technician it’s not working, as follows the code:

dados$Col_Nova<- sub(pattern = "[A-z].{2}cnic[A-z]\\s.*", "Técnico de enfermagem", dados$Col_Nova)

What’s going on and why? Thank you!

  • In its place, I would try to adjust the encoding of reading that data. The old column at least is consistent: it looks like a given UTF-8 being treated as windows-1252.

  • How would I do that?

  • How you uploaded this data?

  • data <- read.csv(files[indice_files], header=T), where files is an array with the directory files and indice_files the name of the file I want to import.

3 answers

5


You can change the encoding of the whole column at once

dados$Col_Nova <- iconv(dados$Col_Velha, to = "latin1//TRANSLIT", from = "UTF-8")
  • Solution proved efficient and practical to the proposed problem. Regards, Arduin.

1

Another alternative, which I consider more elegant, is to treat the file encoding in the load of the same, instead of correcting wrongly configured load errors.

It is possible to define the encoding in the file reading, as follows:

csvFile <- file("arquivo.csv", encoding="UTF-8")
data <- read.csv(csvFile)

Following your case in the other comments, it is possible that the adjustment is like this:

dados <- read.csv(file(arquivos[indice_arquivos], encoding="UTF-8"), header=T)

Since you did not post the full code, it is not possible to guarantee. But if it is not, it will be close to this.

1

you put the backslash

dados$Col_Nova <- sub(pattern = "[A-z].{2}cnic[A-z]\\s.*", "Técnico de enfermagem", dados$Col_Nova)
  • True, but still this error appears: "Error in $<-.data.frame(*tmp*, Col_nova, value = Character(0)) : Replacement has 0 Rows, date has 186" I think the pattern is not correct.

  • here rode smoothly

Browser other questions tagged

You are not signed in. Login or sign up in order to post.