Selection/Cleaning of information in a column

Asked

Viewed 45 times

1

I have a database with thousands of rows, but in one of the columns the data is like this:

XLOCAL
Estirão do Equador, Rio Javari (04°27'S;71°30'W)
Alto Rio Paru de Oeste, Posto Tiriós (02°15'N;55°59'W)
Ipixuna do Pará, Rodovia Belém-Brasília km 92/93 (02°26'S;47°30'W)
Aurora do Pará, Rodovia Belém-Brasília km 86 (02°04'S;47°33'W)

I would like help to leave only the coordinates by removing all the texts, parentheses and point and comma. Would look like this:

 XLOCAL
04°27'S 71°30'W
02°15'N 55°59'W

I tried using strings and gsub but was unsuccessful. Example of what I tried.

df <- c("sdasdad (04°27'S;71°30'W)", "zxczxczcxz (01°40'N;51°23'W)")
grep("^([[:punct:]])", df, value=TRUE)
pattern <- "[[:alpha:]]"
gsub("^.[[:alpha:]]", df, fixed=F)

outworking

[1] " (04°27';71°30')" " (01°40';51°23')" #Reparem que ele removeu também "N", "S", "W" das coordenadas.

The database is museum, they are not available online, have to organize to make available online. Help me, it’s thousands of lines to remove manually. Thank you very much in advance.

  • 1

    Put more info friend, which bank is ? Which code already tried to use? Which error appears?

2 answers

2


I believe that in the question complicated the regex too much. See thus.
First be alone with what’s between ( and ). As these characters are special characters you need to use \\( and \\). That’s what the sub ago.
Then replace semicolon ; for a space. For this I used the gsub but as there is only one ; can also be the sub.

gsub(";", " ", sub("^.*\\((.*)\\)", "\\1", XLOCAL))
#[1] "04°27'S 71°30'W" "02°15'N 55°59'W" "02°26'S 47°30'W"
#[4] "02°04'S 47°33'W"

This is equivalent (is exactly the same thing) to the following, divided into two instructions to be more readable.

tmp <- sub("^.*\\((.*)\\)", "\\1", XLOCAL)
XLOCAL <- gsub(";", " ", tmp)

Data in format dput.

XLOCAL <-
c("Estirão do Equador, Rio Javari (04°27'S;71°30'W)", 
"Alto Rio Paru de Oeste, Posto Tiriós (02°15'N;55°59'W)", 
"Ipixuna do Pará, Rodovia Belém-Brasília km 92/93 (02°26'S;47°30'W)", 
"Aurora do Pará, Rodovia Belém-Brasília km 86 (02°04'S;47°33'W)")

This instruction creates a vector. If you want a dataframe, after running the above instruction, do

dados <- data.frame(XLOCAL)

Next, in the code where it’s just XLOCAL should be dados$XLOCAL.

1

Try it like this:

data = read.delim(file.choose(), header = T)

library("stringr")

new_string = str_sub(data$XLOCAL, start = -16)

str_sub(new_string, start = 1, end=15)
#[1] "04°27'S;71°30'W" "02°15'N;55°59'W" "02°26'S;47°30'W" "02°04'S;47°33'W"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.