3
library(tidyverse)
I’m trying to standardize the "numero_telefonico" column of this dataset: https://raw.githubusercontent.com/rivaquiroga/latinr-taller-regex/master/datos/telefonos.csv
Download the dataset and write to an object telephones
telefonos<- read_csv("https://raw.githubusercontent.com/rivaquiroga/latinr-taller-regex/master/datos/telefonos.csv")
telefonos
> telefonos
# A tibble: 19 x 3
nombre ciudad numero_telefonico
<chr> <chr> <chr>
1 Pérez, María Valparaíso +56 981497134
2 Juan González Olmué 928374591
3 Hernández, Lautaro quilpué 56 964379583
4 Lobos, Antonia La Serena 987654321
5 Martínez, José La Serena +56-983726153
6 Manuel Garrido valparaiso 914230795
7 Soto, Loreto valparaíso 925479238
8 Pérez, Pedro valparaíso 985740293
9 González, Camila Valparaiso 56 973629584
10 Jiménez, Luis Olmué 925403948
11 Sanhueza, Pedro quilpué 56 981408973
12 Aníbal García quilpue 982733747
13 Gómez, Lirayén quilpué 914257483
14 Cayuqueo, Marina Valparaíso 973629853
15 Jaime Espinoza Valparaíso 56981359284
16 Hernández, Francisca Valparaíso 56993418209
17 López, Carla Serena 56912338475
18 Soto, Martina Serena 925653948
19 González, Olivia Valparaíso no tengo
In the end, I achieved my goal using str_sub()
:
telefonos %>%
mutate(numero = str_sub(numero_telefonico, start = -9, end = -1),
cod = 56)
# A tibble: 19 x 5
nombre ciudad numero_telefonico numero cod
<chr> <chr> <chr> <chr> <dbl>
1 Pérez, María Valparaíso +56 981497134 981497134 56
2 Juan González Olmué 928374591 928374591 56
3 Hernández, Lautaro quilpué 56 964379583 964379583 56
4 Lobos, Antonia La Serena 987654321 987654321 56
5 Martínez, José La Serena +56-983726153 983726153 56
6 Manuel Garrido valparaiso 914230795 914230795 56
7 Soto, Loreto valparaíso 925479238 925479238 56
8 Pérez, Pedro valparaíso 985740293 985740293 56
9 González, Camila Valparaiso 56 973629584 973629584 56
10 Jiménez, Luis Olmué 925403948 925403948 56
11 Sanhueza, Pedro quilpué 56 981408973 981408973 56
12 Aníbal García quilpue 982733747 982733747 56
13 Gómez, Lirayén quilpué 914257483 914257483 56
14 Cayuqueo, Marina Valparaíso 973629853 973629853 56
15 Jaime Espinoza Valparaíso 56981359284 981359284 56
16 Hernández, Francisca Valparaíso 56993418209 993418209 56
17 López, Carla Serena 56912338475 912338475 56
18 Soto, Martina Serena 925653948 925653948 56
19 González, Olivia Valparaíso no tengo no tengo 56
However, I wanted to do this operation using REGEX.
I tried the following: I created another column called "numero_telefonico_2", using str_replace_all()
to eliminate signals ("+" and '-') and whitespaces.
I recorded in the object "phones_2"
telefonos_2<-telefonos %>%
mutate(numero_telefonico_2 = str_replace_all(numero_telefonico, "[\\+, \\-]", ""))
telefonos_2
# A tibble: 19 x 4
nombre ciudad numero_telefonico numero_telefonico_2
<chr> <chr> <chr> <chr>
1 Pérez, María Valparaíso +56 981497134 56981497134
2 Juan González Olmué 928374591 928374591
3 Hernández, Lautaro quilpué 56 964379583 56964379583
4 Lobos, Antonia La Serena 987654321 987654321
5 Martínez, José La Serena +56-983726153 56983726153
6 Manuel Garrido valparaiso 914230795 914230795
7 Soto, Loreto valparaíso 925479238 925479238
8 Pérez, Pedro valparaíso 985740293 985740293
9 González, Camila Valparaiso 56 973629584 56973629584
10 Jiménez, Luis Olmué 925403948 925403948
11 Sanhueza, Pedro quilpué 56 981408973 56981408973
12 Aníbal García quilpue 982733747 982733747
13 Gómez, Lirayén quilpué 914257483 914257483
14 Cayuqueo, Marina Valparaíso 973629853 973629853
15 Jaime Espinoza Valparaíso 56981359284 56981359284
16 Hernández, Francisca Valparaíso 56993418209 56993418209
17 López, Carla Serena 56912338475 56912338475
18 Soto, Martina Serena 925653948 925653948
19 González, Olivia Valparaíso no tengo notengo
From here, I thought I’d use the separate()
to divide everything that comes before the first "9" (which refers to the country code), however I’m not sure which argument I should pass to the parameter sep =
If I put "9," it’s a mess
telefonos_2 %>%
separate(numero_telefonico_2,
into = c("codigo","numero"),
sep = "9")
# A tibble: 19 x 5
nombre ciudad numero_telefonico codigo numero
<chr> <chr> <chr> <chr> <chr>
1 Pérez, María Valparaíso +56 981497134 "56" "814"
2 Juan González Olmué 928374591 "" "283745"
3 Hernández, Lautaro quilpué 56 964379583 "56" "6437"
4 Lobos, Antonia La Serena 987654321 "" "87654321"
5 Martínez, José La Serena +56-983726153 "56" "83726153"
6 Manuel Garrido valparaiso 914230795 "" "142307"
7 Soto, Loreto valparaíso 925479238 "" "2547"
8 Pérez, Pedro valparaíso 985740293 "" "857402"
9 González, Camila Valparaiso 56 973629584 "56" "7362"
10 Jiménez, Luis Olmué 925403948 "" "25403"
11 Sanhueza, Pedro quilpué 56 981408973 "56" "81408"
12 Aníbal García quilpue 982733747 "" "82733747"
13 Gómez, Lirayén quilpué 914257483 "" "14257483"
14 Cayuqueo, Marina Valparaíso 973629853 "" "7362"
15 Jaime Espinoza Valparaíso 56981359284 "56" "8135"
16 Hernández, Francisca Valparaíso 56993418209 "56" ""
17 López, Carla Serena 56912338475 "56" "12338475"
18 Soto, Martina Serena 925653948 "" "25653"
19 González, Olivia Valparaíso no tengo "notengo" NA
Warning messages:
1: Expected 2 pieces. Additional pieces discarded in 13 rows [1, 2, 3, 6, 7, 8, 9, 10, 11, 14, 15, 16, 18].
2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [19].
I tried too str_split()
but I’m also not hitting the separation delimiter (the first "9")
split_numero_telefonico_2<-str_split(telefonos_2$numero_telefonico_2, "[9]", simplify = T)
split_numero_telefonico_2
[,1] [,2] [,3] [,4]
[1,] "56" "814" "7134" ""
[2,] "" "283745" "1" ""
[3,] "56" "6437" "583" ""
[4,] "" "87654321" "" ""
[5,] "56" "83726153" "" ""
[6,] "" "142307" "5" ""
[7,] "" "2547" "238" ""
[8,] "" "857402" "3" ""
[9,] "56" "7362" "584" ""
[10,] "" "25403" "48" ""
[11,] "56" "81408" "73" ""
[12,] "" "82733747" "" ""
[13,] "" "14257483" "" ""
[14,] "" "7362" "853" ""
[15,] "56" "8135" "284" ""
[16,] "56" "" "341820" ""
[17,] "56" "12338475" "" ""
[18,] "" "25653" "48" ""
[19,] "notengo" "" "" ""
And a second question: What is the best way to work with the "no tengo" record? In this example, it only appears once, so it’s easy to check. But I kept asking myself: what if the dataset was giant, with the risk of other records typed in various ways ("in Teng", "notengo", etc)?
Hello Marcus, thanks for the return. It worked! However, I wondered if it would be possible to "translate" the REGEX? As far as I can make out, it’s as follows:
.*
[Everything that is in accordance with the parentheses] What is in parentheses means: ;?=
[which is followed by...] Now I’m lost..{9}$
would be the last 9 digits from behind (since the "$" anchors at the end of the string)str_extract()
and the function returns what is not the 9-digit set.– itamar