Standardize phone number with REGEX on R

Asked

Viewed 42 times

3

library(tidyverse)

I’m trying to standardize the "numero_telefonico" column of this dataset: https://raw.githubusercontent.com/rivaquiroga/latinr-taller-regex/master/datos/telefonos.csv

Download the dataset and write to an object telephones

telefonos<- read_csv("https://raw.githubusercontent.com/rivaquiroga/latinr-taller-regex/master/datos/telefonos.csv")

telefonos

> telefonos
# A tibble: 19 x 3
   nombre               ciudad     numero_telefonico
   <chr>                <chr>      <chr>            
 1 Pérez, María         Valparaíso +56 981497134    
 2 Juan González        Olmué      928374591        
 3 Hernández, Lautaro   quilpué    56 964379583     
 4 Lobos, Antonia       La Serena  987654321        
 5 Martínez, José       La Serena  +56-983726153    
 6 Manuel Garrido       valparaiso 914230795        
 7 Soto, Loreto         valparaíso 925479238        
 8 Pérez, Pedro         valparaíso 985740293        
 9 González, Camila     Valparaiso 56 973629584     
10 Jiménez, Luis        Olmué      925403948        
11 Sanhueza, Pedro      quilpué    56 981408973     
12 Aníbal García        quilpue    982733747        
13 Gómez, Lirayén       quilpué    914257483        
14 Cayuqueo, Marina     Valparaíso 973629853        
15 Jaime Espinoza       Valparaíso 56981359284      
16 Hernández, Francisca Valparaíso 56993418209      
17 López, Carla         Serena     56912338475      
18 Soto, Martina        Serena     925653948        
19 González, Olivia     Valparaíso no tengo 

In the end, I achieved my goal using str_sub():

telefonos %>% 
  mutate(numero = str_sub(numero_telefonico, start = -9, end = -1),
         cod = 56)

# A tibble: 19 x 5
   nombre               ciudad     numero_telefonico numero      cod
   <chr>                <chr>      <chr>             <chr>     <dbl>
 1 Pérez, María         Valparaíso +56 981497134     981497134    56
 2 Juan González        Olmué      928374591         928374591    56
 3 Hernández, Lautaro   quilpué    56 964379583      964379583    56
 4 Lobos, Antonia       La Serena  987654321         987654321    56
 5 Martínez, José       La Serena  +56-983726153     983726153    56
 6 Manuel Garrido       valparaiso 914230795         914230795    56
 7 Soto, Loreto         valparaíso 925479238         925479238    56
 8 Pérez, Pedro         valparaíso 985740293         985740293    56
 9 González, Camila     Valparaiso 56 973629584      973629584    56
10 Jiménez, Luis        Olmué      925403948         925403948    56
11 Sanhueza, Pedro      quilpué    56 981408973      981408973    56
12 Aníbal García        quilpue    982733747         982733747    56
13 Gómez, Lirayén       quilpué    914257483         914257483    56
14 Cayuqueo, Marina     Valparaíso 973629853         973629853    56
15 Jaime Espinoza       Valparaíso 56981359284       981359284    56
16 Hernández, Francisca Valparaíso 56993418209       993418209    56
17 López, Carla         Serena     56912338475       912338475    56
18 Soto, Martina        Serena     925653948         925653948    56
19 González, Olivia     Valparaíso no tengo          no tengo     56

However, I wanted to do this operation using REGEX.

I tried the following: I created another column called "numero_telefonico_2", using str_replace_all() to eliminate signals ("+" and '-') and whitespaces. I recorded in the object "phones_2"

telefonos_2<-telefonos %>% 
  mutate(numero_telefonico_2 = str_replace_all(numero_telefonico, "[\\+, \\-]", ""))

telefonos_2

# A tibble: 19 x 4
   nombre               ciudad     numero_telefonico numero_telefonico_2
   <chr>                <chr>      <chr>             <chr>              
 1 Pérez, María         Valparaíso +56 981497134     56981497134        
 2 Juan González        Olmué      928374591         928374591          
 3 Hernández, Lautaro   quilpué    56 964379583      56964379583        
 4 Lobos, Antonia       La Serena  987654321         987654321          
 5 Martínez, José       La Serena  +56-983726153     56983726153        
 6 Manuel Garrido       valparaiso 914230795         914230795          
 7 Soto, Loreto         valparaíso 925479238         925479238          
 8 Pérez, Pedro         valparaíso 985740293         985740293          
 9 González, Camila     Valparaiso 56 973629584      56973629584        
10 Jiménez, Luis        Olmué      925403948         925403948          
11 Sanhueza, Pedro      quilpué    56 981408973      56981408973        
12 Aníbal García        quilpue    982733747         982733747          
13 Gómez, Lirayén       quilpué    914257483         914257483          
14 Cayuqueo, Marina     Valparaíso 973629853         973629853          
15 Jaime Espinoza       Valparaíso 56981359284       56981359284        
16 Hernández, Francisca Valparaíso 56993418209       56993418209        
17 López, Carla         Serena     56912338475       56912338475        
18 Soto, Martina        Serena     925653948         925653948          
19 González, Olivia     Valparaíso no tengo          notengo 

From here, I thought I’d use the separate() to divide everything that comes before the first "9" (which refers to the country code), however I’m not sure which argument I should pass to the parameter sep = If I put "9," it’s a mess

telefonos_2 %>% 
    separate(numero_telefonico_2,
             into = c("codigo","numero"),
             sep = "9")

# A tibble: 19 x 5
   nombre               ciudad     numero_telefonico codigo    numero    
   <chr>                <chr>      <chr>             <chr>     <chr>     
 1 Pérez, María         Valparaíso +56 981497134     "56"      "814"     
 2 Juan González        Olmué      928374591         ""        "283745"  
 3 Hernández, Lautaro   quilpué    56 964379583      "56"      "6437"    
 4 Lobos, Antonia       La Serena  987654321         ""        "87654321"
 5 Martínez, José       La Serena  +56-983726153     "56"      "83726153"
 6 Manuel Garrido       valparaiso 914230795         ""        "142307"  
 7 Soto, Loreto         valparaíso 925479238         ""        "2547"    
 8 Pérez, Pedro         valparaíso 985740293         ""        "857402"  
 9 González, Camila     Valparaiso 56 973629584      "56"      "7362"    
10 Jiménez, Luis        Olmué      925403948         ""        "25403"   
11 Sanhueza, Pedro      quilpué    56 981408973      "56"      "81408"   
12 Aníbal García        quilpue    982733747         ""        "82733747"
13 Gómez, Lirayén       quilpué    914257483         ""        "14257483"
14 Cayuqueo, Marina     Valparaíso 973629853         ""        "7362"    
15 Jaime Espinoza       Valparaíso 56981359284       "56"      "8135"    
16 Hernández, Francisca Valparaíso 56993418209       "56"      ""        
17 López, Carla         Serena     56912338475       "56"      "12338475"
18 Soto, Martina        Serena     925653948         ""        "25653"   
19 González, Olivia     Valparaíso no tengo          "notengo"  NA       
Warning messages:
1: Expected 2 pieces. Additional pieces discarded in 13 rows [1, 2, 3, 6, 7, 8, 9, 10, 11, 14, 15, 16, 18]. 
2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [19].

I tried too str_split() but I’m also not hitting the separation delimiter (the first "9")

split_numero_telefonico_2<-str_split(telefonos_2$numero_telefonico_2, "[9]", simplify = T)

split_numero_telefonico_2

      [,1]      [,2]       [,3]     [,4]
 [1,] "56"      "814"      "7134"   ""  
 [2,] ""        "283745"   "1"      ""  
 [3,] "56"      "6437"     "583"    ""  
 [4,] ""        "87654321" ""       ""  
 [5,] "56"      "83726153" ""       ""  
 [6,] ""        "142307"   "5"      ""  
 [7,] ""        "2547"     "238"    ""  
 [8,] ""        "857402"   "3"      ""  
 [9,] "56"      "7362"     "584"    ""  
[10,] ""        "25403"    "48"     ""  
[11,] "56"      "81408"    "73"     ""  
[12,] ""        "82733747" ""       ""  
[13,] ""        "14257483" ""       ""  
[14,] ""        "7362"     "853"    ""  
[15,] "56"      "8135"     "284"    ""  
[16,] "56"      ""         "341820" ""  
[17,] "56"      "12338475" ""       ""  
[18,] ""        "25653"    "48"     ""  
[19,] "notengo" ""         ""       ""  

And a second question: What is the best way to work with the "no tengo" record? In this example, it only appears once, so it’s easy to check. But I kept asking myself: what if the dataset was giant, with the risk of other records typed in various ways ("in Teng", "notengo", etc)?

1 answer

3


To solve this problem with a regular expression, I would use the expression .*(?=.{9}$), that will extract the last 9 characters from each field within numero_telefonico. As there is an occurrence equal to no tengo, the column numero_telefonico_2 resulting will be of the type chr.

I’ll take that the R converts characters of the type "324" the appropriate number (that is, in 324 in that case) and convert the entire column numero_telefonico_2 in numerical. Thus, all the R find that has no direct relation to a number (i.e., any strings like "no teng", "notengo" etc) shall be coded as NA. So a simple na.omit would remove lines from the object telefonos who do not have telephone numbers, if necessary.

I did this in two steps below for didactic reasons. It’s a little more verbose, but I think it’s clearer this way.

telefonos %>%
  mutate(numero_telefonico_2 = sub(".*(?=.{9}$)", "", numero_telefonico, perl = T)) %>%
  mutate(numero_telefonico_2 = as.numeric(numero_telefonico_2))
#> Warning: Problem with `mutate()` input `numero_telefonico_2`.
#> ℹ NAs introduced by coercion
#> ℹ Input `numero_telefonico_2` is `as.numeric(numero_telefonico_2)`.
#> Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
#> # A tibble: 19 x 4
#>    nombre               ciudad     numero_telefonico numero_telefonico_2
#>    <chr>                <chr>      <chr>                           <dbl>
#>  1 Pérez, María         Valparaíso +56 981497134               981497134
#>  2 Juan González        Olmué      928374591                   928374591
#>  3 Hernández, Lautaro   quilpué    56 964379583                964379583
#>  4 Lobos, Antonia       La Serena  987654321                   987654321
#>  5 Martínez, José       La Serena  +56-983726153               983726153
#>  6 Manuel Garrido       valparaiso 914230795                   914230795
#>  7 Soto, Loreto         valparaíso 925479238                   925479238
#>  8 Pérez, Pedro         valparaíso 985740293                   985740293
#>  9 González, Camila     Valparaiso 56 973629584                973629584
#> 10 Jiménez, Luis        Olmué      925403948                   925403948
#> 11 Sanhueza, Pedro      quilpué    56 981408973                981408973
#> 12 Aníbal García        quilpue    982733747                   982733747
#> 13 Gómez, Lirayén       quilpué    914257483                   914257483
#> 14 Cayuqueo, Marina     Valparaíso 973629853                   973629853
#> 15 Jaime Espinoza       Valparaíso 56981359284                 981359284
#> 16 Hernández, Francisca Valparaíso 56993418209                 993418209
#> 17 López, Carla         Serena     56912338475                 912338475
#> 18 Soto, Martina        Serena     925653948                   925653948
#> 19 González, Olivia     Valparaíso no tengo                           NA

Created on 2021-01-07 by the reprex package (v0.3.0)

  • Hello Marcus, thanks for the return. It worked! However, I wondered if it would be possible to "translate" the REGEX? As far as I can make out, it’s as follows: .* [Everything that is in accordance with the parentheses] What is in parentheses means: ;?= [which is followed by...] Now I’m lost. .{9}$ would be the last 9 digits from behind (since the "$" anchors at the end of the string) str_extract() and the function returns what is not the 9-digit set.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.