How to find a String from Regular Expressions

Asked

Viewed 960 times

17

I have a field in the SQL Server 2014 database that stores the client address. The problem is that everything is stored in a single field varchar.

Here are some real examples:

Antonio José Nerchis Street, 81
Street: Guaira/JD Pacera, 1201
321 Street, N 168
R ALTAIR GAGLIARDI, 132
243 Rio Grande do Sul Street
Rua Duque de Caxias
R CAPELLEN, 58
ROD ACCESS PLINIO ARLINDO DE NEZ, 4303
R Anibal Gazaniga, 107
R D PEDRO I, 231
Rua Dep. Ivan Ferreira do Amaral, 440
R SANTA CATARINA, 711
Rua Santa Catarina, 411
Rua Augusto de Jesus, 77
Rua Marcondes Sobrinho, 40
Antonio Tonelli Street 755

I need to move this to the 3 columns:

  • Type of street (street, avenue, village...)
  • The patio itself (name of the street, avenue, village...)
  • Number of the patio (number and other complements that may have been placed near the patio)

Can anyone tell me if you can do this via regular expression? If so, how? I understand practically nothing of regular expressions.

  • you can put a sample of your data, all are separated by , comma ?

  • It may be possible depending on whether your data follows some kind of pattern. Provide a few more data examples to identify some pattern.

  • @Guilhermelautert Standard...ta hard to find, but follow a few more examples... I will put in question more examples.

  • 1

    @GOKUSSJ4 added real examples in the question.

  • If inputs are opened to the user it can be difficult to define an expression for all cases, as the user can fill in as desired.

  • 1

    Not being able to notice anything outside that the number is always at the end, yet as @gmsantos commented it would become very problematic to try to deal with all cases. Not to mention you can’t tell if R is street or highway. It would be more useful to change the system and inform the customer so that they update the data.

  • I agree with @Guilhermelautert. This would be the solution with better reliability

  • @Guilhermelautert understood your position, in fact, I was in the same situation when I tried to solve this problem... Unfortunately this option to update the data is unfeasible, since it is an internal registration and that the responsible for it would have to enter in many registers (more or less 1900 records...). If we create the following pattern, it always starts with "Street" and ends with the number, as would the regular expression?

Show 3 more comments

3 answers

9


ATTENTION: Before performing the steps described in this reply, make a backup of your table in case something goes wrong.

I don’t think regular expression solves your problem, at least not easily.

Now, imagine you have these addresses in the column endereco of a table called tabela.

Then, you would first add the required fields:

ALTER TABLE tabela ADD COLUMN logradouro varchar(255) DEFAULT NULL;
ALTER TABLE tabela ADD COLUMN numero_logradouro varchar(6) DEFAULT NULL;
ALTER TABLE tabela ADD COLUMN tipo_logradouro varchar(5) DEFAULT NULL;

The size of the field logradouro must be the same as in the field endereco.

Hence, you just have to popular these fields. Let’s start by separating the type of the backyard from the rest:

UPDATE tabela SET
    tipo_logradouro = 'Rua',
    logradouro = SUBSTRING(endereco, 3, 9999)
WHERE UPPER(SUBSTRING(endereco, 1, 2)) = 'R ';

UPDATE tabela SET
    tipo_logradouro = 'Rua',
    logradouro = SUBSTRING(endereco, 4, 9999)
WHERE UPPER(SUBSTRING(endereco, 1, 3)) = 'R. ';

UPDATE tabela SET
    tipo_logradouro = 'Rua',
    logradouro = SUBSTRING(endereco, 5, 9999)
WHERE UPPER(SUBSTRING(endereco, 1, 4)) = 'RUA ';

UPDATE tabela SET
    tipo_logradouro = 'Avenida',
    logradouro = SUBSTRING(endereco, 4, 9999)
WHERE UPPER(SUBSTRING(endereco, 1, 3)) = 'AV ';

UPDATE tabela SET
    tipo_logradouro = 'Avenida',
    logradouro = SUBSTRING(endereco, 5, 9999)
WHERE UPPER(SUBSTRING(endereco, 1, 4)) = 'AV. ';

UPDATE tabela SET
    tipo_logradouro = 'Avenida',
    logradouro = SUBSTRING(endereco, 9, 9999)
WHERE UPPER(SUBSTRING(endereco, 1, 8)) = 'AVENIDA ';

And then, you make the same steps to square, village, road, highway, mall, etc.

Note that 9999 is just a large enough value to fit the entire address. If the address is varchar(123), then trading 9999 for 123 will be enough.

Now it’s time to separate the number. First we normalize:

UPDATE tabela SET logradouro = REPLACE(logradouro, 'nº ', ', ');
UPDATE tabela SET logradouro = REPLACE(logradouro, 'nº', ', ');
UPDATE tabela SET logradouro = REPLACE(logradouro, 'Nº ', ', ');
UPDATE tabela SET logradouro = REPLACE(logradouro, 'Nº', ', ');

Then we separate the number after the last comma. To find the last comma, we use the expression LEN(logradouro) - CHARINDEX(REVERSE(logradouro), ','):

UPDATE tabela SET
    numero_logradouro = TRIM(SUBSTRING(logradouro, LEN(logradouro) - CHARINDEX(REVERSE(logradouro), ',') + 1, 9999)),
    logradouro = TRIM(SUBSTRING(logradouro, 1, LEN(logradouro) - CHARINDEX(REVERSE(logradouro), ',') - 1));
WHERE CHARINDEX(logradouro, ',') >= 1;

Finally, see which records are left or went wrong for some reason. Hope they are few:

SELECT * FROM tabela
WHERE endereco IS NOT NULL
AND (logradouro IS NULL
    OR numero_logradouro IS NULL
    OR tipo_logradouro IS NULL
);

When you’re sure everything’s all right and all the records are correct:

ALTER TABLE tabela DROP COLUMN endereco;

Problems: What is here is not yet able to handle the complement. And if the complement is after the number, it may go wrong. If these cases are few, you can handle them manually. If there are many, edit the question by putting some examples and I try to adapt here in the answer.

5

The problem with an open field for user input is that even though it looks obvie/default, you can’t trust its content. The ideal would be to update the system and create address tables:

CREATE TABLE tb_pais(
    id SERIAL,
    pais VARCHAR(100),
    PRIMARY KEY (id)
);

CREATE TABLE tb_uf(
    id SERIAL,
    uf VARCHAR(100),
    sigla CHAR(2),
    id_pais INTEGER REFERENCES tb_pais(id),
    PRIMARY KEY (id)
);

CREATE TABLE tb_cidade(
    id SERIAL,
    cidade VARCHAR(100),
    id_uf INTEGER REFERENCES tb_uf(id),
    PRIMARY KEY (id)
);

CREATE TABLE tb_bairro(
    id SERIAL,
    bairro VARCHAR(100),
    id_cidade INTEGER REFERENCES tb_cidade(id),
    PRIMARY KEY (id)
);

CREATE TABLE tb_logradouro(
    id SERIAL,
    logradouro VARCHAR(100),
    id_bairro INTEGER REFERENCES tb_bairro(id),
    PRIMARY KEY (id)
);

CREATE TABLE tb_cep(
    id SERIAL,
    cep VARCHAR(100),
    id_logradouro INTEGER REFERENCES tb_logradouro(id),
    PRIMARY KEY (id)
);

CREATE TABLE tb_endereco(
    id_cliente INTEGER REFERENCES tb_cliente(id),
    id_pais INTEGER REFERENCES tb_pais(id),
    id_uf INTEGER REFERENCES tb_uf(id),
    id_cidade INTEGER REFERENCES tb_cidade(id),
    id_bairro INTEGER REFERENCES tb_bairro(id),
    id_logradouro INTEGER REFERENCES tb_logradouro(id),
    id_cep INTEGER REFERENCES tb_cep(id),
    numero INTEGER
);

With this you eliminate duplicate names, and can also develop search systems.

Done this you can go for updating the content in the database, which would consist of capturing the current content and insert in its proper place.

Let’s say as commented your current field follow the standard Rua endereço numero, could create a REGEX [^ ]+ (.+?) (\d+?). So group 1 would be the street and group 2 the number, the initial part up to the first space being discarded, Rua Almeida, 123, would be Almeida, to the backyard and 123 to the address number.

After this step completed, a management would be created to update the data from the backyard, and in case of duplicate removal and replacement of duplicate id.

  • I thought of a whole system of maintenance of this data, in PHP, if you want more details let me know, but yes it will be a lot of work.

4

For your data sample you can do, change the dynamic table to your actual table, create the fields in your table and try to make the change.

You can use a Begin tran and in case it’s all right commit or rollback

   declare @Enderecos table
(
  Enderecos varchar(max),
  TipodeLogradouro varchar(200),
  Logradouro varchar(200),
  Numero varchar(200)
)

insert into @Enderecos(Enderecos) values 
('Rua Antonio José Nerchis, 81'),
('Rua: Guaira/JD Pacera, 1201'),
('Rua 321, N 168'),
('R ALTAIR GAGLIARDI, 132'),
('Rua Rio Grande do Sul, 243'),
('Rua Duque de Caxias'),
('R CAPELLEN, 58'),
('ROD ACESSO PLINIO ARLINDO DE NEZ, 4303'),
('R Anibal Gazaniga, 107'),
('R D PEDRO I, 231'),
('Rua Dep. Ivan Ferreira do Amaral, 440'),
('R SANTA CATARINA, 711'),
('Rua Santa Catarina, 411'),
('Rua Augusto de Jesus, 77'),
('Rua Marcondes Sobrinho, 40'),
('rua Antonio Tonelli nº 755 casa 2')

update @Enderecos
set TipodeLogradouro = REPLACE(REPLACE(REPLACE(LEFT(Enderecos, charindex(' ', Enderecos) - 1),',' , ''),':' , ''),';' , ''),
Numero = 
case when  Enderecos  like '%º%' then substring(Enderecos, charindex('º', Enderecos) - 1, 100 )
        else 
        REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace
        (Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace
        (reverse(left(reverse(Enderecos), charindex(' ', reverse(Enderecos)) -1)        
        ) COLLATE sql_latin1_general_cp1251_ci_as, 'z', ''), 'x', ''), 'w', ''), 'y', ''), 'v', ''), 'u', ''), 't', ''), 's', ''), 'r', ''), 'q', ''), 'p', '')
        , 'o', ''), 'n', ''), 'm', ''), 'l', ''), 'k', ''), 'j', ''), 'i', ''), 'h', '')
        , 'g', ''), 'f', ''), 'e', ''), 'd', ''), 'c', ''), 'b', ''), 'a', ''),',' , ''),':' , ''),';' , ''),' ',''),'º',''),'.',''),'/',''),'\','')
      end
from @Enderecos 

update @Enderecos
set Numero = REPLACE(Numero,'N?','Nº')
update @Enderecos

set Logradouro = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Enderecos,TipodeLogradouro,''),Numero,'')
,':' , ''),';' , ''),',',''),'Nº',''),'.',''),'\','')


select *  from @Enderecos

changed to have a given ('rua Antonio Tonelli nº 755 casa 2') complement.

  • Your code worked, but I must confess I got lost in the amount of Replaces. kkk

Browser other questions tagged

You are not signed in. Login or sign up in order to post.