How to create a regular expression

Asked

Viewed 384 times

1

I have the following div with information

<div class="endereco-item">
	<h2 class="azulclaro identify">Casa</h2>
	<div class="entrelinha_0"></div>
	<div class="font_15"></div>
	<div class="font_15"></div>
	<div class="font_15">R: Antonio Pires dos Santos, 647   praça central</div>
	<div class="font_15">Parque santo antonio - Sao Paulo - SP</div>
	<div class="font_15">CEP: 55555-555</div>
	<div class="font_15">Fone: (11)943-056-295 (55)555-555-555</div>
	<div id="ctl00_Body_rptEnderecos_ctl00_dvRadio" class="font_15 custom-checkbox">
	<input type="radio" id="radio0" name="radioSelect" checked onclick="setPrincipal(0)" />

what would be the regular expression to catch the city São paulo and the state SP ?

  • I believe you are wanting to solve a problem with a tool not suited to this problem. Why did you think of regular expressions?

  • It is that I have tried in other ways and I have not succeeded, I need to put this text in variables to create an array.

  • The recommended is the use of a DOM Parsing for web Scrapping, then if necessary Regex to make fine adjustments. But try this expression: [\s\w]+?-\s*[A-Z]{2}(?=<\/div>) and the demo on Regex101.

2 answers

2


If the format is always this presented in the question, "neighborhood - city - state", and the state is represented by two capital letters, it is even quiet:

>.* - (.*) - ([A-Z]{2})<

(Follow example of regex101 expression: https://regex101.com/r/EPTpOM/1)

That is, a tag lock >, followed by any string (neighborhood), followed by the separator " - ", followed by any string we want to store in the rematch (city), other separator, other rematch for a pair of uppercase letters (status) and, finally, a tag opening <.

In the case of PHP you can pass an array to the function preg_match(). Thus, the city and states will be returned in elements 1 and 2 of the array, respectively:

<?php
$html='<div class="endereco-item">
    <h2 class="azulclaro identify">Casa</h2>
    <div class="entrelinha_0"></div>
    <div class="font_15"></div>
    <div class="font_15"></div>
    <div class="font_15">R: Antonio Pires dos Santos, 647   praça central</div>
    <div class="font_15">Parque santo antonio - Sao Paulo - SP</div>
    <div class="font_15">CEP: 55555-555</div>
    <div class="font_15">Fone: (11)943-056-295 (55)555-555-555</div>
    <div id="ctl00_Body_rptEnderecos_ctl00_dvRadio" class="font_15 custom-checkbox">
    <input type="radio" id="radio0" name="radioSelect" checked onclick="setPrincipal(0)" />';

$cidade_estado = array();
$regex = '/>.* - (.*) - ([A-Z]{2})</';
preg_match($regex, $html, $cidade_estado);

print_r($cidade_estado);

(The following is an example of PHP code in repl.it: https://repl.it/NvhF/0)

  • 1

    It may be purism on my part, but I hate using regular expressions to extract information from a piece of text governed by a free grammar if context

  • 1

    @Jeffersonquesado agree that he could blow the string in the \n, take the fifth line, delete HTML and then have a more "pure" context to apply a regular expression (which would be even unnecessary there)... Still (you can burn me at the stake for heresy) I usually use quite regular expression in web scraping, then rolled a personal identification with the problem presented, haha...

  • 1

    You heretic! Die burned! Hahaha! No joke, I abuse the poor and innocent regular expressions for swiftly pig works. Many times I end up using seds and greps in sequence. I could use once in a while to identify content in source code or HTML, but I don’t do that, say, at the level of putting it into production. My last journey in this was to automate a Java text replacement to get a Singleton of classes whose names ended in async, instead of instantiating these objects again

  • 1

    I am obliged to use a comment to link this blog post somewhat old XKCD... 3 regex in perl and 2 grep, being one of them variable and recursive within a xargs. We always find bigger nonsense out there, haha...

  • 1

    Guys! How come I’ve never been down on the blog? It had just been in the strips and on What if... Hahahaha! Very good!

2

It’s a bit complicated to get the City and State in this html.

I did a test here and managed using the following Regular Expression:

/(?![^<>]*>)-\s?(?P<cidade>[a-zA-z].*?)\s?-\s?(?P<estado>[a-zA-Z]{2})/

This way it finds the city and state even if it has variables in spaces, and in typing. I made a mess of the code for testing and yet it managed to pick up several different ways.

Follow the test with the messy code, where I put cities and states in various parts of the code.

inserir a descrição da imagem aqui

But if you can change the html I suggest putting IDs in each div. This would make it easier to use a regular expression that looks directly for the right div.

But I hope this expression I created works out for you.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.