Filter search block using regex

Asked

Viewed 52 times

-4

I’m trying to perform a filter using Regex to find the results that are within the option value. but I can’t take from the separate selects.

when I use the expression : <option value="(.+?)" returns of all, when in fact I only want the case of "fromPort"

I also tried as follows, but not resulting in any data found (?<=select name="fromPort" class="form-inline">)\s*.*(?=select)

    <select name="fromPort" class="form-inline">
        <option value="Paris">Paris</option>
        <option value="Philadelphia">Philadelphia</option>
        <option value="Boston">Boston</option>
        <option value="Portland">Portland</option>
        <option value="San Diego">San Diego</option>
        <option value="Mexico City">Mexico City</option>
        <option value="São Paolo">São Paolo</option>
    </select>
    <p>
    <h2>Choose your destination city:</h2>
    <select name="toPort" class="form-inline">
        <option value="Buenos Aires">Buenos Aires</option>
        <option value="Rome">Rome</option>
        <option value="London">London</option>
        <option value="Berlin">Berlin</option>
        <option value="New York">New York</option>
        <option value="Dublin">Dublin</option>
        <option value="Cairo">Cairo</option>
    </select>
  • 1

    Each language has its own variant of regular expression syntax, so whenever the subject is regex it is important to inform in which language you are working. Parse HTML with regex not something recommended, in this section for example if the author of HTML makes an update changing the order of the attributes would have to rewrite its regex. There are lots of HTML and XML analysis tools on the internet and depending on the language you are using the HTML parser may be embedded in the language framework

  • Actually, it’s for academic purposes, I’m not using language, I’m working directly on regex101. So I’d like to know the possibility of doing this in regex, I know it’s possible, but I can’t come to any conclusion..

  • 1

    In the technical and academic world, analyzing HTML with REGEX is considered bad practice. Because it is classified as a type 2 language in the Chomsky hierarchy, HTML must be analyzed by a DPDA state machine with AST and state stack, and REGEX cannot analyze semantic variations. See this text Analyzing Html in the Cthulhu way if you have difficulties with English translate it to Portuguese by right-clicking and selecting translate.

  • Thanks for the @Augustovasques tip, I will read the article.

  • 3

    Hello Willian, do not need regex for this, if it is a string and this using Javascript can use the Domparser, if it is a string in the back end with PHP you can use the Domdocument::loadHTML, if it is Java you can use lib jsoup ... if you cite the language you will use (and if it is back-end or front-end) I can suggest a better example, because like @Augustovasques, regex may have problems with minimal unexpected "variants"

  • 2

    Using an html/xml parser is usually the best option, as they said above. For example, the regex of the answer below is naive and fails if you have two option on the same line, or one whose closure is on another line, or one of them commented, or if the select has other attributes, or name and class are in another order, etc. Any minimal variation will require a change in the regex that is not always trivial, and the tendency is that it becomes so complicated that it is not worth it anymore. Further reading: here and here

  • 1

    And just to quote a few more examples of why it is not good to use regex to manipulate HTML: https://answall.com/a/440262/112052 | https://answall.com/a/509938/112052

Show 2 more comments

1 answer

1


Try to use this regex:

<select name="fromPort" class="form\-inline">\n(?:.*?<option value="(.*?)">.*?<\/option>\n){1,}</select>.

Example here.

regex above will return the last option value of the list.

Warning

You will not be able to extract these values using only one regex, it would have to be at least two: one to extract the whole body of the select and another to extract the value within the options. The reason is that with a regex, he will find one match for the whole body, but you need to capture several items.

You would only succeed with a single regex if you knew the amount of options and mount this regex with that amount of options, that is, manually put the bunch of options in their respective groups.

If you access the link above and duplicate the part that seeks the option ((?:.*?<option value="(.*?)">.*?<\/option>\n)), it will return only two values, and so on.

  • 1

    In my example I ended up using the select of toPort rs.

  • Excellent, that was basically the idea I was looking for!

  • 1

    Instead of {1,}, just use + - anyway, using regex to manipulate HTML is almost never a good idea (see my comment in the question, in addition to the links indicated). And only as a curiosity, in C# it is possible to capture multiple items when a capture group repeats itself (of course the question did not quote any language, but anyway, it is not something impossible :-) It is just not supported in most languages/Engines)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.