Regular expression for dynamic URL

Asked

Viewed 623 times

0

I have an html file with urls in this URL pattern: https://www.olympikus.com.br/tenis-olympikus-flower-415-feminino-cinza-D22-1131-010 The standard is protocolo://dominio/strig-dinâmica-000-0000-000

I want to get all the links in this pattern. So I created the following ER: (https\:\/\/?)www\.olympikus\.com\.br\/(.*)\-[A-Z0-9]{3}-[A-Z0-9]{4}-[A-Z0-9]{3}

Unfortunately the pattern takes the initial Techo protocolo://dominio/ and ends in the last possible marriage -000-0000-000 Returning a raw string in the middle because of (.*). I cannot handle the dynamic part of the URL

How to write this ER so that it returns all links?

I am currently using egrep in the terminal, but examples with javascript are accepted because I intend to create a Crawler in this language in Nodejs.

  • Yes. Give it to me anyway.

2 answers

2

Regex

This would be the Regex: ((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*) In which the demo on Regex101 can be seen more didactically.

Code

Example of the Regex101

Returns Group 2

const regex = /((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*)/gm;
const str = `http://dominio.do/strig-dinâmica-000-0000-000
https://www.olympikus.com.br/tenis-olympikus-flower-415-feminino-cinza-D22-1131-010
ftp://dominio.br/strig-dinâmica-000-0000-000
dominio.c/strig-dinâmica-000-0000-000`;
const subst = `$2`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Soen example

Code of a deleted user that can be seen here

In which returns the Entire String

var regex = /((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*)/g;

var input = `http://dominio.do/strig-dinâmica-000-0000-000
https://www.olympikus.com.br/tenis-olympikus-flower-415-feminino-cinza-D22-1131-010
ftp://dominio.br/strig-dinâmica-000-0000-000
dominio.c/strig-dinâmica-000-0000-000`;

while (match = regex.exec(input)) {
    document.write(match[0] + "<br/>");
};

Debug

The Debuggex can be seen in the link and help in understanding, in conjunction with the demo on Regex101.

Explanation:

((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*)

  • 1° Capture Group - ((?:https|http|ftp)?:\/\/)?
    • Quantifier ? - Corresponds to zero once, as many times as possible, returning as needed (Greedy)
    • Catch group - (?: Https | http | ftp)?
      • Quantifier ? - Corresponds to zero once, as many times as possible, returning as needed (Greedy)
      • Alternatives - | are the options that are between the |tab, which acts as a boolean OR.
        • 1st Alternative - https corresponds to https characters literally
        • 2nd Alternative - http corresponds to http characters literally
        • 3rd Alternative - ftp corresponds to ftp characters literally
      • : corresponds to the character: literally
      • \ / corresponds to the character / literally
  • 2° Capture Group - ([^\/,\s]+\.[^\/,\s]+?)
    • [^\/,\s]+ - Matches a character not present in the set
      • Quantifier + - Matches between one and unlimited times, as many times as possible, returning as needed (Greedy)
      • \ / corresponds to the character / literally
      • , corresponds to the character , literally
      • \ s corresponds to any blank character (equal to [ r n t f v])
    • \. Matches the character . literally
    • [^\/,\s]+? - Matches a character not present in the set
      • Quantifier +? - Matches between one and unlimited times, as few times as possible, expanding as needed (Lazy)
      • \ / corresponds to the character / literally
      • , corresponds to the character , literally
      • \ s corresponds to any blank character (equal to [ r n t f v])
  • Positive Lookahead (?=\/|,|\s|$|\?|#)
    • Alternatives - | are the options that are between the |tab, which acts as a boolean OR.
      • 1st alternative \ / corresponds to the character / literally
      • 2nd alternative , corresponds to the character , literally
      • 3rd alternative \ s corresponds to any blank character (equal to [ r n t f v])
      • 4th alternative $ ensures position at the end of a line
      • 5th alternative \? corresponds to the character ? literally
      • 6th alternative # corresponds to the character # literally
  • 3° Capture Group - (.*)
    • . * corresponds to any character (except for line terminators)
    • Quantifier * - Matches between zero and unlimited times, as many times as possible, returning as needed (Greedy)

The second group is what "matters", in which it has the desired link information or if you want to get the entire string, it would be group 0.

0

Considering that the variable part will be composed of letters, number and the hyphen, replace the (.*) for [a-z0-9\-]+ that must resolve.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.