"Permissive" regular expression to detect extensions and allowed hosts

Asked

Viewed 78 times

4

i have a list with some links

https://www.exemplo.com/
https://www.exemplo.com/home/
https://www.exemplo.com/logo.png
https://intranet.exemplo.com/
https://admin.exemplo.com/login
https://www.exemplo.com/sobre/
https://www.exemplo.com/shell.php.log
https://www.exemplo.com/background.jpg

and I’m looking to identify the links that start with

https://www.exemplo.com/

and not end with jpg or png

in case the seguines urls would be blocked

    https://www.exemplo.com/logo.png
    https://www.exemplo.com/background.jpg
    https://intranet.exemplo.com/
    https://admin.exemplo.com/login
  • In python? Or in javascript?

  • in javascript but can also be in python @Miguel

  • 1

    I’m gonna do both

  • It is worth mentioning that this is a thing for comparison of strings, do not need Regex for simple thing like this.

  • Does it have to be with regex? @Bacco is right, in this case it is not necessary. We can do more simply

  • @Miguel if you want, post with Regex to answer the question, but give an example with substring too, I think there values the answer.

  • @Bacco if I knew I wouldn’t be here asking I’m trying on the regexonline but not wanting to go through the whole list

  • @Nikobellic has several online tests. Almost all you have to indicate that you are multiline somewhere to test lists.

Show 3 more comments

2 answers

2

Another simple way to achieve the same result would be by using the property Array.filter() Javascript native, for example:

function filtrarUrls(lista) {
    var base = 'https://www.exemplo.com';

    lista = lista
              .filter((url) => { return url.indexOf(base) > -1 })
              .filter((url) => { return url.match(/(.jpg|.png)/g) === null });

    return lista;
}

And to use the function:

var urls = ['https://url1.com', 'https://url2.com', ...];

var urlsFiltradas = filtrarUrls(urls); //retorna um array apenas com as URLs filtradas.

1


We have then:

urls = ["https://www.exemplo.com/", "https://www.exemplo.com/home/", "https://www.exemplo.com/logo.png", "https://intranet.exemplo.com/", "https://admin.exemplo.com/login", "https://www.exemplo.com/sobre/", "https://www.exemplo.com/shell.php.log", "https://www.exemplo.com/background.jpg"]

Let’s filter those who end up with png/jpg or who don’t have "www".

With and regex in python:

import re

bloqueados = []
for url in urls:
    img = re.compile('^.*\.(jpg|JPG|png)$')
    www = re.compile('(.*?)//www.(.*?)')
    if(img.match(url) or not www.match(url)):
        bloqueados.append(url)
print(bloqueados) # ['https://www.exemplo.com/logo.png', 'https://intranet.exemplo.com/', 'https://admin.exemplo.com/login', 'https://www.exemplo.com/background.jpg']

OR

import re
bloqueados = [url for url in urls if(re.compile('^.*\.(jpg|JPG|png)$').match(url) or re.compile('(.*?)//www.(.*?)').match(url) == None)]
print(bloqueados) # ['https://www.exemplo.com/logo.png', 'https://intranet.exemplo.com/', 'https://admin.exemplo.com/login', 'https://www.exemplo.com/background.jpg']

Although for this simple case I wouldn’t use regex, I would:

bloqueados = [url for url in urls if url[-4:] == '.png' or url[-4:] == '.jpg' or 'https://www.' not in url]
print(bloqueados) # ['https://www.exemplo.com/logo.png', 'https://intranet.exemplo.com/', 'https://admin.exemplo.com/login', 'https://www.exemplo.com/background.jpg']

With regex in javascript:

var bloqueados = []
var ext;
var www;
for(var url in urls) {
    if(/^.*\.(jpg|png)$/.test(urls[url]) || !/(.*?)\/\/www.(.*?)/.test(urls[url])) {
        bloqueados.push(urls[url])
    }
}
console.log(bloqueados); // ["https://www.exemplo.com/logo.png", "https://intranet.exemplo.com/", "https://admin.exemplo.com/login", "https://www.exemplo.com/background.jpg"]

No regex in javascript:

var bloqueados = []
var ext;
var www;
for(var url in urls) {
    exts = urls[url].split('.');
    ext = exts[exts.length - 1];
    if(ext == 'png' || ext == 'jpg' || urls[url].indexOf("//www.") < 0) {
        bloqueados.push(urls[url])
    }
}
console.log(bloqueados); // ["https://www.exemplo.com/logo.png", "https://intranet.exemplo.com/", "https://admin.exemplo.com/login", "https://www.exemplo.com/background.jpg"]
  • very well thought out the last expression @Miguel vlw

  • 1

    I’m still working on the javascript solution if you want

  • No, I don’t want to wear out your skills I asked both of us why the syntax is similar but your answer has already settled here worth

  • 1

    Had already started :P, no problem. Already have solution on top for javascript also @Nikobellic. Obgado Niko

  • @Nikobellic still found a better solution in the last python solution. I think you will like it more

Browser other questions tagged

You are not signed in. Login or sign up in order to post.