Replace string with regex in python 3

Question

Replace string with regex in python 3

Asked 6 years, 12 months ago

Viewed 867 times

2

I have a code that replaces certain string with whitespace

    dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'
    dados = dados.replace('http://', '')
    dados = dados.replace('https://', '')
    print(dados)

Upshot:

[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Good�o","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Qu�ality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"��Good","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"��Good","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]

In this situation the result happens as expected, but when I need to use a regex to do replace I can’t (I’ve tried several ways).

As you can see below, it just overwrites the first element and overwrites the whole data variable, see:

dados = re.sub(re.compile('(/.*)', re.MULTILINE), '', dados)
print(dados)

Upshot:

[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Good�o","Name":"X","Description":null,"Url":"x.com.br

I understand what happened, but I wonder if there is a way to replace using regex, similar to function replace.

The goal is to leave only the domain and take out all the garbage, example: for x.com.br/qweqwe, consider "garbage" the excerpt /qweqwe, for only the x.com.br is important.

1

Although it is possible with regex, it would not be better to use the specific functions for URL Parsing? https://docs.python.org/3/library/urllib.parse.html

– hkotsubo

2018/11/04 at 23:51
hkotsubo, you need to use replace because you need to preserve the rest of the data in the variable, just as you told nosklo

– DaniloAlbergardi

2018/11/06 at 00:18
Could [Edit] the question and put examples of values that should be preserved? Anyway, I think we could make a split on data and treat the Urls one by one, and then you join again with join. But to be sure, just looking at a few examples...

– hkotsubo

2018/11/06 at 11:49
1

hkotsubo added more examples, in the example json has only 3 examples, but there is a plethora of records. Thank you

– DaniloAlbergardi

2018/11/11 at 11:15

2 answers

3

Your input string is a JSON, so you better use the right tools to manipulate this data. You can use the module json and then manipulate the URL with urllib.parse:

# -*- coding: utf-8 -*-

import re
import json
import urllib.parse

dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'

# converter a string para JSON
jsondata = json.loads(dados)

# regex para verificar se a URL tem o protocolo
r = re.compile(r"^(https?|ftp)://")

# substituir somente o campo URL
for d in jsondata:
    url = d['Url']
    # se não tem o protocolo, adiciona qualquer um, apenas para o parsing ser feito corretamente
    if not r.match(url):
        url = "http://"+ url
    d['Url'] = urllib.parse.urlparse(url).netloc

# converter JSON para string
dados = json.dumps(jsondata, ensure_ascii=False)
print(dados)

The exit is:

[{"Id": 12345, "Date": "2018-11-03T00:00:00", "Quality": "Goodão", "Name": "X", "Description": null, "Url": "x.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12346, "Date": "2018-11-03T00:00:00", "Quality": "Good", "Name": "YYy", "Description": "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "Url": "www.y.com.br", "ParseUrl": "y beautiful", "Status": "Ativa", "Surveys": 0, "KeySearch": "y like", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12347, "Date": "2018-11-03T00:00:00", "Quçality": "Pending", "Name": "z Z", "Description": "Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur", "Url": "www.z.com.br", "ParseUrl": null, "Status": "Ativa", "Surveys": 112, "KeySearch": "z plant", "QualityId": 4, "Type": "Agro"}, {"Id": 12335, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "www.j.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12332, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "www.j.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}]

Note that the keys are in a different order than the input, as JSON is set to a unordered set of name/value pairs (a set of key pairs/value unedited). So order is not guaranteed.

Other details:

I used a regex (using the module re) to verify that the URL does not have the protocol (the http:// at first, for example). I used ^(https?|ftp)://, which means:

^: string start
https?: the text "http" or "https" (the s? indicates that the letter "s" is optional)
ftp: the text "ftp""
|: means or. So (https?|ftp) means that the chunk can be http, https, or ftp. Next we have the characters ://

Add more protocols as needed, all separated by |. For example: ^(https?|ftp|telnet|mailto):// (to check http/https, ftp, telnet or mailto).

If the URL does not match this pattern (i.e., if you do not have the protocol), I add some protocol to get the Parsing done (otherwise urlparse gets the netloc empty). As you will not use the protocol for anything, whatever is placed.

And for the special characters, you can force the script to use a specific encoding, as I did in the first line (# -*- coding: utf-8 -*-). And in the method dumps, pass the value of the parameter ensure_ascii equal to False - the default is True, what makes special characters are not displayed correctly.

If you want to keep the same order

But if by chance you need to keep exactly the same order of the keys, then the way is to use regex. One way is to check the section "Url":"..." and get the URL that is there. Then we use urllib.parse to replace the URL with the part you need.

And I also use the same regex from the previous example to check if the URL has the protocol, and add "http://" if it doesn’t have it.

# -*- coding: utf-8 -*-

import urllib.parse
import re

dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'

# regex para verificar se a URL tem o protocolo
r = re.compile(r"^(https?|ftp)://")

dados = re.sub(r'(?<="Url":")[^"]+(?=")', lambda m: urllib.parse.urlparse(m.group(0) if r.match(m.group(0)) else "http://" + m.group(0)).netloc, dados)
print(dados)

The exit is:

[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"www.y.com.br","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"www.z.com.br","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]

It is worth remembering that in this case we are not using the most suitable tools, because regex does not want to know if the string is really a well formed JSON, it just looks for the indicated stretch and makes the substitution.

A brief explanation of the regex:

(?<="Url":") and (?=") are respectively, one lookbehind and a Lookahead. They are used to check what you have before and after a certain stretch. I mean, I want a snippet of the string that you have "Url":" before and " after. The parentheses and characters ?, < and = are part of the regex syntax to define this behavior (in addition, they will not be part of the match, which will have only what is between them - in this case, the URL).

Between the lookbehind and the Lookahead we have [^"]+, which means one or more occurrences (+) anything other than quotation marks ([^"]) - the brackets with ^ indicate that I do not want the characters that are inside, and as only have ", then it means I want everything that has no quotes.

I mean, the regex means: "Url":" followed by one or more non-quote characters, followed by ".

Then I use a lambda to make the substitution. The parameter passed to the lambda is the match of the regex (i.e., the portion that was captured). Then I use urllib.parse to get the snippet I want from the URL.

I chose to use urllib.parse to manipulate the URL for being easier (besides being a dedicated function to precisely handle Urls), because as you can see by reply from @nosklo, a regex to pick up valid URL snippets is too complicated to be worth it. So I ended up using a simpler expression (I took everything that was between the quotes, and then I passed to urlparse, which can more easily check whether it is actually a URL).

hkotsubo Your answer works correctly and I thank you for all the explanation, but there is still a problem, see "Id":12335 and "Id":12332. There is a problem with the output: "Url":", being empty because the url does not contain the protocol and there is a character error when there is a special one, it becomes a question mark. Could you help me? Updated question.

– DaniloAlbergardi

2018/11/15 at 11:29
@Daniloalbergardi I updated the answer, see if it works now

– hkotsubo

2018/11/15 at 16:24
1

worked perfectly, thank you!

– DaniloAlbergardi

2018/12/24 at 16:03

Browser other questions tagged python python-3.x regex replace

You are not signed in. Login or sign up in order to post.

by nosklo • **5,801** points · Answer 1 · 2018-11-04T23:44:40+00:00

The problem is in your regex... '(/.*)' means "one bar and all that comes after"!

I don’t know what you want to do... If you want to take http try using this regex: r'https?://'

EDIT: Now that you have set your goal, I believe that the right tool is not regexp but the specific functions for url that are in urllib.parse:

>>> import urllib.parse

>>> url = 'https://www.y.com.br/sdfsfs'
>>> print(urllib.parse.urlparse(url))
ParseResult(scheme='https', netloc='www.y.com.br', path='/sdfsfs', 
            params='', query='', fragment='')
>>> print(urllib.parse.urlparse(url).netloc)
www.y.com.br

To complete, I’ll leave here the full regexp to parse urls, which really follows all possible rules of URL:

(?:http://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.
)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)
){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?
:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-
fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-
)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?
:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:
(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;/?:&=])+@(?:(?:(
?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3})))|(?:[a-zA-Z](
?:[a-zA-Z\d]|[_.+-])*)|\*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d
])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:[a-zA-Z](?:[a-zA-Z
\d]|[_.+-])*)(?:/(?:\d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a
-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d]
)?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))/?)|(?:gopher://(?:(?:
(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:
(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+
))?)(?:/(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))*)(?:%09(?:(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*)(?:%09(?:(?:[a-zA-Z\d$
\-_.+!*'(),;/?:@&=]|(?:%[a-fA-F\d]{2}))*))?)?)?)?)|(?:wais://(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)(?:(?:/(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))*))|\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]
{2}))|[;:@&=])*))?)|(?:mailto:(?:(?:[a-zA-Z\d$\-_.+!*'(),;/?:@&=]|(?:%
[a-fA-F\d]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]
|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:
(?:\d+)(?:\.(?:\d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-zA-Z
\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)
*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:(?:(?:(?
:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-
zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:(?:;(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&])*)=(?:(?:(?:[a-zA-Z\d
$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&])*)))*)|(?:ldap://(?:(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d])
)|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%2
0)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?
:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID
|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])
?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*)(?:(
?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:(?:(
?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|o
id)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(
?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*))(?:(?:(?:
%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Z\d]|%(
?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:
\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a
-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*))*(?:(?:(?:%0[Aa])?(?:%2
0)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:\?(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:,(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-f
A-F\d]{2}))+))*)?)(?:\?(?:base|one|sub)(?:\?(?:((?:[a-zA-Z\d$\-_.+!*'(
),;/?:@&=]|(?:%[a-fA-F\d]{2}))+)))?)?)?)|(?:(?:z39\.50[rs])://(?:(?:(?
:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?
:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))
?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:\+(?:(?:
[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*(?:\?(?:(?:[a-zA-Z\d$\-_
.+!*'(),]|(?:%[a-fA-F\d]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Z\d$\-_.+!*'(),
]|(?:%[a-fA-F\d]{2}))+))?(?:;rs=(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA
-F\d]{2}))+)(?:\+(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*)
?))|(?:cid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:@&=
])*))|(?:mid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:@
&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:@&=]
)*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z
\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\
.(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:@&=])*)(?:(?:;(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:@&])*)=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d
]{2}))|[/?:@&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:\*|(?:(
?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+))))?)|(?:(?:;[
Aa][Uu][Tt][Hh]=(?:\*|(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2
}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~])+))?))@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])
?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:
\d+)){3}))(?::(?:\d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:
%[a-fA-F\d]{2}))|[&=~:@/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][Tt]|
[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))
|[&=~:@/])+)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~:@/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-
9]\d*)))?)|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~
:@/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9]\d*
)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]\d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo][Nn
]=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~:@/])+)))?))
)?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-
Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:
\.(?:\d+)){3}))(?::(?:\d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Z\d\$\-_.!~*'
(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\-_.!~*'(),
])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\
-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?))|(?:(?:(?:(?:(?:[a-zA-
Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:@&=+])*))*)?)))