Your input string is a JSON, so you better use the right tools to manipulate this data. You can use the module json
and then manipulate the URL with urllib.parse
:
# -*- coding: utf-8 -*-
import re
import json
import urllib.parse
dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'
# converter a string para JSON
jsondata = json.loads(dados)
# regex para verificar se a URL tem o protocolo
r = re.compile(r"^(https?|ftp)://")
# substituir somente o campo URL
for d in jsondata:
url = d['Url']
# se não tem o protocolo, adiciona qualquer um, apenas para o parsing ser feito corretamente
if not r.match(url):
url = "http://"+ url
d['Url'] = urllib.parse.urlparse(url).netloc
# converter JSON para string
dados = json.dumps(jsondata, ensure_ascii=False)
print(dados)
The exit is:
[{"Id": 12345, "Date": "2018-11-03T00:00:00", "Quality": "Goodão", "Name": "X", "Description": null, "Url": "x.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12346, "Date": "2018-11-03T00:00:00", "Quality": "Good", "Name": "YYy", "Description": "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "Url": "www.y.com.br", "ParseUrl": "y beautiful", "Status": "Ativa", "Surveys": 0, "KeySearch": "y like", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12347, "Date": "2018-11-03T00:00:00", "Quçality": "Pending", "Name": "z Z", "Description": "Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur", "Url": "www.z.com.br", "ParseUrl": null, "Status": "Ativa", "Surveys": 112, "KeySearch": "z plant", "QualityId": 4, "Type": "Agro"}, {"Id": 12335, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "www.j.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12332, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "www.j.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}]
Note that the keys are in a different order than the input, as JSON is set to a unordered set of name/value pairs (a set of key pairs/value unedited). So order is not guaranteed.
Other details:
I used a regex (using the module re
) to verify that the URL does not have the protocol (the http://
at first, for example). I used ^(https?|ftp)://
, which means:
^
: string start
https?
: the text "http" or "https" (the s?
indicates that the letter "s" is optional)
ftp
: the text "ftp""
|
: means or. So (https?|ftp)
means that the chunk can be http, https, or ftp. Next we have the characters ://
Add more protocols as needed, all separated by |
. For example: ^(https?|ftp|telnet|mailto)://
(to check http/https, ftp, telnet or mailto).
If the URL does not match this pattern (i.e., if you do not have the protocol), I add some protocol to get the Parsing done (otherwise urlparse
gets the netloc
empty). As you will not use the protocol for anything, whatever is placed.
And for the special characters, you can force the script to use a specific encoding, as I did in the first line (# -*- coding: utf-8 -*-
). And in the method dumps
, pass the value of the parameter ensure_ascii
equal to False
- the default is True
, what makes special characters are not displayed correctly.
If you want to keep the same order
But if by chance you need to keep exactly the same order of the keys, then the way is to use regex. One way is to check the section "Url":"..."
and get the URL that is there. Then we use urllib.parse
to replace the URL with the part you need.
And I also use the same regex from the previous example to check if the URL has the protocol, and add "http://" if it doesn’t have it.
# -*- coding: utf-8 -*-
import urllib.parse
import re
dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'
# regex para verificar se a URL tem o protocolo
r = re.compile(r"^(https?|ftp)://")
dados = re.sub(r'(?<="Url":")[^"]+(?=")', lambda m: urllib.parse.urlparse(m.group(0) if r.match(m.group(0)) else "http://" + m.group(0)).netloc, dados)
print(dados)
The exit is:
[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"www.y.com.br","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"www.z.com.br","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]
It is worth remembering that in this case we are not using the most suitable tools, because regex does not want to know if the string is really a well formed JSON, it just looks for the indicated stretch and makes the substitution.
A brief explanation of the regex:
(?<="Url":")
and (?=")
are respectively, one lookbehind and a Lookahead. They are used to check what you have before and after a certain stretch. I mean, I want a snippet of the string that you have "Url":"
before and "
after. The parentheses and characters ?
, <
and =
are part of the regex syntax to define this behavior (in addition, they will not be part of the match, which will have only what is between them - in this case, the URL).
Between the lookbehind and the Lookahead we have [^"]+
, which means one or more occurrences (+
) anything other than quotation marks ([^"]
) - the brackets with ^
indicate that I do not want the characters that are inside, and as only have "
, then it means I want everything that has no quotes.
I mean, the regex means: "Url":"
followed by one or more non-quote characters, followed by "
.
Then I use a lambda to make the substitution. The parameter passed to the lambda is the match of the regex (i.e., the portion that was captured). Then I use urllib.parse
to get the snippet I want from the URL.
I chose to use urllib.parse
to manipulate the URL for being easier (besides being a dedicated function to precisely handle Urls), because as you can see by reply from @nosklo, a regex to pick up valid URL snippets is too complicated to be worth it. So I ended up using a simpler expression (I took everything that was between the quotes, and then I passed to urlparse
, which can more easily check whether it is actually a URL).
Although it is possible with regex, it would not be better to use the specific functions for URL Parsing? https://docs.python.org/3/library/urllib.parse.html
– hkotsubo
hkotsubo, you need to use replace because you need to preserve the rest of the data in the variable, just as you told nosklo
– DaniloAlbergardi
Could [Edit] the question and put examples of values that should be preserved? Anyway, I think we could make a
split
on data and treat the Urls one by one, and then you join again withjoin
. But to be sure, just looking at a few examples...– hkotsubo
hkotsubo added more examples, in the example json has only 3 examples, but there is a plethora of records. Thank you
– DaniloAlbergardi