What is the difference between utils::Urlencode() output in R and urllib.parse.quote() in Python

Question

What is the difference between utils::Urlencode() output in R and urllib.parse.quote() in Python

Asked 3 years, 11 months ago

Viewed 58 times

2

I wanted to understand the difference between utils::URLencode() in the R and the urllib.parse.quote() in Python, for example:

In R:

tster <- '{"yearStart":"2020",\n"yearEnd":"2020",\n"typeForm":1}'
utils::URLencode(tster)

Output:

'%7B%22yearStart%22:%222020%22,%0A%22yearEnd%22:%222020%22,%0A%22typeForm%22:1%7D'

Python:

import urllib

tster = '{"yearStart":"2020",\n"yearEnd":"2020",\n"typeForm":1}'

result_py = urllib.parse.quote(tster, encoding = 'utf-8')
result_py

Output:

'%7B%22yearStart%22%3A%222020%22%2C%0A%22yearEnd%22%3A%222020%22%2C%0A%22typeForm%22%3A1%7D'

The difference in this case can be solved with, which leaves the output in Python equal to that of R:

result_py.replace('%3A',':').replace('%2C',',')

Output:

'%7B%22yearStart%22:%222020%22,%0A%22yearEnd%22:%222020%22,%0A%22typeForm%22:1%7D'

But for larger strings, ugly is very laborious. How to get Python output equal to R?

I left the two codes for comparison in the links:

In R: https://colab.research.google.com/drive/1oj-GCCUX4MZB_jW942DBMny-kN1sXdvy?usp=sharing

Python: https://colab.research.google.com/drive/1jpo9GrcTrFNidIdQih3PkLMR88t690kZ?usp=sharing

1 answer

Browser other questions tagged python-3.x urllib

You are not signed in. Login or sign up in order to post.

by Rfroes87 • **465** points · Answer 1 · 2021-08-29T04:05:38+00:00

These characters , and : of your example are reserved characters that, according to the own Rdocumentation are preserved. Highlighting the relevant section:

In addition, ! $ & ' ( ) * + , = : / ? @ # [ ] are reserved characters, and should be encoded unless used in their reserved sense, which is Scheme specific. The default in Urlencode is to Leave them alone, which is appropriate for file://Urls, but probably not for http://ones.

That is, these characters are classified as reserved according to the specification of the Internet STD 66 (ancient RFC3986) and the standard used in the function utils::URLencode is to ignore this encoding unconditionally.

If you want to reverse this behavior, simply override the default value of the parameter reserved in this way:

utils::URLencode(tster, reserved = TRUE)

With this the result should be displayed as shown in Python 3; an additional detail is that in the case of the latter, it is possible to use the parameter safe of function urllib.parse.quote to specify the characters that - in addition to unreserved alphanumeric characters, -, _, . and ~ - shall not be coded as described in this part of help:

quote(string, safe='/', encoding=None, errors=None)

(...)

The quote Function %-escapes all characters that are neither in the unreserved chars ("Always safe") nor the Additional chars set via the safe Arg.

(...)

Adding the characters of your example - in addition to the pattern /　-, you can reproduce the result of R:

>>> urllib.parse.quote(tster, encoding = 'utf-8', safe='/,:')
'%7B%22yearStart%22:%222020%22,%0A%22yearEnd%22:%222020%22,%0A%22typeForm%22:1%7D'