Coding problem A u015bvagho u1e63a (Python3)?

Asked

Viewed 23 times

1

I’m working with wikipedia, and I’m having some coding problems. When I add such a link in my browser everything works fine:

https://en.wikipedia.org/wiki/A%C5%9Bvagho%E1%B9%A3a

That goes in the article with the name sequinte:

https://en.wikipedia.org/wiki/Aśvaghoṣa

Which is the same article but the url is being shown differently. In other words, there is an encoding occurring there.

But I’m mining wikipedia topviews:

https://tools.wmflabs.org/topviews/? project=en.wikipedia.org&platform=all-access&date=yesterday&excludes=

And in the case of this article I received this same title with the following name through the API:

https://tools.wmflabs.org/pageviews/api.php?project=en.wikipedia.org&start=2018-09-13&end=2018-09-13&pages=Aśvaghoṣa

"A\u015bvagho\u1e63a": {
  "assessment": "Stub",
  "num_users": 1,
  "assessment_img": "f/f5/Symbol_stub_class.svg",
  "num_edits": 1
},

But when I try to mount the url sequinte:

https://en.wikipedia.org/wiki/A\u015bvagho\u1e63a

Sure doesn’t work.

What I want to know how I can encode this way (In python3):

A\u015bvagho\u1e63a -> A%C5%9Bvagho%E1%B9%A3a

1 answer

1


Some modules that can perform this encoding:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""html encoding (Escaping HTML)"""
import cgi
import html
from urllib.parse import quote_plus

string = 'A\u015bvagho\u1e63a'

# cgi.escape(), utilize ele apenas com Python 2.
# No Python 3 ele irá entrar em desuso em versões futuras (Deprecated).
print('CGI escape:', cgi.escape(string))
print('HTML escape:', html.escape(string))
print('Quote plus:', quote_plus(string))

URL = 'https://en.wikipedia.org/wiki/'

print('CGI escape:', URL + cgi.escape(string))
print('HTML escape:', URL + html.escape(string))
print('Quote plus:', URL + quote_plus(string))

I believe there are other ways, however it will depend on you are searching and saving the data.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.