Regular expression to get number between two bars

Question

Regular expression to get number between two bars

Asked 5 years, 9 months ago

Viewed 310 times

1

I have a code that takes a certain URL, divides it and returns me a list.

url = 'https://www.site.com.br/categoria-produto/category/page/67/? 
gclid=Cjkdksjkcm35522'

last_page = url
if last_page.split("page")[1]:
   t = last_page.split("page")[1]
   print(last_page)

But the list, depending on how the URL is, can have a value /3/ or /23/.

But I have no way of knowing the value that will come between these two bars, one or two positions or even three positions.

The only way I thought was to use regular expressions, but I’m not sure how to put the expression.

Detail: if I try to get the position print(last_page[1:4]) and only has 1 decimal place between the two bars, will take the bar.

2 answers

1

Depends on how the URL is. If the URL has only an occurrence of /page/número/, an alternative is to use search (of module re):

import re

url = 'https://www.site.com.br/categoria-produto/category/page/67/?gclid=Cjkdksjkcm35522'
m = re.search(r'/page/(\d+)', url)
if m:
    print(m[1]) # 67

That is, the regex is looking for /page/ and then check if it has one or more digits (the shortcut \d corresponds to a digit from 0 to 9, and quantifier + means "one or more occurrences").

The excerpt \d+ is in brackets for form a capture group. So I can only get the number, using m[1] (since it is the first pair of parentheses, then it is the first capture group, hence the index 1).

If after the number you don’t necessarily have a slider (i.e., the URL can end with page/67 and have nothing else after), regex also works (\d+ pick up the numbers until you find a character that is not a number, or the end of the string). But if you want to take the number only if you have a slash later, just change the regex to r'/page/(\d+)/' (do not know how are the URL’s you will check, but when using regex is important say exactly what you want and what you don’t want, because depending on the case this can make a difference).

The shortcut + means that the number of digits can be at least 1, and no maximum limit. But if you want to limit the amount, you can use other options:

\d{1,10}: not less than 1, not more than 10 digits
\d{2,}: at least 2 digits, with no upper limit
\d{2}: exactly 2 digits

Adapt the values according to what you need.

Note: the shortcut \d corresponds to any character of the Unicode category "Number, Decimal Digit". This includes not only digits from 0 to 9, but also several other characters representing digits, such as ٢ (ARABIC-INDIC DIGIT TWO), among others.

If such characters do not occur in your URL’s, it is OK to use \d. But if you want to be more specific and consider only the digits from 0 to 9, you can use the flag ASCII, or else use the character class [0-9] instead of \d:

m = re.search(r'/page/(\d+)/', url, re.ASCII)

# ou

m = re.search(r'/page/([0-9]+)/', url)

Finally, the above code only looks for the first occurrence of /page/número. If you have more than one occurrence and you want them all, just use findall:

for m in re.findall(r'/page/(\d+)/', url):
    print(m)

A feature of findall is that when the capture groups are present, only these are returned. That is, the above regex will already bring you only the numbers that appear soon after /page/.

1

Thank you very much. .

– JB_

2019/09/20 at 19:34

Browser other questions tagged python python-3.x regex

You are not signed in. Login or sign up in order to post.

by Bianca Lodoli • 11 points · Answer 1 · 2019-09-20T18:59:18+00:00

The code below does what you need

#!/usr/bin/env python3
import re

url = 'https://www.site.com.br/categoria-produto/category/page/67/?gclid=Cjkdksjkcm35522'

re_numbers = r'\/(\d+)(?=\/)'

numbers = re.findall(re_numbers, url)

print(numbers)

It returns the list of numbers between two bars in a URL. You can manipulate this list and only get the first one, if that’s the case, or just the last one.