Depends on how the URL is. If the URL has only an occurrence of /page/número/
, an alternative is to use search
(of module re
):
import re
url = 'https://www.site.com.br/categoria-produto/category/page/67/?gclid=Cjkdksjkcm35522'
m = re.search(r'/page/(\d+)', url)
if m:
print(m[1]) # 67
That is, the regex is looking for /page/
and then check if it has one or more digits (the shortcut \d
corresponds to a digit from 0 to 9, and quantifier +
means "one or more occurrences").
The excerpt \d+
is in brackets for form a capture group. So I can only get the number, using m[1]
(since it is the first pair of parentheses, then it is the first capture group, hence the index 1).
If after the number you don’t necessarily have a slider (i.e., the URL can end with page/67
and have nothing else after), regex also works (\d+
pick up the numbers until you find a character that is not a number, or the end of the string). But if you want to take the number only if you have a slash later, just change the regex to r'/page/(\d+)/'
(do not know how are the URL’s you will check, but when using regex is important say exactly what you want and what you don’t want, because depending on the case this can make a difference).
The shortcut +
means that the number of digits can be at least 1, and no maximum limit. But if you want to limit the amount, you can use other options:
\d{1,10}
: not less than 1, not more than 10 digits
\d{2,}
: at least 2 digits, with no upper limit
\d{2}
: exactly 2 digits
Adapt the values according to what you need.
Note: the shortcut \d
corresponds to any character of the Unicode category "Number, Decimal Digit". This includes not only digits from 0 to 9, but also several other characters representing digits, such as ٢
(ARABIC-INDIC DIGIT TWO), among others.
If such characters do not occur in your URL’s, it is OK to use \d
. But if you want to be more specific and consider only the digits from 0 to 9, you can use the flag ASCII, or else use the character class [0-9]
instead of \d
:
m = re.search(r'/page/(\d+)/', url, re.ASCII)
# ou
m = re.search(r'/page/([0-9]+)/', url)
Finally, the above code only looks for the first occurrence of /page/número
. If you have more than one occurrence and you want them all, just use findall
:
for m in re.findall(r'/page/(\d+)/', url):
print(m)
A feature of findall
is that when the capture groups are present, only these are returned. That is, the above regex will already bring you only the numbers that appear soon after /page/
.
Thanks for the reply, I marked the @hkotsubo as best because in case there is no final bar it will meet also.
– JB_