Iterating web pages using Requests and Python

Question

Iterating web pages using Requests and Python

Asked 7 years, 2 months ago

Viewed 470 times

0

I am beginner in web scraping. How to learn making a database from data on selling semi-new cars on some websites. One of the sites is this

url = https://www.seminovosunidas.com.br/veiculos/page:1?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-%22

I can get the data I need from the page as normal. To iterate use a url.format passing as argument an Indice that increases the page.

The complete code:

import requests as req
from bs4 import BeautifulSoup as bs

def get_unidas():
    url = "https://www.seminovosunidas.com.br/veiculos/page:{}?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-"
    indice_pagina = 1
    dados = {}
    while True:
        #headers = {'User-Agent':random.choice(user_agent_list)}
        r = req.get(url.format(indice_pagina))
        if r.status_code != req.codes.ok:
            raise Exception("Página inexistente") 
        soup = bs(r.text, "lxml")
        carros = soup.find_all(class_="vehicleDescription")
        valores = soup.find_all(class_="valor")
        for carro, valor in zip(carros,valores):
            texto = list(carro.stripped_strings)
            dados["Empresa"] = "Unidas"
            dados["Modelo"] = texto[2]
            dados["Preco"] = valor.text.replace(".","").replace(",",".")
            dados["Kilometragem"] = texto[4].split(",")[1][5:]
            dados["Ano"] = texto[3][-5:-1]
            #print(dados)
            #print("#######################################")        

get_unidas()

The problem is I know how not to do for the while to end. When accessing a page with nonexistent Index, Dice 200 for example, it goes back to page 1. Normally a non-existent page has different HTML, so I can differentiate it from a page that exists. Even checking the status_code when accessing a non-existent page is returned 200, the code that indicates existing page

1 answer

Browser other questions tagged python-3.x http-request web-scraping python-requests

You are not signed in. Login or sign up in order to post.

by NoobSaibot • **9,554** points · Answer 1 · 2018-05-19T06:29:51+00:00

When you are on a page, is set the value active in the tag attribute li of pagination:

<ul class="list-unstyled list-inline header-paginator pull-right">
  <li class="active number"><a>1</a></li>
  <li class="number"><a href="/veiculos/page:2?utm_source=afilio&amp;utm_medium=display&amp;utm_campaign=maio&amp;utm_content=ron_ambos&amp;utm_term=120x600_promocaomaio_performance_-_-%22">2</a></li>
  <li class="number"><a href="/veiculos/page:3?utm_source=afilio&amp;utm_medium=display&amp;utm_campaign=maio&amp;utm_content=ron_ambos&amp;utm_term=120x600_promocaomaio_performance_-_-%22">3</a></li>
  <li class="disabled"><a>...</a></li>
  <li class="number"><a href="/veiculos/page:106?utm_source=afilio&amp;utm_medium=display&amp;utm_campaign=maio&amp;utm_content=ron_ambos&amp;utm_term=120x600_promocaomaio_performance_-_-%22">106</a></li>
</ul>

So instead of checking the status_code, you can search the element li.active.number and check the text, if the current Dice is greater than or equal to 2 and the value of the element is equal to 1, you end the loop.

Remove the lines:

if r.status_code != req.codes.ok:
  raise Exception("Página inexistente")

and below the line:

soup = bs(r.text, "lxml")

place:

pagina_atual = list(soup.find(class_="active number").stripped_strings)[0]
if indice_pagina >= 2 and pagina_atual == '1': break

also do not forget to make the increment in indice_pagina, otherwise it will be in infinite loop, after for place:

indice_pagina += 1

See working on repl.it