Site with hidden HTML

Asked

Viewed 724 times

0

I need to extract the sales data of semi-new cars on some websites.

One of the sites is the Locamerica company. However, on her website does not appear in the HTML page the content I need to extract.

I need to extract the data from each car present on the page, but they do not appear in HTML. Not even external links to the car page appear.

I downloaded the source code, ran it and it appears the same site but without any car. Link of the HTML that appears to me

I am programming in python and I use Requests to get the page HTML and Beutiful Soup to extract the data I need.

The code

import requests as req
from bs4 import BeautifulSoup as bs

url = "https://seminovos.locamerica.com.br/seu-carro?combustivel=&cor=&q=&cambio=&combustiveis=&cores=&acessorios=&estado=0&loja=0&marca=0&modelo=0&anode=&anoate=&per_page={}&precode=0&precoate=0"
indice_pagina = 1

r = req.get(url.format(indice_pagina))
print(r.text)
  • How do they not appear? I entered there and inspected the code and all this there... picture price, etc etc etc, including they use Bootstrap...

  • I’m not a web developer. I deal more with the data analysis part. I’m really not seeing these details. It’s something to do with Bootstrap?

  • Enter their site, press Ctrl+u that will open the source code of the page, then press Ctrl+f and search for price for example and you will see that this... I saw that they use bootstrap because it’s full of classes that they use in their framework

  • I searched price and gave 43 Chequees, but none with which I wanted. I downloaded the source code, ran it and opened the page without the cars.

1 answer

2


This happens because the page initially does not contain the information about the cars. It is loaded empty, and then uses Javascript to load the data dynamically and insert them into the page.

One of the ways to get around this is by using a webdriver like the Selenium. Basically, you run a browser that is controlled by your Python program.

When possible, it is best to avoid this, however; by running an entire browser, which loads all images and scripts and advertisements, the process is considerably slower than just using simple requests.

What you can do is open your browser’s developer tools, open the Network (Network) tab, and watch the requests your browser makes while loading the page. Sometimes what loads interesting content is a simple call to a website API. In this case, you can make your request for this API.

I did that and I saw some things that seemed interesting:

inserir a descrição da imagem aqui

The other JSON requests are not interesting; they look like filter options and dealerships. This other one seemed a little strange; it didn’t bring the information directly from the cars, but the strange format seemed that it could be Base64.

I copied the field veiculos and pasted it to a decoder site to confirm my suspicions, and in fact, the message becomes HTML:

inserir a descrição da imagem aqui

As a proof of concept to get this HTML with Python:

import requests
import base64

url = 'https://seminovos.locamerica.com.br/veiculos.json?marca=&precode=&precoate=&ano_de=0&cambio=&acessorios=&current_url=https://seminovos.locamerica.com.br/seu-carro?marca=&cambio=&combustivel=&cor=&acessorios=&anode=0&precode=&precoate='

r = requests.get(url)
info = r.json()['veiculos']
info_decoded = base64.b64decode(info)

print(info_decoded)
  • in Chrome, clicking to inspect elements appears the HTML normally. Besides, I asked my acquaintances to send me the html of the site that showed their machines. I inspected the html and the data of the cars also appeared normally.

  • @Rafaelribeiro appears because when you click to inspect or to see the source, the page has already used Javascript to load the additional data. The initial request does not bring the HTML of the cars, the HTML comes in this other request that I explained.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.