Site with hidden HTML

Question

Site with hidden HTML

Asked 7 years, 3 months ago

Viewed 724 times

0

I need to extract the sales data of semi-new cars on some websites.

One of the sites is the Locamerica company. However, on her website does not appear in the HTML page the content I need to extract.

I need to extract the data from each car present on the page, but they do not appear in HTML. Not even external links to the car page appear.

I downloaded the source code, ran it and it appears the same site but without any car. Link of the HTML that appears to me

I am programming in python and I use Requests to get the page HTML and Beutiful Soup to extract the data I need.

The code

import requests as req
from bs4 import BeautifulSoup as bs

url = "https://seminovos.locamerica.com.br/seu-carro?combustivel=&cor=&q=&cambio=&combustiveis=&cores=&acessorios=&estado=0&loja=0&marca=0&modelo=0&anode=&anoate=&per_page={}&precode=0&precoate=0"
indice_pagina = 1

r = req.get(url.format(indice_pagina))
print(r.text)

How do they not appear? I entered there and inspected the code and all this there... picture price, etc etc etc, including they use Bootstrap...

– hugocsl

2018/05/19 at 21:48
I’m not a web developer. I deal more with the data analysis part. I’m really not seeing these details. It’s something to do with Bootstrap?

– Rafael Ribeiro

2018/05/19 at 21:57
Enter their site, press Ctrl+u that will open the source code of the page, then press Ctrl+f and search for price for example and you will see that this... I saw that they use bootstrap because it’s full of classes that they use in their framework

– hugocsl

2018/05/19 at 22:04
I searched price and gave 43 Chequees, but none with which I wanted. I downloaded the source code, ran it and opened the page without the cars.

– Rafael Ribeiro

2018/05/19 at 22:09

1 answer

Browser other questions tagged html python http-request python-requests

You are not signed in. Login or sign up in order to post.

by Pedro von Hertwig Batista • **3,434** points · Answer 1 · 2018-05-20T00:35:11+00:00

This happens because the page initially does not contain the information about the cars. It is loaded empty, and then uses Javascript to load the data dynamically and insert them into the page.

One of the ways to get around this is by using a webdriver like the Selenium. Basically, you run a browser that is controlled by your Python program.

When possible, it is best to avoid this, however; by running an entire browser, which loads all images and scripts and advertisements, the process is considerably slower than just using simple requests.

What you can do is open your browser’s developer tools, open the Network (Network) tab, and watch the requests your browser makes while loading the page. Sometimes what loads interesting content is a simple call to a website API. In this case, you can make your request for this API.

I did that and I saw some things that seemed interesting:

The other JSON requests are not interesting; they look like filter options and dealerships. This other one seemed a little strange; it didn’t bring the information directly from the cars, but the strange format seemed that it could be Base64.

I copied the field veiculos and pasted it to a decoder site to confirm my suspicions, and in fact, the message becomes HTML:

As a proof of concept to get this HTML with Python:

import requests
import base64

url = 'https://seminovos.locamerica.com.br/veiculos.json?marca=&precode=&precoate=&ano_de=0&cambio=&acessorios=&current_url=https://seminovos.locamerica.com.br/seu-carro?marca=&cambio=&combustivel=&cor=&acessorios=&anode=0&precode=&precoate='

r = requests.get(url)
info = r.json()['veiculos']
info_decoded = base64.b64decode(info)

print(info_decoded)