render specific part of a page

Asked

Viewed 240 times

5

I am using the following code to render a web page:

import dryscrape

# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://www.google.com')

# we don't need images
sess.set_attribute('auto_load_images', True)

# visit
sess.visit("/")

sess.render("google.png")

But, I would like to render only a part of the page, for example on the google site, I would like to render only the Doodle(<div id=dood class=cta>)

I tried to replace the last line by:

sess.at_css('.cta').render("google.png")

But this is not allowed. Does anyone know any way?

2 answers

1

The @drgarcia1986 solution is the one I would try, but if you make a point of using [dryscrape] (motivated by the fact that many Google Doodles are animated and use Flash/HTML5?), one option would be for you to somehow edit the HTML of the main page to leave only Doodle. If you can figure out a way to make the [dryscrape] open an HTML file you generated, you can try something using the Beautifulsoup:

soup = BeautifulSoup(codigo_html_do_google)
soup.body = soup.find(**{'class': 'cta'})
codigo_html_simplificado = str(soup.body)

(maybe you need to be careful not to destroy other page elements like scripts, but the general idea is this)

1

If the dryscrape was not a requirement of the solution you can make a combination of requests to google and text processing with regex.

The idea is to read the google page, find (via regex) the Doodle address, assemble the final url, download the file and save to disk:

# -*- coding: utf-8 -*-
from urllib2 import urlopen
import re


response = urlopen('http://www.google.com/').read()
m = re.search(
    r'background:url\(([^)]+)\).+id="hplogo"',
    response
)

final_url = 'http://www.google.com{}'.format(m.group(1))
print 'Downloading {}'.format(final_url)

image = urlopen(final_url).read()
with open('google.png', 'wb') as f:
    f.write(image)

You may need to read the image content-type before saving to disk.

This has a risk, if google change the layout of the page, most likely your regex will not give match, then you would have to redo it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.