Crawler that detects changes on a page and saves screenshots

Asked

Viewed 304 times

-1

  • Guys, I edited his question to make it seem less of a 'personal matter'. It’s (va) poorly worded, but it’s not an absolutely bad issue. They think she needs to get better?

  • 1

    The question is bad but salvageable. It would be good for him to improve what he needs more specifically.

3 answers

4

phantomjs is perfect for this. It is not in python, but it is relatively trivial for tasks like these and only requires you to know javascript. One of the main advantages is that it has advanced features that no other Crawler who does not interpret javascript nor has a full engine could do

It uses a Webkit browser engine (equivalent to Google Chrome) and has a specific function for taking screenshots. With this, you would have to make him access the page and if it is in ajax, just add an event that realizes that something has been changed, and if the page does not load in ajax, would have to access the page of times and times and compare with the previous page, and then repair the differences.

Here, an example of how to access a page and take a screenshot of it :

File github.js

var page = require('webpage').create();
page.open('http://github.com/', function() {
   page.render('github.png');
  phantom.exit();
});

Then run the file by command line with command

phantomjs github.js

  • I have some services in my work that use the phantomjs and he serves our purposes well. You use javascript to program some procedures, it’s not very trivial but you can do a lot of cool things!

  • phantomjs for things simple is trivial. Now there are things with it, which are complicated because of the bugs and the difference in concept between page context and phantomjs. Screenshot in this case it is very easy to do and in a few lines

  • ghost.py (http://jeanphix.me/Ghost.py/) and a python-based Fork of phantom.js, it basically uses Qt’s Webkit module

  • @tovmeod Posta as answer. If it uses the Webkit is as good a solution as the phanthomjs. Maybe other people can also suggest equivalents in other languages.

2

For Rawler you can use the Scrapy.

1

I’ve used the Ghost py.. It is a Fork of the phantom from when they decided not to support python anymore and as the name suggests and a lib for those using python.

Internally it uses the Webkit module of Qt. It may not be the fastest thing in the world, but it runs js, opens iframes, downloads images and behaves like a browser - or at least tries - different from solutions like mechanize or requests+beautifulsoup

It relies on Pyqt or Pyside. I had a headache installing pyside but eventually it works well.

I only stumbled on a bug that is actually from the Webkit module of Qt, that from time to time gave my whole process, I turned it around using the multiprocessing module of python, if it finished the process for any reason did not stop my entire program.

Homepage: http://jeanphix.me/Ghost.py/

Source code: https://github.com/jeanphix/Ghost.py

Browser other questions tagged

You are not signed in. Login or sign up in order to post.