Developing a Webcrawler in Python

Asked

Viewed 1,119 times

1

Is there any open source project of webcrawler simple, developed in Python, for study?

I have been studying for some time / researching on the subject, but I can not find anything ready about it. My goal is to study to create an open source with the following Features:

  • Download HTML from a specific link
  • Fetches the contents of specific tags, for example: < p >, < H1 >
  • Saves the contents obtained in the Mysql database

So I would like to have a basis on how to develop this in Python in a simple way. If you have an idea of how to do (in code) please give me this help!

Obs: My Python domain is currently basic

  • 1

    Pycrawler to "crawl", Beautiful Soup to parse HTML. By the way, what exactly is the question? Whether or not there is a tool, I’m afraid it’s out of scope. Otherwise, please specify exactly what you want to know (preferably without the question becoming too broad and/or based on opinions).

  • edited the question, my goal is to know about open webcrawlers and if anyone can help to know how I can develop one simply ... the goal would be to get only specific link texts

  • 1

    @Gabrielmasson para criar um webcrawler de raiz tem aqui um tutorial: http://pythonprogramming.net/scraping-parsing-rss-feed/

2 answers

3


There are several, in my personal experience:

  • scrapy - for webscraping
  • mechanize - for webcrawling
  • sellenium webdriver - for browser automation (when mechanize is not able to handle the site, eg: ajax, code obfuscation)

Installation of the modules is very simple on the command line:

  • These projects are all very cool and mature - and essential to use in large projects. If OP wants only the result: crawling and saving the results, Scrapy is the most appropriate tool. want a smaller Crawler project.

0

I recommend studying how the request is made, headers, headers, user agent, understand as well as data transport happens.

At the time of development, always debug to the max, try to predict everything, timeout, max requests, redirect, if something goes wrong your script has to know, and log.

recommend studying Concurrent.futures.Threadpoolexecutor for asynchronous request along with

threading. Thread to create database maintenance services and trends, such as how long it took the site to be modified, thus automatically adjusting the request range according to the likelihood of modifications to the site

Recomment http://lxml.de/ along with xpath and regex to extract the data

I hope it helps something.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.