Developing a Webcrawler in Python

Question

Developing a Webcrawler in Python

Asked 9 years, 9 months ago

Viewed 1,119 times

1

Is there any open source project of webcrawler simple, developed in Python, for study?

I have been studying for some time / researching on the subject, but I can not find anything ready about it. My goal is to study to create an open source with the following Features:

Download HTML from a specific link
Fetches the contents of specific tags, for example: < p >, < H1 >
Saves the contents obtained in the Mysql database

So I would like to have a basis on how to develop this in Python in a simple way. If you have an idea of how to do (in code) please give me this help!

Obs: My Python domain is currently basic

1

Pycrawler to "crawl", Beautiful Soup to parse HTML. By the way, what exactly is the question? Whether or not there is a tool, I’m afraid it’s out of scope. Otherwise, please specify exactly what you want to know (preferably without the question becoming too broad and/or based on opinions).

– mgibsonbr

2015/11/03 at 08:17
edited the question, my goal is to know about open webcrawlers and if anyone can help to know how I can develop one simply ... the goal would be to get only specific link texts

– Gabriel Masson

2015/11/03 at 13:44
1

@Gabrielmasson para criar um webcrawler de raiz tem aqui um tutorial: http://pythonprogramming.net/scraping-parsing-rss-feed/

– Rui Lima

2015/11/03 at 18:16

2 answers

3

There are several, in my personal experience:

scrapy - for webscraping
mechanize - for webcrawling
sellenium webdriver - for browser automation (when mechanize is not able to handle the site, eg: ajax, code obfuscation)

Installation of the modules is very simple on the command line:

Pip install Scrapy (documentation)
Pip install mechanize (tutorial)
Pip install Selenium (documentation)

These projects are all very cool and mature - and essential to use in large projects. If OP wants only the result: crawling and saving the results, Scrapy is the most appropriate tool. want a smaller Crawler project.

– jsbueno

2015/11/03 at 18:01

Browser other questions tagged python web-service web-crawler

You are not signed in. Login or sign up in order to post.

by Joelson Fenix • 1 point · Answer 1 · 2015-11-07T00:04:13+00:00

I recommend studying how the request is made, headers, headers, user agent, understand as well as data transport happens.

At the time of development, always debug to the max, try to predict everything, timeout, max requests, redirect, if something goes wrong your script has to know, and log.

recommend studying Concurrent.futures.Threadpoolexecutor for asynchronous request along with

threading. Thread to create database maintenance services and trends, such as how long it took the site to be modified, thus automatically adjusting the request range according to the likelihood of modifications to the site

Recomment http://lxml.de/ along with xpath and regex to extract the data

I hope it helps something.