1
Is there any open source project of webcrawler simple, developed in Python, for study?
I have been studying for some time / researching on the subject, but I can not find anything ready about it. My goal is to study to create an open source with the following Features:
- Download HTML from a specific link
- Fetches the contents of specific tags, for example: < p >, < H1 >
- Saves the contents obtained in the Mysql database
So I would like to have a basis on how to develop this in Python in a simple way. If you have an idea of how to do (in code) please give me this help!
Obs: My Python domain is currently basic
Pycrawler to "crawl", Beautiful Soup to parse HTML. By the way, what exactly is the question? Whether or not there is a tool, I’m afraid it’s out of scope. Otherwise, please specify exactly what you want to know (preferably without the question becoming too broad and/or based on opinions).
– mgibsonbr
edited the question, my goal is to know about open webcrawlers and if anyone can help to know how I can develop one simply ... the goal would be to get only specific link texts
– Gabriel Masson
@Gabrielmasson para criar um webcrawler de raiz tem aqui um tutorial: http://pythonprogramming.net/scraping-parsing-rss-feed/
– Rui Lima