Building a python Crawler web. I need help adding threads

Asked

Viewed 670 times

1

I am trying to develop a Rawler web for studies. It is very simple and I would like to improve it. How to use threads to accelerate/improve the process? The program could make multiple links in parallel. How to apply the concept of threads to Crawler?

import requests
import re

to_crawl =['http://www.g1.globo.com'] #url para fazer o crawler (a semente: ponto de partida)
crawled = set()# o conjunto do que "ja fiz"/ja percorrido, foi feito o crawer
#se a url já estiver em crawled, vou para a próxima!

#é bom usar header pra finger ser um navegador
header = {"user-agent":"Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0",
          "accept": "*/*",
          "accept-language": "en-US,en;q=0.5",
          "accept-encoding": "gzip, deflate",
          "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
          }
while True: #executar pra sempre...

    url = to_crawl[0]
    try: #tratar pox ex URL invalidas..
        req = requests.get(url,headers=header)
    except: #remove a url
        to_crawl.remove(url)
        crawled.add(url)
        continue #passa pro prox link


    #print (req.text) #é a página!
    html = req.text



    links = re.findall(r'(?<=href=["\'])https?://.+?(?=["\'])' ,html)
    print ("Crawling", url)

    #apos a requisicao, removo do to_crawl e insiro em no conjunto crawled:
    to_crawl.remove(url)
    crawled.add(url)


    #agora joga links in to_crawl se nao estiverem em crawled:
    for link in links:
        if link not in crawled and link not in to_crawl:  #se nao estiver nas 2 listas
            to_crawl.append(link)


    #print(padrao.group())
    #print(padrao,)

    for link in links:
        print(link)

2 answers

4


Here is a simple example (python 3.x), the approach is slightly different from yours:

import re
from threading import Thread
import requests

def get_links(url):
    req = get_req(url)
    if(req is not None):
        html = req.text
        urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)
        return urls
    return None

def get_req(url):
    try:
        req = s.get(url)
        return req
    except Exception:
        print('[-] Erro ao ir buscar página: ', url)
        return None

def inject_links(data, url):
    urls = get_links(url)
    if(urls is not None):
        for url in urls:
            if(url not in data and len(data) < urls_max):
                data.add(url) # adicionamos aos urls crawled
                print('[+] Total: {} [+] putting: {} '.format(len(data), url))
                return inject_links(data, url)
    return

def producer(url, threadNum):
    while len(data) < urls_max:
        inject_links(data, url)
    #print('\n', data) # comentar isto depois de ter percebido, este print e muito pesado 
    print('[+] Terminated - killing thread {} -> Total urls stored: {}'.format(threadNum, len(data)))
    # aqui pode escrever para um ficheiro por exemplo

data = set()
urls_max = 100
threads = 10
start_urls = ['/', 'http://www.w3schools.com/default.asp', 'http://spectrum.ieee.org/']

s = requests.Session()
for i in range(len(start_urls)):
    for j in range(threads):
        t = Thread(target=producer, args=(start_urls[i], '{}.{}'.format(i+1, j+1)))
        t.start()

The use of set() is to increase performance when adding/looking for something stored there, and to avoid duplicated urls. Remove the print comment within the method producer() to view stored urls

In this case we will start in three links with 10 threads in each, and each tread is 'dead' when we have 100 links. This condition will be the core of the end of the program if(url not in data and len(data) < urls_max), if the url does not already exist inside our set() then we add, and if the total number of urls in the set is less than urls_max

  • thank you. I will study the code!

  • created another topic with similar subject in another project, because I have difficulty with the use of Threads

4

I put together a pretty basic example of how I could keep using Thread (the most modern way to work with python threads, through the module Concurrent.).

NOTE THE EXAMPLE WAS WRITTEN USING PYTHON 3

import re
from concurrent.futures import ThreadPoolExecutor, as_completed

import requests


HEADERS = {
    'user-agent':
        'Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0',
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.5',
    'accept-encoding': 'gzip, deflate',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
}

MAX_WORKERS = 4


def featch_url(url):
    try:
        res = requests.get(url, headers=HEADERS)
    except:
        return url, ''
    return url, res.text


def process_urls(urls):
    result = {}
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = [executor.submit(featch_url, url) for url in urls]
    for future in as_completed(futures):
        url, html = future.result()
        result[url] = html
    return result


if __name__ == '__main__':
    urls = ['http://www.pudim.com.br/']
    crawled = set()
    while urls:
        to_process = {url for url in urls if url not in crawled}
        print('start process urls: ', to_process)
        process_result = process_urls(to_process)
        urls = []
        for url, page in process_result.items():
            crawled.add(url)
            urls += re.findall(r'(?<=href=["\'])https?://.+?(?=["\'])', page)

    print('Crawled pages: ', crawled)

The highlight of the example is due to the function process_urls which is responsible for creating the ThreadPoolExecutor and "shoot" the threads. Obviously the example should be adapted to your need, because the way it is it will only go through all the links that are found in front and finally adds in the set crawled the pages that have already been processed.

Some remarks

  • In the MAX_WORKERS(which consists of the maximum number of threads that will be opened at a time) I used a completely arbitrary number, but the good practice is to use the * 2 machine Cpus number (can be obtained via os.cpu_count() * 2)
  • If you think about doing some processing on each url (and not just taking the links) you can do inside the for do as_completed, because you can already process the pages as they are read and not only when they are all read (this will give you more performance).
  • Before working with Threads, try to understand the side effects of this, i.e., research on race conditions, Locks, etcs.

Although I did the example with thread (the level of study/learning) I recommend that if you need a more robust solution of Crawler, you take a look at Scrapy and avoid reinventing the wheel (unless to learn how the wheels work).

  • I’ll study the code, test it and get back to you. Thank you.

  • Thank you @drgarcia1986. The program is for studies only!

  • 2

    the cool, then I think it will help :) if you have any questions or problem just talk

  • Thank you. I’ll get back to you soon.

  • created another topic with similar subject in another project, because I have difficulty with the use of Threads.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.