Thread control to avoid lock

Asked

Viewed 388 times

9

I’m starting a Crawler and his idea is to download all the content of a given website. It already has some "usable" features. What’s killing me is, I’m doing it multithreaded, but the threads in a moment stop and I don’t know how to avoid.

I did some tests and found that the Threads are still alive. They are still there but seem to be in a lock state.

They may take 5 seconds or 5 hours but one thing’s for sure, she’s going into lock down. And I’d like to trust my Crawler enough to let him run 24 hours a day.

So here are my questions:

Is there any limit to the number of Threads I can use?

How I Prevent My Thread From Entering Lock?

class Fetcher(Thread):

    wait_time = 7
    dispatcher = None
    work = None

    def __init__(self, dispatcher, *args, **kwargs):
        Thread.__init__(self, *args, **kwargs)
        self.dispatcher = dispatcher
        self.wait_time = kwargs.get('wait_time', 7)
        self.start()

    def request_work(self):
        self.work = None
        if self.dispatcher.has_work():
            self.work = self.dispatcher.get_work()

    def do(self):
        if self.work is not None:
            self.fetch_url()

    def fetch_url(self):
        request = urllib2.Request(self.work.url)

        try:
            response = urllib2.urlopen(request)
            html = buffer(response.read())
            page = Page(self.work, html)
            page.save()
        except urllib2.URLError:
            self.dispatcher.fill_pool([self.work,])
        except sqlite3.OperationalError:
            self.dispatcher.fill_pool([self.work,])
        except:
            self.dispatcher.fill_pool([self.work,])

    def run(self):
        while True:
            self.request_work()
            if self.work:
                self.do()
                time.sleep(self.wait_time)

Dispatcher:

class Dispatcher:        
    def __init__(self, *args, **kwargs):
        self.pool = []

    def has_work(self):
        return len(self.pool) > 0

    def get_work(self):
        return self.pool.pop(0)

    def fill_pool(self, workload):
        self.pool = self.pool + workload

Running Example:

dispatcher = Dispatcher()
dispatcher.fill_pool(['url1', 'url2', 'url3'])
fetcher1 = Fetcher(dispatcher)
fetcher2 = Fetcher(dispatcher)
fetcher3 = Fetcher(dispatcher)
fetcher4 = Fetcher(dispatcher)

I put this example at the request of the Brumazzi user, but it will not run. As stated earlier, the Crawler I’m creating depends on all of its components to run without the slightest problem. And the class Page is part of the project, representing an object in the database.

  • 1

    has the code, but has nothing saying how to use it, what kind of parameter should be passed to the class? where it tells the url? without documentation or comments it is difficult to debug.

  • Brumazzi, unfortunately put all the code of the project here think it will be something unfeasible. Being that not everything uses Thread and that’s the point. What I can do is further improve the project description in order to facilitate debugging. However, I can only do this when I get home. And the question is more about Threads control itself, mine is just an example, don’t need to use it necessarily.

  • Focusing only on the Fetcher class, what type of parameter should be inserted in the instance? since python does not specify the types of the parameters, it becomes more difficult to understand the inputs and outputs of the code. Having to run the code and see error, it is easier to give an answer

  • Brumazzi, then, the project is completely interconnected. There is the Dispatcher which is basically a row of urls to be searched that feeds the fetcher. There is the Fetcher (that I posted the code) which is the guy who goes on the web and searches the content of that URL and feeds the bank. The UrlFinder is responsible for unlocking the web page that is stored in the bank and finding more Urls and feeding the Dispatcher.

  • I can post the Dispatcher code that’s the parameter for Fetcher, but he’ll have to be hand-fed. And there is no 'error' popping, Thread just freezes. :(

  • A hint, don’t do return len(self.pool) is not 0 do return len(self.pool) > 0 or return len(self.pool) != 0 the is you should only use to compare the identity of two objects, not compare the values.

  • Thank you @drgarcia1986, I will modify this in my code as soon as I get home. Thank you for pointing out that slip that I gave there. =)

Show 2 more comments

1 answer

4

You are used a list that is not thread safe (dispatcher.pool) and sharing among several Workers (Fetcher), this may be an indication of your possible problem, try switching from a simple list to a queue (Queue) thread safe.

  • I’m going to do that. Do you happen to know if there’s a thread limit that I can use? If I use 3 or 100 or whatever? Or is there a viable limit?

  • 1

    I believe that the limit will be according to the limit of the OS, however, I recommend you to work with at most two threads per core processor, since the codes "blocantes" will be stuck by GIL (python will release GIL only for I tasks.O.)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.