Scrapy queueQueue and mysql store

Asked

Viewed 86 times

2

I’ve grouped 2 questions because I think they’re related.

I made a script test, where saved in the database the links saved with your data.

This is a bad practice? (High priority)

Do I need to accomplish something else so I don’t import duplicates? In my pipeline has a simple check looking for link=%s, would it be better if I use md5 (link)? Faster query?

I can use the -s JOBDIR=crawls/somespider-1 to pause and return Crawler, but would like to know how to accomplish this by list of links to be processed in Mysql. (Low priority)

I need to add new items to my list of start_urls, or dynamically. I must create Request with callback parse_category? Is there any way I can add self.queue or self.start_url and add new url’s to be processed? (High priority)

  • 1

    Luiz, welcome to Sopt. Access Help and take the Tour, to better understand how to use the resources here.

1 answer

0

This is a bad practice?

No, considering the responsibility for handling the Mysql connection is also being done using good Python practices.

I can use the -s JOBDIR=crawls/somespider-1 to pause and return the Crawler, but would like to know how to accomplish this by list of links to be processed in Mysql.

First you select Mysql records. Then you can use some repeat loop to go calling requests and their respective callbacks:

for registro in registros:
    resultado = Request(url=self.registro['url'], callback=self.meu_callback)
    # Aqui você faz operações adicionais com resultado, se precisar.

I must create Request with callback parse_category?

It’s the right thing to do.

Is there any way I can add self.queue or self.start_url and add new url’s to be processed?

self.queue you will bring from your database. self.start_url will be the column of each record brought.

  • vlw, on the last question. I tried something like self.start_url.append(url) caused an error. I should perform the append in the database and then search. Is this? There is some way gave to perform this directly, in loops while collecting items for example. Something I can run anywhere in the code. But in the future I will rescue from the bank yes, just to expedite for now.. Vlw

  • First you go to the database and bring the data. Then you pass this data to the loop.

  • But in the case I used an Spider without a bank, something more practical that I say.

  • I don’t understand. What would be more practical than a database?

  • I mean, if I create a Spider for test, sometimes I want to run it no matter what. But for this I needed to make a start_url.append(url). This I believe should be done by logic distributing parse_start, parse_cat, parse_item. But my doubt is whether it has any function that modifies the start_url, or increments the Queue. I saw something with Schedule.py, but I’m looking to avoid duplicate request. Maybe it’s in it. I’m sure.

  • You better put the code in your question, because I don’t understand anything.

  • No problem, eh this question in the comment was unrelated to mysql.. But thanks for the answers.. Vlw mto..

  • Taking advantage, see here how the site works. Thank you!

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.