Scrapy queueQueue and mysql store

Question

Scrapy queueQueue and mysql store

Asked 9 years ago

Viewed 86 times

2

I’ve grouped 2 questions because I think they’re related.

I made a script test, where saved in the database the links saved with your data.

This is a bad practice? (High priority)

Do I need to accomplish something else so I don’t import duplicates? In my pipeline has a simple check looking for link=%s, would it be better if I use md5 (link)? Faster query?

I can use the -s JOBDIR=crawls/somespider-1 to pause and return Crawler, but would like to know how to accomplish this by list of links to be processed in Mysql. (Low priority)

I need to add new items to my list of start_urls, or dynamically. I must create Request with callback parse_category? Is there any way I can add self.queue or self.start_url and add new url’s to be processed? (High priority)

1

Luiz, welcome to Sopt. Access Help and take the Tour, to better understand how to use the resources here.

– Leo

2016/07/13 at 17:04

1 answer

Browser other questions tagged python scrapy

You are not signed in. Login or sign up in order to post.

by Leonel Sanches da Silva • **88,623** points · Answer 1 · 2016-07-13T18:01:29+00:00

0

This is a bad practice?

No, considering the responsibility for handling the Mysql connection is also being done using good Python practices.

I can use the -s JOBDIR=crawls/somespider-1 to pause and return the Crawler, but would like to know how to accomplish this by list of links to be processed in Mysql.

First you select Mysql records. Then you can use some repeat loop to go calling requests and their respective callbacks:

for registro in registros:
    resultado = Request(url=self.registro['url'], callback=self.meu_callback)
    # Aqui você faz operações adicionais com resultado, se precisar.

I must create Request with callback parse_category?

It’s the right thing to do.

Is there any way I can add self.queue or self.start_url and add new url’s to be processed?

self.queue you will bring from your database. self.start_url will be the column of each record brought.

vlw, on the last question. I tried something like self.start_url.append(url) caused an error. I should perform the append in the database and then search. Is this? There is some way gave to perform this directly, in loops while collecting items for example. Something I can run anywhere in the code. But in the future I will rescue from the bank yes, just to expedite for now.. Vlw

– Luiz Brz Developer

2016/07/13 at 19:12
First you go to the database and bring the data. Then you pass this data to the loop.

– Leonel Sanches da Silva

2016/07/13 at 19:15
But in the case I used an Spider without a bank, something more practical that I say.

– Luiz Brz Developer

2016/07/13 at 19:21
I don’t understand. What would be more practical than a database?

– Leonel Sanches da Silva

2016/07/13 at 19:39
I mean, if I create a Spider for test, sometimes I want to run it no matter what. But for this I needed to make a start_url.append(url). This I believe should be done by logic distributing parse_start, parse_cat, parse_item. But my doubt is whether it has any function that modifies the start_url, or increments the Queue. I saw something with Schedule.py, but I’m looking to avoid duplicate request. Maybe it’s in it. I’m sure.

– Luiz Brz Developer

2016/07/13 at 19:45
You better put the code in your question, because I don’t understand anything.

– Leonel Sanches da Silva

2016/07/13 at 19:50
No problem, eh this question in the comment was unrelated to mysql.. But thanks for the answers.. Vlw mto..

– Luiz Brz Developer

2016/07/13 at 20:01
Taking advantage, see here how the site works. Thank you!

– Leonel Sanches da Silva

2016/07/13 at 20:13

Show 3 more comments