Pass URL list to Scrapy function

Asked

Viewed 270 times

0

I have a Python API that takes two arguments (URL and a user-defined word) and provides in JSON file how many times the specified word appears in the URL.

However, I would like to pass a URL list. I would also like to request with Asyncio. Any suggestions ?

Follows the code:

from flask import Flask
from flask_restful import Resource, Api, reqparse, abort
import requests

app = Flask(__name__)
api = Api(app)

parser = reqparse.RequestParser()
parser.add_argument('url')
parser.add_argument('word')
parser.add_argument('ignorecase')
	
# Função que faz um GET para a URL e retorna quantas vezes a palavra word aparece no conteudo
def count_words_in(url, word, ignore_case):
	try:
		r = requests.get(url)
		data = str(r.text)
		if (str(ignore_case).lower() == 'true'):
			return data.lower().count(word.lower())
		else:
			return data.count(word)
	except Exception as e:
		raise e
		
# Função que inclui 'http://' na url e retorna a URL valida
def validate_url(url):
	if not(url.startswith('http')):
		url = 'http://' + url
	return url
	

class UrlCrawlerAPI(Resource):
	def get(self):
		try:
			args = parser.parse_args()
			valid_url = validate_url(args['url'])
			return { valid_url : { args['word'] : count_words_in(valid_url, args['word'], args['ignorecase']) }}
		except AttributeError:
			return { 'message' : 'Please provide URL and WORD arguments' }
		except Exception as e:
			return { 'message' : 'Unhandled Exception: ' + str(e) }

		
api.add_resource(UrlCrawlerAPI, "/")

if __name__ == '__main__':
	app.run(debug=True)

1 answer

1


You asked two questions in one:

would like to pass a list of URL.

Looks like you don’t have to do anything, just pass the list.

Maybe rename your parameter url for urls just to be consistent?

args = parser.parse_args()
valid_urls = [validate_url(url) for url in args['urls'])    
for valid_url in valid_urls: 
    ...

I would also like to request with Asyncio. Any suggestions ?

You are using flask, which is a synchronous framework, based on the WSGI standard, does not match much with asyncio. The methods flask do not give control to the event loop as required by asyncio and to meet multiple requests at the same time flask uses threads.

Therefore you will have some difficulty to integrate the asyncio at the flask, and you won’t have much to gain, since part of your IO is not asynchronous. If you prefer to go this way I suggest you take a look at project flask-aiohttp that makes this "glue" but I do not recommend it unless your project has a great need to take advantage of code already written for flask and to asyncio.

If you are starting out, and want to use asynchronous programming, I suggest you also dispense with the flask by a web framework that is also asynchronous. There are several, one example that has been making success in the python community is the sanic, that is made to be similar to the flask, so there won’t be much difference.

  • thanks, thank you very much! I will test the sanic.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.