How to speed up my python program?

Asked

Viewed 486 times

-3

I made a web scraping program, however the requests are very slow, I modified my program in a way that if I run it in several windows it works faster, only it becomes a mess.

Is there any way to do this in a single window?

example of my code: (it is just for example, and is written in an unintentional manner, I would like to know how to improve the speed of requests)

import random
import requests

numeros = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "0"]

while True:
    cpf = "".join(random.choice(numeros) for i in range(11))
    r = requests.get(f"https://api.cpfcnpj.com.br/5ae973d7a997af13f0aaf2bf60e65803/9/{cpf}").json()
    if "CPF inválido!" in r:
        a = open("cpfinvalidos.txt", "a+")
        a.write(f"{cpf}\n")
        a.close()
    else:
        a = open("cpfvalidos.txt", "a+")
        a.write(f"{cpf}\n")
        a.close()
  • 2

    add examples of how your code works so it’s easier to understand what you’re doing

  • example: it is a program that checks valid Cpf’s, it has a list with several Cpf’s and shows which of them are valid

  • 4

    Without knowing the real bottleneck in performance there is no way to suggest improvements. Without seeing the code, then, impossible. Not least because we can’t guarantee that it is slow due to Python limitations or developer limitations.

  • now I believe my question is complete, at a glance :)

  • 2

    I don’t know if the api you are calling is just for example, but if not.. The fastest way to improve your code is to validate the CPF/CNPJ in your own code without calling the API. The very @Andersoncarloswoss has an answer here on the site with a CPF validator.

  • What is weighing on this code is not just the request. What is killing this code is opening and closing the same file every time. Another thing is that if you run multiple processes with this code there is a good chance of a deadlock.

Show 1 more comment

1 answer

1

There are several factors in your code that will directly impact the performance of the application without it being the direct fault of Python, but only of bad structuring of the same code.

First, you define an infinite loop which by definition has no end and therefore your code will never stop running. Worse than that, it makes an infinite loop on a condition that it shouldn’t be infinite. The number of possible Cpfs, i.e., 11-character numerical sequences, is finite, so there is no reason to endlessly query randomly.

To get around this, you can define a generator that will define all possible Cpfs and iterate over it. Python natively has a function for this: itertools.combinations_with_replacement. By the way, you won’t need to test all the possibilities, as it is known that the last two digits of the CPF are check digits, so instead of you finding that the CPF does not exist by ordering an external API - which is extremely expensive for application - you can implement the digit validator yourself. Another approach would be to generate all CPF possibilities with only 9 digits and calculate which would be the check digits and thus ensure that it is a valid sequence.

Second, that you open and close the files for each analyzed CPF. The process of opening and closing files from disk is also a very costly task for your application. If you will always work with the same two files, why not just open them at once out of the loop loop?

with open('cpfs_validos.txt', 'a+') as validos, open('cpfs_invalidos', 'a+') as invalidos:
    for cpf in cpfs:
        ...

Third and most importantly, you rely exclusively on an external API. Not only is it not under your control, but making HTTP requests is quite expensive. In your case it gets even worse because you only make a request synchronously, that is, while your application does not receive the response from the API the program will get stuck, without having anything to do. To improve this process you can make use of asynchronous processes using the package asyncio and concomitant to this can make multiple simultaneous requests as they will become asynchronous. However, even implementing this, you will be chained to the limitations of the API itself. Many of them don’t care much about response time and some - mostly free - limit the number of requests per time. All this is up to you to take care.

Doing this for sure the running time of your application will decrease a lot, but I do not guarantee it to levels you expect. If that doesn’t happen, it might be worth considering switching from API to more performative or even changing programming language.

  • My program is already optimized, this was just an example, my problem is that the requests are very slow, so I have to open several windows to go faster

  • 4

    @x8ss If you are already optimized implementing everything I said your question does not reflect the real problem and should be closed because it is not reproducible or unclear, because you have omitted trivial information to identify the problem. If it doesn’t implement everything I mentioned, at the very least, it’s not optimized.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.