How to monitor a URL when there are changes?

Asked

Viewed 127 times

0

I have a system, which needs to compare various values, between various sites. These values are read from an XML provided by the site in question.

The problem is that the default reading of the URL via Curl, for example, to 1 single site, however, in my case, are numerous websites.

After I get the information, I need to compare it and that’s the problem.

It’s getting slow every time a new site is added. I’m currently doing with cron Jobs + Curl in PHP.

1 answer

4

Monitoring the URL itself is not done without a query to it.

Unless the website informs you that there has been a change, you will only know if you are going to consult.

I’ll explain some idea on how to design this.

First, let’s consider that you have separate resources:

  1. Controller of websites that will be monitored;
  2. Crawler;
  3. Comparator.

Consider the following, the Controller does the job of knowing which sites need to be consulted, when and to whom the work should pass.

The Controller will be on Crontab, however, he will not do the Crawler, he will pass this responsibility on to the Crawler. That is, you may have multiple queries at the same time.

The Comparator is independent and fired the way you prefer, it does not interfere with anything.

I considered separating the resources because then nothing is 'cast' and so dependent. And you can even separate into other servers if the project grows, simply.

A beginning of work:

Consider this to be the Website Controller:

$sites = ['site1', 'site2', 'site3'];

foreach ($sites as $site) {
    // Aqui vc passa o site a ser consultado para o crawleador.
    // Poderia fazer isso em um metodo no proprio arquivo, mas isso nao permitiria multithread.
    // Para isso ser eficaz, cria um script php que fará o crawler e chame-o aqui sem esperar retorno. Ex:

    shell_exec("php crawleador.php?site=$site &");

    // Assim vc tera o foreach acabando muito rapido e os crawleadores disparados vao fazer seus trabalhos sozinhos.
}

Then you work on your Crawler:

$url = $_GET['site'];
// Aqui vc implementa a logica do seu analisador e armazena essa informaçao em algum lugar (mysql?)
// Esse arquivo entrará em açao por que o arquivo anterior mandou. Como se trata de multiplos links, voce terá varias sessoes rodando independentes.
// fim

Another alternative would be for you to implement some Multithread feature in your PHP, it will probably be even more performative.

If what you need is just to take if there is a change, one idea that can simplify your parser is to use hash. Ex:

If you take the md5sum of a file twice, the result is equal.

If the file changes, md5 will be another.

I suggest you make comparisons with this hash, maybe even thinking about the hash in the query itself can make your database more streamlined and make the process of analyzing faster.

  • Got it. I’m going to apply this model. Thanks for your help.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.