In which programming language does a Crawler/scrapper scan the DOM faster?

Asked

Viewed 450 times

1

I developed a script in which I use the class DOMDocument PHP to make a Crawler on a third party website.

The speed of script does not meet the expected goal, I would like to know in which programming language a script for the same purpose will bring me a result of DOM scan with more speed?

1 answer

3


Programming languages do not have speed as a feature. Some have features that help to have more speed. Libraries may already have speed, but the default does not need to be used. If the standard does not meet the performance requirements, rare, very rare, then look for another library.

What gives more speed is to use the right data structure and the right algorithm. The difference between the right choice and the wrong one can be that it takes less than a second to make or take centuries. There are cases that are in that proportion, and there are not few.

Choosing a faster language can make something that takes 1 minute take less than 1 second, no more than that, and in a few cases it makes so much difference. And we’re talking about languages with blatant differences, for example one of the worst implementations of Ruby compared to very well written Assembly.

Assembly is the language that allows the best possible performance. But in practice today it is so difficult to write a correct and fast code in Assembly that almost always one written in C will be faster. In some cases in C++, or Rust, or Fortran may be better. But in Delphi, Java and C#, just to name a few, most tasks will be performed with minimal difference for these languages and even in those that are bad the difference is that it takes about 3 seconds where in C would take less than 1 (almost all the difference is much lower, much worse, almost derisory).

If you want to stay in script then Javascript (who knows Typescript) and Moon, mainly in dialect Luajit, should be the best options.

PHP does not have such a bad performance, especially in newer versions.

But if you don’t master the language, programming and concepts described above, the result won’t be good.

Most applications don’t need as much performance as people think, and the ones that do usually require hard, complex engineering work. So if it’s possible to have a big performance gain changing something is because the original was very wrong (but working, which makes people think it was right).

If you do it right it is likely that the bottleneck is to bring the information over the network, even in "slow languages".

It is possible to see a comparison of languages. But note that this is called a "game", it is not a scientific method. If you use this to make important decisions you can break your face.

  • 1

    I personally advise python for simplicity and satisfactory results. A few days ago I made a script in PHP, I don’t know if you can call a web Rawler, it was more a page fetcher and the speed got a little heated. However, I had already made a python bot for images of Imgur and I was impressed with the speed of the language. But basically what counts is the implementation of the algorithm and the speed/frequency of your requests.

  • 1

    @lazyFox although the standard PHP implementation is often much faster than the standard Python implementation in most tasks. I only trust comparisons that I have seen the source of the language code, the data used and other information to see how comparable they are.

  • Thanks for the answer, I understood that the secret is to "file" the algorithm.

  • Now in your opinion, for this purpose and for performance purposes it would be better to use: PHP, Phyton, Node.js, C# ? @Maniero

  • 1

    C#, but I’m a suspect :) Even if I am a suspect.

  • 1

    In short: Performance is a feature of implementation, not of model. (but some models favour more performatic implementations, of course).

  • @Exact Gabriel.

  • @Charlesfay - As for your last comment: I’ve been involved with data mining projects where I had to create several crawlers to collect data from the internet, so I had the opportunity to test various environments and libraries. The fastest language I could find was Go (and Go’s use of memory is absurdly incredible). However, my favorite language for this was Python, because the performance is sufficient and the code is very succinct and flexible, plus there is a very large variety of tools for web scraping.

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.