Which language performs best for the multithread webcrawler using parallelism

Asked

Viewed 503 times

2

I will start a project in which one of the phases will fetch certain information on websites of other companies.

Bearing in mind that the webcrawler will go through x sites, and on each site will visit several pages at least once a day, it might be interesting to have a Webcrawler multithread, using parallelism to speed up the process.

At this point, I don’t know if Python has the same answer as Java.

Question:

Given the tools each of us has, what experience shows to be the best performance option, Java or Pyhthon? At this point, Python is losing out on Java?

  • 3

    Despite the issue the question is still off-topic, as it is based on opinions. But even though I can’t provide an answer, I’ll try to leave you with a more open mind about this. The language itself will not necessarily be one better than the other, what I believe can improve performance is the lib that will use to download the pages, the capacity of your server and the structure of your algorithm. However your question focused only on Threads can be answered if the question is something like "How does the threads work in python and java?" then from the answers you could test...

  • 1

    ...Which has better "performance" on your server. I believe that the only thing that will change between both languages will really be how threads work, other than that all the answers I found on soen are very old and I really did not find a good way to consider them. I have little knowledge in this area, I believe that soon someone will provide a clearer response or comment or that best addresses the subject.

  • Take a look at the talk from Thiago Avelino on how he greatly improved his webcrawler’s performance by switching from python to Go.

  • 1

    For me this question is based on opinions. The best possible answer is DEPENDS. It depends on the implementation, the network, the amount of information, etc., etc., etc. I could argue that the language makes no difference, because even without thread support it is possible to parallelize using different processes. Also, usually the bottleneck will be in network transfer and not in processing. Anyway, any loss or performance gain will be more related to the architecture of the solution than to the language itself.

  • One of the ways to implement is to create a queue of items to be processed in any database. Then you start several processes that take items out of the queue, download the content and process. In fact, you can create different processes in different languages and use them all at the same time. Going further, you can separate this into two phases, in the first the file downloads and saves in a temporary location, in the second the content is processed. Then you can still download using Java and the parse using Python or vice versa. Summary: focus on architecture.

  • Yes, I understand that more than language, architecture will optimize Crawler. But given the api’s/lib that each language offers I thought I could influence the choice.

Show 1 more comment

1 answer

0

It depends on which tool you are going to use, and it depends on whether you need to emulate a browser and whether you need to emulate as well.

I have written many crawlers/webscrappers in python mainly, although they have libs like lxml which are very fast the biggest problem in processing the third party site is poorly formed html and content in javascript.

In the end it is worth using Selenium, which can use the real browser (can use firefox, Chrome etc).

Selenium doesn’t have very good performace precisely because browsers take up a lot of memory and consume a lot of CPU, Selenium itself only sends commands to the browser you chose.

But it’s worth it because it will run javascript exactly as it was tested and it will work as if a person was using it, if the information you need is visible to the user then you have how to access it with Selenium.

Selenium has bindings for several languages including python to java, Voce can write your code in either of the two, in the end the best is the one you know best or is most productive.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.