Most voted "web-crawler" questions
The Web Crawler (also known as Web Spider) is a computer program that navigates the World Wide Web in a methodical and automated manner or in an orderly manner. Other terms for web crawlers are Ants, automatic indexers, bots, spiders/Spiders, web robots, or - especially in the FOAF community - Web scutters.
Learn more…66 questions
Sort by count of
-
17
votes2
answers618
viewsHow can we not allow indexing by search engines?
Those days I put my domain in Google and he got my Web Site and my System. I wish my System was hidden from Google and any other search engine. You could do that? And how to get indexing already…
-
10
votes3
answers387
viewsHow does semantics/indexing work with Angularjs?
I always wonder, Angularjs is a framework that is being used constantly. But I have a question about how it works for the crawlers (googlebot example). They even run the javascript and interpret the…
-
10
votes1
answer1417
viewsProtect automated access web pages
How can I protect my web pages so that they are not accessed in an automated way? By search bots Engines like Googlebot (I think the basic form was the metatag with noindex and nofollow). By…
-
8
votes3
answers5579
viewsSimple National Optional Consultation (by CNPJ)
I’m trying to implement a query of Simples Nacional, the operation is similar to the consultation by CNPJ of the recipe. Details I’ve understood so far: After loading the page, runs a ajax (filing…
-
7
votes1
answer226
viewsServer Side rendering need for javascript content - Angularjs
Knowing that from this year google Crawler performs javascript, considering the indexing of a content that is displayed using Angularjs, there is still the need for a version of the same content…
-
6
votes2
answers282
viewsConflict between Simple_html_dom and Nonobject-Oriented functions
I’m developing an app that has to access a list of websites stored in a database, upload all their links. It’s a test application but I’ve found a difficulty. The routine is this one: function…
-
5
votes1
answer166
viewsCarousel content harms SEO? Is the hidden Carousel content indexed?
I have a question about Carrousel and how its content is indexed or not by crawlers search. First, I believe that most Carousels are not so friendly from the point of view of Accessibility. This…
-
4
votes2
answers126
viewsHow to calculate an optimal value for Scrapyd’s CONCURENT_REQUESTS variable?
One of the default settings in Scrapyd is the number of concurrent processes (is 16). CONCURRENT_REQUESTS = 16 What would be the best methodology to calculate an optimal value for this variable? The…
-
3
votes1
answer152
viewsMultiple Pipelines to Treat Different Files in Scrapy
How to treat pipelines.py when we have different Piders? Example: I have a Spider that works by getting posts from a particular blog and another by saving images from jpeg banners found on each…
-
3
votes1
answer126
viewsHow to manage the operation and failure in the execution of Spiders?
I’m developing a module to get information about the Piders that run on the company’s system. Below is the model where we keep the beginning of operations and the job. I would like to validate if…
-
3
votes2
answers793
viewsWhat HTTP methods can a Crawler not track?
A conceptual doubt (or not): Of the HTTP methods, which of them cannot be "tracked" - or interpreted - by a Crawler? POST GET PUT PATCH DELETE Someone with knowledge on the subject can answer us?…
-
3
votes1
answer309
viewsInformation contained in two Scrapy pages
I’m not a python programmer, but I’m trying to work with the Scrapy application. The above example is what I need, this runs in extension of Chrome. To explain, I need the post and all available…
-
3
votes0
answers136
viewsMultithreading Crawler problem using jsoup
Hello, I’m developing a multithreaded Crawler, each job (thread) deals with X sites to analyze certain content with the jsoup lib. The sites are all accessible. The problem is that the final results…
-
3
votes1
answer221
viewsScrapy cannot select a form using xpath
Hello, I am using the scrapy to make a Crawler to get to pick up questions of concuros and etc from the site gabarite.com.br, I can get the description of the question the correct alternative, but I…
-
3
votes1
answer27
viewsHow to specify the search engines to update an HTML document?
According to MDN using the tag <time></time> attribute-ridden datetime <time datetime="yyy-mm-dd hh:mm:ss"></time>, allows search engines to know the document’s creation…
-
2
votes1
answer138
viewsDeploy queues to manage competition between Piders in Scrapyd
Is there any way that Scrapyd can create Piders queues so that when I send many Piders (with different functions) I can privilege/limit the competition between them? Today, all the Spiders I send…
-
2
votes2
answers113
viewsHow to protect my Scrapyd server from unauthorized calls?
Let’s say I have the following configuration in scrapy.cfg in Scrapyd. [deploy] url = http://example.com/api/scrapyd/ username = user password = secret project = projectX In the Scrapyd…
-
2
votes1
answer4068
viewsCreating a php CRAWLER
I am a layman on the subject and would like to know where I can find more information about creating a Rawler to download data and images from some websites. I searched a lot but so far I found…
-
2
votes1
answer75
viewsProblems with restrict_xpaths parameter in a Crawler
I have no Python experience, but I decided to try to do something with Scrapy for testing. So I’m trying to collect the existing articles on a particular page, namely a DIV element with an ID…
-
2
votes1
answer568
viewsWeb Crawler (Spider) with ajax in JSF using Node.js or Jsoup api in java
I have the task of creating an interface optimized for touch monitor, taking data from a website (http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml). This site gives a listing of bus…
-
2
votes1
answer77
viewsHow to extract text from a selected Beautifulsoap element?
I’m making a simple Crawler to get some news from the financial market. The code below is working properly, but would like to extract only the headline and then delete the html/CSS codes. import…
-
2
votes2
answers279
viewsWeb Crawler searching for specific text on the page
Well, I’m making a Crawler web to fetch the value of a coin. I wrote the following code in python: #coding: utf-8 from urllib2 import urlopen conteudo =…
-
2
votes1
answer126
viewsIndexoutofboundsexception No Get(0) do Crawler jsoup
I would like to get the names of the companies that appear in a search like "Farmacias em Santo Andre" on Google Maps. Erro: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index 0…
-
1
votes0
answers74
viewsRead all posts and comments from a facebook user?
Dear friends, I’m in need of a help. I need to make a Crawler who read all posts with friends and their comments using php. I can do a test with my posts via dosdk-php. I read from the first page…
-
1
votes1
answer425
viewsPHP Crawlers for external websites API Phpcrawl
Good evening person I’m new to the subject, I’m trying to build a search engine for external sites (indexer) with PHP, I found an API, which provides a Crawler, but it seems to only search for…
-
1
votes1
answer127
viewsCrawler for when http_status_code is different from 200
I’m making a mini Crawler in . php using a library called "Phpcrawl" to do the Crawler function and the "simple_html_dom_parser" library to parse the html. The question is: simple_html_dom cannot…
-
1
votes2
answers1119
viewsDeveloping a Webcrawler in Python
Is there any open source project of webcrawler simple, developed in Python, for study? I have been studying for some time / researching on the subject, but I can not find anything ready about it. My…
-
1
votes1
answer287
viewsHow to Scrapping a page that has a javascript’s using python ?
I need to make Scrapping of a page, but the entry of the page has a button (apparently a Javascript) that gives access to all the content of the page itself. Using traditional libs(urllib2,…
javascript python web-crawler web-scraping scrapingasked 7 years, 9 months ago Wellington Araujo Nogueira 41 -
1
votes1
answer788
viewsHow to collect text when there is no HTML reference class - Crawler Python
I have the following situation below: I want to collect "Text to Crawler" that is below, as I will navigate there without class or id? <td>Texto para crawler</td>…
-
1
votes1
answer1615
viewsCan I run a Python program automatically off my computer?
Good morning guys, I made a Python script that searches various websites for the occurrences of a few words and stores in SQL database. I would like to track the occurrence of these words over time,…
-
1
votes0
answers521
viewsAutomate website browsing for software testing activities
I’m on a web systems development project, accessed by the browser. We are constantly making modifications in the operation of the processes and at each specific period we perform a test on the…
-
1
votes1
answer450
viewsIn which programming language does a Crawler/scrapper scan the DOM faster?
I developed a script in which I use the class DOMDocument PHP to make a Crawler on a third party website. The speed of script does not meet the expected goal, I would like to know in which…
-
1
votes0
answers65
viewsMake clicks for interactive online chart
I need to select the information on the left side of the graph according to my need, but I cannot reference the information. I didn’t put any code because I tried it in various ways (according to…
-
1
votes1
answer114
viewsWeb Crawler with Django’s view.py
I am making a simple web Crawler, using Django 2.0, I want to capture only the "title" class of the news and then render "Return render" to a simple html, below my view.py. I am currently using…
-
1
votes1
answer450
viewsCompeting requests using Xios without losing the session
I am developing a Crawler using Xios. How can I make multiple requests or callbacks without losing the session and without having to manipulate cookies? Note the example below. When logging in, I…
-
1
votes0
answers59
viewsProblem with Crawler
I’m trying to make a simple crowler that takes the temperature, just for study, I’m using the simple_html_dom to read the page, but in the file_get_html function of the link above, it presents some…
-
1
votes1
answer31
viewsPage indexing that can redirect
I have a certain site that contains some pages, but some of these pages are not being indexed by Google. However the pages that Google does not index cannot be accessed if a certain option is not…
-
1
votes1
answer57
viewsindex(find) + Len(find) Valueerror: substring not found Crawler Python
Guys, I need some help with the code on Python searching results of the internet. Python 3.6 The first of Bitcoin worked out, the second that presents error. from urllib import request url =…
-
1
votes1
answer343
viewsHow do I generate an excel file with the data obtained from a webcrawler?
I’m making a web Rawler that should extract the name and price of iphones that appear in search on the Amazon site and generate an xlsx file with this data. However, I am unable to generate the xlsx…
-
1
votes0
answers32
viewsScrapy - Search for items in form
I am very beginner in the subject, can you help me? I am testing Spiders to seek bids. But I can’t return the items through the form. I have the code below example: import scrapy from scrapy.http…
-
0
votes1
answer2139
viewsHow to make a Crawler web access pages that need authentication?
I need to develop a web-crowler where he would access a page (in which it is necessary to login and I have such credentials) and the "robot" would find all the links of the page and list somewhere,…
web-crawlerasked 10 years, 9 months ago MDomingues 93 -
0
votes2
answers115
viewssimple dom php 404 error
I have the following code: <?php include './simple_html_dom.php'; //Este link existe $teste = new simple_html_dom("http://www.btolinux.com.br/"); echo $teste->original_size."<br>";…
-
0
votes1
answer495
viewsRequest on a page with Guzzle
I’m having trouble making a POST request on a website through the component Guzzle. The target site is :http://ciagri.iea.sp.gov.br/nia1/subjetiva.aspx?cod_sis=1&idioma=1 He even enters the site…
-
0
votes0
answers251
viewsParse a page
I am trying to get the information from a page through the url. I am developing in symfony and using simple_html_dom or Crowler. But I’m nowhere near doing what I need to do. The page I’m accessing…
-
0
votes1
answer74
viewsDo the crawlers/bots/web-Spiders of search engines copy and access the href of a link, or "click" on <a></a> to be redirected?
I have this doubt, because I want to develop a portal in Ajax, but that the pages can be accessed also via url. My question is: If the <a> </a> have with return false when clicking, the…
-
0
votes0
answers87
viewsCrawler for Woocommerce
Friends good afternoon. I’m developing a php Crawler that will make Scrapping some urls that I will inform. I’m trying to get him to pull the values of a dynamic url, but I’m not getting it. Could…
-
0
votes1
answer430
viewsHow do I integrate my Django project with Scrapy?
I’m looking to develop a simple project using Django where I will create a web page and this page will capture data from other pages. The problem is that I cannot integrate the Scrapy with Django.…
-
0
votes1
answer168
viewsProblem with Domdocument Openssl
I’m trying to get information from a website using DOMDocument but you’re making a mistake. DOMDocument::loadHTMLFile(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL…
-
0
votes1
answer47
viewsSelect does not update table data after selecting an option
I am trying to select the field with this query however the value of select is changed but does not reload the table values, showing all. But by clicking with the mouse, it works.…
-
0
votes1
answer299
viewsUpload file with pure javascript, (Crawler) inputar file
I am crawling a web page and there comes a time when I need to upload a file, this Crawler is to test a system. What I need to do is with the pure javascript inputar a document on the following…