Most voted "web-crawler" questions

The Web Crawler (also known as Web Spider) is a computer program that navigates the World Wide Web in a methodical and automated manner or in an orderly manner. Other terms for web crawlers are Ants, automatic indexers, bots, spiders/Spiders, web robots, or - especially in the FOAF community - Web scutters.

Learn more…

66 questions

Sort by count of

17
votes

2
answers

618
views

How can we not allow indexing by search engines?

Those days I put my domain in Google and he got my Web Site and my System. I wish my System was hidden from Google and any other search engine. You could do that? And how to get indexing already…

google quest web-crawler
asked 10 years, 5 months ago Marconi 17,287
10
votes

3
answers

387
views

How does semantics/indexing work with Angularjs?

I always wonder, Angularjs is a framework that is being used constantly. But I have a question about how it works for the crawlers (googlebot example). They even run the javascript and interpret the…

angularjs web-application web-crawler semantics
asked 10 years, 1 month ago Hiago Souza 5,837
10
votes

1
answer

1417
views

Protect automated access web pages

How can I protect my web pages so that they are not accessed in an automated way? By search bots Engines like Googlebot (I think the basic form was the metatag with noindex and nofollow). By…

web-crawler scraping
asked 10 years, 2 months ago Ricardo 14,521
8
votes

3
answers

5579
views

Simple National Optional Consultation (by CNPJ)

I’m trying to implement a query of Simples Nacional, the operation is similar to the consultation by CNPJ of the recipe. Details I’ve understood so far: After loading the page, runs a ajax (filing…

php curl web-crawler captcha
asked 10 years, 2 months ago Rafael Withoeft 2,287
7
votes

1
answer

226
views

Server Side rendering need for javascript content - Angularjs

Knowing that from this year google Crawler performs javascript, considering the indexing of a content that is displayed using Angularjs, there is still the need for a version of the same content…

javascript html angularjs seo web-crawler
asked 11 years ago gpupo 2,181
6
votes

2
answers

282
views

Conflict between Simple_html_dom and Nonobject-Oriented functions

I’m developing an app that has to access a list of websites stored in a database, upload all their links. It’s a test application but I’ve found a difficulty. The routine is this one: function…

php mysql dom web-crawler
asked 11 years, 3 months ago pdonatilio 93
5
votes

1
answer

166
views

Carousel content harms SEO? Is the hidden Carousel content indexed?

I have a question about Carrousel and how its content is indexed or not by crawlers search. First, I believe that most Carousels are not so friendly from the point of view of Accessibility. This…

html css browser seo web-crawler
asked 6 years, 7 months ago hugocsl 65,517
4
votes

2
answers

126
views

How to calculate an optimal value for Scrapyd’s CONCURENT_REQUESTS variable?

One of the default settings in Scrapyd is the number of concurrent processes (is 16). CONCURRENT_REQUESTS = 16 What would be the best methodology to calculate an optimal value for this variable? The…

python web-application web-crawler scrapy
asked 10 years, 7 months ago Arthur Alvim 647
3
votes

1
answer

152
views

Multiple Pipelines to Treat Different Files in Scrapy

How to treat pipelines.py when we have different Piders? Example: I have a Spider that works by getting posts from a particular blog and another by saving images from jpeg banners found on each…

python web-application web-crawler scrapy
asked 10 years, 7 months ago Arthur Alvim 647
3
votes

1
answer

126
views

How to manage the operation and failure in the execution of Spiders?

I’m developing a module to get information about the Piders that run on the company’s system. Below is the model where we keep the beginning of operations and the job. I would like to validate if…

python web-application web-crawler scrapy
asked 10 years, 6 months ago Arthur Alvim 647
3
votes

2
answers

793
views

What HTTP methods can a Crawler not track?

A conceptual doubt (or not): Of the HTTP methods, which of them cannot be "tracked" - or interpreted - by a Crawler? POST GET PUT PATCH DELETE Someone with knowledge on the subject can answer us?…

http web-crawler
asked 9 years, 4 months ago Marllon Nasser 3,845
3
votes

1
answer

309
views

Information contained in two Scrapy pages

I’m not a python programmer, but I’m trying to work with the Scrapy application. The above example is what I need, this runs in extension of Chrome. To explain, I need the post and all available…

python web-crawler scrapy
asked 9 years, 2 months ago Luiz Brz Developer 163
3
votes

0
answers

136
views

Multithreading Crawler problem using jsoup

Hello, I’m developing a multithreaded Crawler, each job (thread) deals with X sites to analyze certain content with the jsoup lib. The sites are all accessible. The problem is that the final results…

java multithreading web-crawler http-status jsoup
asked 8 years, 8 months ago user2989745 399
3
votes

1
answer

221
views

Scrapy cannot select a form using xpath

Hello, I am using the scrapy to make a Crawler to get to pick up questions of concuros and etc from the site gabarite.com.br, I can get the description of the question the correct alternative, but I…

python web-crawler scrapy
asked 8 years, 1 month ago joao paulo santos almeida 486
3
votes

1
answer

27
views

How to specify the search engines to update an HTML document?

According to MDN using the tag <time></time> attribute-ridden datetime <time datetime="yyy-mm-dd hh:mm:ss"></time>, allows search engines to know the document’s creation…

html5 seo web-crawler
asked 6 years, 10 months ago ayelsew 818
2
votes

1
answer

138
views

Deploy queues to manage competition between Piders in Scrapyd

Is there any way that Scrapyd can create Piders queues so that when I send many Piders (with different functions) I can privilege/limit the competition between them? Today, all the Spiders I send…

python web-application web-crawler scrapy
asked 10 years, 7 months ago Arthur Alvim 647
2
votes

2
answers

113
views

How to protect my Scrapyd server from unauthorized calls?

Let’s say I have the following configuration in scrapy.cfg in Scrapyd. [deploy] url = http://example.com/api/scrapyd/ username = user password = secret project = projectX In the Scrapyd…

python web-application web-crawler scrapy
asked 10 years, 7 months ago Arthur Alvim 647
2
votes

1
answer

4068
views

Creating a php CRAWLER

I am a layman on the subject and would like to know where I can find more information about creating a Rawler to download data and images from some websites. I searched a lot but so far I found…

php image web-crawler
asked 9 years, 3 months ago Vitor 21
2
votes

1
answer

75
views

Problems with restrict_xpaths parameter in a Crawler

I have no Python experience, but I decided to try to do something with Scrapy for testing. So I’m trying to collect the existing articles on a particular page, namely a DIV element with an ID…

python web-crawler xpath scrapy
asked 9 years, 5 months ago w00t 41
2
votes

1
answer

568
views

Web Crawler (Spider) with ajax in JSF using Node.js or Jsoup api in java

I have the task of creating an interface optimized for touch monitor, taking data from a website (http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml). This site gives a listing of bus…

java node.js http http-request web-crawler
asked 9 years, 2 months ago Eric Silva 478
2
votes

1
answer

77
views

How to extract text from a selected Beautifulsoap element?

I’m making a simple Crawler to get some news from the financial market. The code below is working properly, but would like to extract only the headline and then delete the html/CSS codes. import…

python web-scraping web-crawler beautifulsoup
asked 4 years, 8 months ago Edu Barros 59
2
votes

2
answers

279
views

Web Crawler searching for specific text on the page

Well, I’m making a Crawler web to fetch the value of a coin. I wrote the following code in python: #coding: utf-8 from urllib2 import urlopen conteudo =…

python web-crawler crawling
asked 7 years, 9 months ago user89389
2
votes

1
answer

126
views

Indexoutofboundsexception No Get(0) do Crawler jsoup

I would like to get the names of the companies that appear in a search like "Farmacias em Santo Andre" on Google Maps. Erro: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index 0…

java html web-crawler jsoup
asked 5 years, 11 months ago Eugenio Maria 23
1
votes

0
answers

74
views

Read all posts and comments from a facebook user?

Dear friends, I’m in need of a help. I need to make a Crawler who read all posts with friends and their comments using php. I can do a test with my posts via dosdk-php. I read from the first page…

facebook get web-crawler
asked 9 years, 8 months ago Paulo-karica 11
1
votes

1
answer

425
views

PHP Crawlers for external websites API Phpcrawl

Good evening person I’m new to the subject, I’m trying to build a search engine for external sites (indexer) with PHP, I found an API, which provides a Crawler, but it seems to only search for…

php api quest web-crawler
asked 9 years, 8 months ago João Pacheco 13
1
votes

1
answer

127
views

Crawler for when http_status_code is different from 200

I’m making a mini Crawler in . php using a library called "Phpcrawl" to do the Crawler function and the "simple_html_dom_parser" library to parse the html. The question is: simple_html_dom cannot…

php web-crawler
asked 10 years, 5 months ago Ricardo 14,521
1
votes

2
answers

1119
views

Developing a Webcrawler in Python

Is there any open source project of webcrawler simple, developed in Python, for study? I have been studying for some time / researching on the subject, but I can not find anything ready about it. My…

python web-service web-crawler
asked 9 years, 9 months ago Gabriel Masson 178
1
votes

1
answer

287
views

How to Scrapping a page that has a javascript’s using python ?

I need to make Scrapping of a page, but the entry of the page has a button (apparently a Javascript) that gives access to all the content of the page itself. Using traditional libs(urllib2,…

javascript python web-crawler web-scraping scraping
asked 8 years, 5 months ago Wellington Araujo Nogueira 41
1
votes

1
answer

788
views

How to collect text when there is no HTML reference class - Crawler Python

I have the following situation below: I want to collect "Text to Crawler" that is below, as I will navigate there without class or id? <td>Texto para crawler</td>…

python-3.x web-crawler web-scraping scraping
asked 8 years, 3 months ago DaniloAlbergardi 347
1
votes

1
answer

1615
views

Can I run a Python program automatically off my computer?

Good morning guys, I made a Python script that searches various websites for the occurrences of a few words and stores in SQL database. I would like to track the occurrence of these words over time,…

sql python server web-crawler remote
asked 8 years, 2 months ago André 13
1
votes

0
answers

521
views

Automate website browsing for software testing activities

I’m on a web systems development project, accessed by the browser. We are constantly making modifications in the operation of the processes and at each specific period we perform a test on the…

java c# web-crawler web-scraping html-agility-pack
asked 9 years ago DanOver 1,334
1
votes

1
answer

450
views

In which programming language does a Crawler/scrapper scan the DOM faster?

I developed a script in which I use the class DOMDocument PHP to make a Crawler on a third party website. The speed of script does not meet the expected goal, I would like to know in which…

characteristic-language dom web-crawler
asked 7 years, 8 months ago Charles Fay 1,197
1
votes

0
answers

65
views

Make clicks for interactive online chart

I need to select the information on the left side of the graph according to my need, but I cannot reference the information. I didn’t put any code because I tried it in various ways (according to…

vba excel-vba web-crawler crawling
asked 7 years, 3 months ago Leandro Lazari 153
1
votes

1
answer

114
views

Web Crawler with Django’s view.py

I am making a simple web Crawler, using Django 2.0, I want to capture only the "title" class of the news and then render "Return render" to a simple html, below my view.py. I am currently using…

django web-scraping web-crawler scrapy scraping
asked 7 years, 2 months ago Bruno Lima 49
1
votes

1
answer

450
views

Competing requests using Xios without losing the session

I am developing a Crawler using Xios. How can I make multiple requests or callbacks without losing the session and without having to manipulate cookies? Note the example below. When logging in, I…

javascript node.js axios request web-crawler
asked 6 years, 4 months ago ed-Info 63
1
votes

0
answers

59
views

Problem with Crawler

I’m trying to make a simple crowler that takes the temperature, just for study, I’m using the simple_html_dom to read the page, but in the file_get_html function of the link above, it presents some…

php web-crawler
asked 7 years ago Leandro 459
1
votes

1
answer

31
views

Page indexing that can redirect

I have a certain site that contains some pages, but some of these pages are not being indexed by Google. However the pages that Google does not index cannot be accessed if a certain option is not…

seo web-crawler google-analytics
asked 6 years, 9 months ago MSLacerda 601
1
votes

1
answer

57
views

index(find) + Len(find) Valueerror: substring not found Crawler Python

Guys, I need some help with the code on Python searching results of the internet. Python 3.6 The first of Bitcoin worked out, the second that presents error. from urllib import request url =…

python web-crawler
asked 6 years, 7 months ago Charles Roberto 11
1
votes

1
answer

343
views

How do I generate an excel file with the data obtained from a webcrawler?

I’m making a web Rawler that should extract the name and price of iphones that appear in search on the Amazon site and generate an xlsx file with this data. However, I am unable to generate the xlsx…

javascript node.js web-crawler
asked 5 years, 11 months ago Marcos Davi Spindola 105
1
votes

0
answers

32
views

Scrapy - Search for items in form

I am very beginner in the subject, can you help me? I am testing Spiders to seek bids. But I can’t return the items through the form. I have the code below example: import scrapy from scrapy.http…

python web-crawler scrapy
asked 4 years, 6 months ago Daniel Custodio 11
0
votes

1
answer

2139
views

How to make a Crawler web access pages that need authentication?

I need to develop a web-crowler where he would access a page (in which it is necessary to login and I have such credentials) and the "robot" would find all the links of the page and list somewhere,…

web-crawler
asked 11 years, 4 months ago MDomingues 93
0
votes

2
answers

115
views

simple dom php 404 error

I have the following code: <?php include './simple_html_dom.php'; //Este link existe $teste = new simple_html_dom("http://www.btolinux.com.br/"); echo $teste->original_size."<br>";…

php web-crawler
asked 11 years, 2 months ago pdonatilio 93
0
votes

1
answer

495
views

Request on a page with Guzzle

I’m having trouble making a POST request on a website through the component Guzzle. The target site is :http://ciagri.iea.sp.gov.br/nia1/subjetiva.aspx?cod_sis=1&idioma=1 He even enters the site…

php web-crawler guzzle
asked 10 years, 2 months ago Rodolfo Oliveira 917
0
votes

0
answers

251
views

Parse a page

I am trying to get the information from a page through the url. I am developing in symfony and using simple_html_dom or Crowler. But I’m nowhere near doing what I need to do. The page I’m accessing…

php symfony-2 parser web-crawler
asked 9 years, 6 months ago Marcius Leandro 462
0
votes

1
answer

74
views

Do the crawlers/bots/web-Spiders of search engines copy and access the href of a link, or "click" on <a></a> to be redirected?

I have this doubt, because I want to develop a portal in Ajax, but that the pages can be accessed also via url. My question is: If the <a> </a> have with return false when clicking, the…

google quest web-crawler
asked 8 years, 5 months ago Seu Madruga 2,481
0
votes

0
answers

87
views

Crawler for Woocommerce

Friends good afternoon. I’m developing a php Crawler that will make Scrapping some urls that I will inform. I’m trying to get him to pull the values of a dynamic url, but I’m not getting it. Could…

php curl web-crawler web-scraping
asked 9 years, 5 months ago jeann sebold 91
0
votes

1
answer

430
views

How do I integrate my Django project with Scrapy?

I’m looking to develop a simple project using Django where I will create a web page and this page will capture data from other pages. The problem is that I cannot integrate the Scrapy with Django.…

python web-application django web-crawler scrapy
asked 8 years, 3 months ago Lucas Souto 41
0
votes

1
answer

168
views

Problem with Domdocument Openssl

I’m trying to get information from a website using DOMDocument but you’re making a mistake. DOMDocument::loadHTMLFile(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL…

php dom web-crawler crawling
asked 7 years, 8 months ago Xiro Nakamura 563
0
votes

1
answer

47
views

Select does not update table data after selecting an option

I am trying to select the field with this query however the value of select is changed but does not reload the table values, showing all. But by clicking with the mouse, it works.…

jquery html-select web-scraping web-crawler
asked 6 years ago Gustavobezerra 3
0
votes

1
answer

299
views

Upload file with pure javascript, (Crawler) inputar file

I am crawling a web page and there comes a time when I need to upload a file, this Crawler is to test a system. What I need to do is with the pure javascript inputar a document on the following…

javascript html input javascript-events web-crawler
asked 6 years ago Felipe Tolentino 11