1
I’m learning how to create a Crawler with scrapy + xpath.
However, when I give the command
scrapy shell https://br.udacity.com/courses/all/
The system returns this as if everything is normal:
2021-01-22 15:40:58 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: courses) 2021-01-22 15:40:58 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i 8 Dec 2020), cryptography 3.3.1, Platform Linux-5.4.0-64-generic-x86_64-with-glibc2.29 2021-01-22 15:40:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2021-01-22 15:40:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'courses', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'courses.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['courses.spiders']} 2021-01-22 15:40:58 [scrapy.extensions.telnet] INFO: Telnet Password: a0c1cc56ede9b7e3 2021-01-22 15:40:58 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage'] 2021-01-22 15:40:58 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-01-22 15:40:58 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-01-22 15:40:58 [scrapy.middleware] INFO: Enabled item pipelines: [] 2021-01-22 15:40:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2021-01-22 15:40:58 [scrapy.core.engine] INFO: Spider opened 2021-01-22 15:40:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.udacity.com/robots.txt> from <GET https://br.udacity.com/robots.txt> 2021-01-22 15:40:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.udacity.com/robots.txt> (referer: None) 2021-01-22 15:40:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.udacity.com/courses/all/> from <GET https://br.udacity.com/courses/all/> 2021-01-22 15:40:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.udacity.com/robots.txt> (referer: None) 2021-01-22 15:40:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.udacity.com/courses/all/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7f5d451e7070> [s] item {} [s] request <GET https://br.udacity.com/courses/all/> [s] response <200 https://www.udacity.com/courses/all/> [s] settings <scrapy.settings.Settings object at 0x7f5d45263d30> [s] spider <DefaultSpider 'default' at 0x7f5d445485e0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser
So I apply Sponse to see if I’m inside the site yet it responds normal follows excerpt:
>>> response
<200 https://www.udacity.com/courses/all/>
then I apply :
>>> div = response.xpath('//*[@id="ud860"]')
And by giving Div command to return the html of the div it returns an empty array, I will put a print to view better.
How can I fix this? From what I read it was to return the html with the data from the div, but I don’t know if I’m using the tool right or this module is missing in my python! :(
the first thing you need to make sure is that your xpath is actually searching for some element, I particularly opened the site you mentioned, I searched there for xpath and it did not exist
– Lucas Miranda
So @Lucasmiranda switched to another xpath and still returns the same thing. I’m going to put the code here : '>>> Response <200 https://www.udacity.com/courses/all/> >>> div = Response.xpath('/html/body/div[1]/div/div/div[3]/div[2]/div/div[2]/main/div[2]/ul/li[1]/a/article/div[3]) >> div [[]] >>> ' a stop I noticed that the general terminal appears input and output with lines and mine not. Are you making a mistake my packages Or is it just my mind?
– Cesar Martins
but did you test this xpath and see that it really returns to the div you want? i opened here again by passing this xpath there and again found nothing, if you want to get the Divs with the courses Voce should try for example //*[@class='Catalog-component__card']
– Lucas Miranda
@Lucasmiranda With the xpath you sent me, the contents appeared inside the array, I must be copying the xpath the wrong way. I enter the site inspect the element and take the first class and give a copy xpath, that’s how it is?
– Cesar Martins
the way you did it is to work, I honestly don’t know why yours is generating it there (for me it generated something very different doing the same process), the ideal is that you understand the operation of xpath to be able to optimize the search of your element and not depend on this automatic generation
– Lucas Miranda