Data Crawling in Python

Asked

Viewed 112 times

0

Good afternoon, you guys.

I decided to start my studies with the Python Crawler technique. I built the following script using lib Selenium :

   # Importando selenium para realizar o crawling

from selenium import webdriver
import time
import csv
# especificando onde o arquivo do webdriver está

chrome_path = r" Desktop\Crawler\chromedriver.exe"

# criando uma variável com a localização do webdriver
driver = webdriver.Chrome(chrome_path)

# utilizando o comando driver para redirecionar para um site
driver.get("Link")
time.sleep(5)

# procurando um determinado elemento na página


# Pesquisando pelo estado
estado = driver.find_element_by_css_selector(
    "body > div.layout > main > div > div.col-md-12.ng-scope > div > form:nth-child(4) > div:nth-child(2) > div:nth-child(1) > div > select > option:nth-child(18)").click()
time.sleep(7)

# Após obter o estado, pesquisar município
municipio = driver.find_element_by_css_selector(
    "body > div.layout > main > div > div.col-md-12.ng-scope > div > form.form-inline.ng-valid.ng-dirty.ng-valid-parse > div:nth-child(2) > div:nth-child(2) > div > select > option:nth-child(113)").click()
time.sleep(5)

# Ação de click no botão de pesquisa
pesquisar = driver.find_element_by_css_selector('body > div.layout > main > div > div.col-md-12.ng-scope > div > form.form-inline.ng-pristine.ng-valid > div > button'
                                                ).click()
time.sleep(5)

# Extraindo dados em variáveis
siglaEstado = driver.find_element_by_xpath(
    "/html/body/div[2]/main/div/div[2]/div/div[3]/table/tbody/tr[1]/td[1]")

nmMunicipio = driver.find_element_by_css_selector(
    "body > div.layout > main > div > div.col-md-12.ng-scope > div > div:nth-child(9) > table > tbody > tr:nth-child(1) > td:nth-child(2)")

cnes = driver.find_element_by_xpath(
    "/html/body/div[2]/main/div/div[2]/div/div[3]/table/tbody/tr[1]/td[3]")

nmFantasia = driver.find_element_by_xpath(
    "/html/body/div[2]/main/div/div[2]/div/div[3]/table/tbody/tr[1]/td[4]")

natureza = driver.find_element_by_xpath(
    "/html/body/div[2]/main/div/div[2]/div/div[3]/table/tbody/tr[1]/td[5]")

gestao = driver.find_element_by_xpath(
    "/html/body/div[2]/main/div/div[2]/div/div[3]/table/tbody/tr[1]/td[6]")

sus = driver.find_element_by_xpath(
    "/html/body/div[2]/main/div/div[2]/div/div[3]/table/tbody/tr[1]/td[7]")

I would like to know how to export the data of these variables to a file . CSV that should be generated in Python itself. And tips on how to improve my code.

  • 2

    You can just give a csv print, I did something similar to Selenium https://github.com/bulfaitelo/Tesouro-Direto-Scraper see if it can help you.

  • You can create this csv file and when you finish making the page Scrap, you open the file and save the variables with the data there. There are no limits to manipulating a python csv file.

1 answer

2

You can use the standard python library for csv, below follows an example of how to take value from a lists and convert to csv:

import csv

with open(meuArquivo, 'wb', newline='') as arquivo:
 teste = csv.writer(arquivo, quoting=csv.QUOTE_ALL)
 teste.writerow(lista)

Regarding Lenium, I find it very cool but I recommend you study about the requests.

  • Why do you ask instead of Lenium?

  • 1

    I think Selenium is a good option but it will depend on the scope of your project because it is a little slower. That’s why I recommend the requests, because often the information you want, is in a json returned by the site, not needing to render the whole site to do the extraction.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.