Python Selenium capturing only 1 link

Question

Python Selenium capturing only 1 link

Asked 5 years, 2 months ago

Viewed 89 times

0

Good afternoon to everyone, I’m starting to learn Selenium and have already passed me a project to do that I’m half lost.

I need to capture all the URL summaries https://bgpview.io/reports/countries/BR, which are 8.079 ASN.

I managed to make only 1 ANS, using:

from selenium import webdriver 

driver=webdriver.Firefox()

site=driver.get('https://bgpview.io/reports/countries/BR')

ans=driver.find_element_by_xpath('/html/body/div/div/div/div/div/div/div/div[3]/div/table/tbody/tr[1]/td[1]/a´).click()

extracao=driver.find_element_by_xpath('//*[@id="content-info"]')

print (extracao.text) 

voltar=driver.find_element_by_partial_link_text('Countries Report').click()*

I’d like to know how do I get the remaining 8000.

I haven’t put it in the IDE and I haven’t even done the classes, because I want to first understand all the steps inside the terminal.

Thank you all.

1 answer

Browser other questions tagged python selenium selenium-webdriver web-scraping

You are not signed in. Login or sign up in order to post.

by Juan Caio • **155** points · Answer 1 · 2020-08-06T00:37:57+00:00

If you look at the ASN links you want to capture have the same pattern "https://bgpview.io/asn/+númedo da asn". So I would start from that principle and capture all the asn numbers and then do a search on each link. I will put an example code here but only using the Beautifulsoup

Step 1

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import re

Here I use the request to access the link

Step 2

source = urllib.request.urlopen('https://bgpview.io/reports/countries/BR').read()
soup = BeautifulSoup(source,'lxml')

Now I will access the table and save in a dataframe

Step 3

table = soup.find('table', attrs={"id":"country-report"})
table_rows = table.find_all('tr')
titulo = table.find_all('th')
colunas = []
for col in titulo:
    colunas.append(col.text)
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
df = pd.DataFrame(l, columns=colunas)
df = df.dropna()

Now I will save the column of ASN in a list and pass a regex so that it is only the numbers that we need to put in the links

Step 4

asn = df['ASN'].tolist()
asn_number = []
for a in asn:
    num = re.sub("[A-Za-z]", "", a)
    asn_number.append(num)

Now let’s do a for and pass each link number and save the extraction result in a dataframe:

Step 5

from http.client import IncompleteRead
l = list()
for asn in asn_number:
    row = []
    try:
        source = urllib.request.urlopen('https://bgpview.io/asn/'+str(asn)).read()
    except IncompleteRead:
        continue
    soup = BeautifulSoup(source,'lxml')
    ext=soup.find('div', attrs={"id":"content-info"})
    
    col = ext.find_all('h4')
    colunas = [tr.text.replace(":", "") for tr in col]
    
    span = ext.find_all('span')
    em = ext.find_all('em')
    
    for tr in span:
        row.append(tr.text.replace("\n", ""))
    for tr in em:
        row.append(tr.text.replace("\n", ""))
    dic = dict(zip(colunas, row))
    l.append(dic) 
df = pd.DataFrame(l, columns=colunas)

Remembering that you can follow this same line of reasoning with Selenium as well, but it is much heavier.