Web Crawler searching for specific text on the page

Asked

Viewed 279 times

2

Well, I’m making a Crawler web to fetch the value of a coin.

I wrote the following code in python:

#coding: utf-8

from urllib2 import urlopen

conteudo = urlopen('http://dolarhoje.com/bitcoin').read()

procurar1 = '<span class="symbol">'
posicao1 = int(conteudo.index(procurar1) + len(procurar1))
moeda1 = conteudo[posicao1 : posicao1 + 3]

procurar2 = '<span class="symbol">'
posicao2 = int(conteudo.index(procurar2) + len(procurar2))
moeda2 = conteudo[posicao2 : posicao2 + 3]

procurar3 = '<input type="text" id="nacional" value="'
posicao3 = int(conteudo.index(procurar3) + len(procurar3))
valor = conteudo[posicao3 : posicao3 + 8]

print(moeda1 + ' 1,00 ' + 'vale ' + moeda2 + ' ' + valor)
print ('\n')

I know when I put: procurar1 = '<span class="symbol">' and use: conteudo.index(procurar1) he will return me the first incidence, however I would like to call the second incidence.

The executed code will return: ฿ 1,00 vale ฿ 25086,77

The expected: ฿ 1,00 vale R$ 25086,77

That is, return both the symbol of the first coin and the symbol of the second coin, taking only the second incidence by similarity of the page code.

How to do?

  • From what I’ve seen, the real value is inside a <span class="cotMoeda nacional">. Why don’t you look for her first? Otherwise, I think Regular Expression would make it easier https://tableless.com.br/o-basico-sobre-regular/

  • Or better yet, take a look at this post talking about a parser own: https://answall.com/a/245947/57474

2 answers

2

You can do it more easily with the Mechanicalsoup library (https://github.com/MechanicalSoup/MechanicalSoup)

To use just install in your environment: Pip install Mechanicalsoup

To take the amount you want is quite simple:

import mechanicalsoup


browser = mechanicalsoup.StatefulBrowser()
browser.open("http://dolarhoje.com/bitcoin")

page = browser.get_current_page()

symbols = page.select(".symbol")
inputs = page.find_all("input")

moeda1 = { 'symbol': symbols[0].text, 'value': inputs[0].attrs['value'] }
moeda2 = { 'symbol': symbols[1].text, 'value': inputs[1].attrs['value'] }

print(moeda1)
print(moeda2)

1


Despite the time, in order to close this question with a solution, it is simple if we use some specific libraries to do so.

Libraries

  • urllib.request: responsible for capturing the html page;
  • re: library to use regular expressions.

Explanation of Code

To capture the content I used the following excerpt:

html_content = urllib.request.urlopen('http://dolarhoje.com/bitcoin').read().decode('utf-8')

I am assigning within the variable html_content the content of html of the page to consume later, the read() does the reading, decode('utf-8') is to prevent the site from being captured in other accentuation patterns, sometimes this happens.

When analyzing, we can see that there is a pattern where the coins symbol is shown within this of a tag html as an example:

<span class="symbol">฿</span>
<span class="symbol">R$</span>

Thus, we can construct a regular expression to capture the incidences of the symbols, being represented by the following excerpt:

symbols = re.findall(r'\"symbol\">([\S]+)</', html_content)

re we are using the soon imported library at the beginning of the code;

findall() is a function to capture in the form of a list, the incidences;

r'...' is where we put the respective regular expression;

Thus, the regex built on top of \"symbol\">([\S]+)</ captures the symbols and displays them in the following list: ['฿', 'R$']

While value=\"([.\d,]+)\" captures the values and displays: ['1,00', '13583,42']

html_content is the respective variable where we want to find the incidence - or how we say when we use regex of match.

And, we use the print print('{0} {1} vale {2} {3}'.format(symbols[0], values[0], symbols[1], values[1])) to display the respective values using the format(), thus obtaining the desired result:

฿ 1,00 vale R$ 13583,42

Full script

Below follows the script complete:

# !/usr/bin/python
# -*- coding: utf-8 -*-

# começando com os imports
import urllib.request
import re

# capturando o conteúdo html
html_content = urllib.request.urlopen('http://dolarhoje.com/bitcoin').read().decode('utf-8')

# pegando os símbolos das moedas
symbols = re.findall(r'\"symbol\">([\S]+)</', html_content)

# pegando os valores das moedas
values = re.findall(r'value=\"([.\d,]+)\"', html_content)

# mostrando em console o resultado parecido com: '฿ 1,00 vale R$ 13583,42'
print('{0} {1} vale {2} {3}'.format(symbols[0], values[0], symbols[1], values[1]))

I hope this solution helps other people who may be having a similar doubt.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.