How to find a value between two tags in an HTML text? Other than "XPATH"

Question

How to find a value between two tags in an HTML text? Other than "XPATH"

Asked 6 years, 4 months ago

Viewed 156 times

-1

I’m trying to extract the value between two HTML tags with Python, I need it between two tags same.

I was doing it this way to extract values from a store catalog. But now I have a need to extract value from a specific product. That is, from a product page. I’d like to do something close to Delphi’s Posex.

The idea is to download the HTML content from the page, make a search of a string in the text, using an Initial String and Final String and return me the value between the two.

from urllib.request import urlopen

url = "https://www.panvel.com/panvel/main.do"
pagina = urlopen(url)
texto = pagina.read().decode('utf8')
texto = texto.replace("\t", "")

lista = texto.split("\n")
lista = texto.replace('\n', '')

htmlInicio = '<span class="box-produto__detalhes-nome">'
htmlFim = '</span>'
contador = 0

while contador < len(lista):
    if lista[contador].startswith(htmlInicio):
        #print(lista[contador])
        nEncontrado1 = len(htmlInicio)+(lista[contador].index(htmlInicio))
        nEncontrado2 = lista[contador].index(htmlFim)
        nomeProduto = lista[contador][nEncontrado1:nEncontrado2]

        #print(nomeProduto)
    contador+=1

1

And why the solution is so specific that it makes it impossible to use xpath, which solves the problem in a simple way?

– Woss

2019/04/10 at 14:03
You have to help whoever’s helping you! It is annoying to answer a question in the best possible way, with all the whim, to know that there is an artificial restriction preventing the solution. Please, edit the question and describe all artificial restrictions, stating the reason and the extent to which it is restricted.

– nosklo

2019/04/10 at 14:13

1 answer

Browser other questions tagged python python-3.x scraping urllib

You are not signed in. Login or sign up in order to post.

by nosklo • **5,801** points · Answer 1 · 2019-04-10T13:50:41+00:00

The content of the page you want to extract is structured with a Markup language, HTML. Use this in your favor: Use an html parser.

I recommend the excellent lxml.html, because it works with XPATH!!:

from urllib.request import urlopen
import lxml.html

url = "https://www.panvel.com/panvel/main.do"
pagina = urlopen(url)
texto = pagina.read().decode('utf8')

doc = lxml.html.fromstring(texto)

spans = doc.xpath("//span[@class='box-produto__detalhes-nome']")
for span in spans:
    print(span.text_content())

The result:

Kit Lenços Umedecidos Huggies Classic C/48 Unidades  Le(...)
Lenço Umedecido Huggies One & Done C/48 Unidades
Shampoo Seco Panvel Hair Therapy 150ml
Lencos Umedecidos Huggies Turma Monica Primeiros 100 Di(...)
Lenços Umedecidos Huggies Classic C/48 Unidades
...