How to find a value between two tags in an HTML text? Other than "XPATH"

Asked

Viewed 156 times

-1

I’m trying to extract the value between two HTML tags with Python, I need it between two tags same.

I was doing it this way to extract values from a store catalog. But now I have a need to extract value from a specific product. That is, from a product page. I’d like to do something close to Delphi’s Posex.

The idea is to download the HTML content from the page, make a search of a string in the text, using an Initial String and Final String and return me the value between the two.

from urllib.request import urlopen

url = "https://www.panvel.com/panvel/main.do"
pagina = urlopen(url)
texto = pagina.read().decode('utf8')
texto = texto.replace("\t", "")

lista = texto.split("\n")
lista = texto.replace('\n', '')

htmlInicio = '<span class="box-produto__detalhes-nome">'
htmlFim = '</span>'
contador = 0

while contador < len(lista):
    if lista[contador].startswith(htmlInicio):
        #print(lista[contador])
        nEncontrado1 = len(htmlInicio)+(lista[contador].index(htmlInicio))
        nEncontrado2 = lista[contador].index(htmlFim)
        nomeProduto = lista[contador][nEncontrado1:nEncontrado2]

        #print(nomeProduto)
    contador+=1
  • 1

    And why the solution is so specific that it makes it impossible to use xpath, which solves the problem in a simple way?

  • You have to help whoever’s helping you! It is annoying to answer a question in the best possible way, with all the whim, to know that there is an artificial restriction preventing the solution. Please, edit the question and describe all artificial restrictions, stating the reason and the extent to which it is restricted.

1 answer

1

The content of the page you want to extract is structured with a Markup language, HTML. Use this in your favor: Use an html parser.

I recommend the excellent lxml.html, because it works with XPATH!!:

from urllib.request import urlopen
import lxml.html

url = "https://www.panvel.com/panvel/main.do"
pagina = urlopen(url)
texto = pagina.read().decode('utf8')

doc = lxml.html.fromstring(texto)

spans = doc.xpath("//span[@class='box-produto__detalhes-nome']")
for span in spans:
    print(span.text_content())

The result:

Kit Lenços Umedecidos Huggies Classic C/48 Unidades  Le(...)
Lenço Umedecido Huggies One & Done C/48 Unidades
Shampoo Seco Panvel Hair Therapy 150ml
Lencos Umedecidos Huggies Turma Monica Primeiros 100 Di(...)
Lenços Umedecidos Huggies Classic C/48 Unidades
...
  • 1

    Thanks, but it is for a specific purpose, and for the moment I could not use xpath, however practical and better in my opinion! thanks

  • @Joiner then you should put this artificial restriction in your question, preferably with the reason!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.