Displaying the tags of a web page with indentation proportional to the depth of the element in the document tree structure

Question

Displaying the tags of a web page with indentation proportional to the depth of the element in the document tree structure

Asked 5 years, 6 months ago

Viewed 73 times

1

Issue: Develop the Myhtmlparser class as an Htmlparser subclass which, when fed with an HTML file, displays the names of the start and end tags in the order they appear in the document, and with a indentation proportional to the depth of the element in the document tree structure. Ignore HTML elements that do not require an end tag, such as p and br.

The HTML file used: https://easyupload.io/d45c52

The exit must be:

html start
    head start
        title start
        title end
    head end
    body start
        h1 start
        h1 end
        h2 start
        h2 end
        ul start
            li start
...
        a end
    body end
html end

What I did:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
        print (tag, "start")

    def handle_endtag(self, tag):
        print(tag, "end")

infile = open("w3c.html", "r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

My way out was:

html start
head start
title start
title end
head end
body start
h1 start
h1 end
p start
br start
p end
h2 start
h2 end
...
a start
a end
body end
html end

How to fix the code to achieve indentation on output?

1 answer

Browser other questions tagged python python-3.x parser

You are not signed in. Login or sign up in order to post.

by Ed S • **2,057** points · Answer 1 · 2020-01-23T12:59:39+00:00

The methods handle_starttag() and handle_endtag() need to be reset. Each should display the name of the element corresponding to the tag, set back appropriately.

Indentation is an integer value incremented each token of tag starting and decreasing each token of tag end. (I ignored elements p and br.) The indentation value should be stored as an instance variable of the parser object and initialized in the constructor.

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    'analisador de doc. HTML que mostra tags indentadas '

    def __init__(self):
        'inicializa o analisador e a indentação inicial'
        HTMLParser.__init__(self)
        self.indent = 0            # valor da indentação inicial

    def handle_starttag(self, tag, attrs):
        '''mostra tag de início com indentação proporcional à
           profundidade do elemento da tag no documento'''
        if tag not in {'br','p'}:
            print('{}{} start'.format(self.indent*' ', tag))
            self.indent += 4

    def handle_endtag(self, tag):
        '''mostra tag de fim com indentação proporcional à
           profundidade do elemento da tag no documento'''
        if tag not in {'br','p'}:
            self.indent -= 4
            print('{}{} end'.format(self.indent*' ', tag))