Taking the title of an HTML page with sed or awk

Asked

Viewed 42 times

-1

I have several HTML pages that I would like to get the title by using the sed or the awk.

When I do with the sed it returns all the content of the page, not just the title.

On the pages the title comes like this:

<h1 style="color:#0F1A7F;margin:0px 10px 10px;border-bottom:1px solid #FF0000;font-family:'Trebuchet MS', 'geneva', 'sans-serif', 'arial';"> Titulo da pagina </h1>

I used this command to get only the 'Page title':

k=$(sed 's/<[^>]*>//') pagina.html

Returns all page contents.

1 answer

0

sed and awk (with regular expressions - regex) are not the most suitable tools for the task. Generally speaking, regex is not meant to work with HTML (may even "work" in many cases, but is not the most suitable tool for the task).

Although it is possible - and even "easy" in the simplest cases - it is much more guaranteed to use a parser HTML, and throughout the answer we will see the reasons.

For example, with sed, a naive solution would be:

sed -e 's/^.*<title>//' -e 's/<\/title>.*//' pagina.html

That is, first I remove everything from the beginning until the tag <title>, and then remove everything from </title> from now on. What’s left is the title of the page.

But in your case, how do you want to get the contents of a h1, then I would be:

sed -e 's/^.*<h1[^>]*>//' -e 's/<\/h1>.*//' pagina.html

But this solution is too simplistic. For example, if HTML has something like this:

    <title>título da página</title>
<!--     <title>title comentado</title> -->

The first command above will take the tag commented. And change the regex to detect comments it’s not very simple.

It is for this and other reasons that regex is not the best tool to do Parsing HTML (see more details here, here and here).


One option - better, in my opinion - is to use the xmllint, which is part of libxml (search the Internet how to install it on your system, it is not difficult).

With it is much simpler to obtain the title (and any other tag). Ex:

xmllint --xpath '//title/text()' --html pagina.html

The above command returns only the text of the tag title. If you want the tag, just remove the /text(). Besides, as he is a parser, can already detect - and ignore - correctly if the tag is inside comments, as in the example above.

In the case of h1, could be something like:

xmllint --xpath '//h1[1]/text()' --html pagina.html

h1[1] serves to catch only the first h1 (if there is more than one on the page).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.