sed
and awk
(with regular expressions - regex) are not the most suitable tools for the task. Generally speaking, regex is not meant to work with HTML (may even "work" in many cases, but is not the most suitable tool for the task).
Although it is possible - and even "easy" in the simplest cases - it is much more guaranteed to use a parser HTML, and throughout the answer we will see the reasons.
For example, with sed
, a naive solution would be:
sed -e 's/^.*<title>//' -e 's/<\/title>.*//' pagina.html
That is, first I remove everything from the beginning until the tag <title>
, and then remove everything from </title>
from now on. What’s left is the title of the page.
But in your case, how do you want to get the contents of a h1
, then I would be:
sed -e 's/^.*<h1[^>]*>//' -e 's/<\/h1>.*//' pagina.html
But this solution is too simplistic. For example, if HTML has something like this:
<title>título da página</title>
<!-- <title>title comentado</title> -->
The first command above will take the tag commented. And change the regex to detect comments it’s not very simple.
It is for this and other reasons that regex is not the best tool to do Parsing HTML (see more details here, here and here).
One option - better, in my opinion - is to use the xmllint
, which is part of libxml
(search the Internet how to install it on your system, it is not difficult).
With it is much simpler to obtain the title
(and any other tag). Ex:
xmllint --xpath '//title/text()' --html pagina.html
The above command returns only the text of the tag title
. If you want the tag, just remove the /text()
. Besides, as he is a parser, can already detect - and ignore - correctly if the tag is inside comments, as in the example above.
In the case of h1
, could be something like:
xmllint --xpath '//h1[1]/text()' --html pagina.html
h1[1]
serves to catch only the first h1
(if there is more than one on the page).