How to filter an HTML tag and its contents with regular expressions in Shell Bash?

Asked

Viewed 120 times

1

Based on the text below, how to keep the text output from the first column of tag span, that matches the text of the latter span?

<span class="CVA68e qXLe6d">Colcha Casal e ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Colcha Solteiro e ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Roupão de banho ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; banho</span>  </span>
<span class="CVA68e qXLe6d">Caminho de mesa ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; mesa</span>  </span>
<span class="CVA68e qXLe6d">Cortina para quarto ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Travesseiro de pena com ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Fronha de Solteiro em ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Lençol 70% algodão e ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Pano de prato pintado a ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; mesa</span>  </span>
<span class="CVA68e qXLe6d">Coberto dupla face colo... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Toalha de rosto felpudo ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; banho</span>  </span>

Remembering that the above text has several paragraphs and, what is decisive in this matter is to achieve take the titles of the first span through filtration by #hashtag &#8250; cama/mesa/banho of the third/last span.

What I tried: the sed together with the grep in its simple form of use:

sed 's/\"/\n/g' /tmp/default.htm | grep "TorraTudo"
Significado da opção \" \n:
\" - Filtrar apóstrofos,
\n - Quebrar linha por linha a cada apóstrofo.
  • This gives me a list, as below:
>Colcha Casal e ... - TorraTudo</span>  <span class=
>Colcha Solteiro e ... - TorraTudo</span>  <span class=
>Roupão de banho ... - TorraTudo</span>  <span class=
>Caminho de mesa ... - TorraTudo</span>  <span class=
>Cortina para quarto ... - TorraTudo</span>  <span class=
>Os Simpsons em Português - YouTube</span>  <span class=
>Travesseiro de pena com ... - TorraTudo</span>  <span class=
>Fronha de Solteiro em ... - TorraTudo</span>  <span class=
>Lençol 70% algodão e ... - TorraTudo</span>  <span class=
>Pano de prato pintado a ... - TorraTudo</span>  <span class=
>Coberto dupla face colo... - TorraTudo</span>  <span class=
>Toalha de rosto felpudo ... - TorraTudo</span>  <span class=

But note that there is no distinction between Bed/Table/Bath

I even tried something like:

sed 's/\"/\n/g' /tmp/default.htm | grep "TorraTudo\(^.*$\) &#8250\; cama"

sed 's/\"/\n/g' /tmp/default.htm | grep "TorraTudo\(^.*$\) &#8250\; mesa"

sed 's/\"/\n/g' /tmp/default.htm | grep "TorraTudo\(^.*$\) &#8250\; banho"

Among several useless attempts I made out these shown here, I decided to ask who has more experience in this subject (Regular expression).

This is what I need to separate each title from its category bed/table/bath.

  • There’s line breaking between the spans or it’s the way you put it there?

  • @Kiritonito There is no line break between the span. It’s the original way of what I have with me. It is a real example even, can save it on your PC and try to filter, because this text is what reflects my difficulty.

4 answers

5

Do not use regex to manipulate HTML

Generally, regex is not meant to work with HTML (may even "work" in many cases, but is not the most suitable tool for the task).

The regex of another answer may even have "worked", but there are a number of problems that I will address in detail - there is a good example here and here, but anyway, let’s first look at a solution without regex, and then go back to regex and its problems.


Prefer a parser dedicated

Use the right tool for each situation: if you want to manipulate HTML, use a parser html.

In Linux there are several options, one of them is the libxml2 (that can be easily installed with sudo apt install libxml2 or sudo apt install libxml2-utils - or download directly from official website).

With this, you will have the command available xmllint, and see how simple it is to get the text of all tags span:

xmllint --html --xpath "//span/text()" /tmp/default.htm 

With that the way out will be:

Colcha Casal e ... - TorraTudo
  
www.torratudo.com › cama
  
Colcha Solteiro e ... - TorraTudo
  
www.torratudo.com › cama
  
Roupão de banho ... - TorraTudo
  
www.torratudo.com › banho

etc...

So, now that we have done the most difficult (extract text from HTML tags), just use some commands to manipulate the text. First I remove these blank lines:

xmllint --html --xpath "//span/text()" /tmp/default.htm | grep "\S"

So we’ll have:

Colcha Casal e ... - TorraTudo
www.torratudo.com › cama
Colcha Solteiro e ... - TorraTudo
www.torratudo.com › cama
Roupão de banho ... - TorraTudo
www.torratudo.com › banho
etc...

Then, if I only want the entries corresponding to "bed", I make a grep with the option -B to also pick up the previous line:

xmllint --html --xpath "//span/text()" /tmp/default.htm | grep "\S" | grep " cama$" -B 1 --no-group-separator

So I pick up the lines that end with "bed" (cama$), and the option -B 1 causes the immediately previous line to be returned as well. Only default is to return also a separator between the pouch (in the case, a -- to separate the pouch), then I use the option --no-group-separator so that the separator is not shown. The result is:

Colcha Casal e ... - TorraTudo
www.torratudo.com › cama
Colcha Solteiro e ... - TorraTudo
www.torratudo.com › cama
Cortina para quarto ... - TorraTudo
www.torratudo.com › cama
Travesseiro de pena com ... - TorraTudo
www.torratudo.com › cama
etc...

Now I just pick up the odd lines (the first, third, fifth, etc). You can do this with sed or awk:

xmllint --html --xpath "//span/text()" /tmp/default.htm | grep "\S" | grep " cama$" -B 1 --no-group-separator | awk 'NR%2'

ou

xmllint --html --xpath "//span/text()" /tmp/default.htm | grep "\S" | grep " cama$" -B 1 --no-group-separator | sed -n '1~2 p'

For both of us the way out will be:

Colcha Casal e ... - TorraTudo
Colcha Solteiro e ... - TorraTudo
Cortina para quarto ... - TorraTudo
Travesseiro de pena com ... - TorraTudo
Fronha de Solteiro em ... - TorraTudo
Lençol 70% algodão e ... - TorraTudo
Coberto dupla face colo... - TorraTudo

If you want "table" and "bath", just change the grep "cama$" by their choice.


With regex is worse

Although it seems a "simple" and "works" solution, regex is not ideal.

Just to quote an example:

grep "cama<\/span>" /tmp/default.htm | grep -oP '(?<=<span class="CVA68e qXLe6d">)[^<]+(?=<\/span>)'

First I get the lines that have the tag span that ends with "bed" and then take the contents of the first span that is on the same line. For this I use the option -P that enables Perl compatible regex, with more advanced features like lookarounds, which in this case serve to check if something exists before and after (in this case, the opening and closing of the tag), but these are not part of the match. The result is just the tag text.

One detail is that to catch the tag text I used [^<]+ (one or more characters other than <). That’s a little more efficient than .*? of the other answer (see more details here and here). But it’s still not ideal, as it assumes that there are no other tags within the span (or commented tags, or a block CDATA, etc.).

But this solution is very naive (besides only working for this specific case). HTML only changes a little bit that regex no longer works. If you change the classes of span, or if they are no longer on the same line, or if one of them is commented, or if another attribute appears (will each span gets a id, for example), regex fails. And then you have to change it to contemplate these cases, and it gets more and more complicated, to the point where it’s not worth it anymore.

Already using xmllint, I don’t need to change anything. For all situations already cited (span's on different lines, with other attributes or classes, if one is already commented correctly ignored, etc.), the above options continue to work smoothly. And if you change the structure of the HTML, the xmllint still need fewer changes than regex.

Remember that HTML is a specific format, whose structures are in a larger context (the tags have "parents", "children" and "siblings", and to analyze one, it is often necessary to analyze the whole as well). Already regex operates line by line (by default, there is how to change this, but only ends up leaving even more complicated), without taking into account the context and semantics of this text, and why it is so complicated to do something more assertive (is apparently easy to do something that "works" for simpler cases, which gives the illusion that it is the appropriate tool for the problem - is not).

"Ah, but you also used regex in the first example"

Yeah, but it was just to pick up the lines that end with "bed", and mostly, that was afterward that I extracted the text from the tags. That is, I did not use regex to manipulate HTML, but the resulting text after HTML manipulation (which was done with the appropriate tool - in this case, the xmllint). Regex alone is not "bad", the bad thing is to use when you don’t need it, or when there are better tools to solve the problem.


Another problem of these solutions is that you need to read the file 3 times: one to get the lines corresponding to "home", another to "table" and another to "bath".

The ideal would be to use some programming language, read the file only once and for each tag already go storing the data in some structure (such as a hash map/table/dictionary/etc). I know it seems tempting to just use the command line and resolve everything "in one line", but often the attempt to make a one Liner can end up complicating things.

Even because even the xmllint (and any other tool command line) has its limitations (for example, if the span has other tags inside it, the xmllint prints the contents of each line, and then the grep would have to be adapted to show more than one, but as the amount varies, it would be much more complicated - already using some programming language with the appropriate libs, it is much easier to solve, because many already have mechanisms ready to get this information).

In short, even if it is possible to do it one way, prefer to use the right tools for each problem. Anyway, the options are there.

  • 1

    Excellent answer, a pity that I can not transmit, for other people, the gambiarra that is to do XML/ HTML Parsing with regex. Unfortunately we will still have to see humans doomed to bump heads with codes that suddenly stop working "for no reason".

1

Using regular expressions via Perl

$ perl -nE '/<span.*?>(.*?)- TorraTudo.*8250; (.*?)<.span>/ 
            and say $1,$2' file.html

Colcha Casal e ... cama
Colcha Solteiro e ... cama
Roupão de banho ... banho
Caminho de mesa ... mesa
...

Using an XML/HTML parser (xidel)

xidel ex1.html -e '//span/text()'| perl -0pe 's/TorraTudo\s*.*?›//g'

(produces the same result)

1

You can use this:

">(.*?) - TorraTudo</span>.*?#\d+;(?:\s?)banho

I don’t have linux, I tested it on the Notepad++.

Test and explanation is here.

--

As I said, I don’t have linux, I’m in windows. My "test.txt" file is just like your example, I selected your text (the span), Ctrl+c and Ctrl+v and saved it.

Test file.txt

<span class="CVA68e qXLe6d">Colcha Casal e ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Colcha Solteiro e ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Roupão de banho ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; banho</span>  </span>
<span class="CVA68e qXLe6d">Caminho de mesa ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; mesa</span>  </span>
<span class="CVA68e qXLe6d">Cortina para quarto ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Travesseiro de pena com ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Fronha de Solteiro em ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Lençol 70% algodão e ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Pano de prato pintado a ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; mesa</span>  </span>
<span class="CVA68e qXLe6d">Coberto dupla face colo... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; cama</span>  </span>
<span class="CVA68e qXLe6d">Toalha de rosto felpudo ... - TorraTudo</span>  <span class="qXLe6d dXDvrc">  <span class="fYyStc">www.torratudo.com &#8250; banho</span>  </span>

I ran this command on powershell:

Get-Content -Encoding UTF8 "./teste.txt" | Select-String -Pattern '">(.*?) - TorraTudo</span>.*?#\d+;(?:\s?)cama' | % { "Titulo $($_.matches.groups[1])" }

Upshot:

Colcha Casal e ...
Colcha Solteiro e ...
Cortina para quarto ...
Travesseiro de pena com ...
Fronha de Solteiro em ...
Lençol 70% algodão e ...
Coberto dupla face colo...

The title will be in group 1, the 0 will be the whole line.

  • I decided to change the answer to put the information.

  • I did it! And I’ll share it with you by creating another answer. Since if I have to explain the details of the subject (mistakes, hits and discovery) that after numerous attempts that took place. I would have to write several comments in sequence. Not to pollute this page with comment in parts. I will answer my own question to address the issue as a whole. Your logic is correct!

1


This is an amended response with the response of Mr @Kiritonito

So only serves to clarify some details that once he gave his answer making use of Windows system and that in turn use distro Linux.

Then I will report some [few] points that I could notice and correct so that the solution of the "Regular Expression" given by him, worked in my operating system (Unix/Like).

For sed not to accuse syntax error. The detail here is that, it was necessary to escape the bar (reversed) Slash of span:

<\/span>

I remembered I owned the package so much Coreutils of GNU, as the Busybox.

  • What does that mean??? One has more recourse built-in than the other.

And as was to be expected, there was difference you will know now. Check out:

Running with the sed package Coreutils:

/usr/local/bin/sed '/">(.*?) - TorraTudo<\/span>.*?#\d+;(?:\s?)/g' /tmp/default.htm  | grep banho | cut -d\> -f2 | cut -d\< -f1

Running with the sed package Busybox:

/bin/busybox sed '/">(.*?) - TorraTudo<\/span>.*?#\d+;(?:\s?)banho/g' /tmp/default.htm  | cut -d\> -f2 | cut -d\< -f1

Remarks:

sed WILDEBEEST - 4.2.2

At the end of the regular expression, where the selector word is (va) bed,table or bath does not work in the sed of GNU [Coreutils].

sed '/">(.*?) - TorraTudo<\/span>.*?#\d+;(?:\s?)/g' /tmp/default.htm | grep banho

Which made me think. filter through the grep those words bed, table and bath. Since it makes no difference to coexist these words in the expression of sed

It is necessary to use the grep together with the regular expression in parts in the sed to accomplish such a feat.


sed Busybox - v1.22.1

It is not necessary to change or add just as usual, escape the backslash span. And everything flows as desired.

sed '/">(.*?) - TorraTudo<\/span>.*?#\d+;(?:\s?)banho/g' /tmp/default.htm

I finished with the cut command at the end to make the result very clean.

cut -d\> -f2 | cut -d\< -f1


Completion

In the answer given by @Kiritonito one of the challenges was to find out what was the command to make use of the regular expression(sed,awk or grep).

Fix this minimum error from escaping the slider <\/span>.

And test on multiple command sed and find out which of them the syntax needed adjustment or not depending on the sed and its version.

So this may sound like a silly thing, but I’ve wasted a lot of time by not remembering that I have two different package command-line tools, and it plays the same role. However, one is more modern than the other.

We should also look at versions of programs from different times of their release that can lead to frustrations, leading us to think that the code may contain errors, and often it is not. It is the price much of the times that is paid when using semantic versions of codes or applying modern -, incompatibility.

  • This also worked very well and was more practical than I imagined. sed 's/\*>/\n/g' /tmp/default.htm | grep "mesa" | cut -d\> -f2 | cut -d\< -f1 . All you had to do was use - The wildcard matches the "nothing" as "any amount" is also equal to "no amount".

Browser other questions tagged

You are not signed in. Login or sign up in order to post.