Regular expression, taking values from HTML

Asked

Viewed 998 times

5

I have an HTML that I need to remove values from a set of <li>.

This is the HTML part:

<ul id="minhas-tags">
 <li><em>Tagged: </em></li>
  <li><a href="/tags/tag1">tag1</a>, </li>
  <li><a href="/tags/tag2">tag2</a>, </li>
  <li><a href="/tags/tag3">tag3</a>, </li>
  <li><a href="/tags/tag4">tag4</a>, </li>

I want to get the contents of <li> as tag1, tag2, etc..

After much reading here I arrived in that regular expression:

tags/[a-zA-Z]+">[a-zA-Z]+<+

This can isolate the HTML I want from everything else, but I don’t know how to transform this expression so that it finds the values and returns only the content of <li>.

This expression returns me for example: /tags/tag1">tag1<, and I want only tag1.

How would I do that? And would you explain to me how the suggested expression would work as a solution, please?

Updating

Sorry, I didn’t put the language, I’m using C#, my routine goes like this:

public string retorna_Tags_HTML(string html)
{
    Regex ER = new Regex(@"tags?([\w]+)<\/a>", RegexOptions.None);
    Match m = ER.Match(html);
}
  • What is the language? It may be possible to use a parser, try to use this regex /tags?([\w]+)<\/a>/g.

  • Language is c#, this link you sent also returns the </a>. I put more information in the question.

2 answers

4


You can use the expression tags?\w+(?=<\/a>), that will capture any word (enter a-z, A-Z, 0-9 and the lower trace _) that is before </a> using lookhead positive ?=.

using System.Text.RegularExpressions;
using System.Linq;
....

string html = 
    @"<ul id=""minhas-tags"">
       <li><em>Tagged: </em></li>
        <li><a href=""/tags/tag1"">tag1</a>, </li>
        <li><a href=""/tags/tag2"">tag2</a>, </li>
        <li><a href=""/tags/tag3"">tag3</a>, </li>
        <li><a href=""/tags/tag4"">tag4</a>, </li>";

  Match[] tags = Regex.Matches(html, @"tags?\w+(?=</a>)")
                   .Cast<Match>()
                   .ToArray();

  foreach (var tag in tags) {
        Console.WriteLine(tag.Value);
  }
  Console.ReadLine();

See demonstração

Another way would be to use a parser, like the HTML Agile Pack to extract this information, see an example:

string html = 
    @"<ul id=""minhas-tags"">
       <li><em>Tagged: </em></li>
         <li><a href=""/tags/tag1"">tag1</a>, </li>
         <li><a href=""/tags/tag2"">tag2</a>, </li>
         <li><a href=""/tags/tag3"">tag3</a>, </li>
         <li><a href=""/tags/tag4"">tag4</a>, </li>";

var documento = new HtmlAgilityPack.HtmlDocument();
documento.LoadHtml(html);

foreach (var tag in documento.DocumentNode.SelectNodes("//a")) {
      Console.WriteLine(tag.InnerText);
}
Console.ReadLine();
// tag1
// tag2
// tag3
// tag4

Note: It is necessary to reference the HTML Agile Pack in the project.

0

var html = document.querySelector("#minhas-tags").innerHTML;
var conteudo = [];
html.replace(/tag[0-9]*">([a-zA-Z0-9]*)<\/a>/gi, function($1, $2) {
  conteudo.push($2);
});
alert(conteudo);
<ul id="minhas-tags">
  <li><em>Tagged: </em>
  </li>
  <li><a href="/tags/tag1">tag1</a>,</li>
  <li><a href="/tags/tag2">tag2</a>,</li>
  <li><a href="/tags/tag3">tag3</a>,</li>
  <li><a href="/tags/tag4">tag4</a>,</li>
</ul>

  • As I would do this using c#, I put the classes I use in the question. Thank you.

  • I don’t know =/, but see if you can change this: "tags?([ w]+) </a>" by "tag[0-9]">([a-za-Z0-9])</to>"....

  • Place a textual explanation in your answer to help the questioner and other members who arrive at your question.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.