Regular expression, taking values from HTML

Question

Regular expression, taking values from HTML

Asked 10 years, 1 month ago

Viewed 998 times

5

I have an HTML that I need to remove values from a set of <li>.

This is the HTML part:

<ul id="minhas-tags">
 <li><em>Tagged: </em></li>
  <li><a href="/tags/tag1">tag1</a>, </li>
  <li><a href="/tags/tag2">tag2</a>, </li>
  <li><a href="/tags/tag3">tag3</a>, </li>
  <li><a href="/tags/tag4">tag4</a>, </li>

I want to get the contents of <li> as tag1, tag2, etc..

After much reading here I arrived in that regular expression:

tags/[a-zA-Z]+">[a-zA-Z]+<+

This can isolate the HTML I want from everything else, but I don’t know how to transform this expression so that it finds the values and returns only the content of <li>.

This expression returns me for example: /tags/tag1">tag1<, and I want only tag1.

How would I do that? And would you explain to me how the suggested expression would work as a solution, please?

Updating

Sorry, I didn’t put the language, I’m using C#, my routine goes like this:

public string retorna_Tags_HTML(string html)
{
    Regex ER = new Regex(@"tags?([\w]+)<\/a>", RegexOptions.None);
    Match m = ER.Match(html);
}

What is the language? It may be possible to use a parser, try to use this regex /tags?([\w]+)<\/a>/g.

– stderr

2015/05/29 at 01:06
Language is c#, this link you sent also returns the </a>. I put more information in the question.

– Ricardo

2015/05/29 at 01:23

2 answers

4

You can use the expression tags?\w+(?=<\/a>), that will capture any word (enter a-z, A-Z, 0-9 and the lower trace _) that is before </a> using lookhead positive ?=.

using System.Text.RegularExpressions;
using System.Linq;
....

string html = 
    @"<ul id=""minhas-tags"">
       <li><em>Tagged: </em></li>
        <li><a href=""/tags/tag1"">tag1</a>, </li>
        <li><a href=""/tags/tag2"">tag2</a>, </li>
        <li><a href=""/tags/tag3"">tag3</a>, </li>
        <li><a href=""/tags/tag4"">tag4</a>, </li>";

  Match[] tags = Regex.Matches(html, @"tags?\w+(?=</a>)")
                   .Cast<Match>()
                   .ToArray();

  foreach (var tag in tags) {
        Console.WriteLine(tag.Value);
  }
  Console.ReadLine();

See demonstração

Another way would be to use a parser, like the HTML Agile Pack to extract this information, see an example:

string html = 
    @"<ul id=""minhas-tags"">
       <li><em>Tagged: </em></li>
         <li><a href=""/tags/tag1"">tag1</a>, </li>
         <li><a href=""/tags/tag2"">tag2</a>, </li>
         <li><a href=""/tags/tag3"">tag3</a>, </li>
         <li><a href=""/tags/tag4"">tag4</a>, </li>";

var documento = new HtmlAgilityPack.HtmlDocument();
documento.LoadHtml(html);

foreach (var tag in documento.DocumentNode.SelectNodes("//a")) {
      Console.WriteLine(tag.InnerText);
}
Console.ReadLine();
// tag1
// tag2
// tag3
// tag4

Note: It is necessary to reference the HTML Agile Pack in the project.

Browser other questions tagged c# html regex

You are not signed in. Login or sign up in order to post.

by Matheus Cristian • **1,045** points · Answer 1 · 2015-05-29T01:07:26+00:00

var html = document.querySelector("#minhas-tags").innerHTML;
var conteudo = [];
html.replace(/tag[0-9]*">([a-zA-Z0-9]*)<\/a>/gi, function($1, $2) {
  conteudo.push($2);
});
alert(conteudo);

<ul id="minhas-tags">
  <li><em>Tagged: </em>
  </li>
  <li><a href="/tags/tag1">tag1</a>,</li>
  <li><a href="/tags/tag2">tag2</a>,</li>
  <li><a href="/tags/tag3">tag3</a>,</li>
  <li><a href="/tags/tag4">tag4</a>,</li>
</ul>