Remove HTML tags

Asked

Viewed 1,169 times

4

In terms of efficiency and performance, which of these codes is the best option to remove HTML tags in a string?

Option 1:

string ss = "<b><i>The tag is about to be removed</i></b>";
        Regex regex = new Regex("\\<[^\\>]*\\>");
        Response.Write(String.Format("<b>Before:</b>{0}", ss)); // HTML Text
        Response.Write("<br/>");
        ss = regex.Replace(ss, String.Empty);
        Response.Write(String.Format("<b>After:</b>{0}", ss));// Plain Text as a OUTPUT

Source

Option 2:

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
    /// <summary>
    /// Remove HTML from string with Regex.
    /// </summary>
    public static string StripTagsRegex(string source)
    {
    return Regex.Replace(source, "<.*?>", string.Empty);
    }

    /// <summary>
    /// Compiled regular expression for performance.
    /// </summary>
    static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// Remove HTML from string with compiled Regex.
    /// </summary>
    public static string StripTagsRegexCompiled(string source)
    {
    return _htmlRegex.Replace(source, string.Empty);
    }

    /// <summary>
    /// Remove HTML tags from string using char array.
    /// </summary>
    public static string StripTagsCharArray(string source)
    {
    char[] array = new char[source.Length];
    int arrayIndex = 0;
    bool inside = false;

    for (int i = 0; i < source.Length; i++)
    {
        char let = source[i];
        if (let == '<')
        {
        inside = true;
        continue;
        }
        if (let == '>')
        {
        inside = false;
        continue;
        }
        if (!inside)
        {
        array[arrayIndex] = let;
        arrayIndex++;
        }
    }
    return new string(array, 0, arrayIndex);
    }
}

Source

  • Best option considering which aspects?

  • 1

    I prefer the first option. The code is leaner and I trust the regex there more than the non-greedy match :P .

  • @Gypsy Rhyrrisonmendez Effectiveness and Performance...

  • Well, then put it clearly in the question, or else it becomes opinionated.

  • Okay, I am editing.

  • Escaping a little from the question, if you seek performance ever thought to do this without regular expressions?

  • @Leonardobosquett I could not think of a good solution without regular expressions...

  • 1

    https://dotnetfiddle.net/UZzVJj - Not exactly like its regular expression, which requires a "match" between "<" and ">", but this code uses 0.016s of the CPU according to Fiddle, this can help you.

  • 1

    very good @Leonardobosquett if you can post as an answer, it’s a good solution too...

Show 4 more comments

2 answers

2


I made a Fiddle for the first case. Times were:

Compile:    0.062s
Execute:    0s
Memory :    8kb
CPU    :    0.047s

I made a Fiddle for the second case. For the method HtmlRemoval.StripTagsRegex(), times were:

Compile:    0.109s
Execute:    0s
Memory :    16kb
CPU    :    0.094s

For the method HtmlRemoval.StripTagsRegexCompiled(), times were:

Compile:    0.063s
Execute:    0.031s
Memory :    16kb
CPU    :    0.109s

For the method HtmlRemoval.StripTagsCharArray(), times were:

Compile:    1.969s
Execute:    0.016s
Memory :    16kb
CPU    :    0.703s

Completion

All are equally effective.

The first is undoubtedly the fastest, but is not organized as the second.

The tests I have done do not consider very large strings. For small strings, the test serves well. For larger chains, it would be interesting to establish other criteria and other tests.

  • I can only vote from here 4hs, rsrsrs... Thank you very much for your reply. It opened my eyes as to the Fiddle too, I had not realized the ability to obtain this information. But why the first is not organized?

  • 1

    Because in the second I change a function and the regular expression approach changes completely, since everything is enveloped and ready to use. At first, I have to set the regular expression in hand, instantiate the regular expression, run and put the result in another string.

1

Considering the performance, can also be done removing tags avoiding the use of regular expressions, which greatly increases the performance, here is an initial code (simple).

https://dotnetfiddle.net/UZzVJj

test results:

 Compile:   0.189s 
 Execute:   0s 
 Memory:    0b 
 CPU:       0.016s

He doesn’t exactly the same rule as the regular expression \<[^\>]*\>, as this removes only if there are both tags, < and >.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.