Find a snippet of an HTML

Asked

Viewed 1,016 times

3

I get a string, and contains an HTML.

In it is a table, and its columns:

<td width="24%" valign="top" border="1" style=" 
        BORDER-RIGHT: windowtext 0.5pt solid; 
        BORDER-TOP: windowtext 0.5pt solid; 
        BORDER-LEFT: windowtext 0.5pt solid; 
        BORDER-BOTTOM: windowtext 0.5pt solid; 
        PADDING-LEFT: 3.5pt; 
        ">

        <font face="Arial" style="font-size: 6pt">
        NÚMERO DE INSCRIÇÃO
        </font>
        <br>

        <font face="Arial" style="font-size: 8pt">

        <b>00.000.000</b><br>

        <b>MATRIZ</b>
        </font>
        <br> 
    </td>

What better way to capture just that code: '00,000,000'?

PS: It’s that recipe CNPJ data table.

  • 1

    The best I don’t know but it’s common for people to use some external library like the Htmlagilitypack to parse and deliver everything separately to us reliably, then it’s easy to search the elements. Any attempt to reinvent the wheel can produce some result but it takes work and will hardly be reliable and especially future-proof. Nor am I saying that these libraries are fail-safe but it’s an improvement. Otherwise it will be complicated, laborious and unreliable.

2 answers

5


The best I don’t know but it’s common for people to use some external library like the Htmlagilitypack to parse and deliver everything separately to us reliably, then it’s easy to search the elements. This seems to be the most widely used library for this type of task among . NET programmers.

I’ve seen several other options but I don’t like any. Nor am I a big fan of this, but she’s better than nothing.

Any attempt to reinvent the wheel can produce some result but it takes work and will hardly be reliable and especially future-proof. Otherwise it will be complicated, laborious and unreliable. I am also not saying that these libraries are fail-safe but it is already an improvement.

Anyway this HTML code is very complicated to interpret. If it is yours it would be better to modernize it, it does not use HTML this way anymore. If you have no control over it understand that the code can change and any algorithm created can become invalid and bring unexpected results. Even using a good library to parse, without a pattern, without a way to unambiguously identify the element is very risky.

2

Editing to include a Disclaimer: obviously at some point in your process is made a thing called scraping on the recipe page. As Maniero said in his reply and comment, this is not very reliable. My (incomplete) solution below searches for a CPF or CNPJ in any text, which may or may not contain HTML together. It is only because of this consideration that I answered in the form below. In general, who does Parsing HTML or you don’t know what you’re doing, or you’re desperate #Readyma.

If all you want is to extract a CNPJ, a regular expression can work. Just note that the expression will help because you will not treat HTML, but just extract a number from the text.

The expression you’re looking for is something like:

[0-9]+\.[0-9]+\.[0-9]\\[0-9]+-[0-9]+

And to those who understand REGEX: yes, I know my expression is somewhat lazy. I give a positive vote to everyone who posts an answer with a more precise expression.

Explanation:

  • Each block [0-9] means "a numeric character here";
  • The + means that the character to the left of the + must occur at least once, but may occur multiple times. A more correct and efficient way to capture a CPF or CNPJ would be to repeat the numeric block, type [0-9][0-9][0-9]. I leave it up to you to do this;
  • The backslash serves to escape certain characters that have special meanings, so that their literal values will be used (in this case, . and the bar itself).

Note that since there are inverted bars in the expression, you should also escape them when putting this in a string - or place an arroba in front of the string. You can use a code similar to the one below:

string input; // isso deve conter o seu texto de entrada
Regex foo = new Regex(@"[0-9]+\.[0-9]+\.[0-9]\\[0-9]+-[0-9]+");
Match m = foo.Match(input);

if (m.Success) {
    string resultado = m.Groups[0]; // Suponho um único CNPJ por entrada.
}

Good luck!

  • In case it should work until one day it doesn’t work anymore :) Anyway, there is no good solution.

  • Well, the idea is to capture data directly from the revenue through the customer’s CNPJ, if it is good or will work forever do not know, but until it is working is an extra and differential for the Customer, following the @bigown solution, worked correctly, which by the way lib is very good and simple to use, thank you

  • 1

    @Rod may have worked but don’t take it for granted that it will always work. But you’ve got the spirit. Trust without trust :)

  • @bigown, yes, that is, because whenever we depend on something external, has to have the plan b hehe, thanks for sharing, hug.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.