Regex to capture fixed strings in HTML and JS codes

Asked

Viewed 139 times

2

I’m doing some automated testing for a project legacy in the MVC model, however, there is a requirement for one of them to capture all fixed strings in HTML and JS codes. Since the project company is going through an internationalization process of its content, transforming its fixed strings into resource files.

I did this regex: ([\n]|^)(?<Value>(?!.*?\/\/|.*?@\*|.*?@.*?@|.*?\/\*|.*?<!--|.*?\\\*)([^\n]*?)[áâãàéêèíîìóôõòúûù].*)

It partially solves my problem as it identifies accented characters in the capturing code IF are not in comments (// , /* , @* , @ , <--).

So since there are no HTML or JS functions that use accents, I can assume that these are fixed strings.

After doing this, I was able to identify some pages that have fixed strings that should be transformed into resource files, but this regex does not cover all cases.

I would like a regex that:

  • Can capture fixed strings even without accented characters in HTML and JS codes.
  • Ignore string cases in comments.

Would exist in any of these languages some particularity of syntax that could help me delimit where regex should capture to identify these strings?

  • Could you explain it better? What do you mean strings? Only those that are outside the tags? What are the possibilities of strings fixed? What should not be considerate? What your regex not catching? Could add an example of the page that is in trouble?

  • This is some information that can help you get a faster response.

  • @Randrade I will edit the post to try to explain better, what specifically you did not understand or became vague? the strings I say are any group of characters that are not adaptable by changing the language, such as "run" or 'yes'. What should not be considered are words within tags in html cases like: <do not consider> consider < nc> My regex is not capturing comments (and should be) and fixed strings in the code that have no accented characters. There is a page that is in trouble, it is a big project, there may be hundreds of pages n considered

  • So, this complicates a bit. To mount regex you need to know at least what the pattern to consider or not consider. If you say that everything that is in quotes should be considered, it is one thing. If you say that everything that is out of tags, it is also a possibility. Now, are there other possible cases? Trying to do something generic like this without knowing the possibilities can be complicated.

  • this is the challenge, I wonder if there is some particularity that I could not see in some of these languages that make a pattern that would define the beginning and end of the capture of regex. Maybe consider double quotes inside a content surrounded by tag opening and then closing

  • It’s complicated, worse it should be something generic, since the test will scan more than 1000 files that have been changed by dozens of different programmers

  • I think using REGEX for this is not a good idea. There will always be a case that you can not cover. I suggest trying some proper parser for html. See that answer http://stackoverflow.com/a/1732454/460775

  • @Embarrassing however how to include this in automated testing? And JS cases?

Show 4 more comments

1 answer

3


Cannot use regex over HTML.

Repeat with me. Cannot use regex over HTML.

Write on a board 100 times:

for (int i = 0; i < 100; i++) {
    print('Não é possível usar regex sobre HTML.')
}

If you were able to use regex over HTML, you used only over a snippet, or in a very specific case. Because in general it is not possible to use regex over HTML.

Don’t believe just because I’m talking. The best answer of all time in the Stack Overflow matrix was about a similar question. Then see for yourself.

Cannot use regex over HTML.

However, as the answer says there at the root, you can use an XML parser.

  • 5

    Note that it is not possible to use Regex over HTML.

  • First, thank you for the reply, besides clarifying it was also very funny rs. But I think it should be possible somehow, the regex I have exposed for example works and has confirmed many occurrences in HTML codes. Sidenote: you could have put Stack Overflow Root instead of Stack Overflow array rsrsrsrs

  • @Renan instead of messing with my face could have at least exposed an alternative solution rsrsrs

  • 1

    Mea culpa: you can use an XML parser, as the answer says in the root OS.

  • @Renan edits your reply that mark it as the correct, just edit and put the part of XML parser you just talked about. Dude, I don’t mean to be disrespectful or anything, but sometimes newer users may feel intimidated by an attitude of mocking about a completely valid question like that. I’m cool, but I only say this to revise your concepts there, you may end up offending someone unintentionally...

  • @I had already edited Paz. Now, regarding the tone of the answer, most of my colleagues and I feel intimidated when the answer comes very serious, it seems like I’m talking to hooded clerics when it comes clean and dry. The people in the root OS, which is where I was raised, have the same attitude, so I apologize for this bad XD habit of mine

  • @Renan really, I think I’m being very nuttela even, thanks and Cheers M8

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.