How to obtain the contents of an HTML element from a Regex string?

Question

How to obtain the contents of an HTML element from a Regex string?

Asked 5 years, 5 months ago

Viewed 122 times

1

I am creating an application that reads the content of a page on the internet and then gets the text of an element <textarea> of that content. To accomplish such a task, I decided to use regex.

The problem is I don’t have much knowledge in regex and I can’t develop a Javascript logic to get the element text. Below is an example string and the code I tried to get the text:

content = '4354543/sfd f^^ <textarea id="text">Hello World! 34% #_2@.;/°? </textarea> fr fdgdf //fdg3';

result = content.match(/<textarea id="text">\w*/)[0];
console.log(result);

I understood that you decided to use Regex. But it would not be better to use the method getElementById(), since its string is an HTML page?

– LipESprY

2020/03/08 at 18:09
And how would I use this method with a string ?

– JeanExtreme002

2020/03/08 at 18:11
I will prepare an answer with the 2 methods.

– LipESprY

2020/03/08 at 18:13
There is one more thing, I will use this for an application running on Node. But still, leave the answer with the method getElementById if possible, because I was curious about it.

– JeanExtreme002

2020/03/08 at 18:13
All right. I’ll just use javascript.

– LipESprY

2020/03/08 at 18:14
2

Easier <textarea[^>]*>([^<]*) probably (if you have to appeal this much, to the point of needing Regex, of course)

– Bacco

2020/03/08 at 18:29
I think it’s important to add the tag node the question, since the Node does not have the native API for DOM manipulation, which changes quite the possible answers

– Costamilam

2020/03/08 at 18:49
2

Domparser or qq other API is much more guaranteed, as HTML just changes a little to break the regex. For example, if the textarea is inside comments (), regex does not detect (and a regex to recognize HTML comments is very complicated), since Domparser ignores correctly (it’s just one example, there are several other cases that make regex not feasible depending on the case) - finally, it is always worth reading here and here

– hkotsubo

2020/03/08 at 19:12
@Costamilam the problem is to invalidate any of the answers. I don’t even know what is best in this case.

– Bacco

2020/03/08 at 19:12
@Bacco would be better to have been created with the tag, but anyway, I still find it useful to keep the other answers, since there is no reason to create a new question for when you want the same, but in the browser

– Costamilam

2020/03/08 at 19:17
Oops! You’ve heard of Cheerio? You can load your HTML into it and persist with queries to be able to extract the data you want. The documentation is very simple and you will probably have a very nice record.

– Jorge Linhares

2020/03/08 at 19:44
1

Just to complement (and make a small "jabá"), I just answered just about that (using a parser versus regex), including showing some examples to better illustrate why the solution with regex is worse...

– hkotsubo

2020/03/17 at 13:45

Show 7 more comments

3 answers

1

You can use the bilioteca jsdom to gain access to a DOM manipulation API, but I can’t tell if it’s native or just simulates native:

const jsdom = require("jsdom");

const dom = new jsdom.JSDOM("<!DOCTYPE html><textarea>Hello world</textarea>");

console.log(dom.window.document.querySelector("textarea").value);

Thank you, that’s exactly what I needed.

– JeanExtreme002

2020/03/08 at 18:44
Forget what I said, now I understand that the question is about Node.

– Guilherme Nascimento

2020/03/08 at 20:43
@Guilhermebirth the Domparser API does not exist on the Node, or not? You can create your answer with the best solution you can imagine. I’d rather use the lib along with the fetch if it makes it easier and there’s nothing to stop it

– Costamilam

2020/03/08 at 20:46
@Costamilam was what I said, the author’s question is ambiguous, I had to read the comments to see that it was Node, in case the output seems to be even using packages as answered.

– Guilherme Nascimento

2020/03/08 at 20:48

Browser other questions tagged javascript regex

You are not signed in. Login or sign up in order to post.

by Sam • **79,597** points · Answer 1 · 2020-03-08T18:29:18+00:00

Use DOMParser in this way:

var content = '4354543/sfd f^^ <textarea id="text">Hello World! 34% #_2@.;/°? </textarea> fr fdgdf //fdg3';
var result = new DOMParser().parseFromString(content, "text/html");
document.write(result.querySelector("#text").textContent);

// ou result.getElementById("text").textContent

It converts the string into an object document with the knots. Then just get the id desired with .querySelector or .getElementById.

by LipESprY • **4,525** points · Answer 2 · 2020-03-08T18:33:36+00:00

Following the idea of parsing with regular expression, it could be like this:

var str = '4354543/sfd f^^ <textarea id="text">Hello World! 34% #_2@.;/°? </textarea> fr fdgdf //fdg3';
matchs = str.match(/\<textarea[\s\S]*?id\=\"text\"[\s\S]*?\>([\s\S]+?)\<\/textarea\>/);
console.log(matchs[1]);

In short, I seek the opening of the tag textarea that has the id="text", followed by a search group, which will be the index 1 match, followed by closing the tag textarea.

The character class [\s\S] is a "trick" that finds anything EVEN. Cat jump very explained by @hkotsubo. This will not break the search if you have line break and etc.

I am not going to address everything that has been used in the regular expression, as a better solution is acceptable which I am going to present:

var str = (
    '<textarea id="um_id_qualquer">Lorem, ipsum dolor sit amet consectetur adipisicing elit.</textarea>'
    +'<textarea id="2_id_qualquer">Dignissimos ducimus quas illo a expedita pariatur maxime magni,</textarea>'
    +'<textarea id="tres_id_qualquer">amet sint laborum eveniet, quam,</textarea>'
    +'<textarea id="text">Hello World! 34% #_2@.;/°? </textarea>'
    +'<textarea id="4_id_qualquer">recusandae enim iste delectus quidem! Iusto, at amet!</textarea>'
);

var parser = new DOMParser().parseFromString(str, "text/html");

console.log(parser.getElementById('text').innerHTML);

As the str is a valid HTML, just create a DOMParser and use the methods of the DOM itself.