How to obtain the contents of an HTML element from a Regex string?

Asked

Viewed 122 times

1

I am creating an application that reads the content of a page on the internet and then gets the text of an element <textarea> of that content. To accomplish such a task, I decided to use regex.

The problem is I don’t have much knowledge in regex and I can’t develop a Javascript logic to get the element text. Below is an example string and the code I tried to get the text:

content = '4354543/sfd f^^ <textarea id="text">Hello World! 34% #_2@.;/°? </textarea> fr fdgdf //fdg3';

result = content.match(/<textarea id="text">\w*/)[0];
console.log(result);
  • I understood that you decided to use Regex. But it would not be better to use the method getElementById(), since its string is an HTML page?

  • And how would I use this method with a string ?

  • I will prepare an answer with the 2 methods.

  • There is one more thing, I will use this for an application running on Node. But still, leave the answer with the method getElementById if possible, because I was curious about it.

  • All right. I’ll just use javascript.

  • 2

    Easier <textarea[^>]*>([^<]*) probably (if you have to appeal this much, to the point of needing Regex, of course)

  • I think it’s important to add the tag node the question, since the Node does not have the native API for DOM manipulation, which changes quite the possible answers

  • 2

    Domparser or qq other API is much more guaranteed, as HTML just changes a little to break the regex. For example, if the textarea is inside comments (<!-- <textarea> etc -->), regex does not detect (and a regex to recognize HTML comments is very complicated), since Domparser ignores correctly (it’s just one example, there are several other cases that make regex not feasible depending on the case) - finally, it is always worth reading here and here

  • @Costamilam the problem is to invalidate any of the answers. I don’t even know what is best in this case.

  • @Bacco would be better to have been created with the tag, but anyway, I still find it useful to keep the other answers, since there is no reason to create a new question for when you want the same, but in the browser

  • Oops! You’ve heard of Cheerio? You can load your HTML into it and persist with queries to be able to extract the data you want. The documentation is very simple and you will probably have a very nice record.

  • 1

    Just to complement (and make a small "jabá"), I just answered just about that (using a parser versus regex), including showing some examples to better illustrate why the solution with regex is worse...

Show 7 more comments

3 answers

2

Use DOMParser in this way:

var content = '4354543/sfd f^^ <textarea id="text">Hello World! 34% #_2@.;/°? </textarea> fr fdgdf //fdg3';
var result = new DOMParser().parseFromString(content, "text/html");
document.write(result.querySelector("#text").textContent);

// ou result.getElementById("text").textContent

It converts the string into an object document with the knots. Then just get the id desired with .querySelector or .getElementById.

  • 4

    It is worth saying that if he is going to take the content with ajax, nor does Domparser need (ajax can already receive a Document)

  • Thanks Sam! I am developing the application on Node but also wanted to know about using the querySelector with a string. I’m sure I’ll use the DOMParser quite in the future :)

  • @Jeanextreme002 Blz. But I don’t think you need to upload an external library to do something so simple.

1


You can use the bilioteca jsdom to gain access to a DOM manipulation API, but I can’t tell if it’s native or just simulates native:

const jsdom = require("jsdom");

const dom = new jsdom.JSDOM("<!DOCTYPE html><textarea>Hello world</textarea>");

console.log(dom.window.document.querySelector("textarea").value);
  • Thank you, that’s exactly what I needed.

  • Forget what I said, now I understand that the question is about Node.

  • @Guilhermebirth the Domparser API does not exist on the Node, or not? You can create your answer with the best solution you can imagine. I’d rather use the lib along with the fetch if it makes it easier and there’s nothing to stop it

  • @Costamilam was what I said, the author’s question is ambiguous, I had to read the comments to see that it was Node, in case the output seems to be even using packages as answered.

1

Following the idea of parsing with regular expression, it could be like this:

var str = '4354543/sfd f^^ <textarea id="text">Hello World! 34% #_2@.;/°? </textarea> fr fdgdf //fdg3';
matchs = str.match(/\<textarea[\s\S]*?id\=\"text\"[\s\S]*?\>([\s\S]+?)\<\/textarea\>/);
console.log(matchs[1]);

In short, I seek the opening of the tag textarea that has the id="text", followed by a search group, which will be the index 1 match, followed by closing the tag textarea.

The character class [\s\S] is a "trick" that finds anything EVEN. Cat jump very explained by @hkotsubo. This will not break the search if you have line break and etc.

I am not going to address everything that has been used in the regular expression, as a better solution is acceptable which I am going to present:

var str = (
    '<textarea id="um_id_qualquer">Lorem, ipsum dolor sit amet consectetur adipisicing elit.</textarea>'
    +'<textarea id="2_id_qualquer">Dignissimos ducimus quas illo a expedita pariatur maxime magni,</textarea>'
    +'<textarea id="tres_id_qualquer">amet sint laborum eveniet, quam,</textarea>'
    +'<textarea id="text">Hello World! 34% #_2@.;/°? </textarea>'
    +'<textarea id="4_id_qualquer">recusandae enim iste delectus quidem! Iusto, at amet!</textarea>'
);

var parser = new DOMParser().parseFromString(str, "text/html");

console.log(parser.getElementById('text').innerHTML);

As the str is a valid HTML, just create a DOMParser and use the methods of the DOM itself.

  • 1

    Thanks Lip for helping me.

  • 3

    ([\s\S]+) will give problem if you have 2 textarea. or use +? not to be "cute", or do as I did in the question comment [^<]* "anything but <"

  • @Bacco worse than I put the "anti-fominha" in all classes and forgot about that. Vlw for notifying! kk PS: I prefer to bind the closing of the tag altogether. Goes that the contents of textarea has a simple "smaller than" (<)...

Browser other questions tagged

You are not signed in. Login or sign up in order to post.