get original HTML entities with javascript

Question

get original HTML entities with javascript

Asked 10 years, 9 months ago

Viewed 133 times

13

I need all the original HTML entities of a paragraph, mainly the accents, the methods I know only recover some entities, as the example below where ">" is correctly coded but "ç" does not.

It is important that the code can differentiate accents generated or not by entities (as in çã) because the content comes from an external source and can come without a defined pattern

alert(document.querySelector('p').innerHTML);

<p>situa&ccedil;ão &gt; ativo</p>

Notes: as the accepted @mgibsonbr response is not possible, the adopted solution was to use the function DOMDocument::saveHTML, it interprets entities in the same way as the browser, so that the data is equal on the server and on the client.

1

That’s a great question! I don’t know at what point in the Parsing HTML entities are resolved, not even if it is preserved somewhere or discarded. By checking the children of p, I see there’s only one knot Text, whose data (of CharacterData) is the string with all its resolved entities (including the >). So it seems to me that the information you want no longer exists after the page is loaded, so you would need to get that content from that external source (either on the server side, or maybe via ajax if applicable) and treat it before the browser interpret your HTML.

– mgibsonbr

2015/08/20 at 20:04

2 answers

9

The original HTML entities are not preserved when the Markup of the document is interpreted (Parsed) for browser, so that they are not available to you to consult them via Javascript or any other way. According to the specification, during the step of tokenization (reading the text "raw" and producing "parts" - or tokens - for further analysis) the HTML entities (here called Character Reference) produce a single character when consumed:

8.2.4.69 Tokenizing Character Ferences

...

The behavior depends on the identity of the next character (the one immediately after U+0026 AMPERSAND), as follows:

...

"#" (U+0023)

Consume the U+0023 NUMBER SIGN.

...

Consume all characters matching the character range listed above (ASCII hexadecimal digits or ASCII digits).

...

Otherwise, if the next character is a U+003B SEMICOLON, consume it as well. If it is not, it is a Parsing.

...

Otherwise, return a character token to the Unicode character whose code point is that number.

Anything else

Consume as many characters as possible, provided the characters consumed match one of the identifiers in the first column of the table of named character reference (case sensitive and case sensitive).

...

Return one or two character tokens to the(s) character(s) corresponding to the character name in the reference (given by the second column of the table of named character reference).

(free translation, emphasis mine)

That is, after the HTML document has been "parsed" and the resulting HTML formed (as well as its representation in the DOM) the information about the HTML entities used is no longer there - were replaced by their corresponding characters. The fact of innerHTML (and outerHTML) return the text with > escape is due to the fact that he "rewrites" it for you, regardless of how it was originally in the Markup:

alert(document.querySelector('p').innerHTML);

<p>situa&ccedil;ão > ativo</p>

Therefore, if you really need this information you will need to get it before the text of the document reaches the browser like HTML: on the server side, for example, or - if you’re getting the text via Ajax - analyzing the text before creating elements with it. Unfortunately this implies analyzing the HTML text yourself, which is nothing trivial... Maybe some library of Parsing be able to at the same time deal with the entities (preferably recognizing all that make up the specification) and preserve the original content of the same, but in my head I do not know any (nor know that in environment you are working).

great answer, thank you very much! I am using PHP on the server, as I need the length of strings is the same on server and client maybe the solution is to use html_entity_decode() and .textContent / .innerText

– Pedro Sanção

2015/08/21 at 11:54

Browser other questions tagged php javascript html

You are not signed in. Login or sign up in order to post.

by Tobias Mesquita • **22,900** points · Answer 1 · 2015-08-20T23:30:31+00:00

just complementing, since the content is obtained externally, you can require it through an ajax request

var external = document.getElementById("external");
var innerHTML = document.getElementById("innerHTML");
var responseText = document.getElementById("responseText");

var blob = new Blob(["<p>situa&ccedil;ão &gt; ativo</p>"], { type: "text/html" });
var url = URL.createObjectURL(blob);

var xmlHttp = new XMLHttpRequest();
xmlHttp.onreadystatechange=function()
{
    if (xmlHttp.readyState==4 && xmlHttp.status==200)
    {        
        external.innerHTML = xmlHttp.responseText;
        innerHTML.value = external.innerHTML;
        responseText.value = xmlHttp.responseText;
    }
}

xmlHttp.open("GET", url, true);
xmlHttp.send("");

div {
    margin-bottom: 5px;
}

label {
    display: inline-block;
    width: 100px;
    text-align: right;
}

input {
    width: 400px;
}

<div id="external">
    
</div>
<div>
    <label for="innerHTML">innerHTML:</label>
    <input id="innerHTML" type="text" readonly />
</div>
<div>
    <label for="responseText">responseText:</label>
    <input id="responseText" type="text" readonly />
</div>