When I compare two strings to the "bigger" and "smaller" operators, what am I comparing?

Question

When I compare two strings to the "bigger" and "smaller" operators, what am I comparing?

Asked 5 years, 5 months ago

Viewed 303 times

6

var a = "a";
var b = "b";

if (a < b) // verdadeiro
  console.log(a + " é menor que " + b);

else if (a > b)
  console.log(a + " é maior que " + b);

else
  console.log(a + " e " + b + " são iguais.");

The above example returns a boolean value (true / false) when compared to larger operators (>) and minor (<) the values of 'a' and 'b', that are strings.

But what has been compared to arrive at this result of true or false? It’s string size, byte size or something else?

I’ll edit your code, I’ll change it print() for console.log() because it’s bothering.

– Augusto Vasques

2020/03/18 at 16:27
Yes! I had actually taken this code from the MDN website (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#Comparing_strings) and forgot to change it.

– marquinho

2020/03/18 at 16:31

3 answers

7

According to the documentation:

Strings are Compared based on standard lexicographical Ordering, using Unicode values.

That is, the lexicographic comparison is made, taking into account the Unicode code points of the string. To better understand what a code point is, read here.

But very briefly, each character (and this is not restricted to letters, but also to digits, spaces, punctuation marks, emojis, etc.) has an associated numerical value, called code point. When two strings are compared, the numeric values (code points) corresponding to each character are taken into account in the comparison.

In this case, the letter a corresponds to the code point U+0061 (61 in hexadecimal, or 97 in decimal), and the letter b, at the code point U+0062 (62 hexa, 98 decimal). That’s why the string 'a' is considered "smaller" than the string 'b'.

And that has nothing to do with string size:

console.log('abacate' < 'bola'); // true
console.log('abacate' < 'abra'); // true

According to the algorithm described in the language specification (item 3, if both operands are strings), what happens is that the first character of each string (the value of its code points) is compared. If they are equal, compare the second, and so on, until you find one that is different.

In the first case ('abacate' < 'bola'), the first characters of each string are a and b and how a is less than b (the code point of a is less than the code point of b), then the string 'abacate' is "smaller" than the string 'bola'.

In the second case ('abacate' < 'abra'), the first and second character of the strings are equal (both start with 'ab'), but when we get to the third character, we have a and r, and how a is "less" than r (for the code point of a is U+0061 and the r is U+0072), then the string 'abacate' is smaller than the string 'abra'.

String size is only relevant in cases like this:

console.log('aba' < 'abacate'); // true

Remember that this is not restricted to letters, because each existing character has a code point. Then we can have things like:

console.log('' > '丵124'); // true

Because emoji "" also has a code point (U+1F4A9), whose value is greater than the code point of the character 丵 (U+4E35).

And remember a classic "trap", which is to compare strings that contain digits:

console.log('10000' > '2'); // false

As we are comparing strings, we take into account the code point of the characters, and the character 1 owns the code point U+0031, while the character 2 owns the code point U+0032, and therefore the string '10000' is considered smaller than the string '2'.

If you want to compare numerical values, you must transform the strings into numbers, for example using parseInt:

console.log(parseInt('10000') > parseInt('2')); // true, pois agora são números, e não strings

Remember that Unicode also hides its own "traps":

console.log('á' < 'á'); // true

This happens because the first á is in NFD, and the second is in NFC. To better understand what this is, I suggest you read here, here and here. But to summarize, Unicode defines two different ways of representing the letter a high-pitched:

composite form (NFC), as a single code point: the character itself á
decomposed form (NFD) as two code points: the character a (without accent) and the accent itself (code point U+0301)

But both, when shown on the screen, appear the same way (á), and just "brushing the bits" of the strings to see how many code points there are there:

// mostrar codepoints da string
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)).join(' '); }

// string em NFD, possui 2 code points
console.log(codepoints('á')); // 61 301
// string em NFC, possui 1 code point
console.log(codepoints('á')); // e1

So the first string above actually has two code points, the first being the letter a, which we have already seen is code point U+0061, but the first code point of the second string corresponds to the character á, whose value is U+00E1, and so the first string is considered "smaller".

This can be solved by normalizing both to the same form (something like 'á'.normalize('NFC'), for example), but what exactly to do will depend on each case.

There is still the method localeCompare to compare strings according to a locale specific (that is, according to the rules of a given language, because this varies a lot: accented characters can come before or after the non-stressed, there are languages in which the alphabetical order is different, etc.). But I believe it already runs a little outside the scope of the question (anyway, you can see more details here).

Great! reply I understood everything you said simple and straight on the subject thank you very much.

– marquinho

2020/03/18 at 21:37

Browser other questions tagged javascript string

You are not signed in. Login or sign up in order to post.

by Maniero • **444,682** points · Answer 1 · 2020-03-18T16:36:07+00:00

You’re comparing the texts anyway. Each character is compared against each other in order that they meet to determine whether they are equal, greater or lesser.

This is a loop used internally to compare, so it can be a little slow. But it may not be in cases that we can already determine difference in some way. It may be that only one character needs to be compared. If it is different or bigger or smaller than what is comparing in the other string, then it is already possible to get a result, no need to continue. If it is equal then you should look at the next character to determine the relationship between them, and so on until the characters of one or the other end string. If they were the same then you have to check all the text and then it can be a little slow, but there is no other way.

Of course there are optimizations to indicate if the object is null or if the size is 0 or even if the size is different. At least to test the equality or difference if the size is different is already known to be different. You can’t tell if it’s greater or equal if it’s necessary.

In fact the comparison is a little more complicated than that in certain collations, but that’s the general idea. And just the one about leotard is that will determine the exact rule of how to compare the text.

Roughly it is checked whether one character is equal to, greater than or less than the other that is in the same position as the other text according to an established table, usually an alphabetic table. You can see more about these tables in What are the main differences between Unicode, UTF, ASCII, ANSI?.

Remembering that the characters are graphic representations shown in these tables and what is actually being compared are numbers, so it’s an easy numerical comparison.

Note that depending on the collation may have very specific rules understand this character comparison a way to facilitate understanding. If you want to know the details you would have to search. It’s much more complicated and doesn’t matter to most people.

by Rafael Tavares • **4,528** points · Answer 2 · 2020-03-18T16:33:21+00:00

According to documentation of Mozila:

Strings are compared based on standard lexicographic ordering, using Unicode values.

So, String words are compared letter-to-letter. You can see more information on the site javascript.info/comparison, which is in English.

Quote translated freely:

To check whether a string is larger than another, Javascript uses the so-called "dictionary" or the "lexicographic" order".

The algorithm for comparing two strings is simple:

Compare the first character of both strings.

If the first character of the first string is larger (or smaller) than that of the other string, then the first string is larger (or smaller) than that the second.

Otherwise, if both first characters are the same, compare the second character in the same way.

Repeat until you reach the end of each string.

If both strings are the same size, they are equal.

Otherwise, the string with more characters is larger.