Just complementing, the solution proposed with StringBuilder
works very well, but there are cases that are not covered, if we take into account all the possibilities of Unicode.
An example is when characters are accented. For example, í
(the letter i
with acute accent), according to Unicode, can be represented in two ways:
- like the code point U+00ED (LATIN SMALL LETTER I WITH ACUTE) (
í
)
- as a combination of two code points:
Code points are numeric values that Unicode defines for each character. To better understand, I recommend this article (in English) and this question (in Portuguese).
The first option is called Normalization Form C (NFC) and the second, Normalization Form D (NFD) - both described here, and there’s also an explanation in this question. In the NFD form, the character that corresponds to the acute accent (COMBINING ACUTE ACCENT) is applied to the previous character (in this case, the letter i
- LATIN SMALL LETTER I), and when you print the String
, both "come together", forming the í
.
It is possible to have Strings
in both forms, and when printing them, both will be shown as í
. But internally, the code points will be different. The same goes for other accented characters, such as ô
, ã
and even characters that are not necessarily accentuated, such as the ç
, which in NFD form results in the letter c
followed by character cedilla (COMBINING CEDILLA).
Knowing this is important because the method StringBuilder::reverse
reverses the order of the code points, which means that the result may not be as expected, depending on how the String
is. To show what happens when the String
is in NFD, I will use the class java.text.Normalizer
, to create a String
standardised in NFD, and then I will use the StringBuilder
to invert it:
// normalizar em NFD
String s = Normalizer.normalize("ímãs", Normalizer.Form.NFD);
System.out.println(s); // ímãs
s = new StringBuilder(s).reverse().toString();
System.out.println(s); // s̃aḿi
The method Normalizer::normalize
returns to String
standard in NFD. This means that í
is stored internally as two code points: the letter i
and the acute accent. The same goes for the ã
, which in NFD form corresponds to two code points: the letter a
and the tilde.
Note that when printing, these code points are shown as one thing (the characters í
, ã
). But by reversing the String
, the tilde was on top of the s
and the seat was on top of the m
. This happens because the StringBuilder
reversed all code points. In other words:
- Code points of
String
original (ímãs
) standard in NFD:
StringBuilder::reverse
reverses the codepoints:
After the reversal made by StringBuilder::reverse
, the acute accent was after the letter m
, so it is applied in this letter, and the i
is without accent. The same goes for the tilde, which after the inversion, was after the s
and so it is applied in this letter (and not in a
). If your console doesn’t show the output correctly, follow an image of how it looked here:
The StringBuilder
will only work if the String
is standardised in NFC, in which case the accented characters will be í
(LATIN SMALL LETTER I WITH ACUTE) and the ã
(LATIN SMALL LETTER A WITH TILDE). As they are represented by only a single code point, the accent of these characters does not "change places" after the call to StringBuilder::reverse
.
The problem is that you can receive the String
in any form, but if you print it, it will always be shown as ímãs
. In the above example I forced String
to be in NFD and so you may find that this will not happen in "normal conditions", but it may happen to a user to copy (Ctrl+C) a text from somewhere and paste (Ctrl+V) in its application, and if the original text is in NFD, the accents of the String
reversed will be changed. There is no way to control this (the original text may be in any other program the user is using, and the user does not even know if it is in NFD or NFC, because the screen always shows the accent correctly) - see an example at this link (copy and paste the words mañana
(NFC) and mañana
(NFD), and see the difference).
Then a solution would be to use the method Normalizer::isNormalized
to check whether the String
is standardised in NFC, and if not, normalise before inverting it:
// inverte a String mantendo os acentos
if (! Normalizer.isNormalized(s, Normalizer.Form.NFC)) { // se não está em NFC, normaliza
s = Normalizer.normalize(s, Normalizer.Form.NFC);
}
s = new StringBuilder(s).reverse().toString(); // sãmí
With this, accents are preserved in the correct characters and the result is sãmí
(if you want, you can call normalize
direct, because if the string is already in NFC, it will not be changed). But this still does not solve all the cases.
Emojis
Emojis add a new layer of complexity to this story. And as their use is becoming more common, it’s important to know how to work with them.
Usually emojis correspond to a code point, but that’s not always the case. The emojis of families, for example, are combinations of other emojis. For example, a family with father, mother and 2 daughters is actually a combination of the emoji of a man, one female and two emojis of girl. To join them, the character is used ZERO WIDTH JOINER (also called only ZWJ - and these sequences of emojis separated by ZWJ are called Emoji ZWJ Sequences).
That is, the emoji of "family with father, mother and 2 daughters" is actually a sequence of seven code points:
This sequence of codepoints can be displayed in different ways. If the system/program used recognizes this sequence, a single family image is shown:
But if this sequence is not supported, the emojis are shown next to each other:
Regardless of how they are shown, because it is a sequence that can correspond to a single symbol, we cannot change the order of the code points, because then it is no longer interpreted as the Emoji ZWJ Sequence of "family with father, mother and 2 daughters". And if we use StringBuilder::reverse
(even if we normalize to NFC, as suggested above), the order of these code points will be changed to "GIRL, ZWJ, GIRL, ZWJ, WOMAN, ZWJ, MAN" and we will no longer have the "family emoji".
There are still other sequences that are not separated by ZWJ, such as country flag emojis. Instead of creating a code point for each flag of each country existing in the world, region indicator symbols were created (Regional Indicator Symbols), which are characters that correspond to letters from A to Z (not letters themselves, characters correspondent to these letters, but which are another block of Unicode).
The codes follow the definition of ISO 3166, in which each country has a 2-letter code. Australia, for example, has the code "AU", so to make the emoji of the flag of Australia, just take the characters corresponding to the letters "A" and "U" in the Regional Indicator Symbols, which in this case are:
With this, systems/programs can display these 2 code points as the Australian flag:
Or, if this is not supported, a representation of the characters themselves is shown:
If your console doesn’t even support the representation of these characters, it might just show ?
or blank squares.
Regardless of how it is shown, the two Regional Indicator Symbols correspond to a single "character" (a single "drawing on the screen"), and should be treated as if they were one thing.
But if we use StringBuilder::reverse
, the code points are reversed and the result is the Regional Indicator Symbols corresponding to "UA". And as "AU" is the code of Ukraine, the resulting emoji is the flag of Ukraine:
// Bandeira da Austrália
int[] cp = { 0x1f1e6, 0x1f1fa };
String australia = new String(cp, 0, cp.length);
System.out.println(australia); // Bandeira da Austrália
String s = new StringBuilder( // normalização não muda nada neste caso
Normalizer.normalize(australia, Normalizer.Form.NFC))
// inverte os codepoints, passando a ser "UA" em vez de "AU"
.reverse().toString();
System.out.println(s); // Bandeira da Ucrânia
If your console is not compatible with emojis, follow an image of what the output of this code would look like:
Or, in case the flag emojis are not supported, it may be that the output is:
In order for this not to happen, the two Regional Indicator Symbols ("A" and "U") should be treated as if they were one thing, but with StringBuilder
that is not possible.
Technically, the sequence of two Regional Indicator Symbols represents the country/region (not necessarily the flag itself), but many systems often display the respective flags when these characters are printed.
Finally, in addition to the cases already mentioned, there are many others foreseen in Unicode, in which two or more code points are interpreted as a single "character". The name given to these sequences is Grapheme Cluster.
Like StringBuilder::reverse
simply inverts the code points, without considering whether they are part of a Grapheme Cluster, the solution to correctly reverse a String
is more complicated than simply calling this method.
A solution option is to read the unicode document describing the Grapheme Clusters and implement it. I leave it as exercise for the reader
Another option is to see if someone smarter has ever done it. An example is this project at Github, that I found very interesting, because you recognize Emoji ZWJ Sequences and emojis of flags as a single Grapheme Cluster. With this it is possible to capture these clusters and invert the String
correctly.
An example of code with this API:
import com.github.bhamiltoncx.UnicodeGraphemeParsing;
import com.github.bhamiltoncx.UnicodeGraphemeParsing.Result;
// criar uma String com vários Grapheme Clusters
int[] cp = {
// emoji de família
0x1f468, 0x200d, 0x1f469, 0x200d, 0x1f467, 0x200d, 0x1f467,
// bandeira da Austrália
0x1f1e6, 0x1f1fa,
// outros emojis
0x1f4a9, 0x1f4b0 };
String s = new String(cp, 0, cp.length)
// juntar com caracteres acentuados na forma NFD
+ Normalizer.normalize("áñabc", Normalizer.Form.NFD);
System.out.println(s); // Mostrar a String original (antes de inverter)
// obter os Grapheme Clusters
List<Result> graphemeClusters = UnicodeGraphemeParsing.parse(s);
Collections.reverse(graphemeClusters); // inverter a lista de Grapheme Clusters
// construir a String invertida
StringBuilder reverse = new StringBuilder();
for (UnicodeGraphemeParsing.Result grapheme : graphemeClusters) {
// obter o trecho da String que corresponde ao Grapheme Cluster
String str = s.substring(grapheme.stringOffset, grapheme.stringOffset + grapheme.stringLength);
reverse.append(str);
}
String invertida = reverse.toString();
System.out.println(invertida);
The method UnicodeGraphemeParsing::parse
gets the list of grapheme clusters of a String
. Then I use it Collections.reverse
to invert this list, and I walk it, keeping the respective parts of String
in a StringBuilder
, to finally have the String
reversed. The output of the code is:
Remember that if your console does not support ZWJ Sequences or flags, the family can be shown as 4 emojis (man, woman, 2 girls) and the flag can be shown as the Regional Indicator Symbols themselves, as previously mentioned. But the important thing is that these sequences are treated as if they were one thing, not occurring the problems already mentioned (change the flag and undo the ZWJ Sequence).
For comparison purposes, see the result with StringBuilder::reverse
, normalizing to NFC and NFD:
System.out.println("NFC:" + new StringBuilder(Normalizer.normalize(s, Normalizer.Form.NFC)).reverse().toString());
System.out.println("NFD:" + new StringBuilder(Normalizer.normalize(s, Normalizer.Form.NFD)).reverse().toString());
The exit is:
Note that converting to NFC only corrects accents, but none of the ways prevents the problem of reversing the flag and ZWJ Sequence code points.
Regex
If you do not want to use an external library, starting from Java 9 it is possible to use \X
to check a Unicode Extended grapheme cluster - which is explained in detail here.
However, in the tests I did, he doesn’t recognize a Emoji ZWJ Sequence, because using \X
, each emoji of the sequence is considered a Grapheme Cluster separate. The flag emojis, however, are recognized correctly. Using as example the same String
of the previous example:
int[] cp = {
// emoji de família
0x1f468, 0x200d, 0x1f469, 0x200d, 0x1f467, 0x200d, 0x1f467,
// bandeira da Austrália
0x1f1e6, 0x1f1fa,
// outros emojis
0x1f4a9, 0x1f4b0 };
String s = new String(cp, 0, cp.length)
// juntar com caracteres acentuados na forma NFD
+ Normalizer.normalize("áñabc", Normalizer.Form.NFD);
Pattern p = Pattern.compile("\\X");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
In this example I am looking for all Grapheme Clusters with \X
(remembering that inside Strings
the backslash is written as \\
). The result, however, shows each emoji of ZWJ Sequence separately (the flag, however, is correct):
In the Java 13 improved a little, but still not perfect: the \X
recognizes the emojis of man and woman as a Grapheme Cluster, and the two girl emojis as other (instead of considering all one thing only).
One solution would be to try to modify the regex to get a ZWJ Sequence or one \X
:
Pattern p = Pattern.compile("\\p{InMiscellaneous_Symbols_and_Pictographs}(?:\\u200d\\p{InMiscellaneous_Symbols_and_Pictographs})*|\\X");
\p{InMiscellaneous_Symbols_and_Pictographs}
serves to catch any code point that is on block Miscellaneous Symbols and Pictographs (click on the link and see that the list is great). Next I see if there are one or more occurrences of ZWJ (\u200d
) followed by another block symbol Miscellaneous Symbols and Pictographs. Using this regex, the Grapheme Cluster family is recognized as a single "character".
Then it would be enough to save the results of m.group()
on a list, invert it and build the String
inverted:
List<String> grafemas = new ArrayList<>();
while(m.find()) {
grafemas.add(m.group());
}
Collections.reverse(grafemas);
String invertida = String.join("", grafemas);
With that, the String
reversed is the same as in the previous example with UnicodeGraphemeParsing::parse
. But there’s a catch.
The block Miscellaneous Symbols and Pictographs does not contain all emojis. There are several that are in other blocks, such as the heart, who’s on the block Dingbats, and many others scattered over other blocks, such as the Block Emoticons, which contains the Smiles and other symbols (plus there are blocks can have other characters that are not emojis). Finally, try to make a regex that contemplates all possible emojis it’s not that simple. Another option is to limit yourself to ZWJ Sequences described in Unicode, making a regex that contemplates only those sequences (even so it is a considerable job).
Java <= 8
In Java 8 there is no support for \X
, but I found an alternative in this reply by Soen. Unfortunately, it’s nothing simple:
String extendedGraphemeCluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";
I did a test with this regex and unfortunately she does not recognize the flag emojis as if it were a single Grapheme Cluster (each Regional Indicator Symbol is considered a separate cluster) - in addition to also breaking the ZWJ Sequence into several emojis, in the same way that \X
. That is, in addition to trying to understand this regex, you will still have to modify it to contemplate these cases. An attempt:
String extendedGraphemeCluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";
String zwjSequence = "\\p{InMiscellaneous_Symbols_and_Pictographs}(?:\\u200d\\p{InMiscellaneous_Symbols_and_Pictographs})*";
String flag = "[\\x{1f1e6}-\\x{1f1ff}]{2}";
// regex com ZWJ Sequence, ou sequência de bandeira, ou grapheme cluster
Pattern p = Pattern.compile(zwjSequence + "|" + flag + "|" + extendedGraphemeCluster);
With that, the ZWJ Sequences and the flag emojis are recognized as a single Grapheme Cluster (with all the aforementioned holds on the block Miscellaneous Symbols and Pictographs not having all emojis), but unfortunately it doesn’t work for Korean characters: for example, the character 각
in the NFD form consists of 3 codepoints: ᄀ
(HANGUL CHOSEONG KIYEOK), ᅡ
(HANGUL JUNGSEONG A) and ᆨ
(HANGUL JONGSEONG KIYEOK ). But the above regex does not recognize these 3 code points as a single Grapheme Cluster. Already the previous solutions (UnicodeGraphemeParsing::parse
and \X
from Java 9) recognize this Grapheme Cluster correctly.
In that case, I think I’d better go to a library, like the one I indicated above - did not test for all possible cases, such as characters of different languages (tested only with Japanese and Korean characters and worked), but everything depends on their use cases.
That doesn’t mean that the solution with StringBuilder::reverse
is wrong. If you are only dealing with ASCII characters (or texts in Portuguese/English/any language that uses the Latin alphabet), just remember to normalize the String
for NFC that will not have problems (maybe there is a case or other, will know, but for most cases that only use ASCII should work).
I just wanted to show that there are several other cases that, should they arise, will require a more comprehensive (and complicated) solution. And I recognize that my answer is not complete (in the sense of contemplating all the possibilities foreseen by Unicode), because I am still beginning to deepen myself in the subject (Unicode is much more complicated than I thought) and surely there are many more details that I still do not know (I only know that I know nothing...)
It worked thanks, our so simple I was complicating very vlw
– Sarah
Stringbuilder is a class?
– Sarah
@Sarah yes, it’s a class for manipulating strings, as well as stringbuffer.
– user28595
Thanks, it worked out
– Sarah
@Sarah I don’t know how
JOptionPane
handles accented characters or emojis, but anyway put an answer below complementing the subject– hkotsubo
@hkotsubo that runs away completely to what was asked.
– user28595
@Articunol I think not, the question is "how to invert a String", and a String can have accents, emojis, etc. Unless Joptionpane does not support these characters.
– hkotsubo
@exact hkotsubo, the question did not quote anything that you spoke, so much that answer was accepted. If there was a need to treat accents or emojis, which had never been mentioned by the author, the answer would surely have been edited or the author mentioned.
– user28595
@Articunol I understand your point of view, but in what way, if you search in google "how to invert java string", this question is one of the first results, and so wanted to leave a more "general" solution for future visitors. Please don’t misunderstand me, I didn’t mean that your answer is wrong (I even voted for it), I just wanted to record that, depending on the case, Stringbuilder may not solve.
– hkotsubo
@hkotsubo I understand, there is no problem in your answer, which even turned out great, by the way. I only commented because you made a point of quoting the author here in the comments, which may lead to a misinterpretation of her that my answer is wrong. I just wanted to make it clear that at no time was mentioned the situation you addressed, and the focus of the site is objectivity, if we stop to address all the possible situations, the answers will get tiresome and no room for others, as you yourself did here, respond with complements.
– user28595
@Articunol Yeah, now I realize that I may have given the wrong impression to the author. A thousand apologies for the misunderstanding, was not my intention
– hkotsubo