To detect a zalgo text, first we need to understand how it is done.
Unicode Combining Characters
In the Unicode there is the definition of Combining characters. Basically, in a way well summarized, are characters that can be combined with others, to form different characters.
An existing example in English are the accents (acute and circumflex), the tilde and the cedilla. So if you combine the letter a
tiny with the acute accent (COMBINING ACUTE ACCENT), the result is the character á
(letter "a" accentuated).
There are currently more than 2000 characters with this feature that can be combined with others, listed in the categories Mn (Mark, Nonspacing), Me (Mark, Enclosing) and Mc (Mark, Spacing).
The fact is that there are many languages in which it is perfectly valid to have more than one Combining applied to the same letter, which is why Unicode allows this. In the case of the text that is in the question, for example, the first characters are:
character |
code point |
Category |
Name |
T |
U+0054 |
Lu |
LATIN CAPITAL LETTER T |
̃ |
U+0303 |
Mn |
COMBINING TILDE |
͟ |
U+035F |
Mn |
COMBINING DOUBLE MACRON BELOW |
͏ |
U+034F |
Mn |
COMBINING GRAPHEME JOINER |
̧ |
U+0327 |
Mn |
COMBINING CEDILLA |
̟ |
U+031F |
Mn |
COMBINING PLUS SIGN BELOW |
͓ |
U+0353 |
Mn |
COMBINING X BELOW |
̯ |
U+032F |
Mn |
COMBINING SHORT INVERTED BELOW |
̘ |
U+0318 |
Mn |
COMBINING LEFT TACK BELOW |
͓ |
U+0353 |
Mn |
COMBINING X BELOW |
͙ |
U+0359 |
Mn |
COMBINING ASTERISK BELOW |
͔ |
U+0354 |
Mn |
COMBINING LEFT ARROWHEAD BELOW |
To understand what a code point is, read here.
That is, the text of the question begins with the letter "T" followed by 11 Combining characters. Only this sequence of code points produces this:
T
Follow an image if it doesn’t render correctly:
The rest of the zalgo text follows the same pattern: one letter followed by several Combining characters.
Another feature that makes the zalgo text stay that way so peculiar is the algorithm of Stacking (defined in same document already cited), that defines what can happen when more than one Combining is applied to the same letter.
Basically, every new Combining may end up being rendered above or below existing ones (each has a rule that says in which position it should go, but the exact rendering also depends on the source used). See below what happens when we add Combining characters the letter T (for each line, a new Combining is added to existing):
T <-- letter "T" without any Combining
T <-- added COMBINING TILDE
T <-- added COMBINING DOUBLE MACRON BELOW
T <-- added COMBINING GRAPHEME JOINER
T <-- added COMBINING CEDILLA
T <-- added COMBINING PLUS SIGN BELOW
T <-- added COMBINING X BELOW
T <-- added COMBINING INVERTED BRIEF BELOW
T <-- added COMBINING LEFT TACK BELOW
T <-- added COMBINING X BELOW
T <-- added COMBINING ASTERISK BELOW
T <-- added COMBINING LEFT ARROWHEAD BELOW
That is, as Unicode allows several Combining characters applied to the same character, and in the rendering these are "stacked" cumulatively (can even "invade" the lines from above or below), the result is this look so characteristic of the zalgo text.
An attempt to detect zalgo text
Thus, a criterion for detecting that the text is zalgo could be check if there are "many" Combining characters followed. Something like:
$texto = // string possivelmente contendo zalgo text
if (preg_match('/\p{M}{2,}/u', $texto)) {
echo "zalgo\n";
} else {
echo "not zalgo\n";
}
The idea is to use regex with Unicode properties. In the case, \p{M}
picks any character that is in the "Mark" categories (the three already cited: Mn (Mark, Nonspacing), Me (Mark, Enclosing) and Mc (Mark, Spacing)).
And the quantifier {2,}
indicates "two or more occurrences", that is, if you have two or more occurrences Combining characters, I already consider it to be zalgo. This may be enough if you want to accept only Portuguese texts, since in this language the characters have at most one Combining.
But remember that there are languages in which this limit can be higher. Unicode defines the concept of Stream-safe Text Format, that somehow "sets a limit" of 30 Combining characters followed. But in practice, 30 is a lot, as the largest known sequence is the Tibetan character HAKṢHMALAWARAYAḌ, which is a letter followed by 8 Combining characters. It’s this one, and I admit that for those of you who don’t know it, you might as well be mistaken for zalgo text:
ཧ
Image below if you couldn’t see straight:
So unless you’re going to accept texts in Tibetan, use \p{M}{8,}
would be a valid alternative. Depending on the chosen amount, you may end up deleting valid texts in other languages (many use 2, 3 or even more Combining characters), then you will have to adjust this value according to the strings you receive, to avoid false positives.
Of course, one can also argue - and this is a very subjective discussion - that with only 2 or 3 Combining characters the text is not "zalgo enough" (see the "T" examples above, with only 2 or 3 Combining characters added), as visually it would not look so... "zalguified". That is, it is very difficult to define a precise criterion that meets all cases.
Another way to do it would be to use the extension php-intl
(don’t forget to install it). With it we can iterate through the code points of the string and check for a sequence of Combining characters larger than certain size:
// verifica se o caractere é um combining character
function isCombining($char) {
$t = IntlChar::charType($char);
return $t === IntlChar::CHAR_CATEGORY_NON_SPACING_MARK ||
$t === IntlChar::CHAR_CATEGORY_ENCLOSING_MARK ||
$t === IntlChar::CHAR_CATEGORY_COMBINING_SPACING_MARK;
}
function isZalgo($text) {
$max_allowed = 2; // considera que mais de 2 combining characters já é zalgo
$i = IntlBreakIterator::createCodePointInstance();
$i->setText($text);
$seq = 0; // conta o tamanho da sequência de combining characters
foreach($i->getPartsIterator() as $c) {
if (isCombining($c)) {
$seq++;
} else {
if ($seq > $max_allowed) {
return TRUE;
}
$seq = 0;
}
}
return $seq > $max_allowed;
}
$text = // texto que você quer verificar
if (isZalgo($text)) {
// zalgo detected!
}
To another answer uses the regex [^\x20-\x7E]
, which basically excludes everything that is not ASCII. That is, even if the string has only one accented character (such as 'á'
), she is considered zalgo text, which does not seem to me to be a good result, as many valid texts would be considered invalid (and also did not understand why test the same regex twice). Anyway, it ends up being too restrictive, leaving out completely valid strings (see here an example).
In the end, it is worth what has already been said: the greater the amount of Combining characters followed that are accepted, less valid cases are left out. In compensation, increases the possibilities of zalgo text, because the more characters followed are accepted, the more combinations - including invalid ones ("invalid" will depend on the context, the language used, etc.) - are possible.
And in this case, another alternative would be to have whitelists of valid sequences of Combining characters in the languages your application will accept. For example, in Portuguese, it would be only vowels followed by one of the accents, or the letter c
followed by the cedilla (COMBINING CEDILLA). Something like:
// textos em português: testa se depois do "c" tem a cedilha, ou se tem alguma consoante
// com acento, ou se tem vogal com algo diferente do acento agudo, circunflexo e til (ou a letra "a" com crase)
// (testado em alguns casos mais básicos, precisa de mais testes para ter certeza que é eficaz)
if (preg_match('/c[^\P{M}\x{327}]|[^aeiouc]\p{M}|[eiou][^\P{M}\x{301}-\x{303}]|a[^\P{M}\x{300}-\x{303}]/iu', $texto)) {
echo "invalido\n";
}
That is, instead of detecting a zalgo text, i try to check if it is a valid text (whose definition varies according to the context). And then it doesn’t matter if the invalid text is zalgo or whatever it is. It’s more work, but it has the tradeoff to be more assertive. In the end, there is no magic solution that works in 100% of cases.
References:
Hello, try reading this here.
– Edilson