Edit: I’ve altered almost all class behavior*
I liked the challenge that this brings and I made a class that dynamically generates a rule in regex
with all possibilities combined between the items of Blacklist and Cheats and finally filters all characters unicodes
.
The reason you prefer regex
and not a "grau de similaridade"
, is by comparing "potato" with "roach"... To solve this, it would be necessary to create a Blacklist and Whitelist or decrease the percentage of Threshold, where the complexity could be much higher.
With a dynamic regex, adding batata
in Blacklist, the generated rule is:
/(b.?|3.?|8.?)(a.?|@.?|4.?|á.?|à.?|â.?|ã.?|ª.?|^.?)(t.?|7.?)(a.?|@.?|4.?|á.?|à.?|â.?|ã.?|ª.?|^.?)(t.?|7.?)(a|@|4|á|à|â|ã|ª|^)/gim
Gives match: b\u200Bát@74
, bbaattaattaa
, b@t@t@
, 347474
, b a t a t a
, b$a$t$a$t$a
, etc..
No match: barata
, baratata
, etc..
Sobre Unicode:
After several attempts, all failures, I created a rule, also using
regex, to filter every Unicode character or character that does not respect the rule
of a character "common" (ß
,
, ɐʇɐʇɐ
).
- The function
.normalize('NFKC')
can transcode some Unicode characters, but unfortunately, a lot of things goes wrong.
How to use:
const blacklist = new Blacklist(_sua_blacklist_);
blacklist.validate('_string_a_ser_validada_');
How it works:
- When creating the class, just include your Blacklist (array) in the constructor, with this, all rules in
regex
and stored in memory.
- Calling the function
.validate()
, just pass the string to be validated. The return of the function will, in the case of match, be an object with the properties matchs
and Unicodes
:
{
matchs: [
'batata',
'bbaattaatta',
'b@t@t@'
'347474',
'b a t a t a',
'b$a$t$a$t$a'
],
unicodes: [ '', 'ɐʇɐʇɐ', 'ß' ]
}
Or in case there is no occurrence (or matchs
, nor unicodes
): true
It is possible to add from isolated words to complete expressions in the Blacklist, for example, you allow batata
and purê
separately, but if the person purê de batata
, you want to block.
The idea when returning an object with all invalid values separated by "common" and "unicodes" is that for whom to use, you can make your own specific treatment for each situation. Be a replace(x, y)
, or a alert the user by returning all invalid words and characters.
The class is ready to be used, but is well commented to help anyone who wants to adapt. I also made the code much simpler than before and, at the end of the script, I put some examples into practice:
class Blacklist {
constructor(blacklist) {
const cheats = {
"a": [ "@", "4", "á", "à", "â", "ã", "ª", "^" ],
"b": [ "3", "8" ],
"c": [ "ç", "\\(", "\\[", "\\{", "<" ],
"d": [ ],
"e": [ "3", "é", "ê", "&" ],
"f": [ ],
"g": [ "6" ],
"h": [ ],
"i": [ "1", "!", "í", "l" ],
"j": [ ],
"k": [ ],
"l": [ "1", "!", "í", "i" ],
"m": [ "ññ", "nn" ],
"n": [ "ñ" ],
"o": [ "0", "ó", "ô", "õ", "º", "\\(\\)" ],
"p": [ ],
"q": [ ],
"r": [ ],
"s": [ "5", "\\$", "z" ],
"t": [ "7" ],
"u": [ "ú", "v", "w" ],
"v": [ "u", "ú" ],
"w": [ ],
"x": [ ],
"y": [ "i", "l" ],
"z": [ "2", "s" ]
};
/* -- CODE -- */
try {
// valida a blacklist enviada
if (!Array.isArray(blacklist)) throw('A blacklist precisa ser um array');
// captura tudo que for unicode
const getUnicode = /(((?!\s)(?=([^a-z0-9~!@#$%^&*()_+`'"=:{}[%?|;,./<>ªºáãàâéêíóõôºúüçñ\]\\\-\s]))).?(?=[^\x2A\x30-\x39\x41-\x5A\x61-\x7A])*)+/gim;
// guarda os regex gerados com o conteúdo da blacklist
const blacklistRegex = [];
if (blacklist.length > 0) {
// tratar itens da blacklist
blacklist.map((item, id) => blacklist[id] = item.normalize('NFD').replace(/[\s\u0300-\u036f]/gim, '').toLowerCase());
// percorre todos os itens da blacklist e cria um regex para cada um
blacklist.forEach(item => {
// "explode" cada caractere para um array
const item_split = item.split('');
// prepara variáveis para o loop
let count = 0;
let item_regex = '';
// percorre cada caractere e cria um regex para cada combinação do objeto "cheat" no caractere relacionado
item_split.forEach(character => {
count++;
const currentCheat = cheats[character];
if (currentCheat) {
const item_length = item_split.length;
const cheat_length = currentCheat.length;
const regex_character = currentCheat.toString().split(',').join('.?|');
const regex_last_character = currentCheat.toString().split(',').join('|');
const cheatMid = cheat_length > 0 ? `|${regex_character}.?` : '';
const cheatLast = cheat_length > 0 ? `|${regex_last_character}` : '';
item_regex += count < item_length ? `(${character}.?${cheatMid})` : `(${character}${cheatLast})`;
}
});
// guarda na memória o regex final gerado pelo loop
blacklistRegex.push(RegExp(item_regex, 'gim'));
});
}
this.validate = text => {
try {
// valida os parâmetros
if (typeof text !== 'string' || text?.trim().length === 0) throw('O conteúdo a ser validado precisa ser do tipo string');
if (blacklist.length === 0) return true;
// normalizar dígitos (como "\u0061" de volta para "a")
text = text.normalize('NFC');
// remover espaços duplicados
text = text.replace(/\s/gim, ' ');
// guarda, caso ocorra, os "matchs" com os itens da blacklist
const invalids = [];
// guarda, caso ocorra, os caracteres unicodes
const unicodes = text.match(getUnicode) || [];
// verifica ocorrências para cada regex gerado
blacklistRegex.forEach(regex => {
// cria um array com as ocorrências, caso houver
const words = text.match(regex);
if (words?.length > 0) Object.assign(invalids, words);
});
return invalids?.length > 0 || unicodes?.length > 0 ? { matchs: invalids, unicodes: unicodes } : true;
}
catch(error) {
console.error(error);
}
};
}
catch(error) {
console.error(error);
}
}
}
/***********\
| TESTING |
\***********/
const myBlacklist = [ 'batata' ];
const blacklist = new Blacklist(myBlacklist);
const invalid = blacklist.validate( 'b\u200Bát@74 | bátátá | b$at$a$t$a | 347474 | | ɐʇɐʇɐq | b a t a t a | b-a-t-a-t-a | ß @ 7 @ 7 @ | bbaattaattaa' );
const valid = blacklist.validate( 'Nada a ser encontrado por aqui' );
console.log('FALSO: \n', invalid);
console.log('VERDADEIRO: \n', valid);
I hope it helps and indirectly thanks for the idea.
It probably has a kind of dictionary or a series of common rules based on locks, it doesn’t even need to be something so sophisticated or mathematically elaborate, a replace in
\u200B
would already solve, use of @ as well as A, and so on. I believe that there is no specific algorithm, it must have a number of "ideas" and suggestions, but it is difficult to say that it serves as a rule.– Guilherme Nascimento
Remove (or block) the
\u200B
even goes, but block (or give a replace) in special characters such as@
is not ideal because they may be required in legitimate titles.– Luiz Felipe
The delete I quote is for testing and not for saving. Replace on time and analyze word similarity.
– Guilherme Nascimento
Oh yes, I got it wrong. Really, it’s a little less worse than I had thought about the recording issue. :)
– Luiz Felipe
In PHP for example you already have two implementations of test of similarity https://www.php.net/levenshtein and https://www.php.net/manual/function.similar-text.php, but even with certain algorithms some words have to be treated before evaluating.
– Guilherme Nascimento
If the application accepts a ZWS already seems to me a bug, for starters. The character set should theoretically be restricted to the nature of a field. In Eastern languages it is a little more complicated, but in Western, it makes no sense to even accept common special characters (a given name in Brazil is a-z spaces and some accented letters only, for example. nor numerals can. companies can have the & and numerals, "-" and not much more than that. The hole is much lower than the question is covering. Libraries like ICU have solutions for this, as does Iconv.
– Bacco
I just don’t put it as an answer because it has a lot of holes:
let batata = 'Datilografia';
let fakeBatata = 'Ⓓati' + '\u200B' + 'lografia';
b = batata.replace(/\p{C}/ugi,'');
fB = fakeBatata.replace(/\p{C}/ugi,'');
console.log(b, fB); 
console.log(b.normalize('NFKC') == fB.normalize('NFKC'));
– Augusto Vasques
In fact, contrary to what acolytes preach, using iso-8859-1, win-1252 (or a subset of these, almost equal) for Western software practically only brings advantages (you have to know the real case, of course - no silver bullet, no universal solution). By the way, Dbs like Mysql/Mariadb even allow you to specify different Charsets in each column, very versatile for who knows what you are doing (which unfortunately, in terms of character coding is a minority - for many, it is practically a taboo on this subject, for lack of being given due importance in learning sources)
– Bacco
I would use a score algorithm that will give a note to the similarity of the words. https://itsallbinary.com/similar-strings-or-same-sounding-strings-algorithms-comparison-apache-implementation-in-java/#Fuzzy I don’t know if it’s something simple to implement, but it’s certainly better than doing a giant blocklist or a huge regex to prevent it.
– Luiz Felipe Borges
@Luizfelipe de uma olhada https://codepen.io/AugustoVasques/pen/WNowbqK?editors=1112 ,is not infallible but helps.
– Augusto Vasques
Interestingly, https://meta.stackexchange.com/q/359283/401803
– hkotsubo