How would an algorithm work to prevent attempts to trick word blocks (strings)?

Asked

Viewed 486 times

14

Let’s say I develop an application that allows the creation of arbitrary records (no matter the subject). However, for some reason, I have decided to block the use of the word batata in the title of the record.

However, the user can instead use batata, place bat@t@. Thus, assuming that the lock is implemented lightly, as through simple comparison of equality, no error will occur - that is, the lock will have been avoided. The user will have created a record (with idea of word that I blocked, but not exactly the same string that I blocked).

Another option would be to use the character ZERO WIDTH SPACE (U+200B) between any character of the word, so as to also circumvent a frivolous comparison.

const batata = 'batata';
const fakeBatata = 'bat' + '\u200B' + 'ata';

console.log(batata, fakeBatata); // Parecem iguais.
console.log(batata === fakeBatata); // Mas são diferentes.

Regarding the character ZERO WIDTH SPACE (U+200B), the solution I think is relatively simple, but not ideal for all scenarios. Simply block any use of this type of character (through a Blacklist, for example). However, it could not block a @, for example, since it may be useful in some titles.

I really can’t think of a good way to solve this problem. I also don’t know the "formal" name of the problem or the hypothetical algorithm that mitigates the problem.

Is it even possible to solve it efficiently? If so, how would the algorithm work?


I do not question an implemented algorithm, but rather an orientation. However, hereafter from the above description, an implementation (simple if it is a very complex problem) in some language like Javascript or Python would be helpful as well. :)

  • It probably has a kind of dictionary or a series of common rules based on locks, it doesn’t even need to be something so sophisticated or mathematically elaborate, a replace in \u200B would already solve, use of @ as well as A, and so on. I believe that there is no specific algorithm, it must have a number of "ideas" and suggestions, but it is difficult to say that it serves as a rule.

  • Remove (or block) the \u200B even goes, but block (or give a replace) in special characters such as @ is not ideal because they may be required in legitimate titles.

  • The delete I quote is for testing and not for saving. Replace on time and analyze word similarity.

  • Oh yes, I got it wrong. Really, it’s a little less worse than I had thought about the recording issue. :)

  • 4

    In PHP for example you already have two implementations of test of similarity https://www.php.net/levenshtein and https://www.php.net/manual/function.similar-text.php, but even with certain algorithms some words have to be treated before evaluating.

  • 5

    If the application accepts a ZWS already seems to me a bug, for starters. The character set should theoretically be restricted to the nature of a field. In Eastern languages it is a little more complicated, but in Western, it makes no sense to even accept common special characters (a given name in Brazil is a-z spaces and some accented letters only, for example. nor numerals can. companies can have the & and numerals, "-" and not much more than that. The hole is much lower than the question is covering. Libraries like ICU have solutions for this, as does Iconv.

  • 1

    I just don’t put it as an answer because it has a lot of holes: let batata = 'Datilografia';
let fakeBatata = 'Ⓓati' + '\u200B' + 'lografia';
b = batata.replace(/\p{C}/ugi,'');
fB = fakeBatata.replace(/\p{C}/ugi,'');
console.log(b, fB); 
console.log(b.normalize('NFKC') == fB.normalize('NFKC'));

  • 4

    In fact, contrary to what acolytes preach, using iso-8859-1, win-1252 (or a subset of these, almost equal) for Western software practically only brings advantages (you have to know the real case, of course - no silver bullet, no universal solution). By the way, Dbs like Mysql/Mariadb even allow you to specify different Charsets in each column, very versatile for who knows what you are doing (which unfortunately, in terms of character coding is a minority - for many, it is practically a taboo on this subject, for lack of being given due importance in learning sources)

  • I would use a score algorithm that will give a note to the similarity of the words. https://itsallbinary.com/similar-strings-or-same-sounding-strings-algorithms-comparison-apache-implementation-in-java/#Fuzzy I don’t know if it’s something simple to implement, but it’s certainly better than doing a giant blocklist or a huge regex to prevent it.

  • 1

    @Luizfelipe de uma olhada https://codepen.io/AugustoVasques/pen/WNowbqK?editors=1112 ,is not infallible but helps.

  • 4

    Interestingly, https://meta.stackexchange.com/q/359283/401803

Show 6 more comments

4 answers

1

It is not a definitive solution because I believe when it comes to Unicode such a solution does not exist.

It is possible to restrict to some degree the user’s possibilities of entering character combinations to circumvent word restrictions. But do not prevent it altogether, the human mind is fertile when it comes to circumventing limits.

In that reasoning I imagined the following:

  • A list of restricted words and their mirror versions.
  • Research the Unicode character database in search of characters homographs to ascii characters of [A-Z] including @.
  • On an input provided by the user map the homograph characters to their corresponding ASCII.
  • Pass the input mapped and normalized by approximate comparison software to find possible matches in the restricted word list.

The algorithm implementing this exercise of thought was carried out in python supported on the technical reports Unicode.


import unicodedata
import string
import difflib

#Essa função retorna uma lista de caracteres homógrafos a letter.
def homografos(letter):
    d = []                                                     #Inicializa a lista que receberá os caracteres homógrafos a letter. 
    #Hack para inserir o caractere @ caso solicitado no DB Unicode suas variantes estão discriminado como COMMERCIAL.
    if letter == "@":                                 
        letter = "COMMERCIAL"
    #Para todos os code points no DB Unicode...
    for i in range(0x10FFFF):
        l = chr(i)                                            #...obtem o caractere correspondente ao code point i.
        #...se letter estiver contido na descrição textual do code point em questão...
        if letter in unicodedata.name(l ,"None").split(" "):
            d.append(l)                                       #...adiciona o caractere correspondente ao code point i a lista de homógrafos a letter.
    return d                                                  #Retorna a lista de homógrafos a letter.

#Essa função faz a transliteração entre caracteres homógrafos
def transhomografos(letters):
    d={}                                              #Inicializa o dicionário de transliteração. O formato é mesmo usado por static str.maketrans(x[, y[, z]])
    print("Criando tabela de transliteração...")      #Para diminuir a ansiedade da espera.
    #Para cada caractere em letters...
    for c in letters:
        #...para cada caractere homográfo a c...
        for h in homografos(c):
            d[ord(h)] = ord('A' if c == "@" else c)   #...monta o dicionário de transliteração. Hack para traduzir @ para A.
        print(c)                                      #...para diminuir a ansiedade da espera.
    return d                                          #Retorna o dicionário de transliteração.
        
t = transhomografos(string.ascii_uppercase + '@')     #Cria em t o dicionário de transliteração de [A..Z@]incluindo @.

#Cria a lista de palavras restritas inclusive suas versões espelhadas.
restritas = []    
for p in ['batata', 'capim', 'feno']:
    restritas.append(p.upper())
    restritas.append(p.upper()[::-1])
    
while True:
   #Normaliza a entrada do usuário.
   p = unicodedata.normalize('NFKC',input("Digite uma palavra: "))
   #Se a entrada estiver vazia abandona o program.
   if not len(p):
       break
   p = p.translate(t)                                    #Traduz a entrada do usuário para seu equivalente ASCII.
   m = difflib.get_close_matches(p, restritas)           #Obtem a comparação aproximada entre a entrada tratada do usuário e a lista de palavras restritas.
   #Se a houver alguma semelhança exibe uma mensagem. 
   if len(m):
       print('Possivel palavra restrita encontrada',m)

Test the example on Repli.it.

More information:

What is the difference between code page, Plane, code point, octet and other terms about characters?

What can be considered a character?

  • 1

    Wow, Augusto, I didn’t see your answer. I only noticed now with the notification of the end of the reward... :( As it is the most realistic of the answers, I probably would have given the reward. Too bad you wasted it.

0

Edit: I’ve altered almost all class behavior*

I liked the challenge that this brings and I made a class that dynamically generates a rule in regex with all possibilities combined between the items of Blacklist and Cheats and finally filters all characters unicodes.

The reason you prefer regex and not a "grau de similaridade", is by comparing "potato" with "roach"... To solve this, it would be necessary to create a Blacklist and Whitelist or decrease the percentage of Threshold, where the complexity could be much higher.

With a dynamic regex, adding batata in Blacklist, the generated rule is:

/(b.?|3.?|8.?)(a.?|@.?|4.?|á.?|à.?|â.?|ã.?|ª.?|^.?)(t.?|7.?)(a.?|@.?|4.?|á.?|à.?|â.?|ã.?|ª.?|^.?)(t.?|7.?)(a|@|4|á|à|â|ã|ª|^)/gim

Gives match: b\u200Bát@74, bbaattaattaa, b@t@t@, 347474, b a t a t a, b$a$t$a$t$a, etc..

No match: barata, baratata, etc..

Sobre Unicode:

After several attempts, all failures, I created a rule, also using regex, to filter every Unicode character or character that does not respect the rule of a character "common" (ß, , ɐʇɐʇɐ).

  • The function .normalize('NFKC') can transcode some Unicode characters, but unfortunately, a lot of things goes wrong.

How to use:

const blacklist = new Blacklist(_sua_blacklist_);
blacklist.validate('_string_a_ser_validada_');

How it works:

  • When creating the class, just include your Blacklist (array) in the constructor, with this, all rules in regex and stored in memory.
  • Calling the function .validate(), just pass the string to be validated. The return of the function will, in the case of match, be an object with the properties matchs and Unicodes:
 {
  matchs: [
    'batata',
    'bbaattaatta',
    'b@t@t@'
    '347474',
    'b a t a t a',
    'b$a$t$a$t$a'
  ],
  unicodes: [ '', 'ɐʇɐʇɐ', 'ß' ]
}

Or in case there is no occurrence (or matchs, nor unicodes): true

It is possible to add from isolated words to complete expressions in the Blacklist, for example, you allow batata and purê separately, but if the person purê de batata, you want to block.

The idea when returning an object with all invalid values separated by "common" and "unicodes" is that for whom to use, you can make your own specific treatment for each situation. Be a replace(x, y), or a alert the user by returning all invalid words and characters.

The class is ready to be used, but is well commented to help anyone who wants to adapt. I also made the code much simpler than before and, at the end of the script, I put some examples into practice:

class Blacklist {

   constructor(blacklist) {

      const cheats = {

         "a": [ "@", "4", "á", "à", "â", "ã", "ª", "^" ],
         "b": [ "3", "8" ],
         "c": [ "ç", "\\(", "\\[", "\\{", "<" ],
         "d": [ ],
         "e": [ "3", "é", "ê", "&" ],
         "f": [ ],
         "g": [ "6" ],
         "h": [ ],
         "i": [ "1", "!", "í", "l" ],
         "j": [ ],
         "k": [ ],
         "l": [ "1", "!", "í", "i" ],
         "m": [ "ññ", "nn" ],
         "n": [ "ñ" ],
         "o": [ "0", "ó", "ô", "õ", "º", "\\(\\)" ],
         "p": [ ],
         "q": [ ],
         "r": [ ],
         "s": [ "5", "\\$", "z" ],
         "t": [ "7" ],
         "u": [ "ú", "v", "w" ],
         "v": [ "u", "ú" ],
         "w": [ ],
         "x": [ ],
         "y": [ "i", "l" ],
         "z": [ "2", "s" ]
      };

      /* -- CODE -- */
      try {

         // valida a blacklist enviada
         if (!Array.isArray(blacklist)) throw('A blacklist precisa ser um array');

         // captura tudo que for unicode
         const getUnicode = /(((?!\s)(?=([^a-z0-9~!@#$%^&*()_+`'"=:{}[%?|;,./<>ªºáãàâéêíóõôºúüçñ\]\\\-\s]))).?(?=[^\x2A\x30-\x39\x41-\x5A\x61-\x7A])*)+/gim;
         // guarda os regex gerados com o conteúdo da blacklist
         const blacklistRegex = [];

         if (blacklist.length > 0) {

            // tratar itens da blacklist
            blacklist.map((item, id) => blacklist[id] = item.normalize('NFD').replace(/[\s\u0300-\u036f]/gim, '').toLowerCase());

            // percorre todos os itens da blacklist e cria um regex para cada um
            blacklist.forEach(item => {

               // "explode" cada caractere para um array
               const item_split = item.split('');
               
               // prepara variáveis para o loop
               let count = 0;
               let item_regex = '';
         
               // percorre cada caractere e cria um regex para cada combinação do objeto "cheat" no caractere relacionado
               item_split.forEach(character => {
         
                  count++;
                  const currentCheat = cheats[character];
                  
                  if (currentCheat) {

                     const item_length = item_split.length;
                     const cheat_length = currentCheat.length;

                     const regex_character = currentCheat.toString().split(',').join('.?|');
                     const regex_last_character = currentCheat.toString().split(',').join('|');

                     const cheatMid = cheat_length > 0 ? `|${regex_character}.?` : '';
                     const cheatLast = cheat_length > 0 ? `|${regex_last_character}` : '';

                     item_regex += count < item_length ? `(${character}.?${cheatMid})` : `(${character}${cheatLast})`;
                  }
               });
         
               // guarda na memória o regex final gerado pelo loop
               blacklistRegex.push(RegExp(item_regex, 'gim'));
            });
         }

         this.validate = text => {

            try {

               // valida os parâmetros
               if (typeof text !== 'string' || text?.trim().length === 0) throw('O conteúdo a ser validado precisa ser do tipo string');
               if (blacklist.length === 0) return true;
               
               // normalizar dígitos (como "\u0061" de volta para "a")
               text = text.normalize('NFC');
               // remover espaços duplicados
               text = text.replace(/\s/gim, ' ');
               
               // guarda, caso ocorra, os "matchs" com os itens da blacklist
               const invalids = [];
               // guarda, caso ocorra, os caracteres unicodes
               const unicodes = text.match(getUnicode) || [];
            
               // verifica ocorrências para cada regex gerado
               blacklistRegex.forEach(regex => {
            
                  // cria um array com as ocorrências, caso houver
                  const words = text.match(regex);
            
                  if (words?.length > 0) Object.assign(invalids, words);
               });
               
               return invalids?.length > 0 || unicodes?.length > 0 ? { matchs: invalids, unicodes: unicodes } : true;
            }
            catch(error) {

               console.error(error);
            }
         };
      }
      catch(error) {

         console.error(error);
      }
   }
}


/***********\
|  TESTING  |
\***********/

const myBlacklist = [ 'batata' ];
const blacklist = new Blacklist(myBlacklist);

const invalid = blacklist.validate( 'b\u200Bát@74 | bátátá | b$at$a$t$a | 347474 |  | ɐʇɐʇɐq | b a t a t a | b-a-t-a-t-a | ß @ 7 @ 7 @ | bbaattaattaa' );
const valid = blacklist.validate( 'Nada a ser encontrado por aqui' );

console.log('FALSO: \n', invalid);
console.log('VERDADEIRO: \n', valid);

I hope it helps and indirectly thanks for the idea.

  • 1

    No shredded word detector code works with Unicode. If I write `` all fail. If I want to try I wrote this test that makes the Fuzzy comparison is what works best, but has many holes.

  • 2

    Legal but are 65535 ways to write ɐʇɐʇɐq with Unicode.

  • @Augustovasques, thanks for the "report". I edited the code for him to understand the `` as well. In the object cheats, It is possible to add as many Unicodes as possible, but actually these "65535 forms" became a problem. I added this as the main "counter" in the reply, while thinking of a way to resolve this.

  • 2

    Uma amostra de b̶̨̪͍̥̗̬̈́̌̓̐̾̄̐̃̂̀̉̚͠ą̴̘̮͕͙̰̬͕͉̘͖̠̇̓̍͗͂͑̈̆̊̽̆̈̋̆͒̋͆̕̕̚͠͝ţ̵̡̦̜̳̝̹͎̮̬͔̰̻̤̩͔̥̯̠̯̩̻̩̃̉̉͌͜a̵̡̡̛͖͇̪̤͂̈́͛͗̂̎̈́̒̈́̑̔̈̀̄͛͂̄̃̔͐͌̈́̕̕͝͝ṫ̵̲̠̻͚̫̣͇̜͖͔̰̥̦̾͐̓͆̇̒͌̓̓́͐̃̇̐͗̉̓̃̔̋̕̕͜͝ͅa̴̩̟͇̲͉̙͈̙͊̒̿͑̇͆̽̓͋̈̊̃̍̃̂̐̉̿͌̑̔͐̕͘͠ b̴̧͉͖̥̰̤̟̱͕̬͙͓̬̘̀͊̿̉̌̕ä̸̡͖̠̮̪̻̝̞̥̤́̀̍́͋͘͘t̷̢͖̜͇̜̰̬͕͌̒̽͋́̈́͛̚̚ä̸̮̼̫͍̯́̃̃̎͊͗t̸̲̤͚̘͗͑̉͘ả̸̡͇̺̟͇̭͕͛̔͑̀͗̅̇͘͠ b̵̭̲̲̋̇́̓͒̾̚̕ͅą̴̨̗̹͕̻̜̱͐̄̓̂̑̎̑͜t̴͕̲̤̗̩͈̼̖̻͌͐̉͗̾͋̾̓̆͆a̷̧̲̯̦̲̜͛̔̾͂̋̍t̷̢̩͛͂̊̂͌a̸̭̝͙̦̗̼͑̀̔̓̋̐͘ b̴̯̳͂a̵̢̝̝͑͊̆͊̐ț̴̞͐̊͌̌̃͜å̵̦̒t̸͉̟̮̂͆a̸̬̔́̐̎̅ b̶̪̃̀̈a̷͉̻͚̎t̴̝͇̱͛̔â̶̩̲t̸͔͊̔ã̶̝̩ b̶̻͍̍̚a̴̝̖̋̑ẗ̴͚͕å̵̳̈t̷̜̱́a̷̺̬̚ 乃ᗩαţᗩ

  • 2

    batata ʙᴀᴛᴀᴛᴀ ɐʇɐʇɐq ɒƚɒƚɒd bₐₜₐₜₐ ᵇᵃᵗᵃᵗᵃ ⓑⓐⓣⓐⓣⓐ ๒คՇคՇค Ⴆαƚαƚα ɮǟȶǟȶǟ ᏰᏗᏖᏗᏖᏗ ცąɬąɬą ๖คtคtค BΛƬΛƬΛ вαтαтα ßå†å†å ฿₳₮₳₮₳ 乃卂ㄒ卂ㄒ卂 乃ムイムイム【b】【a】【t】【a】【t】【a】『b』『a』『t』『a』『t』『a』≋b≋a≋t≋a≋t≋a≋ ░b░a░t░a░t░a░ [̲̅b][̲̅a][̲̅t][̲̅a][̲̅t][̲̅a] b҉a҉t҉a҉t҉a҉&#xA;ҍąէąէą ᗷᗩTᗩTᗩ ᗷᗩᖶᗩᖶᗩ b̶a̶t̶a̶t̶a̶ b̴a̴t̴a̴t̴a̴ b̷a̷t̷a̷t̷a̷ b̾a̾t̾a̾t̾a̾ b͎a͎t͎a͎t͎a͎ b͓̽a͓̽t͓̽a͓̽t͓̽a͓̽

  • @Augustovasques, I performed a completely different treatment with regard to unicodes and edited almost totally the answer. Trying to get out of the box, it seems impossible to match all possible Unicode variations, so instead, I made the code "snip" the characters it can’t handle, for a post-match treatment and, not during, as it was. If you can, take a look.

  • Still following the idea of Unicode, in the current way, one can choose, for example, to replace all "normal" occurrences with something like " [ edited ] ", while with respect to unicodes, replace with "" (empty). I looked for this for a long time (well before this question) and nothing answered me, but had never stopped to try. I don’t intend to touch that code around here anymore, but the way it is now, I feel more confident to adapt to my actual projects. The idea was to help and I ended up helping myself.

  • 1

    I just saw, I’m trying to tell you the problem is unsolvable. The purpose of this code is to create something like a family bot Safety for chats(facebook, Tweeter, instagram, Discord, twich, Whatsapp, ..), filtering profanity in conversations and users with obscene nicknames but allowing common words to be used even if ripped

Show 3 more comments

-1

How an algorithm would work to prevent attempts at deception word blocks (strings)?

It could work 3 ways:

1st - Using a list of forbidden words with reference to what has been reported. This comparison results in a percentage of similarity.

And being the degree of similarity greater than or equal to 50% it is considered that the word can be considered that the word used has characters that mask its real meaning.

2nd - Identify characters and letters mixed with a word, replace them with letters and try to form the word.

This form is more expensive, as it requires more processing to generate combinations, form the word to be blocked and use another list of forbidden words as a reference for comparison.

3rd - Using Machine Learning. For this it would be necessary to train a model and make it able to identify the similarities between the words, obviously using a classification system to determine whether the word should be blocked.

Basically the strategies I mentioned are just simple searches for matches that mimic a little the algorithm Aho-Corasick.

Some languages provide some functions that deal with similarity and search for occurrences in a simple way.

I believe that the creation of an algorithm for this is not something so complicated to be implemented.

  • 1

    No shredded word detector code works with Unicode. If I write `` all fail. If I want to try I wrote this test that makes the Fuzzy comparison is what works best, but has many holes.

-2

I would start by implementing Levenshtein distance to see if it solves the problem (if I understand well what you intend to do).

In Python for example NLTK has a lot that can help you to do this. Even ready now. But you find Levenshtein distance implementations in any language. It’s simple.

with few lines it is possible to calculate the distance of two words. Obviously you need to have the dictionary of the words you intend to block. Otherwise your software will "guess" this?

So, just consider below the array "words" as being a dictionary of words that you want to block.

import nltk

proibido = "b@t@t@"
palavras = ['batata', 'bota', 'borracha', 'b@t@t@', 'outro']

for p in palavras:
    dist = nltk.edit_distance(proibido, p)
    print(p, dist)

Result of distance calculation:

batata 3
bota 4
borracha 7
b@t@t@ 0
outro 5
  • Please, whoever "downvote" explains the reason in the comments, because it is very clear to me that this solution works because I have already had to do this. Unless I have misunderstood the question.Same thing with the newcomer there Luiz Felipe. (It’s boring to receive a downvote) and not knowing the reason to improve.

  • 1

    No shredded word detector code works with Unicode. If I write `` all fail. If I want to try I wrote this test that makes the comparison Fuzzy is what works best, but has many holes. PS: Nor voted because I find this issue insoluble.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.