How to make "case-insensitive" comparisons in Sqlite?

Asked

Viewed 2,673 times

38

Since there’s an obsession with accent questions, here goes mine :)

Sqlite allows direct comparisons or through like with any encoding/charset as long as it’s done byte to byte. It only allows comparison by disregarding the box sensitivity (case insensitive) with ASCII characters.

I have some situations I need to use Sqlite with case insensitive and consequently Accent insensitive also for Latin characters (ISO 8859-1 or Latin1). No need and no use encoding UTF-8 or UTF-16, because, and not only, the proper treatment of these encodings in the Sqlite.

The default response of the Sqlite developers, which is english Centric, is that if you need extra treatment to ASCII, you should arrange it. They claim that you will already have it available in your application. Which is not true in my case.

I often use Sqlite in languages that do not have the proper treatment to normalize the character box taking into account Latin characters, at least not in a way that can be called by Sqlite.

If I used UTF8 I wouldn’t use the ICU for its weight in memory and processing. It is cannon to kill bird and in some cases would bring me undesirable complications.

It would need to be in C to communicate directly with the Sqlite API without overhead and to be portable to any language used in the application and also to the operating system, which is another requirement.

The main requirements:

  • normalize the marry treating main characters (excluding accents) of ISO-8859-1
  • lightness (memory and processing)
  • simplicity (which allows easy deployment and maintenance)
  • portability (language and operating system)
  • avoid duplication of data in tables (create standard column)
  • no external or license dependency allowing me to compile together with my/Sqlite application.

It can be a simple library or a function to subscribe to the Sqlite API. Preferably it also works with the LIKE and FTS, UPPER and LOWER, etc..

I’ve researched and it doesn’t solve my problem:

Reference: API Sqlite

  • Have you considered creating an additional standard column?

  • 4

    @utluiz Already! It’s over overhead yet. In some specific situation may even be an (inelegant) solution, but in most cases, it would be absurd to do so.

  • Create an additional column is a overhead which may be valid in some cases where search efficiency is crucial and comparison is not solved only with encoding. An application is to allow the search of words with small variations of writing with the algorithm [Soundex] (http://en.wikipedia.org/wiki/Soundex).

  • 1

    Ah, and whatever the solution, for it to be efficient, it is not enough to apply a comparison clause in the query, it must act on the bank’s indices to not cause a Full Table Scan.

  • @utluiz I agree with his last statement and that the additional column can valid in some cases.

  • I don’t know if it answers the question but on Android if you put collate localized in the search query. And in the DB create you put db.setLocale(Locale.getDefault()); works perfectly in terms of case insensitive and Accent insensitive.

  • I don’t know the details on Android but probably the Android Sqlite is customized with extra functions. But what I wanted is a solution for other platforms. Bacco has already given the solution that works. Thank you.

Show 2 more comments

3 answers

24


Sqlite has case insensitive only to ASCII natively, because one of the goals of the database is precisely to be Lite :) and there is the understanding that any application that relies on specific language comparison will already have a function for this, since the database is just one of the points where it will be used.

To have an order in the form you want, using your own criteria, just implement a new collation, using the function family sqlite3_create_collation( ), which allows you to point out a function of your own in C, which makes the comparison between strings in the way you find most convenient.

After having this collation implemented, just include it in the desired query:

SELECT nome, endereco FROM cadastro ORDER BY nome COLLATE CollationExemploPtBr

Details of the function to be implemented range from the need of each, the important thing is to know that its function will receive the two strings to be compared, and must return 0 for strings considered equal (action and action, in English, should return 0), some positive for string1 greater than string2, and one negative for string1 smaller than string2.

One of the advantages of having your own collation is that you can not only compare characters by accentuation, but also make & and E have the same "weight" when ordering a string, or anything else that is convenient for the desired result.

Since you call the create_collation function before you start using the database functions themselves, you can use your collation (or several different ones at the same time, as you wish) normally as index. This is especially important to maintain DB performance without losing the freedom to sort as you wish. These functions work as efficiently or even better than native Sqlite functions, depending only on the quality of the code implemented.

Remember that the collation is only a part of the process, because when locating a string with LIKE, for example, you should also implement a compatible function, and you can use the function family create_function() that uses the same principle, pointing out a function of your code that processes the result the way you want.

Here’s a simplified example, adapted from a function I use in some apps:

// Atencao: esta implementacao esta muito simplificada,
// e foi postada como mero exemplo. Por ter sido rapidamente
// adaptada e simplificada de um codigo de uso interno para
// ser postada no SO, pode conter erros de tipagem e alguma
// condicao nao tratada (como utf mal formado) ou bugs.
//
// Esta implementacao somente esta considerando os caracteres
// acentuados entre u+0000 e u+00ff, para fins de exemplo, e
// considerando as equivalencias basicas em pt_BR
//
// -- Bacco

sqlite3 *db;
if (SQLITE_OK==sqlite3_open( "databasename.db", &db)
{

   sqlite3_create_collation( db,
                             "CollationExemploPtBr",
                              SQLITE_UTF8,  /* Nota [1] */
                              &example_table_ptbr, /* Nota [2] */
                              &example_collation ); 

// [1] A pergunta original menciona 8859-1. Notar que este codigo
// trata de UTF, mas basta ajustar os "ifs" pra ignorar UTF e
// usar a tabela com 256 caracteres "puros".
//
// [2] Notar que estou usando um cargo pra nao referenciar o
// example_table_ptbr direto no example_collation.
// O SQLite repassa esse pointer ao chamar a funcao indicada. 
// Este e um otimo jeito de usar a mesma funcao com tabelas diferentes.
// Se sua funcao ja tiver a tabela embutida,
// basta usar null neste parametro.

// ... seu codigo aqui ...

}

static const char example_table_ptbr[] = {
    /* u+0000 .. U+007F */
   0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,  0x08,0x09,0x0A,0x0B,0x0C,0x0D,0x0E,0x0F,
   0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,  0x18,0x19,0x1A,0x1B,0x1C,0x1D,0x1E,0x1F,
   0x20,0x21,0x22,0x23,0x24,0x25,'E' ,0x27,  0x28,0x29,0x2A,0x2B,0x2C,0x2D,0x2E,0x2F,
   0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,  0x38,0x39,0x3A,0x3B,0x3C,0x3D,0x3E,0x3F,
   0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,  0x48,0x49,0x4A,0x4B,0x4C,0x4D,0x4E,0x4F,
   0x50,0x51,0x52,0x53,0x54,0x55,0x56,0x57,  0x58,0x59,0x5A,0x5B,0x5C,0x5D,0x5E,0x5F,
   0x60,'A' ,'B' ,'C' ,'D' ,'E' ,'F' ,'G' ,  'H' ,'I' ,'J' ,'K' ,'L' ,'M' ,'N' ,'O' ,
   'P' ,'Q' ,'R' ,'S' ,'T' ,'U' ,'V' ,'W' ,  'X' ,'Y' ,'Z' ,0x7B,0x7C,0x7D,0x7E,0x7F,
   /* u+0080 .. U+00FF */
   0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,  0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,
   0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,  0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,
   0x20,'!' ,'C' ,'L' ,0x20,'Y' ,0x20,0x20,  0x20,'C' ,'A' ,'<' ,0x20,0x20,'R' ,0x20,
   'O' ,0x20,'2' ,'3' ,0x20,'U' ,0x20,0x20,  0x20,'1' ,'O' ,'>' ,0x20,0x20,0x20,'?' ,
   'A' ,'A' ,'A' ,'A' ,'A' ,'A' ,0x20,'C' ,  'E' ,'E' ,'E' ,'E' ,'I' ,'I' ,'I' ,'I' ,
   'D' ,'N' ,'O' ,'O' ,'O' ,'O' ,'O' ,'X' ,  'O' ,'U' ,'U' ,'U' ,'U' ,'Y' ,0x20,'S' ,
   'A' ,'A' ,'A' ,'A' ,'A' ,'A' ,0x20,'C' ,  'E' ,'E' ,'E' ,'E' ,'I' ,'I' ,'I' ,'I' ,
   'D' ,'N' ,'O' ,'O' ,'O' ,'O' ,'O' ,'/' ,  'O' ,'U' ,'U' ,'U' ,'U' ,'Y' ,0x20,'Y' };

int example_collation(void* pArg,int n1,const void* s1,int n2,const void* s2)
{
   int v = 0;
   unsigned char b1;
   unsigned char b2;
   unsigned char *t = (unsigned char *)pArg;
   unsigned char *p1 = (unsigned char *)s1;
   unsigned char *p2 = (unsigned char *)s2;
   unsigned char *c1 = p1 + n1;
   unsigned char *c2 = p2 + n2;

   while( ( p1 < c1 || p2 < c2 ) && v == 0) {
      b1 = *(p1++);
      b2 = *(p2++);

      if ( p1 > c1 )
         b1 = 0x20;

      else if ( b1 < 0x80 )
         b1 = t[ b1 ];

      else if( ( b1 & 0xFE ) == 0xC2 )
         b1 = t[ 0x80 | ( ( b1 << 6 ) & 0x40 ) | ( *(p1++) & 0x3F ) ];

      if ( p2 > c2 )
         b2 = 0x20;

      else if ( b2 < 0x80 )
         b2 = t[ b2 ];

      else if( ( b2 & 0xFE ) == 0xC2 )
         b2 = t[ 0x80 | ( ( b2 << 6 ) & 0x40 ) | ( *(p2++) & 0x3F ) ];

      v = b1 - b2 ;
   }
   return v;
}
  • +1, your solution seems to be the most professional of all I have ever researched. Could you let me know if it would be possible to use it with a flutter app. I’m having trouble creating a basic collation from the official sqlite doc (sqlite3_create_collation). The flutter apps run behind Java on android and Swift/Obc on iOS, I’m not sure if the context to do this would be exposed to the flutter that runs on Dart, but I believe that because it is about SQL commands it should work.

  • 1

    @Cassioseffrin unfortunately I do not know details of the low-level implementation of flutter to help. Alternatively, you can do this on the Sqlite layer by putting your collation on the lib instead of putting on the application side, but only by evaluating the specific same scenario.

4

To switch between comparisons case insensitive and case sensitive there is a key to configuration:

PRAGMA case_sensitive_like=OFF;

About removing accents, I found that reference with a Function that looks promising:

SELECT spellfix1_translit('água'); -- agua 
SELECT * FROM table WHERE spellfix1_translit(column) LIKE 'a%'; 

To implementation in question, part of a virtual table to find "next" words, uses a mega table of translations to get the ASCII counterpart of accented characters (ie makes the substitution of Roman characters in the nail). For more details see the data structure translit[] and the function transliterate which I omitted for reasons of space and simplicity (the "copyright" header message is quite interesting hehehe).

/*
**    spellfix1_translit(X)
**
** Convert a string that contains non-ASCII Roman characters into 
** pure ASCII.
*/
static void transliterateSqlFunc(
  sqlite3_context *context,
  int argc,
  sqlite3_value **argv
){
  const unsigned char *zIn = sqlite3_value_text(argv[0]);
  int nIn = sqlite3_value_bytes(argv[0]);
  unsigned char *zOut = transliterate(zIn, nIn);
  if( zOut==0 ){
    sqlite3_result_error_nomem(context);
  }else{
    sqlite3_result_text(context, (char*)zOut, -1, sqlite3_free);
  }
}

As for the implementation of the function, if you want something more "robust", since ICU is out of the question, there are other options like unac. Just use the above function as template and replace the call to transliterate by the normalization function you prefer.

  • PRAGMA only works for ASCII. I’ll take a closer look at the rest, then give back. It seems to be at least a good start. I just couldn’t tell yet it solves everything, because it is not enough to remove the seats for the comparison to work well. I’ve seen solutions even simple to normalize the string, but that generated unnecessary processing in bulk. I don’t know if that’s the case. I found the table very large p/ my need, but this gives p/ customize. I will analyze the unac also.

  • If it is to customize (I don’t think today this table is so great in terms of memory consumption, even for Devices embarked), see that it is already possible to treat the case directly in the table also (I on the other hand would leave the case sensitive off to like and compose functions UPPER and LOWER). Regarding overhead and processing, will always be worth the trade-off processing space vs, no extra columns or some kind of auxiliary data structure you will have to perform normalization every query, and this will be expensive (it may be more expensive or less expensive, but it will be expensive).

  • Of course, the problem is to be expensive without being necessary. And I’ve seen several solutions like this. Depending on how to use UPPER and LOWER, this is just the case.

2

Although the question has already been answered and @Maniero has made it clear

avoid duplication of data in tables

In some situations it is not possible to create a collate sqlite3_create_collation. And, using data duplication can meet other requirements like:

lightness and simplicity

And in case you’re also interested in accent insensitive you can have the alternative with field duplication with the use of Slug.

CREATE TABLE 'posts' (
    'id' INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
    'name' TEXT, 
    'name_slug' TEXT);

When inserting, you will need to enter the information twice. The first way raw and the second, Slug. In the example would be name and name_slug. Slugify in C++

INSERT INTO items ( name, name_slug )
    VALUES ( "O nome do POST é comprido", "o-nome-do-post-e-comprido" );

And when making a comparison or like, use the same search term slugify procedure.

# Post é
SELECT * FROM items WHERE status = 1 AND name_slug like "%post-e%"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.