Sort with Accentuation in Elasticsearch

Asked

Viewed 691 times

2

I’m trying to make an ordination with Elastic Search, however some fields have accentuation, as name of cities, I tried to use fields with index not_analyzed and with ptbr in the second form:

    {
   "settings": {
      "analysis": {
         "analyzer": {
            "folding": {
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            },
            "analyzer_ptbr": {
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "stemmer_plural_portugues",
                  "asciifolding"
               ]
            }
         },
         "filter": {
            "stemmer_plural_portugues": {
               "type": "stemmer",
               "name": "minimal_portuguese"
            }
         }
      }
   },
   "mappings": {
      "post": {
         "properties": {
            "title": {
               "type": "multi_field",
               "fields": {
                  "title": {
                     "type": "string",
                     "analyzer": "standard"
                  },
                  "folded": {
                     "type": "string",
                     "analyzer": "folding"
                  },
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "ptbr": {
                     "type": "string",
                     "analyzer": "analyzer_ptbr"
                  }
               }
            }
         }
      }
   }

When trying to order with:

{
   "query": {
      "match_all": {}
   },
   "sort": [
      {
         "title.ptbr": {
            "order": "asc"
         }
      }
   ]
}

Is returned:

Tobacty
Version of Tocentuação
ângelo
Banna
Dois Neighbours

Switch to raw field (not analyzed):

{
   "query": {
      "match_all": {}
   },
   "sort": [
      {
         "title.raw": {
            "order": "desc"
         }
      }
   ]
}

Returns:

ângelo
Versão de Acentuação
Two Vtiny
Banna
Tobacty

That is, ignoring the accent I cannot order by the first word of the sentence, if I keep the field as not analyzed the special characters are considered the first in the decreasing ordering, someone has already gone through this problem?

Thank you

1 answer

3


Friend, I had this problem and I managed to solve by removing all the accents and spaces of the words with a filter and creating a version of the field multi_field, so your sentences stay:

Angelo => Ice
Accent Version => versaodeacentuacao
Two Neighbors => two little ones
Banana => banana
Avocado => Avocado

So you can apply Sort to this version of the field, see the code:

{
   "settings": {
      "analysis": {
         "analyzer": {
            "without_space": {
               "filter": [
                  "lowercase",
                  "whitespace_remove",
                  "asciifolding"
               ],
               "type": "custom",
               "tokenizer": "keyword"
            }
         },
         "filter": {
            "whitespace_remove": {
               "type": "pattern_replace",
               "pattern": " ",
               "replacement": ""
            }
         }
      }
   },
   "mappings": {
      "my_type": {
         "properties": {
            "title": {
               "type": "multi_field",
               "fields": {
                  "title": {
                     "type": "string",
                     "analyzer": "standard"
                  },
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "sorting": {
                     "type": "string",
                     "analyzer": "without_space"
                  }
               }
            }
         }
      }
   }
}

In the query:

{
    "query": {
        "match_all": {

        }
    },
    "sort": [
       {
          "title.sorting": {
             "order": "desc"
          }
       }
    ]
}

It was the simplest solution I could find to treat accentuation and "relevance" of the first letter of the sentence.

I hope it helps.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.