sábado, 13 de dezembro de 2014

Randomizing with Elasticsearch: a practical example

This post explains how to return shuffle values returned by Elasticsearch. The use case is the situation where we want to avoid that users receive results of only one (or few) type. For instance, if you have a ecommerce and you want to return products of different brands, even though, you have some brands that dominates your dataset (ie., you have a brand that represents a large amount of your data).

1. In order to test that I created a small dataset with 15 products of 3 brands.

PUT /test

PUT /test/products/1
{
    "name" : "product 1",
    "brand" : "brand1"
}

PUT /test/products/2
{
    "name" : "product 2",
    "brand" : "brand1"
}

PUT /test/products/3
{
    "name" : "product 3",
    "brand" : "brand1"
}

PUT /test/products/4
{
    "name" : "product 4",
    "brand" : "brand1"
}

PUT /test/products/5
{
    "name" : "product 5",
    "brand" : "brand1"
}

PUT /test/products/6
{
    "name" : "product 6",
    "brand" : "brand2"
}

PUT /test/products/7
{
    "name" : "product 7",
    "brand" : "brand2"
}

PUT /test/products/8
{
    "name" : "product 8",
    "brand" : "brand2"
}

PUT /test/products/9
{
    "name" : "product 9",
    "brand" : "brand2"
}

PUT /test/products/10
{
    "name" : "product 10",
    "brand" : "brand2"
}


PUT /test/products/11
{
    "name" : "product 11",
    "brand" : "brand3"
}

PUT /test/products/12
{
    "name" : "product 12",
    "brand" : "brand3"
}

PUT /test/products/13
{
    "name" : "product 13",
    "brand" : "brand3"
}

PUT /test/products/14
{
    "name" : "product 14",
    "brand" : "brand3"
}

PUT /test/products/15
{
    "name" : "product 15",
    "brand" : "brand3"
}

2. I did three queries: (A) one without any sorting, (B) a sort script using the Java hashCode function and (C) the last one using Elasticsearch function of random_score.

POST /test/products/_search
{
   "from": 0,
   "size": 3,
   "query": {
      "match": {
         "name": "product"
      }
   }
}

POST /test/products/_search
{
   "from": 0,
   "size": 3,
   "query": {
      "match": {
         "name": "product"
      }
   },
   "sort": {
      "_script": {
         "script": "(doc['_id'].value + seed).hashCode()",
         "type": "number",
         "params": {
            "seed": "doc['name'].value"
         },
         "order": "asc"
      }
   }
}

POST /test/products/_search
{
   "from": 0,
   "size": 3,
   "query": {
      "function_score": {
         "query": {
            "match": {
               "name": "product"
            },
            "functions": [
               {
                  "random_score": {
                     "seed": "1"
                  }
               }
            ],
            "score_mode": "sum"
         }
      }
   }
}

3. The results were:

A. brand 2, brand 2, brand 1
B. brand 1, brand 3, brand 2
C. brand 3, brand 3, brand 1

4. My conclusion is that using the Java hash code function is the best approach. The ramdon_score function is interesting if you want to keep the results consistents for a same user (you can use the user id as the seed of this function).

Best regards,

Luiz

Nenhum comentário: