Elasticsearch & a bit of maths

8a15597b6b0ce5b5adddc823c2d18486?s=47 Matteo
October 28, 2016

Elasticsearch & a bit of maths

Elasticsearch is a powerful tool with a great Symfony integration. This talk will explain how to deal with Elasticsearch and FOSElasticaBundle when the sorting becomes hard. A bit of maths and some statistics tricks will help you to sort things out, with some examples about geolocation and users’ reviews—and you’ll understand why sorting things by average rating is not always a good idea.

[http://2016.symfonyday.it]

8a15597b6b0ce5b5adddc823c2d18486?s=128

Matteo

October 28, 2016
Tweet

Transcript

  1. Elasticsearch & a bit of maths Symfony Day 2016 (Roma)

  2. Matteo Dora github.com/mattbit

  3. WARNING contains maths!

  4. None
  5. Yes • Elasticsearch & Symfony • Score query • Geolocation

    • Rating
  6. No • Full text, TF/IDF • Vector space model •

    Analyzers • Schrödinger equation, cat, etc.
  7. Relevance sorting? Roero Arneis Pescaja 2015 92⁄100 12 € Mandrarossa

    Fiano Settesoli 2015 86⁄100 5 € Tavernello Chardonnay Caviro – 74⁄100 3 € Mirum La Monacesca 2013 96⁄100 21 € Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €
  8. Per prezzo… NAIVE SORTING

  9. Tavernello Chardonnay Caviro – 74⁄100 3 € Mandrarossa Fiano Settesoli

    2015 86⁄100 5 € Roero Arneis Pescaja 2015 92⁄100 12 € Mirum La Monacesca 2013 96⁄100 21 € Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €
  10. Tavernello Chardonnay Caviro – 74⁄100 3 € Mandrarossa Fiano Settesoli

    2015 86⁄100 5 € Roero Arneis Pescaja 2015 92⁄100 12 € Mirum La Monacesca 2013 96⁄100 21 € Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €
  11. Non proprio la mia prima scelta…

  12. Secondo la critica… NAIVE SORTING

  13. Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €

    Mirum La Monacesca 2013 96⁄100 21 € Roero Arneis Pescaja 2015 92⁄100 12 € Mandrarossa Fiano Settesoli 2015 86⁄100 5 € Tavernello Chardonnay Caviro – 74⁄100 3 €
  14. Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €

    Mirum La Monacesca 2013 96⁄100 21 € Roero Arneis Pescaja 2015 92⁄100 12 € Mandrarossa Fiano Settesoli 2015 86⁄100 5 € Tavernello Chardonnay Caviro – 74⁄100 3 €
  15. Ci dovrei pensare…

  16. Rilevanza • Vitigno • Annata • Alcol • Acidità •

    Dolcezza • Astringenza • Affinamento • Recensioni • doc, docg • …
  17. ???

  18. Friends, Romans, countrymen,
 lend me your ears! “ — Shakespeare,

    Julius Caesar
  19. $ brew install elasticsearch Elasticsearch

  20. $ apt-get install elasticsearch Elasticsearch

  21. $ composer require \
 friendsofsymfony/elastica-bundle Fos Elastica Bundle

  22. app/AppKernel.php … new FOS\ElasticaBundle\FOSElasticaBundle(), …

  23. app/config/elasticsearch.yml # include in config.yml! fos_elastica: clients: default: { host:

    localhost, port: 9200 } indexes: wines: types: …
  24. app/config/elasticsearch.yml wine: mappings: name: { type: string } producer: type:

    object properties: name: { type: string } location: { type: geo_point } price: { type: float } persistence: …
  25. app/config/elasticsearch.yml wine: mappings: … persistence: driver: orm model: AppBundle\Entity\Wine

  26. $ bin/console fos:elastica:populate Populate!

  27. http://localhost:9200/wines/wine/1 { "_index": "wines", "_type": "wine", "_id": "1", "_version": 1,

    "found": true, "_source": { "name": "Roero Arneis", "producer": "Pescaja", "price": 12.00, "rating": 92 },
  28. Fatto!

  29. None
  30. ???

  31. What’s past is prologue. “ — Shakespeare, The Tempest

  32. Score function Assegnamo un punteggio a ogni caratteristica, poi combiniamo

    in un unico punteggio totale.
  33. Score function • punteggio da 0 a 1: i ∈[0,

    1] • moltiplichiamo: ∏ i • i = fi (x)
  34. Exempli gratia

  35. 0 0,25 0,5 0,75 1 Prezzo 0 50 € 100

    € 150 €
  36. 0 0,25 0,5 0,75 1 Voto 0 20 40 60

    80 100
  37. None
  38. f ( x ) = e x 2 2 2

  39. None
  40. None
  41. Elasticsearch!

  42. function_score query

  43. Decay functions • Linear (linear) • Exponential (exp) • Gaussian

    (gauss)
  44. ⚫︎ linear ⚫︎ exponential ⚫︎ gaussian

  45. function_score query { "query": { "function_score": { "functions": [ …

    ] } } }
  46. function_score query [ { "gauss": { "price": { "origin": 0,

    "offset": 10, "scale": 20 } } }, { "gauss": { "rating": { "origin": 100,
  47. E Symfony?

  48. WineController::listAction // use Elastica\Query\FunctionScore; $query = new FunctionScore(); $query->addDecayFunction( FunctionScore::DECAY_GAUSS,

    "price", // field name 0, // origin 20, // scale 10 // offset );
  49. WineController::listAction $query->addDecayFunction( FunctionScore::DECAY_GAUSS, "rating", // field name 100, // origin

    10, // scale 10 // offset );
  50. WineController::listAction $finder = $this->get( 'fos_elastica.finder.wines.wine' ); $wines = $finder->find($query);

  51. and…

  52. Relevance sorting! Roero Arneis Pescaja 2015 92⁄100 12 € 0,993

    Mandrarossa Fiano Settesoli 2015 86⁄100 5 € 0,895 Mirum La Monacesca 2013 96⁄100 21 € 0,810 Tavernello Chardonnay Caviro – 74⁄100 3 € 0,169 Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 € 0,000
  53. WineController::listAction // use Elastica\Query\FunctionScore; // use Elastica\Query\Term; $query = new

    FunctionScore(); $query->addDecayFunction(…); $query->addWeightFunction( 2, // weight value new Term(["grape" => "verdicchio"]) );
  54. Mirum La Monacesca 2013 96⁄100 21 € 1,621 Roero Arneis

    Pescaja 2015 92⁄100 12 € 0,993 Mandrarossa Fiano Settesoli 2015 86⁄100 5 € 0,895 Tavernello Chardonnay Caviro – 74⁄100 3 € 0,169 Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 € 0,000
  55. Geolocation INTERMEZZO

  56. Winebar.php public function getLocation() { return "{$this->lat}, {$this->lon}"; }

  57. elasticsearch.yml winebar: mappings: location: { type: geo_point } […]

  58. WinebarController::listAction // use Elastica\Query\FunctionScore; $query = new FunctionScore(); $query->addDecayFunction( FunctionScore::DECAY_GAUSS,

    "location", // field name "41.849872, 12.574170", // origin "5km", // scale "1km" // offset );
  59. None
  60. And now for something completely different. “ — Monty Python

  61. Recensioni degli utenti • ★★★☆☆, ★★★★★ • (★★★☆☆+★★★★★)÷2 = ★★★★☆

    • Giusto? [domanda retorica]
  62. Esempio

  63. Roero Arneis 2015 Marchesi di Barolo Pescaja Roero Arneis 2015

    ★★★★★ ★★★★☆ ★★★★★ ★★★★★ ★★★★★ ★★★★☆ ★★★★★ ★★★★★ ★★★★★ 4,75 5,00
  64. Idee?

  65. ⋆ ×

  66. Roero Arneis 2015 Cantina Canci Pescaja Roero Arneis 2015 ★★★★★

    ★ˑˑˑˑ ★ˑˑˑˑ ★ˑˑˑˑ ★ˑˑˑˑ ★ˑˑˑˑ ★ˑˑˑˑ 5 × 1 = 5 1 × 6 = 6
  67. Altre idee?

  68. Inferenza bayesiana Diamo una stima del voto medio sulla base

    dei dati a disposizione.
  69. Intervallo di credibilità ★ ★ ★ ★ ★ 0 5

    3,7 4,2
  70. 3 1 30 10

  71. 0 5

  72. 0 5

  73. 0 5

  74. Come si fa?

  75. Ranking items with star ratings evanmiller.org/ranking-items-with-star-ratings.html by Evan Miller

  76. WARNING take cover! ∑xi

  77. media intervallo di credibilità S(n1, . . . , nk)

    = K X k=1 sk nk + 1 N + K ± z↵/2 v u u u t 0 @ K X k=1 s2 k nk + 1 N + K ! K X k=1 sk nk + 1 N + K !2 1 A /(N + K + 1) media2
  78. Stats.php $votes = [ // vote value => vote count

    1 => 0, 2 => 1, 3 => 0, 4 => 2, 5 => 6, ];
  79. Stats.php $N = array_sum($votes); // tot num of votes $K

    = count($votes); // number of stars $z = 1.65; // 90% credibility $M = 0; $A = 0;
  80. Stats.php foreach ($votes as $value => $count) { $M +=

    $value*($count + 1)/($N + $K); $A += ($value**2)*($count + 1)/($N + $K); } $intervalWidth = 2*$z*sqrt( ($A - $M**2)/($N + $K + 1) ); $lowerBound = $M - $intervalWidth/2;
  81. Bonus

  82. By medicine life may be prolong’d, yet death will seize

    the doctor too. “ — Shakespeare, Cymbeline
  83. None
  84. Fabio Giannese Andrés Vasquez Massimo Chiarillo Eugenio Canciello Oreste di

    Modugno Grazie.