$30 off During Our Annual Pro Sale. View Details »

Elasticsearch & a bit of maths

Matteo
October 28, 2016

Elasticsearch & a bit of maths

Elasticsearch is a powerful tool with a great Symfony integration. This talk will explain how to deal with Elasticsearch and FOSElasticaBundle when the sorting becomes hard. A bit of maths and some statistics tricks will help you to sort things out, with some examples about geolocation and users’ reviews—and you’ll understand why sorting things by average rating is not always a good idea.

[http://2016.symfonyday.it]

Matteo

October 28, 2016
Tweet

More Decks by Matteo

Other Decks in Programming

Transcript

  1. Elasticsearch
    & a bit of maths
    Symfony Day 2016 (Roma)

    View Slide

  2. Matteo Dora
    github.com/mattbit

    View Slide

  3. WARNING
    contains maths!

    View Slide

  4. View Slide

  5. Yes
    • Elasticsearch & Symfony
    • Score query
    • Geolocation
    • Rating

    View Slide

  6. No
    • Full text, TF/IDF
    • Vector space model
    • Analyzers
    • Schrödinger equation, cat, etc.

    View Slide

  7. Relevance sorting?
    Roero Arneis Pescaja 2015 92⁄100 12 €
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 €
    Tavernello Chardonnay Caviro – 74⁄100 3 €
    Mirum La Monacesca 2013 96⁄100 21 €
    Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €

    View Slide

  8. Per prezzo…
    NAIVE SORTING

    View Slide

  9. Tavernello Chardonnay Caviro – 74⁄100 3 €
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 €
    Roero Arneis Pescaja 2015 92⁄100 12 €
    Mirum La Monacesca 2013 96⁄100 21 €
    Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €

    View Slide

  10. Tavernello Chardonnay Caviro – 74⁄100 3 €
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 €
    Roero Arneis Pescaja 2015 92⁄100 12 €
    Mirum La Monacesca 2013 96⁄100 21 €
    Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €

    View Slide

  11. Non proprio la mia prima scelta…

    View Slide

  12. Secondo la critica…
    NAIVE SORTING

    View Slide

  13. Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €
    Mirum La Monacesca 2013 96⁄100 21 €
    Roero Arneis Pescaja 2015 92⁄100 12 €
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 €
    Tavernello Chardonnay Caviro – 74⁄100 3 €

    View Slide

  14. Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 €
    Mirum La Monacesca 2013 96⁄100 21 €
    Roero Arneis Pescaja 2015 92⁄100 12 €
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 €
    Tavernello Chardonnay Caviro – 74⁄100 3 €

    View Slide

  15. Ci dovrei pensare…

    View Slide

  16. Rilevanza
    • Vitigno
    • Annata
    • Alcol
    • Acidità
    • Dolcezza
    • Astringenza
    • Affinamento
    • Recensioni
    • doc, docg
    • …

    View Slide

  17. ???

    View Slide

  18. Friends, Romans, countrymen,

    lend me your ears!

    — Shakespeare, Julius Caesar

    View Slide

  19. $ brew install elasticsearch
    Elasticsearch

    View Slide

  20. $ apt-get install elasticsearch
    Elasticsearch

    View Slide

  21. $ composer require \

    friendsofsymfony/elastica-bundle
    Fos Elastica Bundle

    View Slide

  22. app/AppKernel.php

    new FOS\ElasticaBundle\FOSElasticaBundle(),

    View Slide

  23. app/config/elasticsearch.yml
    # include in config.yml!
    fos_elastica:
    clients:
    default: { host: localhost, port: 9200 }
    indexes:
    wines:
    types: …

    View Slide

  24. app/config/elasticsearch.yml
    wine:
    mappings:
    name: { type: string }
    producer:
    type: object
    properties:
    name: { type: string }
    location: { type: geo_point }
    price: { type: float }
    persistence: …

    View Slide

  25. app/config/elasticsearch.yml
    wine:
    mappings: …
    persistence:
    driver: orm
    model: AppBundle\Entity\Wine

    View Slide

  26. $ bin/console fos:elastica:populate
    Populate!

    View Slide

  27. http://localhost:9200/wines/wine/1
    {
    "_index": "wines",
    "_type": "wine",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
    "name": "Roero Arneis",
    "producer": "Pescaja",
    "price": 12.00,
    "rating": 92
    },

    View Slide

  28. Fatto!

    View Slide

  29. View Slide

  30. ???

    View Slide

  31. What’s past is prologue.

    — Shakespeare, The Tempest

    View Slide

  32. Score function
    Assegnamo un punteggio a ogni
    caratteristica, poi combiniamo in
    un unico punteggio totale.

    View Slide

  33. Score function
    • punteggio da 0 a 1: i ∈[0, 1]
    • moltiplichiamo: ∏ i
    • i
    = fi
    (x)

    View Slide

  34. Exempli gratia

    View Slide

  35. 0
    0,25
    0,5
    0,75
    1
    Prezzo
    0 50 € 100 € 150 €

    View Slide

  36. 0
    0,25
    0,5
    0,75
    1
    Voto
    0 20 40 60 80 100

    View Slide

  37. View Slide

  38. f
    (
    x
    ) = e x
    2
    2 2

    View Slide

  39. View Slide

  40. View Slide

  41. Elasticsearch!

    View Slide

  42. function_score query

    View Slide

  43. Decay functions
    • Linear (linear)
    • Exponential (exp)
    • Gaussian (gauss)

    View Slide

  44. ⚫︎ linear
    ⚫︎ exponential
    ⚫︎ gaussian

    View Slide

  45. function_score query
    {
    "query": {
    "function_score": {
    "functions": [ … ]
    }
    }
    }

    View Slide

  46. function_score query
    [
    {
    "gauss": {
    "price": {
    "origin": 0,
    "offset": 10,
    "scale": 20
    }
    }
    },
    {
    "gauss": {
    "rating": {
    "origin": 100,

    View Slide

  47. E Symfony?

    View Slide

  48. WineController::listAction
    // use Elastica\Query\FunctionScore;
    $query = new FunctionScore();
    $query->addDecayFunction(
    FunctionScore::DECAY_GAUSS,
    "price", // field name
    0, // origin
    20, // scale
    10 // offset
    );

    View Slide

  49. WineController::listAction
    $query->addDecayFunction(
    FunctionScore::DECAY_GAUSS,
    "rating", // field name
    100, // origin
    10, // scale
    10 // offset
    );

    View Slide

  50. WineController::listAction
    $finder = $this->get(
    'fos_elastica.finder.wines.wine'
    );
    $wines = $finder->find($query);

    View Slide

  51. and…

    View Slide

  52. Relevance sorting!
    Roero Arneis Pescaja 2015 92⁄100 12 € 0,993
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 € 0,895
    Mirum La Monacesca 2013 96⁄100 21 € 0,810
    Tavernello Chardonnay Caviro – 74⁄100 3 € 0,169
    Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 € 0,000

    View Slide

  53. WineController::listAction
    // use Elastica\Query\FunctionScore;
    // use Elastica\Query\Term;
    $query = new FunctionScore();
    $query->addDecayFunction(…);
    $query->addWeightFunction(
    2, // weight value
    new Term(["grape" => "verdicchio"])
    );

    View Slide

  54. Mirum La Monacesca 2013 96⁄100 21 € 1,621
    Roero Arneis Pescaja 2015 92⁄100 12 € 0,993
    Mandrarossa Fiano Settesoli 2015 86⁄100 5 € 0,895
    Tavernello Chardonnay Caviro – 74⁄100 3 € 0,169
    Terlaner I Gran Cuveé Cantina Terlan 2013 97⁄100 180 € 0,000

    View Slide

  55. Geolocation
    INTERMEZZO

    View Slide

  56. Winebar.php
    public function getLocation()
    {
    return "{$this->lat}, {$this->lon}";
    }

    View Slide

  57. elasticsearch.yml
    winebar:
    mappings:
    location: { type: geo_point }
    […]

    View Slide

  58. WinebarController::listAction
    // use Elastica\Query\FunctionScore;
    $query = new FunctionScore();
    $query->addDecayFunction(
    FunctionScore::DECAY_GAUSS,
    "location", // field name
    "41.849872, 12.574170", // origin
    "5km", // scale
    "1km" // offset
    );

    View Slide

  59. View Slide

  60. And now for something
    completely different.

    — Monty Python

    View Slide

  61. Recensioni degli utenti
    • ★★★☆☆, ★★★★★
    • (★★★☆☆+★★★★★)÷2 = ★★★★☆
    • Giusto? [domanda retorica]

    View Slide

  62. Esempio

    View Slide

  63. Roero Arneis 2015
    Marchesi di Barolo
    Pescaja
    Roero Arneis 2015
    ★★★★★
    ★★★★☆
    ★★★★★
    ★★★★★
    ★★★★★
    ★★★★☆
    ★★★★★
    ★★★★★
    ★★★★★
    4,75 5,00

    View Slide

  64. Idee?

    View Slide

  65. ⋆ ×

    View Slide

  66. Roero Arneis 2015
    Cantina Canci
    Pescaja
    Roero Arneis 2015
    ★★★★★ ★ˑˑˑˑ
    ★ˑˑˑˑ
    ★ˑˑˑˑ
    ★ˑˑˑˑ
    ★ˑˑˑˑ
    ★ˑˑˑˑ
    5 × 1 = 5 1 × 6 = 6

    View Slide

  67. Altre idee?

    View Slide

  68. Inferenza bayesiana
    Diamo una stima del voto medio
    sulla base dei dati a disposizione.

    View Slide

  69. Intervallo di credibilità
    ★ ★ ★ ★ ★
    0 5
    3,7 4,2

    View Slide

  70. 3

    1
    30

    10

    View Slide

  71. 0 5

    View Slide

  72. 0 5

    View Slide

  73. 0 5

    View Slide

  74. Come si fa?

    View Slide

  75. Ranking items with star ratings
    evanmiller.org/ranking-items-with-star-ratings.html
    by Evan Miller

    View Slide

  76. WARNING
    take cover!
    ∑xi

    View Slide

  77. media intervallo di credibilità
    S(n1, . . . , nk) =
    K
    X
    k=1
    sk
    nk + 1
    N + K
    ± z↵/2
    v
    u
    u
    u
    t
    0
    @
    K
    X
    k=1
    s2
    k
    nk + 1
    N + K
    !
    K
    X
    k=1
    sk
    nk + 1
    N + K
    !2
    1
    A /(N + K + 1)
    media2

    View Slide

  78. Stats.php
    $votes = [ // vote value => vote count
    1 => 0,
    2 => 1,
    3 => 0,
    4 => 2,
    5 => 6,
    ];

    View Slide

  79. Stats.php
    $N = array_sum($votes); // tot num of votes
    $K = count($votes); // number of stars
    $z = 1.65; // 90% credibility
    $M = 0; $A = 0;

    View Slide

  80. Stats.php
    foreach ($votes as $value => $count) {
    $M += $value*($count + 1)/($N + $K);
    $A += ($value**2)*($count + 1)/($N + $K);
    }
    $intervalWidth = 2*$z*sqrt(
    ($A - $M**2)/($N + $K + 1)
    );
    $lowerBound = $M - $intervalWidth/2;

    View Slide

  81. Bonus

    View Slide

  82. By medicine life may be
    prolong’d, yet death
    will seize the doctor too.

    — Shakespeare, Cymbeline

    View Slide

  83. View Slide

  84. Fabio Giannese Andrés Vasquez
    Massimo Chiarillo Eugenio Canciello
    Oreste di Modugno
    Grazie.

    View Slide