Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Orange is moving its French web search engine to Elastic products

Elastic Co
November 05, 2015

How Orange is moving its French web search engine to Elastic products

Orange dispose déjà de plusieurs clusters Elasticsearch en production pour ses petits moteurs de recherche. Comme ils sont satisfaits par leur performance et leur flexibilité, ils ont commencé la migration de leur moteur de recherche français, entièrement construit avec des technologies brevetées, vers Elasticsearch afin de passer en production dès la fin de l’année 2015.
Cette présentation montre comment Orange a optimisé son moteur de recherche en combinant Elasticsearch et Kibana. Étant donné que les documents trouvés sur le Web ne sont pas tous autorisés à figurer dans les pages de réponses des moteurs de recherche, il impératif de les filtrer. Grâce à Elasticsearch et Kibana, Orange peut effectuer un filtrage plus efficace et interactif des documents provenant 1.2 milliards d’URL tout en les visualisant les résultats sur des diagrammes Kibana qui apparaissent en quelques secondes.

Jean-Pierre Paris | Elastic{ON} Tour | Paris, France

Elastic Co

November 05, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 1

  2. By Zied Benzarti, Jean-Pierre Paris November 5th 2015 How Orange

    is moving its French web search engine to Elastic products 2
  3. context •  French search engine §  started in 1996 with

    echo interactive •  lot of components and data sources §  thematic engines §  French web engine •  lot of supporting tools §  compare SERPs §  corpus analysis §  … 4
  4. quality indicators •  example §  UNID 0252 ^.+[0-9]{4,}$ o http://www.deezer.com/album/11030984 • 

    how to debug hundreds of regexps and labels? §  SPAM 0600 ^(?!www)[0-9a-zA-Z]+www\. §  SPAM 1833 ^.+(?:casino-online|viagra-(?:online|orders)) §  LANG 1836 ^.+/(?:lang-en|pages/eng)/ •  other indicators to collect more information 7
  5. how to analyze indicators? Crawl Indexing Search Web FR split

    url (host) compute indicator details compute labels inject 8 discover visualize analyze
  6. search engines @ lemoteur.orange.fr •  thematic engines §  under Elasticsearch

    §  started S2 2013, live in Jan 2014 §  still growing! •  French web search §  built on proprietary technology §  migration to Elasticsearch under (good) way volume: 1,2 bn urls performances: 30 req/s, < 200 ms 14
  7. architecture crawl indexaZon search 1 block 200m docs 15 …

    data client client client data data data data data data data data data data$ client$ client$ client$ data$ data$ data$ data$ data$ data$ data$ data$ data$
  8. architecture mulZ sites M M M M M M 6

    blocks 1.2bn docs 16 …" Data$ H&p$ H&p$ H&p$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ data" h'p" h'p" h'p" data" data" data" data" data" data" data" data" data" …" Data$ H&p$ H&p$ H&p$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ data" h'p" h'p" h'p" data" data" data" data" data" data" data" data" data" …" Data$ H&p$ H&p$ H&p$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ data" h'p" h'p" h'p" data" data" data" data" data" data" data" data" data" …" Data$ H&p$ H&p$ H&p$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ data" h'p" h'p" h'p" data" data" data" data" data" data" data" data" data" …" Data$ H&p$ H&p$ H&p$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ data" h'p" h'p" h'p" data" data" data" data" data" data" data" data" data" …" Data$ H&p$ H&p$ H&p$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ Data$ data" h'p" h'p" h'p" data" data" data" data" data" data" data" data" data"
  9. looking for search performance •  JSON requests §  find the

    least costly request §  preserve the quality (pertinence) •  filter cache size §  80%, 60%, 40%, 30%, 20%, 10% •  machines §  10, 20 •  nodes §  1, 2 per machine •  os.processors §  not set, 4, 8, 16 •  parsing JSON §  loader then dispatcher §  change for a faster lib •  shards §  80, 40, 20 •  replica §  0, 1, 2 •  RAM §  8GB, 48GB, 8GB •  aggregation §  one, two requests 17
  10. controlling filter cache "bool": { "_cache": true, "_cache_key": ":static", "must":

    [ { "term": { "_cache": false, "query": "" } }, ... "bool": { "_cache": true, "_cache_key": ":fth_1998,9001", "must_not": [ { "term": { "_cache": true, "themes": 1998 } }, … 18 will change in 2.0
  11. looking for search performance •  JSON requests §  find the

    least costly request §  preserve the quality (pertinence) •  filter cache size §  80%, 60%, 40%, 30%, 20%, 10% •  machines §  10, 20 •  nodes §  1, 2 per machine •  os.processors §  not set, 4, 8, 16 •  parsing JSON §  loader then dispatcher §  RapidJSON •  shards §  80, 40, 20 •  replica §  0, 1, 2 •  RAM §  8GB, 48GB, 8GB •  aggregation §  one, two requests 19
  12. best practices •  your mileage may vary §  test your

    own architecture •  take time to collect bench results §  ease comparison afterwards •  use monitoring tools §  understand hard/soft behavior (at least try to!) •  move around best settings §  reduce combinatory explosion 20
  13. indexing figures •  1,2 bn web documents •  4,2 TB

    index •  2h45m •  indexing •  index.refresh_interval •  http vs. data nodes 22
  14. perspectives •  Kibana on every ES index! §  and data

    comes to life §  hope to try new 4.2 features really soon •  share some of our efforts with the community §  CCLA signed, more commits expected in next months! •  connect ES to hadoop/spark §  compute graph scores based on ES indices §  explore url graph (backlinks…) 24