Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Navigating Through the World’s Encyclopedia

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 18, 2015
3.7k

Navigating Through the World’s Encyclopedia

Over the last year and a half, Wikimedia has switched from a home-grown search solution based on Lucene 2 to a highly redundant Elasticsearch cluster powering billions of user prefix and full-text searches. This talk will outline the brief history of search at Wikimedia, why we went with Elasticsearch, how we've scaled it and what we've built, and finally touch on the integration and features we've provided to users of our platform.

Presented by Nik Everett & Chad Horohoe, Wikimedia

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 18, 2015
Tweet

Transcript

  1. Navigating Through the World’s Encyclopedia And Much, Much More Chad

    Horohoe and Nik Everett
  2. CC-BY-ND 4.0 { 1 } 888 wikis Wikipedia, Wiktionary, Wikidata,

    Wikimedia Commons, Wikisource, Wikivoyage, Wikinews, and Wikiversity
  3. CC-BY-ND 4.0 { 2 } 265 or so languages English,

    German, French, Dutch, Italian, Spanish, Russian, Swedish, Polish, Waray-Waray, Vietnamese, Cebuano, Japanese, Portuguese, Arabic, Chinese, Ukrainian, Catal, Norwegian (Bokmål), Finnish, Czech, Hungarian, Turkish, Romanian, Swahili, Korean, Kazakh, Danish, Esperanto, Serbian, Indonesian, Lithuanian, Volapük, Slovak, Hebrew, Persian, Bulgarian, Slovenian, Basque, Lombard, Estonian, Croatian, Newar / Nepal Bhasa, Telugu, Norwegian (Nynorsk),
  4. CC-BY-ND 4.0 But We’re not Linguists • And we don’t

    understand all 889 communities we serve • We rely on active participation, advice, bug reporting, patience, and help from the communities that we serve • And some communities and some languages don’t get as good support as others, unfortunately
  5. CC-BY-ND 4.0 Real Time • Direct edits in index usually

    in less than 1 minute • Edits through templates can take longer • But you can force them through by doing a null edit (a noop edit) if you are impatient
  6. CC-BY-ND 4.0 { 5 } 870 million full text searches

    in December 2014 and 3.1 billion prefix searches and 282 million updates in December 2014
  7. CC-BY-ND 4.0 { 6 } 200 million docs and counting

    Across >1800 indices (~2 per wiki) That’s ~6500 shards ranging from a few MB to 50GB each Run at 2x replicas
  8. CC-BY-ND 4.0 { 7 } 31 servers 20% load. Nice

    servers.
  9. CC-BY-ND 4.0 We Contributed! • Three plugins ◦ Swift, Experimental

    Highlighter, Wikimedia Extra (trigram accelerated regex search) • Directly ◦ Source transform, detect_noops updates, phrase suggester improvements, highlighting improvements, multipass rescore, some regex work
  10. CC-BY-ND 4.0 Search as a tool for editors • insource:

    • insource:// • morelike: • incategory: • hastemplate: • linksto:
  11. CC-BY-ND 4.0 The Good • Shard balancing • Expressive syntax

    • Plugins • Running Elasticsearch behind Linux Virtual Server • Analyze endpoint • Aliases aliases aliases
  12. CC-BY-ND 4.0 The Bad • Rolling restarts • query_string •

    Shard balancing
  13. CC-BY-ND 4.0 Next Steps! • Instrumentation for user perceived relevance

    and performance • Experimentation and try to improve relevance. Two pronged attack: ◦ Improve underlying search stuff ◦ Give communities more tools to curate results • Get much more involved with our mobile apps and web teams • Multilingual search ◦ Multilingual wikis like commons and wikidata ◦ Interwiki search (enrich Wikipedia with commons, wikidata, etc) • Nicer search for Wikidata
  14. CC-BY-ND 4.0 Demo Links! • https://en.wikipedia.org/w/index.php? search=chicken+pot+pie&title=Special% 3ASearch&go=Go&fulltext=1&cirrusDumpQuery=yes • https://en.wikipedia.org/wiki/Maple_syrup?action=cirrusdump

    • http://en.wikipedia.org/w/api.php?action=cirrus-config- dump&srbackend=CirrusSearch&format=json • http://en.wikipedia.org/w/api.php?action=cirrus-mapping- dump&srbackend=CirrusSearch&format=json • http://en.wikipedia.org/w/api.php?action=cirrus-settings- dump&srbackend=CirrusSearch&format=json
  15. Questions?

  16. CC-BY-ND 4.0 This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http: //creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USAu { 15