Slide 1

Slide 1 text

Navigating Through the World’s Encyclopedia And Much, Much More Chad Horohoe and Nik Everett

Slide 2

Slide 2 text

CC-BY-ND 4.0 { 1 } 888 wikis Wikipedia, Wiktionary, Wikidata, Wikimedia Commons, Wikisource, Wikivoyage, Wikinews, and Wikiversity

Slide 3

Slide 3 text

CC-BY-ND 4.0 { 2 } 265 or so languages English, German, French, Dutch, Italian, Spanish, Russian, Swedish, Polish, Waray-Waray, Vietnamese, Cebuano, Japanese, Portuguese, Arabic, Chinese, Ukrainian, Catal, Norwegian (Bokmål), Finnish, Czech, Hungarian, Turkish, Romanian, Swahili, Korean, Kazakh, Danish, Esperanto, Serbian, Indonesian, Lithuanian, Volapük, Slovak, Hebrew, Persian, Bulgarian, Slovenian, Basque, Lombard, Estonian, Croatian, Newar / Nepal Bhasa, Telugu, Norwegian (Nynorsk),

Slide 4

Slide 4 text

CC-BY-ND 4.0 But We’re not Linguists ● And we don’t understand all 889 communities we serve ● We rely on active participation, advice, bug reporting, patience, and help from the communities that we serve ● And some communities and some languages don’t get as good support as others, unfortunately

Slide 5

Slide 5 text

CC-BY-ND 4.0 Real Time ● Direct edits in index usually in less than 1 minute ● Edits through templates can take longer ● But you can force them through by doing a null edit (a noop edit) if you are impatient

Slide 6

Slide 6 text

CC-BY-ND 4.0 { 5 } 870 million full text searches in December 2014 and 3.1 billion prefix searches and 282 million updates in December 2014

Slide 7

Slide 7 text

CC-BY-ND 4.0 { 6 } 200 million docs and counting Across >1800 indices (~2 per wiki) That’s ~6500 shards ranging from a few MB to 50GB each Run at 2x replicas

Slide 8

Slide 8 text

CC-BY-ND 4.0 { 7 } 31 servers 20% load. Nice servers.

Slide 9

Slide 9 text

CC-BY-ND 4.0 We Contributed! ● Three plugins ○ Swift, Experimental Highlighter, Wikimedia Extra (trigram accelerated regex search) ● Directly ○ Source transform, detect_noops updates, phrase suggester improvements, highlighting improvements, multipass rescore, some regex work

Slide 10

Slide 10 text

CC-BY-ND 4.0 Search as a tool for editors ● insource: ● insource:// ● morelike: ● incategory: ● hastemplate: ● linksto:

Slide 11

Slide 11 text

CC-BY-ND 4.0 The Good ● Shard balancing ● Expressive syntax ● Plugins ● Running Elasticsearch behind Linux Virtual Server ● Analyze endpoint ● Aliases aliases aliases

Slide 12

Slide 12 text

CC-BY-ND 4.0 The Bad ● Rolling restarts ● query_string ● Shard balancing

Slide 13

Slide 13 text

CC-BY-ND 4.0 Next Steps! ● Instrumentation for user perceived relevance and performance ● Experimentation and try to improve relevance. Two pronged attack: ○ Improve underlying search stuff ○ Give communities more tools to curate results ● Get much more involved with our mobile apps and web teams ● Multilingual search ○ Multilingual wikis like commons and wikidata ○ Interwiki search (enrich Wikipedia with commons, wikidata, etc) ● Nicer search for Wikidata

Slide 14

Slide 14 text

CC-BY-ND 4.0 Demo Links! ● https://en.wikipedia.org/w/index.php? search=chicken+pot+pie&title=Special% 3ASearch&go=Go&fulltext=1&cirrusDumpQuery=yes ● https://en.wikipedia.org/wiki/Maple_syrup?action=cirrusdump ● http://en.wikipedia.org/w/api.php?action=cirrus-config- dump&srbackend=CirrusSearch&format=json ● http://en.wikipedia.org/w/api.php?action=cirrus-mapping- dump&srbackend=CirrusSearch&format=json ● http://en.wikipedia.org/w/api.php?action=cirrus-settings- dump&srbackend=CirrusSearch&format=json

Slide 15

Slide 15 text

Questions?

Slide 16

Slide 16 text

CC-BY-ND 4.0 This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http: //creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USAu { 15