Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cool Bonsai Cool - An introduction to ElasticSearch

Cool Bonsai Cool - An introduction to ElasticSearch

YAPC::EU 2011

D0dd23d18388ba0225bbb9bcba7ede83?s=128

Clinton Gormley

August 16, 2011
Tweet

More Decks by Clinton Gormley

Other Decks in Programming

Transcript

  1. “Cool, Bonsai, Cool” An introduction to Clinton Gormley, YAPC::EU 2011

  2. Why do I need a search engine?

  3. None
  4. None
  5. None
  6. Search is how we find stuff

  7. None
  8. None
  9. How does a search engine work?

  10. None
  11. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic

    Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic
  12. Magic == inverted index + relevance scoring

  13. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic

    Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Take some text
  14. Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic

    Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Tokenise it
  15. acme magic 8 ball acme magic pony config magic file

    magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Tokenise it
  16. acme magic 8 ball acme magic pony config magic file

    magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Find unique tokens/terms
  17. 8 acme ball config ext file info m magic Find

    unique tokens/terms meta mime mro object pager pony template test xs
  18. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic

    MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Map terms to documents
  19. acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic

    MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Search for: “file xs”
  20. Search for: “file xs” acme file magic mime template xs

    Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS
  21. But, not just about finding

  22. None
  23. Sort by RELEVANCE

  24. Relevance: How many matching terms does this document contain?

  25. Relevance: How often does each term appear in this document,

    as a % of its length?
  26. Relevance: How frequently does each term appear in all your

    documents?
  27. Relevance: Can be customised

  28. Relevance: Can be customised By document or field

  29. Relevance: Can be customised By document or field At index

    or search time
  30. Simple as: Can be customised By document or field At

    index or search time
  31. FAST!

  32. POWERFUL!

  33. MAGIC!

  34. None
  35. None
  36. None
  37. www.elasticsearch.org

  38. elasticsearch is:

  39. elasticsearch is: • an Open Source (Apache 2)

  40. elasticsearch is: • an Open Source (Apache 2) • distributed

  41. elasticsearch is: • an Open Source (Apache 2) • distributed

    • RESTful
  42. elasticsearch is: • an Open Source (Apache 2) • distributed

    • RESTful • search engine
  43. elasticsearch is: • an Open Source (Apache 2) • distributed

    • RESTful • search engine • built on top of Lucene
  44. Installing elasticsearch: Latest version at: http://www.elasticsearch.org/download/ wget https://github.com/.../elasticsearch-0.17.6.tar.gz tar -xzf

    elasticsearch-0.17.6.tar.gz cd elasticsearch-0.17.6/ ./bin/elasticsearch
  45. Installing ElasticSearch.pm: Latest version at: https://metacpan.org/module/ElasticSearch cpanm ElasticSearch perl -de

    0 > use ElasticSearch; > $e = ElasticSearch->new( trace_calls => 1) > $e->cluster_health
  46. Some terminology Relational DB elasticsearch

  47. Some terminology Relational DB elasticsearch database ⇒ index

  48. Some terminology Relational DB elasticsearch database ⇒ index table ⇒

    type
  49. Some terminology Relational DB elasticsearch database ⇒ index table ⇒

    type row ⇒ document
  50. Some terminology Relational DB elasticsearch database ⇒ index table ⇒

    type row ⇒ document column ⇒ field
  51. Some terminology Relational DB elasticsearch database ⇒ index table ⇒

    type row ⇒ document column ⇒ field schema ⇒ mapping
  52. Some terminology Relational DB elasticsearch database ⇒ index table ⇒

    type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed
  53. Some terminology Relational DB elasticsearch database ⇒ index table ⇒

    type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed SQL ⇒ query DSL
  54. Clustering

  55. Clustering auto-discovery

  56. Clustering single master auto-elected

  57. Clustering immediate failover master re-election

  58. Clustering index ==

  59. Clustering index == 1 or more primary shards

  60. Clustering index == 1 or more primary shards + 0

    or more replica shards
  61. Clustering more primary shards

  62. Clustering ⇒ faster indexing ⇒ more scale more primary shards

  63. Clustering ⇒ faster indexing ⇒ more scale more primary shards

    more replicas
  64. Clustering ⇒ faster indexing ⇒ more scale ⇒ faster searching

    ⇒ more failover more primary shards more replicas
  65. Clustering Big subject... http://www.elasticsearch.org/videos/2011/08/09/road- to-a-distributed-searchengine-berlinbuzzwords.html http://berlinbuzzwords.de/sites/ berlinbuzzwords.de/files/elasticsearch- bbuzz2011.pdf

  66. Document oriented:

  67. Document oriented: No ORM required

  68. Document oriented: JSON in JSON out ⇔

  69. Schema free Dynamic mapping

  70. Schema free Dynamic (or strict) mapping

  71. Unknown field?

  72. elasticsearch guesses the type

  73. elasticsearch guesses the type and indexes it

  74. Put data in: $e->index( );

  75. Put data in: $e->index( index => 'twitter', );

  76. Put data in: $e->index( index => 'twitter', type => 'tweet',

    );
  77. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, );
  78. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, # optional );
  79. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, # ES always returns the ID );
  80. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, data => { } );
  81. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, data => { tweet => “ElasticSearch is cool”, } );
  82. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, } );
  83. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, } );
  84. Put data in: $e->index( index => 'twitter', type => 'tweet',

    id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => [“search”,”perl”], } );
  85. Realtime GET

  86. Retrieve your doc immediately

  87. Persistent

  88. No commit required

  89. Get data out: $e->get( index => 'twitter', type => 'tweet',

    id => 1);
  90. Get data out: $e->get( index => 'twitter', type => 'tweet',

    id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, }
  91. Get data out: $e->get( index => 'twitter', type => 'tweet',

    id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, }
  92. Get data out: $e->get( index => 'twitter', type => 'tweet',

    id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }
  93. bulk-indexing

  94. bulk-indexing multi-get

  95. bulk-indexing multi-get avoids http latency

  96. bulk-indexing multi-get avoids http latency 10x as fast!

  97. Versioning

  98. Versioning “Optimistic currency control”

  99. Versioning “Put if absent”

  100. Versioning Optional

  101. Versioning Can use external version numbers

  102. So far, all we have is a NoSQL document store

    which is fast, reliable, scalable & easy to use
  103. So far, all we have is a NoSQL document store

    which is fast, reliable, scalable & easy to use
  104. None
  105. Simple search $e->search( index => 'twitter', type => 'tweet', );

  106. Simple search $e->search( index => ['twitter','facebook'], type => ['tweet','post'], );

  107. Simple search $e->search( # all indices # all types );

  108. Simple search $e->search( index => 'twitter', type => 'tweet', query

    => { } );
  109. Simple search $e->search( index => 'twitter', type => 'tweet', query

    => { text => { _all => 'clinton' } } );
  110. Simple search $e->search( index => 'twitter', type => 'tweet', queryb

    => 'clinton' );
  111. Simple search $e->search( index => 'twitter', type => 'tweet', queryb

    => 'clinton' # ElasticSearch::SearchBuilder, # like SQL::Abstract );
  112. Search results { took => 1, hits => { total

    => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  113. Search results { took => 1, # milliseconds hits =>

    { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  114. Search results { took => 1, hits => { total

    => 1, # total results max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  115. Search results { took => 1, hits => { total

    => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  116. Search results { took => 1, hits => { total

    => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }
  117. JSON doc included in results

  118. No need to fetch from DB

  119. Docs visible to search in near-real time (< 1 second)

  120. refresh_index() to force

  121. What can you do with search?

  122. standard text search

  123. ...with highlighting

  124. stemming

  125. stemming arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech,

    danish, dutch, english, finnish, french, galician, german, german2, greek, hindi, hungarian, indonesian, italian, kp, light_finish, light_french, light_german, light_hungarian, light_italian, light_portuguese, light_russian, light_spanish, light_swedish., lovins, minimal_english, minimal_french, minimal_german, minimal_portuguese, norwegian, persian, porter, porter2, portuguese, possessive_english, romanian, russian, spanish, swedish, thai, turkish
  126. ngrams & edge-ngrams

  127. auto-complete

  128. camelCase

  129. camelCase

  130. camelCase

  131. term facets, date histograms

  132. ranges

  133. geo bounding box

  134. geo distance

  135. geo distance ranges

  136. geo polygons

  137. None
  138. None
  139. “Terms of endearment” The ElasticSearch query language explained Thurs. 14:35

    - Auditorija 301