Slide 1

Slide 1 text

“Cool, Bonsai, Cool” An introduction to Clinton Gormley, YAPC::EU 2011

Slide 2

Slide 2 text

Why do I need a search engine?

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Search is how we find stuff

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

How does a search engine work?

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic

Slide 12

Slide 12 text

Magic == inverted index + relevance scoring

Slide 13

Slide 13 text

Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Take some text

Slide 14

Slide 14 text

Acme::Magic8Ball Acme::Magic::Pony Config::Magic File::Magic File::MimeInfo::Magic File::MMagic::XS MagicTemplate Meta::File::MMagic MRO::Magic Template::Magic Template::Magic::Pager Test::Magic XS::MagicExt XS::Object::Magic Tokenise it

Slide 15

Slide 15 text

acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Tokenise it

Slide 16

Slide 16 text

acme magic 8 ball acme magic pony config magic file magic file mime info magic file m magic xs magic template meta file m magic mro magic template magic template magic pager test magic xs magic ext xs object magic Find unique tokens/terms

Slide 17

Slide 17 text

8 acme ball config ext file info m magic Find unique tokens/terms meta mime mro object pager pony template test xs

Slide 18

Slide 18 text

acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Map terms to documents

Slide 19

Slide 19 text

acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS Search for: “file xs”

Slide 20

Slide 20 text

Search for: “file xs” acme file magic mime template xs Acme::Magic8Ball Acme::Magic::Pony File::Magic File::MimeInfo::Magic MagicTemplate Template::Magic Template::Magic::Pager XS::Object::Magic XS::MagicExt File::MMagic::XS

Slide 21

Slide 21 text

But, not just about finding

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Sort by RELEVANCE

Slide 24

Slide 24 text

Relevance: How many matching terms does this document contain?

Slide 25

Slide 25 text

Relevance: How often does each term appear in this document, as a % of its length?

Slide 26

Slide 26 text

Relevance: How frequently does each term appear in all your documents?

Slide 27

Slide 27 text

Relevance: Can be customised

Slide 28

Slide 28 text

Relevance: Can be customised By document or field

Slide 29

Slide 29 text

Relevance: Can be customised By document or field At index or search time

Slide 30

Slide 30 text

Simple as: Can be customised By document or field At index or search time

Slide 31

Slide 31 text

FAST!

Slide 32

Slide 32 text

POWERFUL!

Slide 33

Slide 33 text

MAGIC!

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

www.elasticsearch.org

Slide 38

Slide 38 text

elasticsearch is:

Slide 39

Slide 39 text

elasticsearch is: ● an Open Source (Apache 2)

Slide 40

Slide 40 text

elasticsearch is: ● an Open Source (Apache 2) ● distributed

Slide 41

Slide 41 text

elasticsearch is: ● an Open Source (Apache 2) ● distributed ● RESTful

Slide 42

Slide 42 text

elasticsearch is: ● an Open Source (Apache 2) ● distributed ● RESTful ● search engine

Slide 43

Slide 43 text

elasticsearch is: ● an Open Source (Apache 2) ● distributed ● RESTful ● search engine ● built on top of Lucene

Slide 44

Slide 44 text

Installing elasticsearch: Latest version at: http://www.elasticsearch.org/download/ wget https://github.com/.../elasticsearch-0.17.6.tar.gz tar -xzf elasticsearch-0.17.6.tar.gz cd elasticsearch-0.17.6/ ./bin/elasticsearch

Slide 45

Slide 45 text

Installing ElasticSearch.pm: Latest version at: https://metacpan.org/module/ElasticSearch cpanm ElasticSearch perl -de 0 > use ElasticSearch; > $e = ElasticSearch->new( trace_calls => 1) > $e->cluster_health

Slide 46

Slide 46 text

Some terminology Relational DB elasticsearch

Slide 47

Slide 47 text

Some terminology Relational DB elasticsearch database ⇒ index

Slide 48

Slide 48 text

Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type

Slide 49

Slide 49 text

Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document

Slide 50

Slide 50 text

Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field

Slide 51

Slide 51 text

Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping

Slide 52

Slide 52 text

Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed

Slide 53

Slide 53 text

Some terminology Relational DB elasticsearch database ⇒ index table ⇒ type row ⇒ document column ⇒ field schema ⇒ mapping index ⇒ everything is indexed SQL ⇒ query DSL

Slide 54

Slide 54 text

Clustering

Slide 55

Slide 55 text

Clustering auto-discovery

Slide 56

Slide 56 text

Clustering single master auto-elected

Slide 57

Slide 57 text

Clustering immediate failover master re-election

Slide 58

Slide 58 text

Clustering index ==

Slide 59

Slide 59 text

Clustering index == 1 or more primary shards

Slide 60

Slide 60 text

Clustering index == 1 or more primary shards + 0 or more replica shards

Slide 61

Slide 61 text

Clustering more primary shards

Slide 62

Slide 62 text

Clustering ⇒ faster indexing ⇒ more scale more primary shards

Slide 63

Slide 63 text

Clustering ⇒ faster indexing ⇒ more scale more primary shards more replicas

Slide 64

Slide 64 text

Clustering ⇒ faster indexing ⇒ more scale ⇒ faster searching ⇒ more failover more primary shards more replicas

Slide 65

Slide 65 text

Clustering Big subject... http://www.elasticsearch.org/videos/2011/08/09/road- to-a-distributed-searchengine-berlinbuzzwords.html http://berlinbuzzwords.de/sites/ berlinbuzzwords.de/files/elasticsearch- bbuzz2011.pdf

Slide 66

Slide 66 text

Document oriented:

Slide 67

Slide 67 text

Document oriented: No ORM required

Slide 68

Slide 68 text

Document oriented: JSON in JSON out ⇔

Slide 69

Slide 69 text

Schema free Dynamic mapping

Slide 70

Slide 70 text

Schema free Dynamic (or strict) mapping

Slide 71

Slide 71 text

Unknown field?

Slide 72

Slide 72 text

elasticsearch guesses the type

Slide 73

Slide 73 text

elasticsearch guesses the type and indexes it

Slide 74

Slide 74 text

Put data in: $e->index( );

Slide 75

Slide 75 text

Put data in: $e->index( index => 'twitter', );

Slide 76

Slide 76 text

Put data in: $e->index( index => 'twitter', type => 'tweet', );

Slide 77

Slide 77 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, );

Slide 78

Slide 78 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # optional );

Slide 79

Slide 79 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, # ES always returns the ID );

Slide 80

Slide 80 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { } );

Slide 81

Slide 81 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, } );

Slide 82

Slide 82 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, } );

Slide 83

Slide 83 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, } );

Slide 84

Slide 84 text

Put data in: $e->index( index => 'twitter', type => 'tweet', id => 1, data => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => [“search”,”perl”], } );

Slide 85

Slide 85 text

Realtime GET

Slide 86

Slide 86 text

Retrieve your doc immediately

Slide 87

Slide 87 text

Persistent

Slide 88

Slide 88 text

No commit required

Slide 89

Slide 89 text

Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1);

Slide 90

Slide 90 text

Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, }

Slide 91

Slide 91 text

Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, }

Slide 92

Slide 92 text

Get data out: $e->get( index => 'twitter', type => 'tweet', id => 1); { _index => 'twitter', _type => 'tweet', _id => 1, _version => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }

Slide 93

Slide 93 text

bulk-indexing

Slide 94

Slide 94 text

bulk-indexing multi-get

Slide 95

Slide 95 text

bulk-indexing multi-get avoids http latency

Slide 96

Slide 96 text

bulk-indexing multi-get avoids http latency 10x as fast!

Slide 97

Slide 97 text

Versioning

Slide 98

Slide 98 text

Versioning “Optimistic currency control”

Slide 99

Slide 99 text

Versioning “Put if absent”

Slide 100

Slide 100 text

Versioning Optional

Slide 101

Slide 101 text

Versioning Can use external version numbers

Slide 102

Slide 102 text

So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use

Slide 103

Slide 103 text

So far, all we have is a NoSQL document store which is fast, reliable, scalable & easy to use

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

Simple search $e->search( index => 'twitter', type => 'tweet', );

Slide 106

Slide 106 text

Simple search $e->search( index => ['twitter','facebook'], type => ['tweet','post'], );

Slide 107

Slide 107 text

Simple search $e->search( # all indices # all types );

Slide 108

Slide 108 text

Simple search $e->search( index => 'twitter', type => 'tweet', query => { } );

Slide 109

Slide 109 text

Simple search $e->search( index => 'twitter', type => 'tweet', query => { text => { _all => 'clinton' } } );

Slide 110

Slide 110 text

Simple search $e->search( index => 'twitter', type => 'tweet', queryb => 'clinton' );

Slide 111

Slide 111 text

Simple search $e->search( index => 'twitter', type => 'tweet', queryb => 'clinton' # ElasticSearch::SearchBuilder, # like SQL::Abstract );

Slide 112

Slide 112 text

Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }

Slide 113

Slide 113 text

Search results { took => 1, # milliseconds hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }

Slide 114

Slide 114 text

Search results { took => 1, hits => { total => 1, # total results max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }

Slide 115

Slide 115 text

Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }

Slide 116

Slide 116 text

Search results { took => 1, hits => { total => 1, max_score => 1, hits => [{ _score => 1, _index => 'twitter', _type => 'tweet', _id => 1, _source => { tweet => “ElasticSearch is cool”, sent => “2011-08-16 15:15:00”, user => { name => “Clinton”, user_id => 123 }, tags => ['search','perl'], } }], }, ... other information ... }

Slide 117

Slide 117 text

JSON doc included in results

Slide 118

Slide 118 text

No need to fetch from DB

Slide 119

Slide 119 text

Docs visible to search in near-real time (< 1 second)

Slide 120

Slide 120 text

refresh_index() to force

Slide 121

Slide 121 text

What can you do with search?

Slide 122

Slide 122 text

standard text search

Slide 123

Slide 123 text

...with highlighting

Slide 124

Slide 124 text

stemming

Slide 125

Slide 125 text

stemming arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, german2, greek, hindi, hungarian, indonesian, italian, kp, light_finish, light_french, light_german, light_hungarian, light_italian, light_portuguese, light_russian, light_spanish, light_swedish., lovins, minimal_english, minimal_french, minimal_german, minimal_portuguese, norwegian, persian, porter, porter2, portuguese, possessive_english, romanian, russian, spanish, swedish, thai, turkish

Slide 126

Slide 126 text

ngrams & edge-ngrams

Slide 127

Slide 127 text

auto-complete

Slide 128

Slide 128 text

camelCase

Slide 129

Slide 129 text

camelCase

Slide 130

Slide 130 text

camelCase

Slide 131

Slide 131 text

term facets, date histograms

Slide 132

Slide 132 text

ranges

Slide 133

Slide 133 text

geo bounding box

Slide 134

Slide 134 text

geo distance

Slide 135

Slide 135 text

geo distance ranges

Slide 136

Slide 136 text

geo polygons

Slide 137

Slide 137 text

No content

Slide 138

Slide 138 text

No content

Slide 139

Slide 139 text

“Terms of endearment” The ElasticSearch query language explained Thurs. 14:35 - Auditorija 301