The Road to a Distributed Search Engine

Slide 1

Slide 1 text

elasticsearch The Road to a Distributed, (Near) Real Time, Search Engine Shay Banon - @kimchy

Slide 2

Slide 2 text

Lucene Basics - Directory A File System Abstraction Mainly used to read and write “ﬁles” Used to read and write different index ﬁles

Slide 3

Slide 3 text

Lucene Basics - IndexWriter Used to add documents / delete documents from the index Changes are stored in memory (possibly ﬂushing to maintain memory limits) Requires a commit to make changes “persistent”, which is expensive A single IndexWriter can write to an index, expensive to create (reuse at all cost!)

Slide 4

Slide 4 text

Lucene Basics - Index Segments An index is composed of internal segments Each segment is almost a self sufficient index by itself, immutable up to deletes Commits “officially” adds segments to the index, though internal flushing might create new segments as well Segments are merged continuously A lot of caching per segment (terms, field)

Slide 5

Slide 5 text

Lucene Basics - (Near) Real Time IndexReader is the basis for searching IndexWriter#getReader allows to get a refreshed reader that sees changes done to IW Requires ﬂushing (but not committing) Can’t call it on each operation, too expensive Segment based readers and search

Slide 6

Slide 6 text

Distributed Directory Implement a Directory that works on top of a distributed “system” Store ﬁle chunks, read them on demand Implemented for most (Java) data grids Compass - GigaSpaces, Coherence, Terracotta Inﬁnispan

Slide 7

Slide 7 text

Distributed Directory DIR Node IndexWriter IndexReader Chunk Chunk Node Chunk Chunk Node Chunk Chunk

Slide 8

Slide 8 text

Distributed Directory “Chatty”- many network roundtrips to fetch data Big indices still suffer from a non distributed IndexReader Lucene IndexReader can be quite “heavy” Single IndexWriter problem, can’t really scale writes

Slide 9

Slide 9 text

Partitioning Document Partitioning Each shard has a subset of the documents A shard is a fully functional “index” Term Partitioning Shards has subset of terms for all docs

Slide 10

Slide 10 text

Partitioning - Term Based pro: K term query -> handled at most by K shards pro: O(K) disk seeks for K term query con: high network trafﬁc data about each matching term needs to be collected in one place con: harder to have per doc information (facets / sorting / custom scoring)

Slide 11

Slide 11 text

Partitioning - Term Based Riak Search - Utilizing its distributed key- value storage Lucandra (abandoned, replaced by Solandra) Custom IndexReader and IndexWriter to work on top of Cassandra Very very “chatty” when doing a search Does not work well with other Lucene constructs, like FieldCache (by doc info)

Slide 12

Slide 12 text

Partitioning - Document Based pro: each shard can process queries independently pro: easy to keep additional per-doc information (facets, sorting, custom scoring) pro: network trafﬁc small con: query has to be processed by each shard con: O(K*N) disk seeks for K term on N shard

Slide 13

Slide 13 text

Distributed Lucene Doc Partitioning Shard Lucene into several instances Index a document to one Lucene shard Distribute search across Lucene shards Lucene Lucene Lucene Search Index

Slide 14

Slide 14 text

Distributed Lucene Replication Replicated Lucene Shards High Availability Scale search by searching replicas Lucene Lucene

Slide 15

Slide 15 text

Pull Replication Master - Slave conﬁguration Slave pulls index ﬁles from the master (delta, only new segments) Lucene Segment Segment Segment Lucene Segment Segment Segment

Slide 16

Slide 16 text

Pull Replication - Downsides Requires a “commit”on master to make changes available for replication to slave Redundant data transfer as segments are merged (especially for stored ﬁelds) Friction between commit (heavy) and replication, slaves can get “way” behind master (big new segments), looses HA Does not work for real time search, slaves are “too” behind

Slide 17

Slide 17 text

Push Replication “Master/Primary” push to all the replicas Indexing is done on all replicas Lucene Lucene Client Doc Doc

Slide 18

Slide 18 text

Push Replication - Downsides Indexing the document on all nodes (though less data transfer over the wire) Delicate control over concurrent indexing operations Usually solved using versioning

Slide 19

Slide 19 text

Push Replication - Beneﬁts Documents indexed are immediately available on all replicas Improves High Availability Allows for (near) real time search architecture Architecture allows to switch “roles” -> Primary dies, slave can become primary, and still allow indexing

Slide 20

Slide 20 text

Push Replication - IndexWriter#commit IndexWriter#commit is heavy, but required in order to make sure data is actually persisted Can be solved by having a write ahead log that can be replayed on the event of a crash Can be more naturally supported in push replication

Slide 21

Slide 21 text

elasticsearch http://www.elasticsearch.org

Slide 22

Slide 22 text

index - shards and replicas Node Node Client curl -XPUT localhost:9200/test -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'

Slide 23

Slide 23 text

index - shards and replicas Node Shard 0 (primary) Shard 1 (replica) Node Shard 0 (replica) Shard 1 (primary) Client curl -XPUT localhost:9200/test -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'

Slide 24

Slide 24 text

indexing - 1 Node Shard 0 (primary) Shard 1 (replica) Node Shard 0 (replica) Shard 1 (primary) Client curl -XPUT localhost:9200/test/type1/ 1 -d '{ "name" : { "ﬁrst" : "Shay", "last" : "Banon" } , "title" : "ElasticSearch - A distributed search engine" }' Automatic sharding, push replication

Slide 25

Slide 25 text

indexing - 2 Node Shard 0 (primary) Shard 1 (replica) Node Shard 0 (replica) Shard 1 (primary) Client curl -XPUT localhost:9200/test/type1/ 2 -d '{ "name" : { "ﬁrst" : "Shay", "last" : "Banon" } , "title" : "ElasticSearch - A distributed search engine" }' Automatic request “redirection”

Slide 26

Slide 26 text

search - 1 Node Shard 0 (primary) Shard 1 (replica) Node Shard 0 (replica) Shard 1 (primary) Client curl -XPUT localhost:9200/test/_search?q=test Scatter / Gather search

Slide 27

Slide 27 text

search - 2 Node Shard 0 (primary) Shard 1 (replica) Node Shard 0 (replica) Shard 1 (primary) Client curl -XPUT localhost:9200/test/_search?q=test Automatic balancing between replicas

Slide 28

Slide 28 text

search - 3 Node Shard 0 (primary) Shard 1 (replica) Node Shard 0 (replica) Shard 1 (primary) Client curl -XPUT localhost:9200/test/_search?q=test failure Automatic failover

Slide 29

Slide 29 text

adding a node Node Shard 0 (primary) Shard 1 (replica) Node Shard 1 (primary) Shard 0 (replica) “Hot” relocation of shards to the new node

Slide 30

Slide 30 text

adding a node Node Shard 0 (primary) Shard 1 (replica) Node Shard 1 (primary) Node Shard 0 (replica) “Hot” relocation of shards to the new node

Slide 31

Slide 31 text

adding a node Node Shard 0 (primary) Shard 1 (replica) Node Shard 1 (primary) Node Shard 0 (replica) “Hot” relocation of shards to the new node Shard 0 (replica)

Slide 32

Slide 32 text

node failure Node Shard 1 (primary) Node Shard 0 (replica) Node Shard 0 (primary) Shard 1 (replica)

Slide 33

Slide 33 text

node failure - 1 Node Shard 1 (primary) Node Shard 0 (primary) Replicas can automatically become primaries

Slide 34

Slide 34 text

node failure - 2 Node Shard 1 (primary) Node Shard 0 (primary) Shards are automatically assigned, and do “hot” recovery Shard 0 (replica) Shard 1 (replica)

Slide 35

Slide 35 text

dynamic replicas Node Shard 0 (primary) Node Shard 0 (replica) Client curl -XPUT localhost:9200/test -d '{ "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } }'

Slide 36

Slide 36 text

dynamic replicas Node Shard 0 (primary) Node Node Shard 0 (replica) Client

Slide 37

Slide 37 text

dynamic replicas Node Shard 0 (primary) Node Node Shard 0 (replica) Client Shard 0 (replica) curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "number_of_replicas" : 2 } }'

Slide 38

Slide 38 text

multi tenancy - indices Node Node Node Client curl -XPUT localhost:9200/test1 -d '{ "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } }'

Slide 39

Slide 39 text

multi tenancy - indices Node test1 S0 (primary) Node Node test1 S0 (replica) Client curl -XPUT localhost:9200/test1 -d '{ "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } }'

Slide 40

Slide 40 text

multi tenancy - indices Node test1 S0 (primary) Node Node test1 S0 (replica) Client curl -XPUT localhost:9200/test2 -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'

Slide 41

Slide 41 text

multi tenancy - indices Node test1 S0 (primary) Node Node test1 S0 (replica) Client curl -XPUT localhost:9200/test2 -d '{ "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }' test2 S0 (replica) test2 S1 (primary) test2 S1 (replica) test2 S0 (primary)

Slide 42

Slide 42 text

multi tenancy - indices Search against speciﬁc index curl localhost:9200/test1/_search Search against several indices curl localhost:9200/test1,test2/_search Search across all indices curl localhost:9200/_search Can be simpliﬁed using aliases

Slide 43

Slide 43 text

transaction log Indexed / deleted doc is fully persistent No need for a Lucene IndexWriter#commit Managed using a transaction log / WAL Full single node durability (kill dash 9) Utilized when doing hot relocation of shards Periodically “ﬂushed” (calling IW#commit)

Slide 44

Slide 44 text

many more... (dist. related) Custom routing when indexing and searching Different “search execution types” dfs, query_then_fetch, query_and_fetch Complete non blocking, event IO based communication (no blocking threads on sockets, no deadlocks, scalable with large number of shards/replicas)

Slide 45

Slide 45 text

Thanks Shay Banon, twitter: @kimchy elasticsearch http://www.elasticsearch.org/ twitter: @elasticsearch github: https://github.com/elasticsearch/ elasticsearch