Hippo meetup: enterprise search with Solr and elasticsearch

15th January 2013 – Hippo meetup [email protected] - @lucacavanna Luca
Cavanna Software developer & Search consultant at Trifork Amsterdam

Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search
– Mobile – Custom solutions – Knowledge (GOTO Amsterdam) • Hippo partner • Hippo related search projects: – uva.nl – working on rijksoverheid.nl

Agenda • Search introduction – Lucene foundation – Why do
we need Solr or elasticsearch? • Scaling with Solr • Elasticsearch distributed nature • Elasticsearch features

Apache Lucene • High-performance, full-featured text search engine library written
entirely in Java • It indexes documents as collections of fields • A field is a string based key-value pair • What data structure does it use under the hood?

Inverted index 1 The old night keeper keeps the keep
in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. term freq Posting list and 1 6 big 2 2 3 dark 1 6 did 1 4 grown 1 2 had 1 3 house 2 2 3 in 5 1 2 3 5 6 keep 3 1 3 5 keeper 3 1 4 5 keeps 3 1 5 6 light 1 6 never 1 4 night 3 1 4 5 old 4 1 2 3 4 sleep 1 4 sleeps 1 6 the 6 1 2 3 4 5 6 town 2 1 3 where 1 4

Inverted index • Indexing – Text analysis • Tokenization, lowercasing
and more • The inverted index can contain more data – Term offsets and more • The inverted index itself doesn't contain the text for displaying the search results

Indexing • Lucene writes indexes as segments • Segments are
not modifiable: Write-Once • Each segment is a searchable mini index • Each segment contains – Inverted index – Stored fields – ...and more

Indexing: the commit operation • Documents are searchable only after
a commit! • Commit gives also durability • The most expensive operation in Lucene!!!

Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) •
With the Lucene near-real time API you don't need a commit to make new documents searchable • Less expensive than commit • Doesn't guarantee durability though • Exposed as soft commit in Solr 4.0

Lucene code example – indexing data IndexWriterConfig config = new
IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();

Lucene code example – querying and showing results QueryParser queryParser
= new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }

What's missing? • A common way to represent documents •
Interface to send document to (HTTP) • A way to represent queries • Interface to send queries to (HTTP) • Configuration • Caching • Distributed infrastructure • And more....

Enterprise search servers

Scaling – why? ‣ The more concurrent searches you run,
the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over

Solr replication example

Solr replication (pull approach) • Master-slave based solution • Single
machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!

SolrCloud • A set of new distributed capabilities in Solr
• uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes

elasticsearch • Distributed search engine built on top of Lucene
• Apache 2 license • Written in Java • RESTful • Created and mainly developed by Shay Banon • A company behind it: elasticsearch.com • Regular releases – Latest release 0.20.2

elasticsearch • Schemaless – Uses defaults and automatic type guessing
– Custom mappings may be defined if needed • JSON oriented • Multi tenancy – Multiple indexes per node, multiple types per index • Designed to be distributed from the beginning • Almost everything is available as API (including configuration) • Wide range of administration APIs

elasticsearch distributed terminology • Node: a running instance of elasticsearch
which belongs to a cluster (usually one node per server) • Cluster: one or more nodes with the same cluster name • Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards. • Index: a logical namespace which points to one or more shards – Your code won't deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)

elasticsearch distributed terminology • More shards: – improve indexing performance
– increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well! • More replicas: – increase failover – improve querying performance

Transaction Log • Indexed docs are fully persistent • No
need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!

Index - Shards & Replicas Node Node Node Node Client
Client curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'

Index - Shards & Replicas Node Node Shard 0 Shard
0 (primary) (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'

Indexing - 1 Node Node Shard 0 Shard 0 (primary)
(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client • Automatic sharding, push replication curl -XPUT localhost:9200/hippo/users/1 -d ' { "name" : { "first" : "Luca", "last" : "Cavanna" } }'

Indexing - 2 Node Node Shard 0 Shard 0 (primary)
(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/users/2 -d ' { "name" : { "first" : "Jeroen", "last" : "Reijn" } }'

Search - 1 Node Node Shard 0 Shard 0 (primary)
(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/_search?q=luca • Scatter / Gather search

Node Node Shard 0 Shard 0 (primary) (primary) Shard 1
Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/_search?q=luca • Automatic balancing between replicas Search - 2

Search - 3 Node Node Shard 0 Shard 0 (primary)
(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/_search?q=luca failure • Automatic failover

Adding a node Node Node Shard 0 Shard 0 (primary)
(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 1 Shard 1 (primary) (primary) Shard 0 Shard 0 (replica) (replica) • “Hot” reallocation of shards to the new node

(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 1 Shard 1 (primary) (primary) Node Node Shard 0 Shard 0 (replica) (replica) • “Hot” reallocation of shards to the new node

(primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 1 Shard 1 (primary) (primary) Node Node Shard 0 Shard 0 (replica) (replica) Shard 0 Shard 0 (replica) (replica) • “Hot” reallocation of shards to the new node

Node failure Node Node Shard 1 Shard 1 (primary) (primary)
Node Node Shard 0 Shard 0 (replica) (replica) Node Node Shard 0 Shard 0 (primary) (primary) Shard 1 Shard 1 (replica) (replica)

Node failure - 1 Node Node Shard 1 Shard 1
(primary) (primary) Node Node Shard 0 Shard 0 (primary) (primary) • Replicas can automatically become primaries

Node failure - 2 Node Node Shard 1 Shard 1
(primary) (primary) Node Node Shard 0 Shard 0 (primary) (primary) Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (replica) (replica) • Shards are automatically assigned and do “hot” recovery

Dynamic Replicas Node Node Shard 0 Shard 0 (primary) (primary)
Node Node Shard 0 Shard 0 (replica) (replica) Client Client curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } }' Node Node

Dynamic Replicas Node Node Shard 0 Shard 0 (primary) (primary)
Node Node Node Node Shard 0 Shard 0 (replica) (replica) Client Client Shard 0 Shard 0 (replica) (replica) curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_replicas" : 2 } }'

Indexing (Push) - ElasticSearch • Documents added through push requests
• Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }

Indexing (Pull) - ElasticSearch • Data flows from sources using
‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River River River

Searching - ElasticSearch • Search request in Request Body •
Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more

Search Example - ElasticSearch $ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '
{ "query" : { "term" : { "first_name" : "luca" } } }' { "_shards": { "total" : 5, "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }

Thanks There would be a lot more to say: •
Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?

Hippo meetup: enterprise search with Solr and e...

Hippo meetup: enterprise search with Solr and elasticsearch

More Decks by Luca Cavanna

Other Decks in Programming

Featured

Transcript