Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hippo meetup: enterprise search with Solr and elasticsearch

Luca Cavanna
January 15, 2013

Hippo meetup: enterprise search with Solr and elasticsearch

Presentation given at the Hippo meetup on January 15th 2013 in Amsterdam.

Luca Cavanna

January 15, 2013
Tweet

More Decks by Luca Cavanna

Other Decks in Programming

Transcript

  1. 15th January 2013 – Hippo meetup [email protected] - @lucacavanna Luca

    Cavanna Software developer & Search consultant at Trifork Amsterdam
  2. Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search

    – Mobile – Custom solutions – Knowledge (GOTO Amsterdam) • Hippo partner • Hippo related search projects: – uva.nl – working on rijksoverheid.nl
  3. Agenda • Search introduction – Lucene foundation – Why do

    we need Solr or elasticsearch? • Scaling with Solr • Elasticsearch distributed nature • Elasticsearch features
  4. Apache Lucene • High-performance, full-featured text search engine library written

    entirely in Java • It indexes documents as collections of fields • A field is a string based key-value pair • What data structure does it use under the hood?
  5. Inverted index 1 The old night keeper keeps the keep

    in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. term freq Posting list and 1 6 big 2 2 3 dark 1 6 did 1 4 grown 1 2 had 1 3 house 2 2 3 in 5 1 2 3 5 6 keep 3 1 3 5 keeper 3 1 4 5 keeps 3 1 5 6 light 1 6 never 1 4 night 3 1 4 5 old 4 1 2 3 4 sleep 1 4 sleeps 1 6 the 6 1 2 3 4 5 6 town 2 1 3 where 1 4
  6. Inverted index • Indexing – Text analysis • Tokenization, lowercasing

    and more • The inverted index can contain more data – Term offsets and more • The inverted index itself doesn't contain the text for displaying the search results
  7. Indexing • Lucene writes indexes as segments • Segments are

    not modifiable: Write-Once • Each segment is a searchable mini index • Each segment contains – Inverted index – Stored fields – ...and more
  8. Indexing: the commit operation • Documents are searchable only after

    a commit! • Commit gives also durability • The most expensive operation in Lucene!!!
  9. Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) •

    With the Lucene near-real time API you don't need a commit to make new documents searchable • Less expensive than commit • Doesn't guarantee durability though • Exposed as soft commit in Solr 4.0
  10. Lucene code example – indexing data IndexWriterConfig config = new

    IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
  11. Lucene code example – querying and showing results QueryParser queryParser

    = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
  12. What's missing? • A common way to represent documents •

    Interface to send document to (HTTP) • A way to represent queries • Interface to send queries to (HTTP) • Configuration • Caching • Distributed infrastructure • And more....
  13. Scaling – why? ‣ The more concurrent searches you run,

    the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
  14. Solr replication (pull approach) • Master-slave based solution • Single

    machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
  15. SolrCloud • A set of new distributed capabilities in Solr

    • uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
  16. elasticsearch • Distributed search engine built on top of Lucene

    • Apache 2 license • Written in Java • RESTful • Created and mainly developed by Shay Banon • A company behind it: elasticsearch.com • Regular releases – Latest release 0.20.2
  17. elasticsearch • Schemaless – Uses defaults and automatic type guessing

    – Custom mappings may be defined if needed • JSON oriented • Multi tenancy – Multiple indexes per node, multiple types per index • Designed to be distributed from the beginning • Almost everything is available as API (including configuration) • Wide range of administration APIs
  18. elasticsearch distributed terminology • Node: a running instance of elasticsearch

    which belongs to a cluster (usually one node per server) • Cluster: one or more nodes with the same cluster name • Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards. • Index: a logical namespace which points to one or more shards – Your code won't deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
  19. elasticsearch distributed terminology • More shards: – improve indexing performance

    – increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well! • More replicas: – increase failover – improve querying performance
  20. Transaction Log • Indexed docs are fully persistent • No

    need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
  21. Index - Shards & Replicas Node Node Node Node Client

    Client curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'
  22. Index - Shards & Replicas Node Node Shard 0 Shard

    0 (primary) (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } }'
  23. Indexing - 1 Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client • Automatic sharding, push replication curl -XPUT localhost:9200/hippo/users/1 -d ' { "name" : { "first" : "Luca", "last" : "Cavanna" } }'
  24. Indexing - 2 Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/users/2 -d ' { "name" : { "first" : "Jeroen", "last" : "Reijn" } }'
  25. Search - 1 Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/_search?q=luca • Scatter / Gather search
  26. Node Node Shard 0 Shard 0 (primary) (primary) Shard 1

    Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/_search?q=luca • Automatic balancing between replicas Search - 2
  27. Search - 3 Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (primary) (primary) Client Client curl -XPUT localhost:9200/hippo/_search?q=luca failure • Automatic failover
  28. Adding a node Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 1 Shard 1 (primary) (primary) Shard 0 Shard 0 (replica) (replica) • “Hot” reallocation of shards to the new node
  29. Adding a node Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 1 Shard 1 (primary) (primary) Node Node Shard 0 Shard 0 (replica) (replica) • “Hot” reallocation of shards to the new node
  30. Adding a node Node Node Shard 0 Shard 0 (primary)

    (primary) Shard 1 Shard 1 (replica) (replica) Node Node Shard 1 Shard 1 (primary) (primary) Node Node Shard 0 Shard 0 (replica) (replica) Shard 0 Shard 0 (replica) (replica) • “Hot” reallocation of shards to the new node
  31. Node failure Node Node Shard 1 Shard 1 (primary) (primary)

    Node Node Shard 0 Shard 0 (replica) (replica) Node Node Shard 0 Shard 0 (primary) (primary) Shard 1 Shard 1 (replica) (replica)
  32. Node failure - 1 Node Node Shard 1 Shard 1

    (primary) (primary) Node Node Shard 0 Shard 0 (primary) (primary) • Replicas can automatically become primaries
  33. Node failure - 2 Node Node Shard 1 Shard 1

    (primary) (primary) Node Node Shard 0 Shard 0 (primary) (primary) Shard 0 Shard 0 (replica) (replica) Shard 1 Shard 1 (replica) (replica) • Shards are automatically assigned and do “hot” recovery
  34. Dynamic Replicas Node Node Shard 0 Shard 0 (primary) (primary)

    Node Node Shard 0 Shard 0 (replica) (replica) Client Client curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } }' Node Node
  35. Dynamic Replicas Node Node Shard 0 Shard 0 (primary) (primary)

    Node Node Node Node Shard 0 Shard 0 (replica) (replica) Client Client Shard 0 Shard 0 (replica) (replica) curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_replicas" : 2 } }'
  36. Indexing (Push) - ElasticSearch • Documents added through push requests

    • Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
  37. Indexing (Pull) - ElasticSearch • Data flows from sources using

    ‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River River River
  38. Searching - ElasticSearch • Search request in Request Body •

    Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more
  39. Search Example - ElasticSearch $ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '

    { "query" : { "term" : { "first_name" : "luca" } } }' { "_shards": { "total" : 5, "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }
  40. Thanks There would be a lot more to say: •

    Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?