entirely in Java • It indexes documents as collections of fields • A field is a string based key-value pair • What data structure does it use under the hood?
in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. term freq Posting list and 1 6 big 2 2 3 dark 1 6 did 1 4 grown 1 2 had 1 3 house 2 2 3 in 5 1 2 3 5 6 keep 3 1 3 5 keeper 3 1 4 5 keeps 3 1 5 6 light 1 6 never 1 4 night 3 1 4 5 old 4 1 2 3 4 sleep 1 4 sleeps 1 6 the 6 1 2 3 4 5 6 town 2 1 3 where 1 4
and more • The inverted index can contain more data – Term offsets and more • The inverted index itself doesn't contain the text for displaying the search results
With the Lucene near-real time API you don't need a commit to make new documents searchable • Less expensive than commit • Doesn't guarantee durability though • Exposed as soft commit in Solr 4.0
Interface to send document to (HTTP) • A way to represent queries • Interface to send queries to (HTTP) • Configuration • Caching • Distributed infrastructure • And more....
the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
• uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
• Apache 2 license • Written in Java • RESTful • Created and mainly developed by Shay Banon • A company behind it: elasticsearch.com • Regular releases – Latest release 0.20.2
– Custom mappings may be defined if needed • JSON oriented • Multi tenancy – Multiple indexes per node, multiple types per index • Designed to be distributed from the beginning • Almost everything is available as API (including configuration) • Wide range of administration APIs
which belongs to a cluster (usually one node per server) • Cluster: one or more nodes with the same cluster name • Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards. • Index: a logical namespace which points to one or more shards – Your code won't deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
– increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well! • More replicas: – increase failover – improve querying performance
need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
• Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River River River
Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more