Engineer at Blibli. ▸ Part of Research and Development Team at Blibli. ▸ Code Scala & Java, but sometimes code Ruby (In this demo we use Ruby) ▸ https://www.linkedin.com/in/khannedy
always available, and to scale with your needs. Scale can come from buying bigger servers (vertical scale) or from buying more servers (horizontal scale). ▸ Real scalability comes from horizontal scale - the ability to add more nodes to the cluster and to spread load and reliability between them. ▸ Elasticsearch is distributed by nature; it knows how to manage multiple nodes to provide scale and high availability. This also means that your application doesn’t need to care about it.
running instance of Elasticsearch. ▸ A Cluster consists of one or more nodes with the same cluster.name that working together to share their data and workloads. ▸ As nodes are added to or removed from the cluster, the cluster reorganizes itself to spread the data evenly.
and replicas shards are active. ▸ YELLOW : All primary shards are active, but not all replicas shards are active. ▸ RED : Not all primary shards are active.
Elasticsearch will give you 5 shards and 1 Replica per index. ▸ You can change the replica size in runtime without downtime. ▸ But we can not change the shard size. If we want to change the shard size, we need to create new index and migrate old index to new index.
written to disk is immutable: it doesn’t change. Ever. This immutability has important benefits. ▸ There is no need for locking. If you never have to update the index, you never have to worry about multiple processes trying to make changes at the same time.
immutable, so the document cannot be removed, nor can be updated to a newer version of the document. ▸ Every commit point includes a .del file that lists which documents have been deleted. ▸ When a document deleted, it is actually marked as deleted in the .del file. ▸ Document updates work in similar way: when a document is updated, the old version of the document is marked as deleted, and the new version of the document is indexed in a new segment.
complicated execution model because we don’t know which documents will match the query: they could be on any shard in the cluster. ▸ Finding all matching documents is only half the story. Result from multiple shards must be combined into single sorted list before return the results. ▸ For this reason, search executed in two-phase process called “query then fetch”
must build a priority queue of length from + size, all of which need to be passed back to the coordinating node. And coordinating node needs to sort through number_or_shards + (from + size) documents in order to find the correct size documents. ▸ With big-enough from values, the sorting process can become very heavy indeed, using vast amount of CPU, memory and bandwidth. ▸ For this reason, we strongly advice against deep paging. ▸ As alternative, we can use Scan & Scroll API for deep pagination.