Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Life of a Document in Elasticsearch

Boaz Leskes
November 25, 2015

Life of a Document in Elasticsearch

This a variant an Elastic{ON} 15 talk with the same title. This talk was given a the Vrije Universiteit as a guest lecture.

Abstract:
Ever wondered about the lifecycle of a single document in Elasticsearch? What happens when you index it? How does Elasticsearch ensure a document is replicated and found across the whole cluster reliably? Does deleting a document really physically remove the document from disk? How do we get from a blob of text and keywords to near real-time search and analytics?

This talk will explain the when, where and why of your document's life inside of Elasticsearch. Alex and Boaz will take you through the journey of a document across a cluster, taking off with JSON and the curly braces, travelling through the network into the memory and all the analysis chains, heading further onto storage when writing into the transaction logs and Apache Lucene index, being read back by executing searches, all the way until the document is finally deleted.

Even though this talk will cover a lot of different aspects, it's a talk for those who may be less familiar with core searchfunctionality. You do not need to be an Apache Lucene wizard to follow along and find this session useful.

Boaz Leskes

November 25, 2015
Tweet

More Decks by Boaz Leskes

Other Decks in Technology

Transcript

  1. { } CC-BY-ND 4.0 Agenda 3 Life of a document

    in Elasticsearch How, when and where is a document stored/processed in Elasticsearch?
  2. { } CC-BY-ND 4.0 Wait… Elasticsearch? • Distributed, highly available,

    data storage & retrieval • Search engine / aggregation engine / analytics • Suggestions / percolation / highlighting / geo • Document store • Horizontally distributed • High availability • Real time & near real time • Open Source • Apache License • Proprietary plugins available for security & messaging 4
  3. { } CC-BY-ND 4.0 Exhibit A: A JSON document {

    "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 5
  4. { } CC-BY-ND 4.0 Exhibit A: A JSON document PUT

    /books/book/978-1449358549 { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 6
  5. { } CC-BY-ND 4.0 Indexing a document 7 Elasticsearch Client

    PUT /books/book/978-1449358549 200 OK
  6. { } CC-BY-ND 4.0 Indexing a document 8 Client PUT

    /books/book/978-1449358549 200 OK ?
  7. { } CC-BY-ND 4.0 Receiving: Where to put it? 9

    node 1 PUT /books/book/978-1449358549 { }
  8. { } CC-BY-ND 4.0 Receiving: Where to put it? 10

    node 1 PUT /books/book/978-1449358549 { } hash(978-1449358549) % number_of_shards
  9. { } CC-BY-ND 4.0 Receiving: Where to put it? 11

    node 1 PUT /books/book/978-1449358549 { } node 2
  10. { } CC-BY-ND 4.0 Next step: Analysis 12 Each field

    is put into the inverted index { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", }
  11. { } CC-BY-ND 4.0 Analysis 13 Each field is put

    into the inverted index Clinton Gormley -> Input Tokenization Analysis Clinton Gormley [Clinton, Gormley] [clinton, gormley] Zachary Tong [Zachary, Tong] [zachary, tong]
  12. { } CC-BY-ND 4.0 Analysis 14 Resulting in a postings

    list Clinton Gormley -> Value Document ID clinton 1 gormley 1 tong 1 zachary 1
  13. { } CC-BY-ND 4.0 Lucene at work: Field by field

    { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 15
  14. { } CC-BY-ND 4.0 Special case: _all field { "name"

    : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 16
  15. { } CC-BY-ND 4.0 Special case: _source field { "name"

    : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 17
  16. { } CC-BY-ND 4.0 Multifields PUT /books/book/_mapping { "book" :

    { "properties" : { "authors": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": “not_analyzed" } } } } } } 18 Sorting & Aggregations
  17. { } CC-BY-ND 4.0 Multifields 19 Resulting in multiple postings

    lists Field: authors Value Document ID clinton 1 gormley 1 tong 1 zachary 1 Value Document ID Clinton Gormley 1 Zachary Tong 1 Field: authors.raw
  18. { } CC-BY-ND 4.0 Completion suggester • Auto-complete, search-as-you-type •

    Data structure is stored on disk on indexing, then loaded into memory 20
  19. { } CC-BY-ND 4.0 Term vectors • Used for term

    vector highlighter and more like this query • Single document inverted index • Allows to easily find sorted unique terms as well as original positions and offsets 21
  20. { } CC-BY-ND 4.0 Where are we now? • Document

    received via network • Document received on the correct node • Document analyzed, stored in translog & buffer • No reply to the client yet that everything is done 26
  21. { } CC-BY-ND 4.0 Replication on indexing 27 Document level

    replication node 1 a0 a1 node 2 a1 a0 node 3 a1 a0
  22. { } CC-BY-ND 4.0 Replication on indexing 28 Document level

    replication node 1 a0 a1 node 2 a1 a0 node 3 a1 a0
  23. { } CC-BY-ND 4.0 Replication on indexing 29 Document level

    replication node 1 a0 a1 node 2 a1 a0 node 3 a1 a0
  24. { } CC-BY-ND 4.0 Replication on indexing 30 Document level

    replication node 1 a0 a1 node 2 a1 a0 node 3 a1 a0
  25. { } CC-BY-ND 4.0 Replication on indexing 31 Document level

    replication node 1 a0 a1 node 2 a1 a0 node 3 a1 a0 Document got written 6 times! HTTP/1.1 200 OK
  26. { } CC-BY-ND 4.0 Searches: Collecting node 33 node 1

    a0 a1 node 2 a1 a0 node 3 a1 a0 query hits node GET /books/book/_search?q=elasticsearch
  27. { } CC-BY-ND 4.0 Searches: Collecting node 34 node 1

    a0 a1 node 2 a1 a0 node 3 a1 a0 QUERY phase executed on each shard
  28. { } CC-BY-ND 4.0 Searches: Collecting node 35 node 1

    a0 a1 node 2 a1 a0 node 3 a1 a0 QUERY results are sorted
  29. { } CC-BY-ND 4.0 Searches: Collecting node 36 node 1

    a0 a1 node 2 a1 a0 node 3 a1 a0 FETCH phase executed
  30. { } CC-BY-ND 4.0 Searches: Collecting node 37 node 1

    a0 a1 node 2 a1 a0 node 3 a1 a0 FETCH results are returned
  31. { } CC-BY-ND 4.0 Search types • Full text search

    – Goes on disk, access segments, execute search • Aggregation – Loads data into memory from segments, executes on in-memory data structure 39
  32. { } CC-BY-ND 4.0 GET /books/book/_search?search_type=count { "aggs" : {

    "author-top10" : { "terms" : { "field" : "authors" } } } } 40 Memory: Fielddata
  33. { } CC-BY-ND 4.0 41 This does not work for

    aggs Clinton Gormley -> Term Document ID clinton 1 gormley 1 tong 1 zachary 1 Memory: Fielddata
  34. { } CC-BY-ND 4.0 Memory: Fielddata 42 Uninverting the index

    from a raw field Document ID Term 1 Clinton Gormley, Zachary Tong Term Document ID Clinton Gormley 1 Zachary Tong 1
  35. { } CC-BY-ND 4.0 Memory: Fielddata • Used for sorting

    & aggregations • All values of a field are loaded into memory • Per segment data structure • Can take a lot of memory 43
  36. { } CC-BY-ND 4.0 Doc values 44 On-Disk vs. in-memory

    PUT /books/book/_mapping { "book" : { "properties" : { "authors": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": “not_analyzed", "doc_values" : true } } } } } }
  37. { } CC-BY-ND 4.0 Where are we now? • Original

    source is sent to collecting node during queries • Document lives in a segment as part of a shard/index • Document fields live in memory as part of fielddata • Segments are put into the filesystem cache by the OS 45
  38. { } CC-BY-ND 4.0 Relocation 47 Let’s offload our fast

    node node 1 0P STARTED 1P STARTED node 2 node.box_type: large node.box_type: small PUT /books/ { "settings" : { "index.routing.allocation.include.tag" : “large" } }
  39. { } CC-BY-ND 4.0 Relocation 48 Let’s offload our fast

    node node 1 0P STARTED 1P STARTED node 2 node.box_type: large node.box_type: small PUT /books/_settings { "index.routing.allocation.include.tag" : "small" }
  40. { } CC-BY-ND 4.0 Relocation 49 Let’s offload our fast

    node node 1 0P STARTED 1P RELOCATING node 2 node.box_type: large node.box_type: small PUT /books/_settings { "index.routing.allocation.include.tag" : "small" } 1P INITIALIZING
  41. { } CC-BY-ND 4.0 Relocation 50 Let’s offload our fast

    node node 1 0P STARTED node 2 node.box_type: large node.box_type: small PUT /books/_settings { "index.routing.allocation.include.tag" : "small" } 1P STARTED
  42. { } CC-BY-ND 4.0 Relocation 51 Let’s offload our fast

    node node 1 0P RELOCATING node 2 node.box_type: large node.box_type: small PUT /books/_settings { "index.routing.allocation.include.tag" : "small" } 1P STARTED 1P INITIALIZING
  43. { } CC-BY-ND 4.0 Relocation 52 Let’s offload our fast

    node node 1 node 2 node.box_type: large node.box_type: small PUT /books/_settings { "index.routing.allocation.include.tag" : "small" } 1P STARTED 1P STARTED
  44. { } CC-BY-ND 4.0 Relocation 53 What does RELOCATING mean?

    • Point in time snapshot is copied (segments) • Write operations occur and create new segments • All write operations are stored in translog • This translog is sent to the other node and replayed node 1 0P STARTED 1P RELOCATING
  45. { } CC-BY-ND 4.0 Relocation • Process of copying whole

    shards across the cluster • Triggered by – adding/removing nodes from the cluster – exceeded disk thresholds – shard allocation filtering 54
  46. { } CC-BY-ND 4.0 Where are we now? • Relocation

    moves shards across the cluster • Document is duplicated by being copied from one node to another, as part of a segment • Segments are kept open until operation is complete 55
  47. { } CC-BY-ND 4.0 Data deletion 57 node 1 b0

    b1 node 2 b1 b0 node 3 b0 b1 DELETE /books/
  48. { } CC-BY-ND 4.0 Restore from a snapshot 59 node

    1 node 2 node 3 POST /_snapshot/books/20150310/_restore
  49. { } CC-BY-ND 4.0 Restore from a snapshot 60 node

    1 node 2 node 3 POST /_snapshot/books/20150310/_restore Repository S3, FS, HDFS Azure
  50. { } CC-BY-ND 4.0 Restore from a snapshot 61 node

    1 node 2 node 3 POST /_snapshot/books/20150310/_restore b0 b1
  51. { } CC-BY-ND 4.0 Restore from a snapshot 62 node

    1 node 2 node 3 POST /_snapshot/books/20150310/_restore b1 b0 b1 b0
  52. { } CC-BY-ND 4.0 Creating a snapshot 63 node 1

    node 2 node 3 PUT /_snapshot/books/20150310 b1 b0 b1 b0
  53. { } CC-BY-ND 4.0 Data deletion 64 node 1 b0

    b1 node 2 b1 b0 node 3 b0 b1 DELETE /books/book/978-1449358549
  54. { } CC-BY-ND 4.0 Document deletion • Documents are immutable

    • Deletes are soft, document ids are added in a tombstone file to be marked for deletion • Actual deletion happens on next merge of that segment 65
  55. { } CC-BY-ND 4.0 Lucene merges 66 segment • Segments

    are created all the time when indexing data • Merging creates one big segment from several smaller ones
  56. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0