Life of a Document in Elasticsearch

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 19, 2015

Life of a Document in Elasticsearch

Ever wondered about the lifecycle of a single document in Elasticsearch? What happens when you index it? How does Elasticsearch ensure a document is replicated and found across the whole cluster reliably? Does deleting a document really physically remove the document from disk? How do we get from a blob of text and keywords to near real-time search and analytics?

This talk will explain the when, where and why of your document's life inside of Elasticsearch. Alex and Boaz will take you through the journey of a document across a cluster, taking off with JSON and the curly braces, travelling through the network into the memory and all the analysis chains, heading further onto storage when writing into the transaction logs and Apache Lucene index, being read back by executing searches, all the way until the document is finally deleted.

Even though this talk will cover a lot of different aspects, it's a talk for those who may be less familiar with core searchfunctionality. You do not need to be an Apache Lucene wizard to follow along and find this session useful.

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 19, 2015
Tweet

Transcript

  1. Life of a document in Elasticsearch Alexander Reelsen - @spinscale

    Boaz Leskes - @bleskes
  2. { } CC-BY-ND 4.0 Agenda 2 Life of a document

    in Elasticsearch
  3. { } CC-BY-ND 4.0 Agenda 3 Life of a document

    in Elasticsearch How, when and where is a document stored/processed in Elasticsearch?
  4. { } CC-BY-ND 4.0 About us Core developer & Marvel

    lead 4 Core & Shield developer Boaz Leskes Alexander Reelsen
  5. { } CC-BY-ND 4.0 Exhibit A: A JSON document {

    "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 5
  6. { } CC-BY-ND 4.0 Exhibit A: A JSON document PUT

    /books/book/978-1449358549 { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 6
  7. { } CC-BY-ND 4.0 Indexing a document 7 Elasticsearch Client

    PUT /books/book/978-1449358549 200 OK
  8. { } CC-BY-ND 4.0 Indexing a document 8 Client PUT

    /books/book/978-1449358549 200 OK ?
  9. { } CC-BY-ND 4.0 Receiving: Where to store it? 9

    node  1 C PUT /books/book/978-1449358549 { }
  10. { } CC-BY-ND 4.0 Receiving: Where to put it? 10

    node  1 C PUT /books/book/978-1449358549 { } hash(978-1449358549) % number_of_shards
  11. { } CC-BY-ND 4.0 Receiving: Where to put it? 11

    node  1 C PUT /books/book/978-1449358549 { } node  2
  12. { } CC-BY-ND 4.0 Next step: Analysis 12 Each field

    is put into the inverted index { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", }
  13. { } CC-BY-ND 4.0 Analysis 13 Each field is put

    into the inverted index Clinton Gormley -> Input Tokenization Analysis Clinton  Gormley [Clinton,  Gormley] [clinton,  gormley] Zachary  Tong [Zachary,  Tong] [zachary,  tong]
  14. { } CC-BY-ND 4.0 Analysis 14 Resulting in a postings

    list Clinton Gormley -> Value Document  ID clinton 1 gormley 1 tong 1 zachary 1
  15. { } CC-BY-ND 4.0 Lucene at work: Field by field

    { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 15
  16. { } CC-BY-ND 4.0 Special case: _all field { "name"

    : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 16
  17. { } CC-BY-ND 4.0 Special case: _source field { "name"

    : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 17
  18. { } CC-BY-ND 4.0 Multifields PUT /books/book/_mapping { "book" :

    { "properties" : { "authors": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": “not_analyzed" } } } } } } 18 Sorting & Aggregations
  19. { } CC-BY-ND 4.0 Multifields 19 Resulting in multiple postings

    lists Field: authors Value Document  ID clinton 1 gormley 1 tong 1 zachary 1 Value Document  ID Clinton  Gormley 1 Zachary  Tong 1 Field: authors.raw
  20. { } CC-BY-ND 4.0 Completion suggester • Auto-complete, search-as-you-type •

    Data structure is stored on disk on indexing, then loaded into memory 20
  21. { } CC-BY-ND 4.0 Term vectors • Handy when hit

    highlighting huge documents • An index of a single document • Allows to easily find sorted unique terms as well as original positions and offsets 21
  22. { } CC-BY-ND 4.0 Buffering documents 22 buffer buffer

  23. { } CC-BY-ND 4.0 Buffering documents 23 buffer buffer

  24. { } CC-BY-ND 4.0 Buffering documents 24 buffer segment

  25. { } CC-BY-ND 4.0 Elasticsearch refresh 25 buffer translog

  26. { } CC-BY-ND 4.0 Elasticsearch refresh 26 buffer translog segment

  27. { } CC-BY-ND 4.0 Indexing a document 27 buffer translog

  28. { } CC-BY-ND 4.0 Elasticsearch refresh 28 buffer translog segment

  29. { } CC-BY-ND 4.0 Elasticsearch flush 29 buffer translog segment

    fsync
  30. { } CC-BY-ND 4.0 Where are we now? • Document

    received via network • Document received on the correct node • Document analyzed, stored in translog & buffer • Client is still waiting for a response 30
  31. { } CC-BY-ND 4.0 Replication on indexing 31 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  32. { } CC-BY-ND 4.0 Replication on indexing 32 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  33. { } CC-BY-ND 4.0 Replication on indexing 33 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  34. { } CC-BY-ND 4.0 Replication on indexing 34 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  35. { } CC-BY-ND 4.0 Replication on indexing 35 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  36. { } CC-BY-ND 4.0 Replication on indexing 36 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C Document got written 6 times! HTTP/1.1 200 OK
  37. { } CC-BY-ND 4.0 Searching 37

  38. { } CC-BY-ND 4.0 Searches: Collecting node 38 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C query hits node GET /books/book/_search?q=elasticsearch
  39. { } CC-BY-ND 4.0 Searches: Collecting node 39 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C QUERY phase executed on each shard
  40. { } CC-BY-ND 4.0 Searches: Collecting node 40 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C QUERY results are sorted
  41. { } CC-BY-ND 4.0 Searches: Collecting node 41 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C FETCH phase executed
  42. { } CC-BY-ND 4.0 Searches: Collecting node 42 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C FETCH results are returned
  43. { } CC-BY-ND 4.0 Searches: Collecting node 43 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C
  44. { } CC-BY-ND 4.0 Search types • Full text search

    – Goes on disk, access segments, execute search • Aggregation – Loads data into memory from segments, executes on in-memory data structure 44
  45. { } CC-BY-ND 4.0 Memory: Fielddata GET /books/book/_search?search_type=count { "aggs"

    : { "author-top10" : { "terms" : { "field" : “authors.raw" } } } } 45
  46. { } CC-BY-ND 4.0 Memory: Fielddata 46 This does not

    work for aggs Clinton Gormley -> Term Document  ID clinton 1 gormley 1 tong 1 zachary 1
  47. { } CC-BY-ND 4.0 Memory: Fielddata 47 Uninverting the index

    from a raw field Document  ID Term 1 Clinton  Gormley,  Zachary  Tong Term Document  ID Clinton  Gormley 1 Zachary  Tong 1
  48. { } CC-BY-ND 4.0 Memory: Fielddata • Used for sorting

    & aggregations • All values of a field are loaded into the heap • Per segment data structure • Can take a lot of memory 48
  49. { } CC-BY-ND 4.0 Doc values 49 On-Disk vs. in-memory

    PUT /books/book/_mapping { "book" : { "properties" : { "authors": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": "not_analyzed", "doc_values" : true } } } } } }
  50. { } CC-BY-ND 4.0 Where are we now? • Original

    source is sent to collecting node during queries • Document lives in a segment as part of a shard/index • Document fields live in memory as part of fielddata • Segments are put into the filesystem cache by the OS 50
  51. { } CC-BY-ND 4.0 Scaling your cluster 51

  52. { } CC-BY-ND 4.0 Relocation 52 Let’s offload our fast

    node node  1 0P   STARTED 1P   STARTED node  2 node.box_type: ssd node.box_type: hdd PUT /books/ { "settings" : { “index.routing.allocation.include.box_type" : “ssd" } }
  53. { } CC-BY-ND 4.0 Relocation 53 Let’s offload our fast

    node node  1 0P   STARTED 1P   STARTED node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" }
  54. { } CC-BY-ND 4.0 Relocation 54 Let’s offload our fast

    node node  1 0P   STARTED 1P   RELOCATING node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   INITIALIZING
  55. { } CC-BY-ND 4.0 Relocation 55 Let’s offload our fast

    node node  1 0P   STARTED node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   STARTED
  56. { } CC-BY-ND 4.0 Relocation 56 Let’s offload our fast

    node node  1 0P   RELOCATING node  2 node.box_type: sdd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   STARTED 1P   INITIALIZING
  57. { } CC-BY-ND 4.0 Relocation 57 Let’s offload our fast

    node node  1 node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   STARTED 1P   STARTED
  58. { } CC-BY-ND 4.0 Relocation 58 What does RELOCATING mean?

    • Point in time snapshot is copied (segments) • Write operations occur and create new segments • All write operations are stored in translog • This translog is sent to the other node and replayed node  1 0P   STARTED 1P   RELOCATING
  59. { } CC-BY-ND 4.0 Relocation • Process of copying whole

    shards across the cluster • Triggered by – adding/removing nodes from the cluster – exceeded disk thresholds – shard allocation filtering 59
  60. { } CC-BY-ND 4.0 Where are we now? • Relocation

    moves shards across the cluster • Document is duplicated by being copied from one node to another, as part of a segment • Segments are kept open until operation is complete 60
  61. { } CC-BY-ND 4.0 Backup your data 61

  62. { } CC-BY-ND 4.0 How to backup? 62 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0
  63. { } CC-BY-ND 4.0 How to backup? 63 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 Repository S3, FS, HDFS Azure
  64. { } CC-BY-ND 4.0 How to backup? 64 node  1

    b0 b1 node  2 b1 b0 node  3 b0 b1 S3, FS, HDFS Azure C PUT /_snapshot/books/20150310 b1 b0
  65. { } CC-BY-ND 4.0 Restore from a snapshot 65 node

     1 node  2 node  3 C POST /_snapshot/books/20150310/_restore b0 b1
  66. { } CC-BY-ND 4.0 Restore from a snapshot 66 node

     1 node  2 node  3 C POST /_snapshot/books/20150310/_restore b1 b0 b1 b0
  67. { } CC-BY-ND 4.0 Deleting data 67

  68. { } CC-BY-ND 4.0 Index deletion 68 node  1 b0

    b1 node  2 b1 b0 node  3 b0 b1 C DELETE /books/
  69. { } CC-BY-ND 4.0 Index deletion 69 node  1 node

     2 node  3 C DELETE /books/
  70. { } CC-BY-ND 4.0 Document deletion 70 node  1 b0

    b1 node  2 b1 b0 node  3 b0 b1 C DELETE /books/book/978-1449358549
  71. { } CC-BY-ND 4.0 Document deletion • Segments are immutable

    • Deletes are soft, document ids are added in a tombstone file to be marked for deletion • Actual deletion happens on next merge of that segment 71
  72. { } CC-BY-ND 4.0 Lucene merges 72 segment • Segments

    are created all the time when indexing data • Merging creates one big segment from several smaller ones
  73. { } CC-BY-ND 4.0 Lucene merges 73 segment segment

  74. { } CC-BY-ND 4.0 Lucene merges 74 segment segment segment

  75. { } CC-BY-ND 4.0 Lucene merges 75 segment segment segment

    segment
  76. { } CC-BY-ND 4.0 Lucene merges 76 segment segment segment

    segment document exists twice!
  77. { } CC-BY-ND 4.0 Lucene merges 77 segment segment segment

    segment
  78. { } CC-BY-ND 4.0 Lucene merges 78 segment

  79. { } CC-BY-ND 4.0 Don’t be afraid of data duplication!

    79
  80. { } CC-BY-ND 4.0 Don’t be afraid of data duplication!

    ... if you are aware of its lifecycle! 80
  81. { } Thanks for listening! Questions? @bleskes @spinscale

  82. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0