Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Life of a Document in Elasticsearch

Elastic Co
March 19, 2015

Life of a Document in Elasticsearch

Ever wondered about the lifecycle of a single document in Elasticsearch? What happens when you index it? How does Elasticsearch ensure a document is replicated and found across the whole cluster reliably? Does deleting a document really physically remove the document from disk? How do we get from a blob of text and keywords to near real-time search and analytics?

This talk will explain the when, where and why of your document's life inside of Elasticsearch. Alex and Boaz will take you through the journey of a document across a cluster, taking off with JSON and the curly braces, travelling through the network into the memory and all the analysis chains, heading further onto storage when writing into the transaction logs and Apache Lucene index, being read back by executing searches, all the way until the document is finally deleted.

Even though this talk will cover a lot of different aspects, it's a talk for those who may be less familiar with core searchfunctionality. You do not need to be an Apache Lucene wizard to follow along and find this session useful.

Elastic Co

March 19, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. { } CC-BY-ND 4.0 Agenda 3 Life of a document

    in Elasticsearch How, when and where is a document stored/processed in Elasticsearch?
  2. { } CC-BY-ND 4.0 About us Core developer & Marvel

    lead 4 Core & Shield developer Boaz Leskes Alexander Reelsen
  3. { } CC-BY-ND 4.0 Exhibit A: A JSON document {

    "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 5
  4. { } CC-BY-ND 4.0 Exhibit A: A JSON document PUT

    /books/book/978-1449358549 { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 6
  5. { } CC-BY-ND 4.0 Indexing a document 7 Elasticsearch Client

    PUT /books/book/978-1449358549 200 OK
  6. { } CC-BY-ND 4.0 Indexing a document 8 Client PUT

    /books/book/978-1449358549 200 OK ?
  7. { } CC-BY-ND 4.0 Receiving: Where to store it? 9

    node  1 C PUT /books/book/978-1449358549 { }
  8. { } CC-BY-ND 4.0 Receiving: Where to put it? 10

    node  1 C PUT /books/book/978-1449358549 { } hash(978-1449358549) % number_of_shards
  9. { } CC-BY-ND 4.0 Receiving: Where to put it? 11

    node  1 C PUT /books/book/978-1449358549 { } node  2
  10. { } CC-BY-ND 4.0 Next step: Analysis 12 Each field

    is put into the inverted index { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", }
  11. { } CC-BY-ND 4.0 Analysis 13 Each field is put

    into the inverted index Clinton Gormley -> Input Tokenization Analysis Clinton  Gormley [Clinton,  Gormley] [clinton,  gormley] Zachary  Tong [Zachary,  Tong] [zachary,  tong]
  12. { } CC-BY-ND 4.0 Analysis 14 Resulting in a postings

    list Clinton Gormley -> Value Document  ID clinton 1 gormley 1 tong 1 zachary 1
  13. { } CC-BY-ND 4.0 Lucene at work: Field by field

    { "name" : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 15
  14. { } CC-BY-ND 4.0 Special case: _all field { "name"

    : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 16
  15. { } CC-BY-ND 4.0 Special case: _source field { "name"

    : "Elasticsearch - The definitive guide", "pages" : 721, "isbn13" : "978-1449358549", "authors" : [ "Clinton Gormley" , "Zachary Tong" ], "publish_date" : "2015/01/31", "category" : "IT > Search Engines", "description" : "Whether you need full-text search or real-time analytics...", } 17
  16. { } CC-BY-ND 4.0 Multifields PUT /books/book/_mapping { "book" :

    { "properties" : { "authors": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": “not_analyzed" } } } } } } 18 Sorting & Aggregations
  17. { } CC-BY-ND 4.0 Multifields 19 Resulting in multiple postings

    lists Field: authors Value Document  ID clinton 1 gormley 1 tong 1 zachary 1 Value Document  ID Clinton  Gormley 1 Zachary  Tong 1 Field: authors.raw
  18. { } CC-BY-ND 4.0 Completion suggester • Auto-complete, search-as-you-type •

    Data structure is stored on disk on indexing, then loaded into memory 20
  19. { } CC-BY-ND 4.0 Term vectors • Handy when hit

    highlighting huge documents • An index of a single document • Allows to easily find sorted unique terms as well as original positions and offsets 21
  20. { } CC-BY-ND 4.0 Where are we now? • Document

    received via network • Document received on the correct node • Document analyzed, stored in translog & buffer • Client is still waiting for a response 30
  21. { } CC-BY-ND 4.0 Replication on indexing 31 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  22. { } CC-BY-ND 4.0 Replication on indexing 32 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  23. { } CC-BY-ND 4.0 Replication on indexing 33 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  24. { } CC-BY-ND 4.0 Replication on indexing 34 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  25. { } CC-BY-ND 4.0 Replication on indexing 35 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C
  26. { } CC-BY-ND 4.0 Replication on indexing 36 Document level

    replication node  1 b0 b1 node  2 b1 b0 node  3 b1 b0 C Document got written 6 times! HTTP/1.1 200 OK
  27. { } CC-BY-ND 4.0 Searches: Collecting node 38 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C query hits node GET /books/book/_search?q=elasticsearch
  28. { } CC-BY-ND 4.0 Searches: Collecting node 39 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C QUERY phase executed on each shard
  29. { } CC-BY-ND 4.0 Searches: Collecting node 40 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C QUERY results are sorted
  30. { } CC-BY-ND 4.0 Searches: Collecting node 41 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C FETCH phase executed
  31. { } CC-BY-ND 4.0 Searches: Collecting node 42 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C FETCH results are returned
  32. { } CC-BY-ND 4.0 Searches: Collecting node 43 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 C
  33. { } CC-BY-ND 4.0 Search types • Full text search

    – Goes on disk, access segments, execute search • Aggregation – Loads data into memory from segments, executes on in-memory data structure 44
  34. { } CC-BY-ND 4.0 Memory: Fielddata GET /books/book/_search?search_type=count { "aggs"

    : { "author-top10" : { "terms" : { "field" : “authors.raw" } } } } 45
  35. { } CC-BY-ND 4.0 Memory: Fielddata 46 This does not

    work for aggs Clinton Gormley -> Term Document  ID clinton 1 gormley 1 tong 1 zachary 1
  36. { } CC-BY-ND 4.0 Memory: Fielddata 47 Uninverting the index

    from a raw field Document  ID Term 1 Clinton  Gormley,  Zachary  Tong Term Document  ID Clinton  Gormley 1 Zachary  Tong 1
  37. { } CC-BY-ND 4.0 Memory: Fielddata • Used for sorting

    & aggregations • All values of a field are loaded into the heap • Per segment data structure • Can take a lot of memory 48
  38. { } CC-BY-ND 4.0 Doc values 49 On-Disk vs. in-memory

    PUT /books/book/_mapping { "book" : { "properties" : { "authors": { "type": "string", "analyzer": "standard", "fields": { "raw": { "type": "string", "index": "not_analyzed", "doc_values" : true } } } } } }
  39. { } CC-BY-ND 4.0 Where are we now? • Original

    source is sent to collecting node during queries • Document lives in a segment as part of a shard/index • Document fields live in memory as part of fielddata • Segments are put into the filesystem cache by the OS 50
  40. { } CC-BY-ND 4.0 Relocation 52 Let’s offload our fast

    node node  1 0P   STARTED 1P   STARTED node  2 node.box_type: ssd node.box_type: hdd PUT /books/ { "settings" : { “index.routing.allocation.include.box_type" : “ssd" } }
  41. { } CC-BY-ND 4.0 Relocation 53 Let’s offload our fast

    node node  1 0P   STARTED 1P   STARTED node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" }
  42. { } CC-BY-ND 4.0 Relocation 54 Let’s offload our fast

    node node  1 0P   STARTED 1P   RELOCATING node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   INITIALIZING
  43. { } CC-BY-ND 4.0 Relocation 55 Let’s offload our fast

    node node  1 0P   STARTED node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   STARTED
  44. { } CC-BY-ND 4.0 Relocation 56 Let’s offload our fast

    node node  1 0P   RELOCATING node  2 node.box_type: sdd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   STARTED 1P   INITIALIZING
  45. { } CC-BY-ND 4.0 Relocation 57 Let’s offload our fast

    node node  1 node  2 node.box_type: ssd node.box_type: hdd PUT /books/_settings { "index.routing.allocation.include.box_type" : "hdd" } 1P   STARTED 1P   STARTED
  46. { } CC-BY-ND 4.0 Relocation 58 What does RELOCATING mean?

    • Point in time snapshot is copied (segments) • Write operations occur and create new segments • All write operations are stored in translog • This translog is sent to the other node and replayed node  1 0P   STARTED 1P   RELOCATING
  47. { } CC-BY-ND 4.0 Relocation • Process of copying whole

    shards across the cluster • Triggered by – adding/removing nodes from the cluster – exceeded disk thresholds – shard allocation filtering 59
  48. { } CC-BY-ND 4.0 Where are we now? • Relocation

    moves shards across the cluster • Document is duplicated by being copied from one node to another, as part of a segment • Segments are kept open until operation is complete 60
  49. { } CC-BY-ND 4.0 How to backup? 62 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0
  50. { } CC-BY-ND 4.0 How to backup? 63 node  1

    b0 b1 node  2 b1 b0 node  3 b1 b0 Repository S3, FS, HDFS Azure
  51. { } CC-BY-ND 4.0 How to backup? 64 node  1

    b0 b1 node  2 b1 b0 node  3 b0 b1 S3, FS, HDFS Azure C PUT /_snapshot/books/20150310 b1 b0
  52. { } CC-BY-ND 4.0 Restore from a snapshot 65 node

     1 node  2 node  3 C POST /_snapshot/books/20150310/_restore b0 b1
  53. { } CC-BY-ND 4.0 Restore from a snapshot 66 node

     1 node  2 node  3 C POST /_snapshot/books/20150310/_restore b1 b0 b1 b0
  54. { } CC-BY-ND 4.0 Index deletion 68 node  1 b0

    b1 node  2 b1 b0 node  3 b0 b1 C DELETE /books/
  55. { } CC-BY-ND 4.0 Index deletion 69 node  1 node

     2 node  3 C DELETE /books/
  56. { } CC-BY-ND 4.0 Document deletion 70 node  1 b0

    b1 node  2 b1 b0 node  3 b0 b1 C DELETE /books/book/978-1449358549
  57. { } CC-BY-ND 4.0 Document deletion • Segments are immutable

    • Deletes are soft, document ids are added in a tombstone file to be marked for deletion • Actual deletion happens on next merge of that segment 71
  58. { } CC-BY-ND 4.0 Lucene merges 72 segment • Segments

    are created all the time when indexing data • Merging creates one big segment from several smaller ones
  59. { } CC-BY-ND 4.0 Don’t be afraid of data duplication!

    ... if you are aware of its lifecycle! 80
  60. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0