Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Document relations at OSSC

Document relations at OSSC

Introduces the various ways in Elasticsearch to model your data in a relational manner. Discusses nested field type and parent-child features.

Presented at OSS basis tech week.

Martijn van Groningen

November 06, 2013
Tweet

More Decks by Martijn van Groningen

Other Decks in Programming

Transcript

  1. Topics • Background • Parent / child support • Nested

    support • Future developments Wednesday, November 6, 13
  2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction • What is Elasticsearch? Document based search engine. • Apache Lucene, JSON based • Major characteristics Dynamic schema that adjusts to your data. Distributed & Mult-tenant Easily scalable API centric Wednesday, November 6, 13
  3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction • Key features Free text search • Find any emails containing: “Apache Lucene released” Structured search • Find any emails sent by: [email protected] • Find any emails sent between April 2013 and now. Statistics • Return the top ‘from’ senders And a combination of all of the above. • All features work in (near) realtime Wednesday, November 6, 13
  4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction • Write apis index, delete, update, bulk, delete_by_query • Read apis search, get, suggest, mget, msearch • Admin apis health, cluster_state, mapping, settings and more Wednesday, November 6, 13
  5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction curl -XPUT 'localhost:9200/emails/email/1' -d '{ "title" : "Apache Lucene 4.5 released", "to" : ["[email protected]", "[email protected]"], "body" : " ... ", ... }' { "ok" : true, "_index" : "emails", "_type" : "email", "_id" : "1", "_version" : 1 } Indexing a document: Index response: index type id Wednesday, November 6, 13
  6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction curl -XGET 'localhost:9200/_search' -d '{ "query" : { "match" : { "body" : "apache lucene released" } } }' { "total" : 1, "max_score" : 0.434322, "hits" : [ { "_index" : "emails", "_type" : "email", "_id" : "1", "_score" : 0.434322, "_source" : { "title" : "Apache Lucene 4.5 released", ... } ... } Search request: Search response: Endpoint Query DSL Wednesday, November 6, 13
  7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction curl -XGET 'localhost:9200/_search' -d '{ "query" : { "bool" : { "must" : [ { "match" : { "body" : "apache lucene released" } }, { "range" : { "sent_at" : { "gte" : "2013-04-01" } } } ] } } }' A more sophisticated search request: Query DSL supports many types of queries and filters Wednesday, November 6, 13
  8. Background • Elasticsearch is a document based system. Documents are

    defined as JSON • The elasticsearch document is always converted to a Lucene document. Lucene Document is just key value pairs • Both the nested and parent-child support are just tools for document design. Wednesday, November 6, 13
  9. • Lets design a book document. Background { "title" :

    "Elasticsearch" "summary" : "The definitive guide for Elasticsearch ..." "published_year" : 2013, "num_pages" : 289, "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], }' Wednesday, November 6, 13
  10. • Lets design a book document. Background { "title" :

    "Elasticsearch" "summary" : "The definitive guide for Elasticsearch ..." "published_year" : 2013, "num_pages" : 289, "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], }' • But how to add data to it that is related? Like chapter or page data. Wednesday, November 6, 13
  11. Background { "book_title" : "Elasticsearch" "book_summary" : "The definitive guide

    for Elasticsearch ..." "book_num_pages" : 289, "chapter_title" : "Introduction", "chapter_text" : "Short introduction about Elasticsearch’s features ...", "chapter_number_of_pages" : 12 }' • Lets add a book with chapters data. Each chapter as separate document with the book data. { "book_title" : "Elasticsearch" "book_summary" : "The definitive guide for Elasticsearch ..." "book_num_pages" : 289, "chapter_title" : "Data in, Data out", "chapter_text" : "How to manage your data with Elasticsearch ...", "chapter_num_pages" : 39 }' Document 1: Document 2: Wednesday, November 6, 13
  12. Background • Lets add a book with chapters data. Each

    chapter as inner object. { "title" : "Elasticsearch", "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], "published_year" : 2013, "summary" : "The definitive guide for Elasticsearch ...", "chapters" : [ { "title" : "Introduction", "summary" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 }, { "title" : "Data in, Data out", "summary" : "How to manage your data with Elasticsearch ...", "number_of_pages" : 39 }, ... ] } Wednesday, November 6, 13
  13. Background { "title" : "Elasticsearch" "summary" : "The definitive guide

    for Elasticsearch ..." "num_pages" : 289, }' • Lets add a book with chapters data. Both book and chapters as separate documents and do a query time join. { "chapter_title" : "Data in, Data out", "chapter_text" : "How to manage your data with Elasticsearch ...", "chapter_num_pages" : 39 }' Document 1: Document 2: { "title" : "Introduction", "text" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 } Document 2: Document 3: Wednesday, November 6, 13
  14. Background • Document Granularity is the unit of your data.

    • DG is different per application. The right DG depends on how your data is used. • But how does DG this affect the scalability of the data? Wednesday, November 6, 13
  15. Background • We need more capacity. • But how to

    divide the relational data? Wednesday, November 6, 13
  16. Background • Dealing with relations either pay the price on

    write time or read time. • Alternatively documents relations can balance the costs between read and write time. For example: one join to reduce duplicated data. • Supporting “many-to-many” joins in a distributed system is difficult. Either unbalanced partitions or very expensive join. Wednesday, November 6, 13
  17. Parent child • Parent / child is a query time

    join between different document types in the same index. • Parent and children documents are stored as separate documents in the same index. • Child documents can point to only one parent. • Parent documents can be referred by multiple child documents. • Also a parent document can be a child document of a different parent. Wednesday, November 6, 13
  18. Parent child • A parent document and its children documents

    are routed into the same shard. Parent id is used as routing value. • In combination with a parent ids in memory data structure the parent-child join is fast. • Use warming api to pre-load id cache. Wednesday, November 6, 13
  19. Parent child - Indexing • The parent document doesn’t need

    to exist at time of indexing. curl -XPUT 'localhost:9200/products' -d '{ "mappings" : { "offer" : { "_parent" : { "type" : "product" } } } }' A offer document is a parent of a product document curl -XPUT 'localhost:9200/products/offer/12?parent=p2345' -d '{ "valid_from" : "2013-05-01", "valid_to" : "2013-10-01", "price" : 26.87, }' Then when indexing mention to what product a offer points to. Wednesday, November 6, 13
  20. Parent child - Querying • The has_child query returns parent

    documents based on matches in its child documents. The optional “score_mode” defines how child hits are mapped to its parent document. curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", " "query" : { "range" : { "price" : { "lte" : 50 } } } } } }' Wednesday, November 6, 13
  21. Parent child - Querying • The has_parent query returns child

    documents based on matches in parent documents. curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_parent" : { "type" : "product", " "query" : { "match" : { "category" : "electronics" } } } } }' Wednesday, November 6, 13
  22. Nested objects • In many cases domain models have the

    same write / update live-cycle. • Books & Chapters. • Movies & Actors. • De-normalizing results in the fastest queries. • Compared to using parent/child queries. • Nested objects allow smart de-normalization. Wednesday, November 6, 13
  23. Nested objects { "title" : "Elasticsearch", "author" : "Clinton Gormley",

    "categories" : ["programming", "information retrieval"], "published_year" : 2013, "summary" : "The definitive guide for Elasticsearch ...", "chapters" : [ { "title" : "Introduction", "summary" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 }, { "title" : "Data in, Data out", "summary" : "How to manage your data with Elasticsearch ...", "number_of_pages" : 39 }, ... ] } • JSON allows complex nesting of objects. • But how does this get indexed? Wednesday, November 6, 13
  24. Nested objects { "title" : "Elasticsearch", ... "chapters" : [

    {"title" : "Introduction", "summary" : "Short ...", "number_of_pages" : 12}, {"title" : "Data in, ...", "summary" : "How to ...", "number_of_pages" : 39}, ... ] } { "title" : "Elasticsearch", ... "chapters.title" : ["Data in, Data out", "Introduction"], "chapters.summary" : ["How to ...", "Short ..."], "chapters.number_of_pages" : [12, 39] } Original json document: Lucene Document Structure: Wednesday, November 6, 13
  25. Nested objects - Mapping • The nested type triggers Lucene’s

    block indexing. • Multiple levels of inner objects is possible. curl -XPUT 'localhost:9200/books' -d '{ "mappings" : { "book" : { "properties" : { "chapters" : { "type" : "nested" } } } } }' Document type Field type: ‘nested’ Wednesday, November 6, 13
  26. Nested objects - Block indexing {"chapters.title" : "Into...", "chapters.summary" :

    "...", "chapters.number_of_pages" : 12}, {"chapters.title" : "Data...", "chapters.summary" : "...", "chapters.number_of_pages" : 39}, ... { "title" : "Elasticsearch", ... } Lucene Documents Structure: • Inlining the inner objects as separate Lucene documents right before the root document. • The root document and its nested documents always remain in the same block. Wednesday, November 6, 13
  27. Nested objects - Nested query • Nested query returns the

    complete “book” as hit. (root document) curl -XGET 'localhost:9200/books/book/_search' -d '{ "query" : { "nested" : { "path" : "chapters", "score_mode" : "avg", " "query" : { "match" : { "chapters.summary" : { "query" : "indexing data" } } }" " } } }' Specify the nested level. Chapter level query score mode Wednesday, November 6, 13
  28. Nested objects X X X X X root documents bitset:

    Nested Lucene documents, that match with the inner query. Aggregate nested scores and push to root document. X Set bit, that represents a root document. Wednesday, November 6, 13
  29. Parent child - Querying • Both has_parent and has_child can

    only return one side of the relation. • The multi search api workaround can return both sides. At the cost of one extra subsequent request. • Future development will introduce the concept of inner_hits. This can include related parent and top children hits. Wednesday, November 6, 13
  30. Parent child - Querying • Search request with has_child query:

    curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", " "query" : { "range" : { "price" : { "lte" : 50 } } } } } }' Wednesday, November 6, 13
  31. Parent child - Querying • Search response: { ... "hits"

    : [ { "_index" : "index1", "_type" : "product", "_id" : "1", ... }, { "_index" : "index1", "_type" : "product", "_id" : "2", ... }, ... ] } Wednesday, November 6, 13
  32. Parent child - Querying • For each hit create a

    search request, that returns the top child docs. Like this sample request for the first hit: { "query" : { “filtered" : { “query" : { "range" : { "price" : { "lte" : 50 } } }, "filter" : { “term" : { “_parent" : "product#1", } } } } } The has_child’s inner query Only includes child docs that related to the first parent hit. Wednesday, November 6, 13
  33. • Wrap request into a single msearch request: Each search

    request has its own header and body line. By including routing in header, the execution of the search request will be optimized. Parent child - Querying {"routing" : "1"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} {"routing" : "2"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} {"routing" : "3"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} ... Requests file: curl -XGET 'localhost:9200/products/offer/_msearch --data-binary @requests; echo Item header Item body Wednesday, November 6, 13
  34. Nested objects - Nested sorting curl -XGET 'localhost:9200/books/book/_search' -d '{

    "query" : { "match" : { "summary" : { "query" : "guide" } } }, "sort" : [ { "chapters.number_of_pages" : { "sort_mode" : "avg", "nested_filter" : { "range" : { "chapters.number_of_pages" : {"lte" : 15} } } } } ] }' Sort mode Wednesday, November 6, 13
  35. Parent child - sorting • Parent/child sorting isn’t possible at

    the moment. • But there is a “custom_score” query work around. • Downsides: • Forces to execute a script for each matching document. • The child sort value is converted into a float value. "has_child" : { "type" : "offer", "score_mode" : "avg", "query" : { "custom_score" : { "query" : { ... }, "script" : "doc[\"price\"].value" } } } Wednesday, November 6, 13