Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Document relations at OSSC

Document relations at OSSC

Introduces the various ways in Elasticsearch to model your data in a relational manner. Discusses nested field type and parent-child features.

Presented at OSS basis tech week.

Avatar for Martijn van Groningen

Martijn van Groningen

November 06, 2013
Tweet

More Decks by Martijn van Groningen

Other Decks in Programming

Transcript

  1. Topics • Background • Parent / child support • Nested

    support • Future developments Wednesday, November 6, 13
  2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction • What is Elasticsearch? Document based search engine. • Apache Lucene, JSON based • Major characteristics Dynamic schema that adjusts to your data. Distributed & Mult-tenant Easily scalable API centric Wednesday, November 6, 13
  3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction • Key features Free text search • Find any emails containing: “Apache Lucene released” Structured search • Find any emails sent by: [email protected] • Find any emails sent between April 2013 and now. Statistics • Return the top ‘from’ senders And a combination of all of the above. • All features work in (near) realtime Wednesday, November 6, 13
  4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction • Write apis index, delete, update, bulk, delete_by_query • Read apis search, get, suggest, mget, msearch • Admin apis health, cluster_state, mapping, settings and more Wednesday, November 6, 13
  5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction curl -XPUT 'localhost:9200/emails/email/1' -d '{ "title" : "Apache Lucene 4.5 released", "to" : ["[email protected]", "[email protected]"], "body" : " ... ", ... }' { "ok" : true, "_index" : "emails", "_type" : "email", "_id" : "1", "_version" : 1 } Indexing a document: Index response: index type id Wednesday, November 6, 13
  6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction curl -XGET 'localhost:9200/_search' -d '{ "query" : { "match" : { "body" : "apache lucene released" } } }' { "total" : 1, "max_score" : 0.434322, "hits" : [ { "_index" : "emails", "_type" : "email", "_id" : "1", "_score" : 0.434322, "_source" : { "title" : "Apache Lucene 4.5 released", ... } ... } Search request: Search response: Endpoint Query DSL Wednesday, November 6, 13
  7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission

    is strictly prohibited Introduction curl -XGET 'localhost:9200/_search' -d '{ "query" : { "bool" : { "must" : [ { "match" : { "body" : "apache lucene released" } }, { "range" : { "sent_at" : { "gte" : "2013-04-01" } } } ] } } }' A more sophisticated search request: Query DSL supports many types of queries and filters Wednesday, November 6, 13
  8. Background • Elasticsearch is a document based system. Documents are

    defined as JSON • The elasticsearch document is always converted to a Lucene document. Lucene Document is just key value pairs • Both the nested and parent-child support are just tools for document design. Wednesday, November 6, 13
  9. • Lets design a book document. Background { "title" :

    "Elasticsearch" "summary" : "The definitive guide for Elasticsearch ..." "published_year" : 2013, "num_pages" : 289, "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], }' Wednesday, November 6, 13
  10. • Lets design a book document. Background { "title" :

    "Elasticsearch" "summary" : "The definitive guide for Elasticsearch ..." "published_year" : 2013, "num_pages" : 289, "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], }' • But how to add data to it that is related? Like chapter or page data. Wednesday, November 6, 13
  11. Background { "book_title" : "Elasticsearch" "book_summary" : "The definitive guide

    for Elasticsearch ..." "book_num_pages" : 289, "chapter_title" : "Introduction", "chapter_text" : "Short introduction about Elasticsearch’s features ...", "chapter_number_of_pages" : 12 }' • Lets add a book with chapters data. Each chapter as separate document with the book data. { "book_title" : "Elasticsearch" "book_summary" : "The definitive guide for Elasticsearch ..." "book_num_pages" : 289, "chapter_title" : "Data in, Data out", "chapter_text" : "How to manage your data with Elasticsearch ...", "chapter_num_pages" : 39 }' Document 1: Document 2: Wednesday, November 6, 13
  12. Background • Lets add a book with chapters data. Each

    chapter as inner object. { "title" : "Elasticsearch", "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], "published_year" : 2013, "summary" : "The definitive guide for Elasticsearch ...", "chapters" : [ { "title" : "Introduction", "summary" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 }, { "title" : "Data in, Data out", "summary" : "How to manage your data with Elasticsearch ...", "number_of_pages" : 39 }, ... ] } Wednesday, November 6, 13
  13. Background { "title" : "Elasticsearch" "summary" : "The definitive guide

    for Elasticsearch ..." "num_pages" : 289, }' • Lets add a book with chapters data. Both book and chapters as separate documents and do a query time join. { "chapter_title" : "Data in, Data out", "chapter_text" : "How to manage your data with Elasticsearch ...", "chapter_num_pages" : 39 }' Document 1: Document 2: { "title" : "Introduction", "text" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 } Document 2: Document 3: Wednesday, November 6, 13
  14. Background • Document Granularity is the unit of your data.

    • DG is different per application. The right DG depends on how your data is used. • But how does DG this affect the scalability of the data? Wednesday, November 6, 13
  15. Background • We need more capacity. • But how to

    divide the relational data? Wednesday, November 6, 13
  16. Background • Dealing with relations either pay the price on

    write time or read time. • Alternatively documents relations can balance the costs between read and write time. For example: one join to reduce duplicated data. • Supporting “many-to-many” joins in a distributed system is difficult. Either unbalanced partitions or very expensive join. Wednesday, November 6, 13
  17. Parent child • Parent / child is a query time

    join between different document types in the same index. • Parent and children documents are stored as separate documents in the same index. • Child documents can point to only one parent. • Parent documents can be referred by multiple child documents. • Also a parent document can be a child document of a different parent. Wednesday, November 6, 13
  18. Parent child • A parent document and its children documents

    are routed into the same shard. Parent id is used as routing value. • In combination with a parent ids in memory data structure the parent-child join is fast. • Use warming api to pre-load id cache. Wednesday, November 6, 13
  19. Parent child - Indexing • The parent document doesn’t need

    to exist at time of indexing. curl -XPUT 'localhost:9200/products' -d '{ "mappings" : { "offer" : { "_parent" : { "type" : "product" } } } }' A offer document is a parent of a product document curl -XPUT 'localhost:9200/products/offer/12?parent=p2345' -d '{ "valid_from" : "2013-05-01", "valid_to" : "2013-10-01", "price" : 26.87, }' Then when indexing mention to what product a offer points to. Wednesday, November 6, 13
  20. Parent child - Querying • The has_child query returns parent

    documents based on matches in its child documents. The optional “score_mode” defines how child hits are mapped to its parent document. curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", " "query" : { "range" : { "price" : { "lte" : 50 } } } } } }' Wednesday, November 6, 13
  21. Parent child - Querying • The has_parent query returns child

    documents based on matches in parent documents. curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_parent" : { "type" : "product", " "query" : { "match" : { "category" : "electronics" } } } } }' Wednesday, November 6, 13
  22. Nested objects • In many cases domain models have the

    same write / update live-cycle. • Books & Chapters. • Movies & Actors. • De-normalizing results in the fastest queries. • Compared to using parent/child queries. • Nested objects allow smart de-normalization. Wednesday, November 6, 13
  23. Nested objects { "title" : "Elasticsearch", "author" : "Clinton Gormley",

    "categories" : ["programming", "information retrieval"], "published_year" : 2013, "summary" : "The definitive guide for Elasticsearch ...", "chapters" : [ { "title" : "Introduction", "summary" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 }, { "title" : "Data in, Data out", "summary" : "How to manage your data with Elasticsearch ...", "number_of_pages" : 39 }, ... ] } • JSON allows complex nesting of objects. • But how does this get indexed? Wednesday, November 6, 13
  24. Nested objects { "title" : "Elasticsearch", ... "chapters" : [

    {"title" : "Introduction", "summary" : "Short ...", "number_of_pages" : 12}, {"title" : "Data in, ...", "summary" : "How to ...", "number_of_pages" : 39}, ... ] } { "title" : "Elasticsearch", ... "chapters.title" : ["Data in, Data out", "Introduction"], "chapters.summary" : ["How to ...", "Short ..."], "chapters.number_of_pages" : [12, 39] } Original json document: Lucene Document Structure: Wednesday, November 6, 13
  25. Nested objects - Mapping • The nested type triggers Lucene’s

    block indexing. • Multiple levels of inner objects is possible. curl -XPUT 'localhost:9200/books' -d '{ "mappings" : { "book" : { "properties" : { "chapters" : { "type" : "nested" } } } } }' Document type Field type: ‘nested’ Wednesday, November 6, 13
  26. Nested objects - Block indexing {"chapters.title" : "Into...", "chapters.summary" :

    "...", "chapters.number_of_pages" : 12}, {"chapters.title" : "Data...", "chapters.summary" : "...", "chapters.number_of_pages" : 39}, ... { "title" : "Elasticsearch", ... } Lucene Documents Structure: • Inlining the inner objects as separate Lucene documents right before the root document. • The root document and its nested documents always remain in the same block. Wednesday, November 6, 13
  27. Nested objects - Nested query • Nested query returns the

    complete “book” as hit. (root document) curl -XGET 'localhost:9200/books/book/_search' -d '{ "query" : { "nested" : { "path" : "chapters", "score_mode" : "avg", " "query" : { "match" : { "chapters.summary" : { "query" : "indexing data" } } }" " } } }' Specify the nested level. Chapter level query score mode Wednesday, November 6, 13
  28. Nested objects X X X X X root documents bitset:

    Nested Lucene documents, that match with the inner query. Aggregate nested scores and push to root document. X Set bit, that represents a root document. Wednesday, November 6, 13
  29. Parent child - Querying • Both has_parent and has_child can

    only return one side of the relation. • The multi search api workaround can return both sides. At the cost of one extra subsequent request. • Future development will introduce the concept of inner_hits. This can include related parent and top children hits. Wednesday, November 6, 13
  30. Parent child - Querying • Search request with has_child query:

    curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", " "query" : { "range" : { "price" : { "lte" : 50 } } } } } }' Wednesday, November 6, 13
  31. Parent child - Querying • Search response: { ... "hits"

    : [ { "_index" : "index1", "_type" : "product", "_id" : "1", ... }, { "_index" : "index1", "_type" : "product", "_id" : "2", ... }, ... ] } Wednesday, November 6, 13
  32. Parent child - Querying • For each hit create a

    search request, that returns the top child docs. Like this sample request for the first hit: { "query" : { “filtered" : { “query" : { "range" : { "price" : { "lte" : 50 } } }, "filter" : { “term" : { “_parent" : "product#1", } } } } } The has_child’s inner query Only includes child docs that related to the first parent hit. Wednesday, November 6, 13
  33. • Wrap request into a single msearch request: Each search

    request has its own header and body line. By including routing in header, the execution of the search request will be optimized. Parent child - Querying {"routing" : "1"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} {"routing" : "2"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} {"routing" : "3"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} ... Requests file: curl -XGET 'localhost:9200/products/offer/_msearch --data-binary @requests; echo Item header Item body Wednesday, November 6, 13
  34. Nested objects - Nested sorting curl -XGET 'localhost:9200/books/book/_search' -d '{

    "query" : { "match" : { "summary" : { "query" : "guide" } } }, "sort" : [ { "chapters.number_of_pages" : { "sort_mode" : "avg", "nested_filter" : { "range" : { "chapters.number_of_pages" : {"lte" : 15} } } } } ] }' Sort mode Wednesday, November 6, 13
  35. Parent child - sorting • Parent/child sorting isn’t possible at

    the moment. • But there is a “custom_score” query work around. • Downsides: • Forces to execute a script for each matching document. • The child sort value is converted into a float value. "has_child" : { "type" : "offer", "score_mode" : "avg", "query" : { "custom_score" : { "query" : { ... }, "script" : "doc[\"price\"].value" } } } Wednesday, November 6, 13