Document relations at OSSC

Martijn van Groningen [email protected] @mvgroningen Document relations Wednesday, November 6,
13

Topics • Background • Parent / child support • Nested
support • Future developments Wednesday, November 6, 13

Elasticsearch Introduction Wednesday, November 6, 13

Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission
is strictly prohibited Introduction • What is Elasticsearch? Document based search engine. • Apache Lucene, JSON based • Major characteristics Dynamic schema that adjusts to your data. Distributed & Mult-tenant Easily scalable API centric Wednesday, November 6, 13

is strictly prohibited Introduction • Key features Free text search • Find any emails containing: “Apache Lucene released” Structured search • Find any emails sent by: [email protected] • Find any emails sent between April 2013 and now. Statistics • Return the top ‘from’ senders And a combination of all of the above. • All features work in (near) realtime Wednesday, November 6, 13

is strictly prohibited Introduction • Write apis index, delete, update, bulk, delete_by_query • Read apis search, get, suggest, mget, msearch • Admin apis health, cluster_state, mapping, settings and more Wednesday, November 6, 13

is strictly prohibited Introduction curl -XPUT 'localhost:9200/emails/email/1' -d '{ "title" : "Apache Lucene 4.5 released", "to" : ["[email protected]", "[email protected]"], "body" : " ... ", ... }' { "ok" : true, "_index" : "emails", "_type" : "email", "_id" : "1", "_version" : 1 } Indexing a document: Index response: index type id Wednesday, November 6, 13

is strictly prohibited Introduction curl -XGET 'localhost:9200/_search' -d '{ "query" : { "match" : { "body" : "apache lucene released" } } }' { "total" : 1, "max_score" : 0.434322, "hits" : [ { "_index" : "emails", "_type" : "email", "_id" : "1", "_score" : 0.434322, "_source" : { "title" : "Apache Lucene 4.5 released", ... } ... } Search request: Search response: Endpoint Query DSL Wednesday, November 6, 13

is strictly prohibited Introduction curl -XGET 'localhost:9200/_search' -d '{ "query" : { "bool" : { "must" : [ { "match" : { "body" : "apache lucene released" } }, { "range" : { "sent_at" : { "gte" : "2013-04-01" } } } ] } } }' A more sophisticated search request: Query DSL supports many types of queries and filters Wednesday, November 6, 13

Background Wednesday, November 6, 13

Background • Elasticsearch is a document based system. Documents are
defined as JSON • The elasticsearch document is always converted to a Lucene document. Lucene Document is just key value pairs • Both the nested and parent-child support are just tools for document design. Wednesday, November 6, 13

• Lets design a book document. Background { "title" :
"Elasticsearch" "summary" : "The definitive guide for Elasticsearch ..." "published_year" : 2013, "num_pages" : 289, "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], }' Wednesday, November 6, 13

• Lets design a book document. Background { "title" :
"Elasticsearch" "summary" : "The definitive guide for Elasticsearch ..." "published_year" : 2013, "num_pages" : 289, "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], }' • But how to add data to it that is related? Like chapter or page data. Wednesday, November 6, 13

Background { "book_title" : "Elasticsearch" "book_summary" : "The definitive guide
for Elasticsearch ..." "book_num_pages" : 289, "chapter_title" : "Introduction", "chapter_text" : "Short introduction about Elasticsearch’s features ...", "chapter_number_of_pages" : 12 }' • Lets add a book with chapters data. Each chapter as separate document with the book data. { "book_title" : "Elasticsearch" "book_summary" : "The definitive guide for Elasticsearch ..." "book_num_pages" : 289, "chapter_title" : "Data in, Data out", "chapter_text" : "How to manage your data with Elasticsearch ...", "chapter_num_pages" : 39 }' Document 1: Document 2: Wednesday, November 6, 13

Background • Lets add a book with chapters data. Each
chapter as inner object. { "title" : "Elasticsearch", "author" : ["Clinton Gormley", "Zachary Tong"], "categories" : ["programming", "information retrieval"], "published_year" : 2013, "summary" : "The definitive guide for Elasticsearch ...", "chapters" : [ { "title" : "Introduction", "summary" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 }, { "title" : "Data in, Data out", "summary" : "How to manage your data with Elasticsearch ...", "number_of_pages" : 39 }, ... ] } Wednesday, November 6, 13

Background { "title" : "Elasticsearch" "summary" : "The definitive guide
for Elasticsearch ..." "num_pages" : 289, }' • Lets add a book with chapters data. Both book and chapters as separate documents and do a query time join. { "chapter_title" : "Data in, Data out", "chapter_text" : "How to manage your data with Elasticsearch ...", "chapter_num_pages" : 39 }' Document 1: Document 2: { "title" : "Introduction", "text" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 } Document 2: Document 3: Wednesday, November 6, 13

Background • Document Granularity is the unit of your data.
• DG is different per application. The right DG depends on how your data is used. • But how does DG this affect the scalability of the data? Wednesday, November 6, 13

Background C Query Local join Wednesday, November 6, 13

Background • We need more capacity. • But how to
divide the relational data? Wednesday, November 6, 13

Background C Q uery sub-queries Wednesday, November 6, 13

Background C Query sub-query De-normalized document Wednesday, November 6, 13

Background Query sub-query C local join local join Wednesday, November
6, 13

Background • Dealing with relations either pay the price on
write time or read time. • Alternatively documents relations can balance the costs between read and write time. For example: one join to reduce duplicated data. • Supporting “many-to-many” joins in a distributed system is difficult. Either unbalanced partitions or very expensive join. Wednesday, November 6, 13

The query time join Parent child Wednesday, November 6, 13

Parent child • Parent / child is a query time
join between different document types in the same index. • Parent and children documents are stored as separate documents in the same index. • Child documents can point to only one parent. • Parent documents can be referred by multiple child documents. • Also a parent document can be a child document of a different parent. Wednesday, November 6, 13

Parent child • A parent document and its children documents
are routed into the same shard. Parent id is used as routing value. • In combination with a parent ids in memory data structure the parent-child join is fast. • Use warming api to pre-load id cache. Wednesday, November 6, 13

Parent child - Indexing • The parent document doesn’t need
to exist at time of indexing. curl -XPUT 'localhost:9200/products' -d '{ "mappings" : { "offer" : { "_parent" : { "type" : "product" } } } }' A offer document is a parent of a product document curl -XPUT 'localhost:9200/products/offer/12?parent=p2345' -d '{ "valid_from" : "2013-05-01", "valid_to" : "2013-10-01", "price" : 26.87, }' Then when indexing mention to what product a offer points to. Wednesday, November 6, 13

Parent child - Querying • The has_child query returns parent
documents based on matches in its child documents. The optional “score_mode” defines how child hits are mapped to its parent document. curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", " "query" : { "range" : { "price" : { "lte" : 50 } } } } } }' Wednesday, November 6, 13

Parent child - Querying • The has_parent query returns child
documents based on matches in parent documents. curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_parent" : { "type" : "product", " "query" : { "match" : { "category" : "electronics" } } } } }' Wednesday, November 6, 13

Smart de-normalization Nested objects Wednesday, November 6, 13

Nested objects • In many cases domain models have the
same write / update live-cycle. • Books & Chapters. • Movies & Actors. • De-normalizing results in the fastest queries. • Compared to using parent/child queries. • Nested objects allow smart de-normalization. Wednesday, November 6, 13

Nested objects { "title" : "Elasticsearch", "author" : "Clinton Gormley",
"categories" : ["programming", "information retrieval"], "published_year" : 2013, "summary" : "The definitive guide for Elasticsearch ...", "chapters" : [ { "title" : "Introduction", "summary" : "Short introduction about Elasticsearch’s features ...", "number_of_pages" : 12 }, { "title" : "Data in, Data out", "summary" : "How to manage your data with Elasticsearch ...", "number_of_pages" : 39 }, ... ] } • JSON allows complex nesting of objects. • But how does this get indexed? Wednesday, November 6, 13

Nested objects { "title" : "Elasticsearch", ... "chapters" : [
{"title" : "Introduction", "summary" : "Short ...", "number_of_pages" : 12}, {"title" : "Data in, ...", "summary" : "How to ...", "number_of_pages" : 39}, ... ] } { "title" : "Elasticsearch", ... "chapters.title" : ["Data in, Data out", "Introduction"], "chapters.summary" : ["How to ...", "Short ..."], "chapters.number_of_pages" : [12, 39] } Original json document: Lucene Document Structure: Wednesday, November 6, 13

Nested objects - Mapping • The nested type triggers Lucene’s
block indexing. • Multiple levels of inner objects is possible. curl -XPUT 'localhost:9200/books' -d '{ "mappings" : { "book" : { "properties" : { "chapters" : { "type" : "nested" } } } } }' Document type Field type: ‘nested’ Wednesday, November 6, 13

Nested objects - Block indexing {"chapters.title" : "Into...", "chapters.summary" :
"...", "chapters.number_of_pages" : 12}, {"chapters.title" : "Data...", "chapters.summary" : "...", "chapters.number_of_pages" : 39}, ... { "title" : "Elasticsearch", ... } Lucene Documents Structure: • Inlining the inner objects as separate Lucene documents right before the root document. • The root document and its nested documents always remain in the same block. Wednesday, November 6, 13

Nested objects - Nested query • Nested query returns the
complete “book” as hit. (root document) curl -XGET 'localhost:9200/books/book/_search' -d '{ "query" : { "nested" : { "path" : "chapters", "score_mode" : "avg", " "query" : { "match" : { "chapters.summary" : { "query" : "indexing data" } } }" " } } }' Specify the nested level. Chapter level query score mode Wednesday, November 6, 13

Nested objects X X X X X root documents bitset:
Nested Lucene documents, that match with the inner query. Aggregate nested scores and push to root document. X Set bit, that represents a root document. Wednesday, November 6, 13

But first questions! Extra slides Wednesday, November 6, 13

Parent child - Querying • Both has_parent and has_child can
only return one side of the relation. • The multi search api workaround can return both sides. At the cost of one extra subsequent request. • Future development will introduce the concept of inner_hits. This can include related parent and top children hits. Wednesday, November 6, 13

Parent child - Querying • Search request with has_child query:
curl -XGET 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", " "query" : { "range" : { "price" : { "lte" : 50 } } } } } }' Wednesday, November 6, 13

Parent child - Querying • Search response: { ... "hits"
: [ { "_index" : "index1", "_type" : "product", "_id" : "1", ... }, { "_index" : "index1", "_type" : "product", "_id" : "2", ... }, ... ] } Wednesday, November 6, 13

Parent child - Querying • For each hit create a
search request, that returns the top child docs. Like this sample request for the first hit: { "query" : { “filtered" : { “query" : { "range" : { "price" : { "lte" : 50 } } }, "filter" : { “term" : { “_parent" : "product#1", } } } } } The has_child’s inner query Only includes child docs that related to the ﬁrst parent hit. Wednesday, November 6, 13

• Wrap request into a single msearch request: Each search
request has its own header and body line. By including routing in header, the execution of the search request will be optimized. Parent child - Querying {"routing" : "1"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} {"routing" : "2"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} {"routing" : "3"} {"query" : {"filtered":{"query":"range"{"price":{"lte":50}}},"f...}}, "size" : 3} ... Requests ﬁle: curl -XGET 'localhost:9200/products/offer/_msearch --data-binary @requests; echo Item header Item body Wednesday, November 6, 13

Nested objects - Nested sorting curl -XGET 'localhost:9200/books/book/_search' -d '{
"query" : { "match" : { "summary" : { "query" : "guide" } } }, "sort" : [ { "chapters.number_of_pages" : { "sort_mode" : "avg", "nested_filter" : { "range" : { "chapters.number_of_pages" : {"lte" : 15} } } } } ] }' Sort mode Wednesday, November 6, 13

Parent child - sorting • Parent/child sorting isn’t possible at
the moment. • But there is a “custom_score” query work around. • Downsides: • Forces to execute a script for each matching document. • The child sort value is converted into a float value. "has_child" : { "type" : "offer", "score_mode" : "avg", "query" : { "custom_score" : { "query" : { ... }, "script" : "doc[\"price\"].value" } } } Wednesday, November 6, 13

Document relations at OSSC

Document relations at OSSC

More Decks by Martijn van Groningen

Other Decks in Programming

Featured

Transcript