Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Join Data Type in Elasticsearch

Avatar for Luca Gennari Luca Gennari
February 21, 2020

Join Data Type in Elasticsearch

Are Join Data Type in Elasticsearch so bad?
What are they useful for?

Here you will find some information and instructions how, when and why to use Join Data Type in Elasticsearch.

Avatar for Luca Gennari

Luca Gennari

February 21, 2020
Tweet

More Decks by Luca Gennari

Other Decks in Programming

Transcript

  1. Luca Gennari Elasticsearch EMEA, January 28, 2020 Original author: Luca

    Gennari - Education Engineer EMEA Did someone said JOIN?
 Join Data Type in Elasticsearch
  2. 2 A SQL Join walked into Elastic bar. A little

    while later it walked out... Because it could't find a table! Did someone said JOIN?
  3. 3 Before talking about JDTs, let see Denormalization • Denormalizing

    data refers to “flattening” data ‒ storing redundant copies of data in each document, instead of using some type of relationship ‒ _source is compressed which reduces the disk "waste" • Denormalization provides the best performance out of Elasticsearch ‒ That is the standard way to indexing documents in a search engine or in any NoSQL database Source: elastic.co
  4. 4 A simple example of Denormalization Source: elastic.co Blog id

    d title publish_data 1 Beats 5.0.0 Release 2016-10-26 2 Build your own Beat 2016-07-14 blog author 1 1 1 3 2 2 2 3 Author id d name company 1 Tudor Gulobenco 1 2 Jongmin Kim 1 3 Monica Sarbu 1
  5. 5 Join Data Type in Elasticsearch
 
 A way to

    keep relations between documents stored in the same index. Available from version 6, they can be useful in many use cases like classification systems, categorising documents or products, frequent updates, etc... Let's have a look in depth...
  6. 6 • JDTs can help to redesign a document structure

    if just a small part of it must be frequently updated. • Allows you to split data into multiple documents while maintaining a relationship between them. FREQUENT UPDATES
  7. 7 • sort documents according to their specifications and characteristics.

    • documents can be categorised on multiple levels. • ease to adding new categories when required. CATEGORIZATION
  8. 8 • mitigate the number of false positives. • i

    n c r e a s e p r e c i s i o n o n aggregations and full-text queries. • particularly suitable for a n a l y s i n g s c i e n t i f i c o r research documents. ANALYSIS
  9. 9 The logic behind JDT - Parent & Child 1.

    Definition must be done at mapping level • just custom mapping, they can't be mapped dynamically 2. More relations per join field allowed • List all parent:child that are needed 3. Can have hierarchical relationships • a child can also be declared as parent 4. More than one parent document can be created per relationship. • parents documents can be created independently and they will be at the same level Source: elastic.co
  10. 10 Where JDTs can help e-commerce Machine Learning Data Science

    • Product/category relationship • Customer level management • Promotions and 'Hot Products' • Better precision during the analysis phase • More consistent structures for DataFrames • Categorisation • Structuring the DataSets
  11. 12 JDT mapping - 1 • Define the name of

    the join field • in this example 'fruits_relations' ... (ok it's definitely not the best name in the world, but we have a demo later!) • Then define the type as 'type: join' • The join type requires at least one parent/ child relationship. • in our example we have two parent: • category_fruits • category_citrus • and two child: • fruits • citrus
  12. 13 JDT mapping - 2 • More than one child

    can be defined per parent • passing a list of child's names as value. • In this case the relationship structure is of the following type:
  13. 14 JDT mapping - 3 • It is also possible

    to define as second-level parent a child of another relation • In this case the relationship structure is of the following type:
  14. 15 • Parent/child relationship 1:1 • Single set of associated

    documents for each parent • Multiple documents can be defined independently for each parent • Parent/child relationship 1:n • Different documents sets for each parent • All child documents are grouped by the parent they belong to • Parent/child relationship n:1 • Different documents sets for each parent • Multiple classification for each set of documents To put it simply... Simple Relationship Multiple Child Multiple levels
  15. 17 Parent Document Name of the join field defined in

    the mapping 'name' field is required. It is not part of the join but without can't find documents The rest of the document 'name' field that associates it to a parent (must exist in the mapping) PUT my_index/_doc/apple
  16. 18 Child Document Name of the join field defined in

    the mapping The 'name' field is required as it was in the parent's document The rest of the document Field 'name' associates it with a child. The 'parent' field associates it with the _id of the parent
  17. 20 JDT Limitations • System performance may degrade for large

    quantities of documents. To work with JDTs, the effort that elasticsearch makes for query or aggregation is greater than usual. • An index that maps a join field can also contain documents that have no relationship but, all those that use the relationship, must be indexed in the same shard. This means that the use of the _routing parameter is mandatory. (Which makes mandatory the use of _routing parameter for any operation). • Only a join field is allowed per index. • A parent can have an indefinite number of child, but a child can refer to only one parent Source: elastic.co