Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LODOP - Multi-Query Optimization for Linked Data Profiling Queries

LODOP - Multi-Query Optimization for Linked Data Profiling Queries

Talk at PROFILES2014, ESWC2014

Anja Jentzsch

May 26, 2014
Tweet

More Decks by Anja Jentzsch

Other Decks in Science

Transcript

  1. LODOP Multi-Query Optimization for Linked Data Profiling Queries Anja Jentzsch

    (@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany ! ! ! ! 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014 2014/05/26
  2. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 1. Challenges of Linked Data Profiling 2. Profiling Tasks 3. LODOP 4. Multi-Query Optimizations OUTLINE
  3. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. LINKED DATA PROFILING • Metadata often not available • e.g. statistical information on predicates, classes, vocabularies, value patterns, property co-occurrence, … • Data registries, VoiD, and Semantic Sitemaps provide only basic information. e.g., description, author & license information, estimated triple and link count ! • Use cases requiring metadata • Query optimization • Data cleansing • Data integration • Schema induction ! • Data profiling: methods for computing metrics / metadata for datasets
  4. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. TRADITIONAL VS LINKED DATA PROFILING • State of the art data profiling • Based on columns • Assumes well-defined semantics • Expects regular data ! • Heterogeneity on the Web of Data • Diverse sources • Diverse structures • Diverse views
  5. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. CHALLENGES OF LD PROFILING • Heterogeneity • Nested graphs Makes reasoning difficult • Loose structure Things have different predicate sets • Incomplete Missing property definitions • Poorly formatted Property types used inconsistently • Inconsistent Multiple representations claim opposite things ! • Existing (relational) data profiling tools don’t work ! • Volume of data • Requires parallelization
  6. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. LODOP - CONTRIBUTIONS • Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts) • System for executing, benchmarking and optimizing data profiling scripts with Apache Pig on Hadoop • Development and evaluation of 3 multi-script optimization rules ! • Apache Pig: • Platform for analyzing large datasets • High-level language: Pig Latin • Scripts executed on Hadoop / MapReduce
  7. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. PROFILING TASKS • Groupings • e.g. by resource, class, property type, language, vocabulary, … ! • Tasks • Number of triples • Average number of triples per resource • Average number of triples per object URI • Average number of triples per context URL • Number of property types • Average number of property values • Number of resources • Number of inlinks / outlinks • Number of context URLs • Number of context PLDs • Property co-occurrence • Inverse Properties • URI-Literal ratio • Property value ranges • Average value length
  8. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. DATASETS STATISTICS ! ! ! ! ! ! ! ! ! ! ! * source: BTC 2012 dataset ** WDC = Web Data Commons *** EUNIS = European Environment Agency ! Statistics for 1M triples! DBpedia*! Freebase*! WDC RDFa**! EUNIS Species***! Number of resources! 169,035! 226,834! 168,736! 65,843! Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2! Number of classes! 19,585! 1,928! 61! 1! Number of property types! 7,844! 2,748! 477! 16! Number of URIs! 519,692! 642,183! 174,317! 407,418! Number of inlinks! 207,712! 192,179! 35,329! 78,377! Number of literals! 480,279! 357,817! 825,564! 592,582! Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
  9. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION
  10. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION • 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script) • Earlier MapReduce jobs have longer runtimes • Earlier jobs handle more data 㱺 more HDFS activity • Most scripts scale linearly • Most scripts reduce amount of data in workflow • Exceptions e.g. property co-occurrence scripts
  11. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. OPTIMIZATION GOALS • Optimize concurrent execution of multiple scripts • Reduce number of operators • Reduce data flow between operators
  12. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. NUMBER OF INSTANCES (PIG)
  13. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. LODOP - SYSTEM OVERVIEW
  14. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. MULTI-QUERY OPTIMIZATION 1. Merging identical operators 2. Combining FILTER operators 3. Combining FOREACH operators
  15. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. • Merging all logical plans into one master plan • Allows parallel execution • Reduces runtime to 25-30% of sequential execution ! STEP 0: MASTER PLAN
  16. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS Number of property types per class! URI Literal Ratio per class!
  17. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  18. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  19. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. • Number of operators reduced from 365 to 267 • Number of MapReduce jobs reduced from 176 to 140 • Frees up cluster resources • Prerequisite step for other optimisations • Restricts parallelism 1. MERGING IDENTICAL OPERATORS
  20. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS
  21. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  22. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  23. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  24. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  25. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  26. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  27. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  28. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  29. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  30. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. ALL OPTIMIZATIONS
  31. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. SUMMARY • Optimizations reduce • Number of operations • Number of MapReduce jobs • Data flow between operators → less HDFS I/O → Improved execution time • Reduces execution time by 70% • … but rules should not be applied in all cases • More advanced (cost-based) approach is needed
  32. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. FUTURE WORK • Additional logical optimization rules • Ignore projections if it allows further merging of operators • Advanced optimization strategies • Cost-based approach could use previous profiling results (e.g. cardinalities) → on-the-go • Materialization of intermediate results • Materialize common subsets, e.g. only triples with typed object values for later scripts
  33. LODOP - Multi-Query Optimization for Linked Data Profiling Queries. A.

    Jentzsch. PROFILES2014, ESWC2014. http://github.com/bforchhammer/lodop/ ! @anjeve [email protected]