(@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany ! ! ! ! 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014 2014/05/26
Jentzsch. PROFILES2014, ESWC2014. LINKED DATA PROFILING • Metadata often not available • e.g. statistical information on predicates, classes, vocabularies, value patterns, property co-occurrence, … • Data registries, VoiD, and Semantic Sitemaps provide only basic information. e.g., description, author & license information, estimated triple and link count ! • Use cases requiring metadata • Query optimization • Data cleansing • Data integration • Schema induction ! • Data profiling: methods for computing metrics / metadata for datasets
Jentzsch. PROFILES2014, ESWC2014. TRADITIONAL VS LINKED DATA PROFILING • State of the art data profiling • Based on columns • Assumes well-defined semantics • Expects regular data ! • Heterogeneity on the Web of Data • Diverse sources • Diverse structures • Diverse views
Jentzsch. PROFILES2014, ESWC2014. LODOP - CONTRIBUTIONS • Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts) • System for executing, benchmarking and optimizing data profiling scripts with Apache Pig on Hadoop • Development and evaluation of 3 multi-script optimization rules ! • Apache Pig: • Platform for analyzing large datasets • High-level language: Pig Latin • Scripts executed on Hadoop / MapReduce
Jentzsch. PROFILES2014, ESWC2014. PROFILING TASKS • Groupings • e.g. by resource, class, property type, language, vocabulary, … ! • Tasks • Number of triples • Average number of triples per resource • Average number of triples per object URI • Average number of triples per context URL • Number of property types • Average number of property values • Number of resources • Number of inlinks / outlinks • Number of context URLs • Number of context PLDs • Property co-occurrence • Inverse Properties • URI-Literal ratio • Property value ranges • Average value length
Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION • 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script) • Earlier MapReduce jobs have longer runtimes • Earlier jobs handle more data 㱺 more HDFS activity • Most scripts scale linearly • Most scripts reduce amount of data in workflow • Exceptions e.g. property co-occurrence scripts
Jentzsch. PROFILES2014, ESWC2014. OPTIMIZATION GOALS • Optimize concurrent execution of multiple scripts • Reduce number of operators • Reduce data flow between operators
Jentzsch. PROFILES2014, ESWC2014. • Merging all logical plans into one master plan • Allows parallel execution • Reduces runtime to 25-30% of sequential execution ! STEP 0: MASTER PLAN
Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
Jentzsch. PROFILES2014, ESWC2014. • Number of operators reduced from 365 to 267 • Number of MapReduce jobs reduced from 176 to 140 • Frees up cluster resources • Prerequisite step for other optimisations • Restricts parallelism 1. MERGING IDENTICAL OPERATORS
Jentzsch. PROFILES2014, ESWC2014. SUMMARY • Optimizations reduce • Number of operations • Number of MapReduce jobs • Data flow between operators → less HDFS I/O → Improved execution time • Reduces execution time by 70% • … but rules should not be applied in all cases • More advanced (cost-based) approach is needed
Jentzsch. PROFILES2014, ESWC2014. FUTURE WORK • Additional logical optimization rules • Ignore projections if it allows further merging of operators • Advanced optimization strategies • Cost-based approach could use previous profiling results (e.g. cardinalities) → on-the-go • Materialization of intermediate results • Materialize common subsets, e.g. only triples with typed object values for later scripts