FishMark: A Linked Data Application Benchmark @ SSWS 2012, Boston, US
Slides for my presentation of our paper on FishMark, a linked data application benchmark, which can be used to measure the performance of linked data stores vs classic relational DB.
Desiderata •Use real(istic...) data, queries, and query mixes •(Realistic) scalability of the data ‣ Scale down data for system to handle ‣ Test how system scales •Compare alternative technologies ‣ Linked data vs classic relational DB •Transparency (what is measured and how?) 2
& FishDelish •FishBase: Database about the world’s finned fish species •Contains information about ~32,000 species •fishbase.org: Web front-end to the FishBase DB •Provides interface for (canned) queries •Backed by MySQL DB •DB: 195 tables (3GB) •FishDelish: RDF graph of fishbase.org •Result of D2R conversion •1.38bn triples (250GB) 4
families.Order, families.Class, morphdat.AddChars, species.DemersPelag, species.SpeciesRefNo, species.AnaCat, species.PicPreferredName, picturesmain.autoctr, picturesmain.PicName, picturesmain.Entered, picturesmain.AuthName FROM species, refrens, morphdat, families, picturesmain WHERE species.SpeciesRefNo=refrens.RefNo AND species.SpecCode=morphdat.SpecCode AND species.FamCode=families.FamCode AND species.SpecCode =picturesmain.SpecCode AND (species.Genus="Danio" AND species.Species="rerio")
•22 query templates based on typical activities on fishbase.org: ‣ Generate search results page for common name search ‣ Generate species page for a given species and genus ‣ Generate pictures page for a given species ‣ ... •SQL queries ported to SPARQL 11
•Using fixed queries may introduce bias, e.g. •common name search for ‘salmon’: 96 results •common name search for ‘borna snakehead’: 1 result ‣ We want to measure performance of the same query type (‘search for common name’, ‘generate species page’, ...) with different parameter values ‣ Use query templates to generate queries with random parameters 12
families.Order, families.Class, morphdat.AddChars, species.DemersPelag, species.SpeciesRefNo, species.AnaCat, species.PicPreferredName, picturesmain.autoctr, picturesmain.PicName, picturesmain.Entered, picturesmain.AuthName FROM species, refrens, morphdat, families, picturesmain WHERE species.SpeciesRefNo=refrens.RefNo AND species.SpecCode=morphdat.SpecCode AND species.FamCode=families.FamCode AND species.SpecCode =picturesmain.SpecCode AND (species.Genus="Danio" AND species.Species="rerio")
families.Order, families.Class, morphdat.AddChars, species.DemersPelag, species.SpeciesRefNo, species.AnaCat, species.PicPreferredName, picturesmain.autoctr, picturesmain.PicName, picturesmain.Entered, picturesmain.AuthName FROM species, refrens, morphdat, families, picturesmain WHERE species.SpeciesRefNo=refrens.RefNo AND species.SpecCode=morphdat.SpecCode AND species.FamCode=families.FamCode AND species.SpecCode =picturesmain.SpecCode AND (species.Genus="%Genus%" AND species.Species="%Species%")
framework •Based on Berlin SPARQL Benchmark (BSBM) [1] framework •Extensions: •Query generation from query templates •Connection for OBDA systems •Supports different query scenarios 19 [1] Christian Bizer, Andreas Schultz: The Berlin SPARQL Benchmark . In: International Journal on Semantic Web & Information Systems, Vol. 5, Issue 2, Pages 1-24, 2009.
query mix •Obtain frequency of query from fishbase.org server logs •Execute queries to simulate ‘realistic’ load of server •Problem: Does not touch some of the rare queries 22 Query Total / Month Avg. / Day Avg. / Hour SpeciesPage CommonName By Genus CountrySpeciesInformation CollaboratorPage 96154 3205.13 133.55 31008 1033.60 43.07 13331 444.63 18.53 4429 147.63 6.15 4138 137.93 5.75
use case query mix •Query mix based on typical ‘session’ of fishbase.org visitor •Cf. BSBM ‘Explore’ use case •Requires chaining of query results 23 Search for common name ‘Zebrafish’ Generate species page for ‘Danio rerio’ Request picture page for ‘Danio rerio’ ...
•Tested systems: •Virtuoso Open Source 6.1.5 triple store •ontop Quest 1.7 OBDA System (using a MySQL database) •MySQL 5.5 Relational DBMS •Benchmark parameters for scenario 1 •50 warm-up runs •100 timed runs per query instance •20* instantiations of each query template ‣ 2000 timed runs per query type 24
•Tested systems: •Virtuoso Open Source 6.1.5 triple store •ontop Quest 1.7 OBDA System (using a MySQL database) •MySQL 5.5 Relational DBMS •Benchmark parameters for scenario 1 •50 warm-up runs •100 timed runs per query instance •20* instantiations of each query template ‣ 2000 timed runs per query type 25 * arbitrary value! Better: # instantiations based on the number of results returned by seed query
future work •FishMark: Application benchmark based on real data & queries •Preliminary evaluation: •Virtuoso: 5% of MySQL performance •Quest (with cache): 70% of MySQL performance •Future work: •Extend framework for query scenarios 2 and 3 •Perform comprehensive tests ‣ More systems ‣ Complete FishDelish data (1.38bn triples) 29