Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Querying for Analytics on Internet of...

Eugene Siow
December 18, 2017

Efficient Querying for Analytics on Internet of Things Databases and Streams (The Distilled Edition)

PhD Defence / Viva Slides. UK. 20.12.2017.

Eugene Siow

December 18, 2017
Tweet

More Decks by Eugene Siow

Other Decks in Technology

Transcript

  1. 1 Efficient Querying for Analytics on Internet of Things Databases

    and Streams The Distilled Edition Eugene Siow | 20.12.17
  2. 2 “How can database systems support the efficient storage and

    retrieval of a semantically interoperable representation of data and metadata, both historical and real-time, from the Internet of Things, for analytical applications?” Research Question C1 Section 1.4
  3. 3 Map-Match- Operate Abstraction to apply graph queries across in-memory

    model and on-disk database tier TritanDB Native Time- series Map- Match-Operate Background Categorisation, Databases and Streams, Analytics, Fog Computing Applications PIOTRe TritanDB Analytics SWoT Characteristics IoT Data Data Models IoT RDF Graphs Streaming Map-Match-Operate for streams Eywa Fog Computing Infrastructure 07 02 03 04 05 06 Background Characteristics of IoT Data Map-Match-Operate TritanDB Streaming and Fog Computing Applications Structure of Thesis 8 Chapters + 3 Appendices Introduction Conclusions A. IoT Applications Survey B. Smart Home Benchmark C. Operate Query Translation Section 1.6 C1
  4. 4 Contributions INSI 16 Publications Interoperable &Efficient ISWC 16 ISWC

    P&D INSCI 17 CSUR TOD S IEEE PC S2S PIOTRe Eywa Analytics TritanDB SWoT Published Minor Revision Review Datasets Dweet IO Cross IoT dF Software DistributedFog S2S T Eywa P H.s S2S TritanDB Eywa PIOTRe Hubber C1 Studies Theory and Design IoT Surveys Time-series DB Public IoT Schemata IoT Metamodels IoT RDF Metadata-to-Data Ratios Time-series Compression and Data Structures Implementation and Evaluation Map-Match-Operate TrTables SPARQL Operator Translation Distributed Fog Computing Infrastructure Stream Query Translation S2S + S2SML TritanDB + TSDB Evaluation Eywa + RSP Evaluation Smart Home Bench S2S Streams + Evaluation Applications and Analytics PIOTRe + Apps SWoT + Hubber.space TritanDB Analytics Eywa Section 1.5 + Appendices
  5. 5 Background Internet of Things Applications Platforms Building Blocks Building

    a categorisation for IoT Application Areas, Domains and Themes Figure 2.3 Survey of IoT Applications, Techniques and Data Currency Architectures: 3-Layer, 5-Layer and Edge/Fog Figure 2.4 Platform Horizontals Diagram from Architectural Layers Figure 2.5, Tables 2.3 + 2.4 + 2.5 1 2 1 2 C2 Definition and Categorisation Definition: Global Infrastructure of Things, interconnected to share data and applications have potential technological and societal impacts. Categorisation: Applications, Platforms and Building Blocks Table 2.2: Both Real-time and Historical, Vertical Silos, No Common Platform Time-Series Databases: Native, NoSQL, Relational RDF Stores: Open Source, Commercial, Research SPARQL-to-SQL Translation: R2RML, Ontop, Morph Federated SPARQL Stream Processing: CEP, C-SPARQL, CQELS, SPARQLstreams Analytics: 5 capability classification Fog Computing Methodology Review of Surveys + Evidence-Based Systematic Review
  6. 6 Platform Horizontals from Architectural Layers Building Blocks within each

    Platform Horizontal C2 Application + Business Layer Network Layer Processing Layer Perception Layer Section 2.1.4, Figure 2.5, Tables 2.3 + 2.4 + 2.5 Operating Systems Tags Hardware Sensors Power Actuators Network Protocols Sensor Networks Discovery Gateway Networks Thing Directories Security Fog/Edge App Protocols Cloud of Things Social Web of Things Social IoT Web of Things Big Data Processing Stream/Complex Event Processing Analytics Classes Descriptive Diagnostic Discovery Predictive Prescriptive Visual Content Text Video Trend Data Mining OLAP Business Anomaly Pattern Types/Research Areas Databases Relational Graph Semantic/ RDF NoSQL Time-series Middleware Resource Discovery Resource Management Data Management Event Management Code Management Functional Requirements Interoperable Context-Aware Autonomous Adaptive Service-oriented Programmable Lightweight Distributed Architectural Requirements 3 1 2 4
  7. 7 Background Surveys Focused Surveys on State-of-the-Art for the IoT

    01 02 S2.2 S2.2.5 S2.3 03 C2 Cassandra-Based Heroic KairosDB Hawkular Blueflood Native InfluxDB Vulcan/Prometheus Gorilla/Beringei BTrDb Akumuli Riak DaltmatinerDB Riak-TS Other NoSQL OpenTSDB (HBase) Cube (MongoDB) Relational (PG) Tgres Timescale Time-Series RDF Graph Stores Virtuoso Jena TDB GraphDB/Owlim 3Store Stardog AllegroGraph Blazegraph/BigDa ta Re-purposed MarkLogic Oracle 11g Research RDF-3X Hexastore CumulusRDF Hagedorn (2013) LDF Federated- SPARQL SPARQL-to-SQL D2RQ Morph Ontop RDF Stream Processing (RSP) C-SPARQL CQELS SPARQLstream /Morph- streams RSP-QL Complex Event Processing (CEP) CQL DAHP Esper/EPL Sections 2.2 + 2.2.5 + 2.3 + 6.1 Descriptive Diagnostic Discovery Predictive Prescriptive Value Knowledge Hierarchy Databases Performance: Time-series, Interoperability: Graph/Semantic/RDF Streams Analytics Capability Categorisation
  8. 8 The Role of Fog Computing Fog Computing An emerging

    technology that bridges the gap, deployed close to the source. Cloud Dynamic provisioning of scalable resources e.g. analytics on a huge volume of historical data. Things Connected sensors and actuators producing streams of time-series data. Section 2.4 C2 Database systems should cater for diverse hardware employed in the Cloud-to-Things continuum. 04 Focused Survey on State-of-the-Art for the IoT
  9. 9 Characteristics of IoT Time-Series Schemata dweet.io Varying Periodicity dweet.io

    Numerical dweet.io Wide dweet.io Flat not Complex 2% 3% 3% 5% 87% 99.2% Numerical String Boolean Categorical Identifiers 80% 100% 100% 100% 50% 53% 84% 100% 100% 83% ArrayOfThings OpenEnergy ThingSpeak 100% - 47% 52% Zero MAD >1 column Flat Section 3.1 + 3.2 C3
  10. 10 IoT Data Metamodels Study of the Interoperability Structures Across

    Domains Graph Structure Tree Structure RDF Data Model Haystack OGC SensorThings IPSO Objects oneM2M SDT W3C Thing Description ETSI SAREF W3C SSN IoT-O LOV4IoT Section 3.3, Table 3.2 C3
  11. 11 Metadata Expansion IoT RDF Graph Data Obs1 Sensor1 RainfallObs

    Data1 Time1 Point1 “40.82” “103.25” produces located at lat lon cm “0.1” “2017-06-01T15:46:08” XSDDateTime value unit sampleTime a result Sensor1 Point1 Point1 “40.82” “103.25” located at lat lon Sensor1 located at Obs1 Obs1 RainfallObs a Data1 Time1 result sampleTime cm Data1 unit Observation Metadata Sensor Metadata Data1 “0.1” “2017-06-01…” value XSDDateTime Observation Data Section 3.4, Table 3.3 C3
  12. 12 Metadata Expansion Ratios of Data-to-Metadata in IoT RDF Graph

    Data 1:7 1:7 1:4.5 1:5.9 1:2 1:3.5 1:6 1:1.8 Observation Data RDF Triples CityPulse Smart Home LinkedSensorData Ratio Blizzard Ike Analytics Parking Weather Traffic Pollution Events 109 Triples in Millions 535 11.2 0.7 0.2 489 552 0.02 Section 3.4.2, Table 3.4 C3
  13. 13 Map-Match-Operate Obs1 Sensor1 RainfallObs Data1 Time1 Obs2 produces produces

    cm 0.1 2017-06- 01T15:46:08 XSDDateTime value unit sampleTime a result Section 4.2 C4 TemperatureObs a Data2 result degreesC 30.0 value unit Model Tier Database Tier Input Query Expressed in a graph query language like SPARQL Translated Query Expressed in the database tier language Result Returned as a ResultSet
  14. 14 ?obs+?data+?time ?timeVal ?val _uuid Time Temp Map-Match-Operate {series._uuid}o1 Sensor1

    RainfallObs {series._uuid}d1 {series._uuid}t {series._uuid}o2 produces cm “series.rainfall” XSDDateTime value unit sampleTime a result Section 4.2.1 + 4.2.2 C4 TemperatureObs a {series._uuid}d2 result degreesC unit Model Tier Map and Match Steps “series.time” “series.temp” value produces S2SML Data Model Mapping 1 Database Tier Time Rainfall Temp Row1 t1 0.1 30.0 Row2 t2 0.1 30.1 Row3 t3 0.1 30.1 … … … … s2s:literalMap Faux Node Literal Map Bindings Compact 2D Structure like a table stores time-series data. A point in the time-series is referred to as a row. Map Match 2 Input Query Sensor1 ?obs TemperatureObs ?data ?val ?time ?timeVal SPARQL Select Graph Query produces result a sampleTime SELECT * WHERE }= { value XSDDateTime series Basic Graph Pattern Match ResultSet
  15. 15 ?obs+?data+?time ?timeVal ?val _uuid Time Temp Map-Match-Operate Section 4.2.3

    + Appendix C C4 Query Operator Tree Operate Step SPARQL Algebra and the Match Step Graphs 3 Operate Union Filter ?val > 20.0 AND t1 < ?timeVal < t2 Graph Temperature Filter ?val2 < 1.0 AND t1 < ?timeVal < t2 Graph Rainfall Project ?val, ?val2, ?timeVal Temp > 20.0 Time > t1 Time < t2 FROM series WHERE AND AND Rainfall < 1.0 Time > t1 Time < t2 FROM series WHERE AND AND Temp > 20.0 Rainfall < 1.0 Time > t1 Time < t2 FROM series WHERE AND AND AND SELECT Temp, Rainfall, Time Translate Query To the database tier language like SQL Graph Filter Union Project
  16. 16 Map-Match-Operate Evaluation Section 4.3 + 4.4, Figure 4.5 +

    4.6, Table 4.5 C4 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Max Time Taken (s) Query S2S TDB GraphDB Ontop Morph 47.08 1328.20 1747 2097 Using the SPARQL-to-SQL (S2S) Engine Distributed SRBench 0 100 200 300 400 1 2 3 4 Time Taken (s) Query S2S TDB GraphDB 527 Smart Home RPi2 CPU 4 x 0.9GHz RAM 1GB SDc 15.6 MB/s Hardware Data LinkedSensorData 10x SRBench Queries Obs 10mil / 34mil Method Distributed + Broadcast Scenarios Data Smart* 4x SmartHomeBench Queries Obs ~1mil Method Centralised Hub S2S TDB GraphDB Blizzard 1 68 1352 Ike 1 112 453 SmartHome 1 15 9 Storage
  17. 17 TritanDB: Time-series Rapid IoT Analytics Section 5.2 C5 The

    Rationale and A Study of Time-series Compression + Storage Data Structures Fast Floating Point Compression fpc Hum Rainfall Temp t1 0.77 0.1 30.0 t2 0.78 0.1 30.1 … … … … Columns representing Measurements Rows over Time Conceptual 2D Database Tier Time-series Storage Jump to Offset Physical 1D Interface block1 block2 … Compression Gorilla gor Delta-of-Delta Delta-of-Delta Delta-RLE-LEB128 leb Delta-RLE-Rice rice Data Structures TrTables Tr B+ Tree B+ Hash-Tree # LSM-Tree lsm Hum Rainfall t1 0.77 0.1 t2 0.78 0.1 t3 0.79 0.1 … … … Timestamp Numerical Value Equal Values Small Deltas High Precision
  18. 18 TrTables: TritanDB Tables Section 5.2.5.4 C5 A Novel Time-Partitioned

    Block Data Structure 1496331968 Row …2870 …3700 …5840 …4920 …6910 Time q = 6 Quantum Re-ordering Buffer (QRB) Supporting out-of-order timestamp insertion Quantum Expiration Insertion Sort 1496331968 Row …2870 …3700 …4920 …5840 …6910 a x q = 4 Flush Sorted Buffer (a x q) of the sorted QRB’s head is flushed to a memtable 1496335840 Row …6910 (1 – a) x q = 2 ...7890 …5970 Ingress Timestamp Offset Avg Ma x Mi n Count memtabl e Index Entry Blocks Block Timestamp δ Row ε δδ Row 1496331000 968 … ‘110’ -66 … Header First Entry Subsequent Entries Memtable Blocks Reach bsize = 64KB Flush to Disk TrTables Timestamp Offset Avg Ma x Mi n Count New Index Entry .. . … Index .. . … TrTable New Block 1496331000 968 Row ‘110’ -66 Row series.idx series.tsc 1 2 3
  19. 19 TritanDB Design Section 5.3, Figure 5.8 + 5.9 C5

    A Ring-Buffer-Centered Design Journaler For crash recovery QRB, memtable, TrTables Ingesting time-series points Query SPARQL Query New Event Time-series point 1496331968 …2870 1496331000 968 ‘110’ -66 Ring Buffer Router Dealer request reply websocket/comet Parser MatchOp Query Grammar Parse Tree Operate result Walk Persistent Store In-memory Store Import Export Match SWIBRE SWIPE
  20. 20 1 10 100 1000 TritanDB InfluxDb Akumuli MongoDb OpenTSDB

    H2 SQL Cassandra ES Average Execution Time (log10 ms) Shelburne Taxi 1 10 100 TritanDB InfluxDb Akumuli MongoDb OpenTSDB H2 SQL Cassandra ES Average Execution Time (log10 ms) Shelburne Taxi TritanDB Evaluation Section 5.4, Figure 5.11 – 5.13 C5 Comparing Time-series Database Query Performance Cross-sectional S1 CPU 2 x 2.6GHz RAM 32GB Disk 380.7 MB/s Hardware S2 CPU 4 x 2.6GHz RAM 4GB Disk 372.9 MB/s RPi2 CPU 4 x 0.9GHz RAM 1GB SDc 15.6 MB/s Gizmo2 CPU 4 x 0.9GHz RAM 1GB SSD 154 MB/s 1 10 100 1000 TritanDB InfluxDb Akumuli MongoDb OpenTSDB H2 SQL Cassandra ES Average Execution Time (log10 ms) Shelburne Taxi Results from an average of Server 1 and 2 configurations are shown above. Deep-History Aggregation RAM CPU x86
  21. 21 Map-Match-Operate on Streams Section 6.1 C6 Extending S2S for

    Stream Processing SELECT ?v1 ?v2 ?v3 FROM NAMED WINDOW :traffic ON <IRItraffic > [RANGE 3s] FROM NAMED WINDOW :weather ON <IRIweather > [RANGE 3s] WHERE { WINDOW :weather { ?obId1 a ssn:Observation; ssn:observedProperty ?p1; sao:hasValue ?v1. ?p1 a ct:Temperature. ?obId2 a ssn:Observation; ssn:observedProperty ?p2; sao:hasValue ?v2. ?p2 a ct:Humidity. } WINDOW :traffic { ?obId3 a ssn:Observation; ssn:observedProperty ?p3; sao:hasValue ?v3. ?p3 a ct:CongestionLevel. } } RSP-QL CityBench Q2 RSP-QL for S2S Implementation of the W3C Draft Specification for RDF Stream Processing (RSP) Engines in S2S. SPARQL Algebra Tree Join Window :traffic Graph traffic Window :weather Graph weather Project ?v1,?v2,?v3 Translation to EPL Event Processing Language executed by Esper CEP Engine SELECT weather.tempm AS v1 , weather.hum AS v2 , traffic.congestionLevelAS v3 FROM weather.win:time(3 sec) , traffic.win:time(3 sec) EPL
  22. 22 Evaluating S2S for Streams Section 6.1.3 C6 Using SRBench

    and Comparing Against CQELS 1 2 3 4 5 7 8 9 10 SRBench 294 261 306 277k 3243k 5245 426 280k 98 Smart Home 196 167 1 2 Ratio Query Ratio S2S : CQELS (1:?) Query S2S : CQELS (1:?) RPi2 CPU 4 x 0.9GHz RAM 1GB SDc 15.6 MB/s Hardware Data LinkedSensorData 10x SRBench Queries Obs 10mil / 34mil Method Distributed + Broadcast Scenarios Data Smart* 4x SmartHomeBench Queries Obs ~1mil Method Centralised Hub
  23. 23 Process Stream Processing by Query Translation Distribute Workload Distribution

    by Projection Pushdown Deliver Inverse-publish- subscribe Eywa Eywa’s Design Section 6.2, Figure 6.7 C6 A Fog Computing Infrastructure for Distributed Stream Query Processing Ъ Broker Well-known Source 1 Produces Stream, μ1 S1 S2 Source 2 Produces Stream, μ2 Client Issues Query, q1 τ Deliver Queries 1 Distribute Processing 2 ?v1,?v2 ?v3 temp,hum congestionLevel Client Receives the Projection τ Project Streams temp,hum, congestionLevel No Extra Join Variables From CityBench Query 2. Finding the traffic congestion level and weather conditions of my planned journey. Source 1 Processing q1 1 S1 Produces a Projection Tree from Query
  24. 24 0 50 100 150 200 250 1 8 15

    Memory Consumed (MB) Experiment Time (min) Eywa Evaluation Section 6.3, Figure 6.9 + 6.13 + 6.16 C6 Performance and Scalability with CityBench on Smart City Traffic, Parking, Events, Weather and Pollution IoT Data 0 2 4 6 8 10 12 14 16 1 8 15 Latency (s) Experiment Time (min) C-SPARQL High Latency CQELS Some Fluctuations, Medium Latency Eywa Stable, low latency Latency CityBench Q1 Traffic congestion level on two roads Scalability CityBench Q2 Traffic congestion level and weather C-SPARQL High Memory Consumption, Large Fluctuations CQELS Medium Memory Consumption Eywa Very Low Memory Consumption 0 10 20 30 40 50 60 1 8 15 Memory Consumed (MB) Experiment Time (min) Fog Scalability Single 8 Streams Fog 8 Streams Fog 5 Streams Single 5 Streams CityBench Q10: Most Polluted Area RPi2 CPU 4 x 0.9GHz RAM 1GB SDc 15.6 MB/s Hardware Tests were run for 15 minutes and averaged over 3 runs. Single Node 1x Client, Source, Broker Fog 1x Client, Broker 2x Source, Broker Setup
  25. 25 v v PIOTRe Personal IoT Repository and data store

    for resource- constrained devices. S2S PIOTRe integrates S2S to perform efficient and interoperable storage and retrieval of IoT data and streams. Eywa Node PIOTRe allows you to join an Eywa Network as a source or client. Applications 2 example applications are deployed: A Smart Home dashboard and an analytics tool. Publish Metadata with Hypercat Section 7.1 C7
  26. 26 TritanDB Analytics Resampling Conversion of Non-Periodic / Unevenly-Spaced Time-series

    01 02 SPARQL 1.1 TritanDB SPARQL (hours (?time) as ? hours) GROUP BY SELECT AVG (?val) ? hours WHERE … (SMA(?time,tau) as ? hours) GROUP BY SELECT AVG (?val) ? hours WHERE … Resampling Simple Moving Average Value Time t-tau t Section 7.2.1, Figure 7.5 C7 Methods to analyse evenly-spaced time-series data > Methods for unevenly-spaced time series data. SMA = area / tau
  27. 27 The Social Web of Things Layered Design of a

    Social-Capital Inspired SWoT Infrastructure: Sensors and Networking IoT Core Primary App: SWoT Core: Social Graph and Common Interface Patterns: Decentralisation and Collaborative Work Builds on core and patterns to provide apps. Tetiary Apps 01 02 03 04 C7 Section 7.3.1, Figure 7.6 Building and Maintaining a Network Information Flow Feedback and Interoperability
  28. 28 The Social Web of Things Human-to-Machine-Translation and a Pub-Sub

    Infrastructure Network Management •Web2.0/Mobile • Rule-Based •Algorithimic/Learned •Strong AI •Manual/Policy •Edge Prediction •Game Theory •Trustless Nets •Manual •Semantically-enriched •Self-awareness Human-Driven Machine-Automated Information Flow Patterns and Value SWoT Functions •Messaging Clients • Chatbots • Neural Representations •Per Message • Aspects/Policy • P2P Trust • Social Capital •Manual •Content/Collaborative Filtering •Machine-Reasoning Management Interface, Social Graph and Profile Messaging, Sharing and Feed Management Collaboration, Apps and Analytics •Collaborative Editing • Crowd-sourcing • Trustless/Proof •Manual • Big Data Model-Driven • Big Data AI-Driven •Descriptive •Diagnostic •Discovery •Predictive •Prescriptive 4 Selective- Subscribe 2 1 3 Broker RelLogic Auth 5 p(1) p(2) p(3) s(1), s(2) s(2), s(3) 4 2 1 3 RelLogic Broker Auth 5 p(1) p(2) p(3) s(4) s(5) Selective- Publish p(4) p(4), p(5) p(5) 1.2 – 4.2x better C7 Section 7.3.2 + 7.3.3., Figure 7.7 + 7.8
  29. 29 Limitations Leading to Future Work 02 03 04 01

    Multimodal Data Support for complex data structures and images, sounds and unstructured text Other Graph/Tree Models Extending Map-Match-Operate to work on other Graph/Tree models RDF Property Paths Support chain of predicates and predicate cardinality (arbitrary number of hops), across tiers Horizontal Scalability feasibility of partitioning timeseries data across instances Focus of this Thesis Current time-series IoT data characteristics, a particular graph model, RDF, and processing done on a single node or in a co- operative cloud scenario. C8 Section 8.2
  30. 30 Eywa Fog Computing Operators Studying how various workload operators,

    besides projections, can be processed in the data plane between the source and client nodes Social Web of Things Knowledge Graph Research on utilising the knowledge graph formed from integrating mappings of Thing metadata provided as S2SML Opportunities C8 Section 8.3 + 8.4 Leading to Future Work Final Remarks Map-Match-Operate, TritanDB and Eywa help achieve efficient storage and retrieval of a semantically interoperable representation of data and metadata, both historical and real- time, utilising the unique characteristics of Internet of Things data. The thesis goes beyond the immediate performance and interoperability of the solutions proposed, showing how they can lead to analytical applications and platforms in different directions.