Slide 1

Slide 1 text

A Scalable Distributed Spatial Index for the Internet-of-Things Anand Iyer, Ion Stoica ACM SoCC September 26, 2017 1

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3 Big Data Analytics

Slide 4

Slide 4 text

From batch data to advanced analytics 4 Big Data Analytics

Slide 5

Slide 5 text

From batch data to advanced analytics 5 From live data to real-time decisions Big Data Analytics

Slide 6

Slide 6 text

IoT Data Challenge#1 Inherently geospatial data • Complex polygons • Existing spatial indices not designed for dynamic data 6

Slide 7

Slide 7 text

IoT Data Challenge#1 Inherently geospatial data • Complex polygons • Existing spatial indices not designed for dynamic data 7

Slide 8

Slide 8 text

IoT Data Challenge#1 Inherently geospatial data • Complex polygons • Existing spatial indices not designed for dynamic data 8 Need robust dynamic spatial indexing

Slide 9

Slide 9 text

IoT Data Challenge#2 Human generated → Machine generated • Location Based Services (LBS) → Spatial analytics 9

Slide 10

Slide 10 text

IoT Data Challenge#2 Human generated → Machine generated • Location Based Services (LBS) → Spatial analytics 10

Slide 11

Slide 11 text

IoT Data Challenge#2 Human generated → Machine generated • Location Based Services (LBS) → Spatial analytics 11 Need online ingestion at massive rates

Slide 12

Slide 12 text

IoT Data Challenge#3 Heavily skewed • Operating on fresh data better than using stale data at all • Post-ingestion load-balancing not sufficient 12

Slide 13

Slide 13 text

IoT Data Challenge#3 Heavily skewed • Operating on fresh data better than using stale data at all • Post-ingestion load-balancing not sufficient 13 Need good performance under skews

Slide 14

Slide 14 text

Current solutions are not good enough 14 More details in the paper

Slide 15

Slide 15 text

Current solutions are not good enough 15 More details in the paper Suited to LBS workloads

Slide 16

Slide 16 text

Problem: Ingest, index & query dynamic spatial data having unpredictable skews at unprecedented rates

Slide 17

Slide 17 text

Problem: Ingest, index & query dynamic spatial data having unpredictable skews at unprecedented rates SIFT: Robust, skew-resistant, massively parallel spatial index

Slide 18

Slide 18 text

SIFT Design

Slide 19

Slide 19 text

SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

Slide 20

Slide 20 text

SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

Slide 21

Slide 21 text

SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

Slide 22

Slide 22 text

SIFT Design Distributing data When/how to create children Skew-resistant design

Slide 23

Slide 23 text

SIFT Design Distributing data When/how to create children Skew-resistant design

Slide 24

Slide 24 text

SIFT Design Distributing data When/how to create children Skew-resistant design The Grid File: An Adaptable, Symmetric Multikey File Structure, TODS 84

Slide 25

Slide 25 text

What to use for distribution? SIFT Design

Slide 26

Slide 26 text

What to use for distribution? SIFT Design

Slide 27

Slide 27 text

What to use for distribution? SIFT Design 0.9976! 0.9981! 0.9986! 0.9991! 0.9996! 1.0001! 0! 25000! 50000! 75000! Probability! Area (x 1million m2)!

Slide 28

Slide 28 text

SIFT Design layers

Slide 29

Slide 29 text

How to parallelize? Need to address node SIFT Design

Slide 30

Slide 30 text

How to parallelize? Need to address node SIFT Design Key: (min_x, min_y), (max_x, max_y)

Slide 31

Slide 31 text

SIFT Design

Slide 32

Slide 32 text

Cloud Network Latency 0! 0.5! 1! 1.5! 2! 2.5! 3! 3.5! 0! 5! 10! 15! 20! Avg. Query Time (ms)! Number of Machines! No Locality! With Locality! No Locality (Batched)!

Slide 33

Slide 33 text

SIFT Design How to parallelize? Need to address node

Slide 34

Slide 34 text

SIFT Design 0 0 1 2 3 0 1 14 15 2 3 13 12 4 5 6 7 8 9 10 11 How to parallelize? Need to address node

Slide 35

Slide 35 text

SIFT Design 0 0 1 2 3 0 1 14 15 2 3 13 12 4 5 6 7 8 9 10 11 How to parallelize? Need to address node

Slide 36

Slide 36 text

SIFT Design 0 0 1 2 3 0 1 14 15 2 3 13 12 4 5 6 7 8 9 10 11 0, 00* 11* 01* 10*

Slide 37

Slide 37 text

Implementation mongod KeyGenerator QueryPlanner mongos ShardManager ... ... Insert/Query Response MongoDB 3.2 Google S2 Library for H-Curves ~2000 LoC

Slide 38

Slide 38 text

Amazon EC2 20 r4.xlarge instances, 30.5GB memory Performance compared against PostGIS & MongoDB Evaluations Dataset Records Size All landmark in USA (Tiger) 122K 406 MB All cities in earth (OSM) 542K 844 MB All parks in earth (OSM) 234K 102 MB All rivers in earth (OSM) 555K 945 MB Taxi trip records 1.1 billion 280 GB Cellular network (partial) 500 million 2 TB Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

Slide 39

Slide 39 text

Amazon EC2 20 r4.xlarge instances, 30.5GB memory Performance compared against PostGIS & MongoDB Evaluations Dataset Records Size All landmark in USA (Tiger) 122K 406 MB All cities in earth (OSM) 542K 844 MB All parks in earth (OSM) 234K 102 MB All rivers in earth (OSM) 555K 945 MB Taxi trip records 1.1 billion 280 GB Cellular network (partial) 500 million 2 TB Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

Slide 40

Slide 40 text

Evaluations: Indexing

Slide 41

Slide 41 text

Evaluations: Indexing 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT

Slide 42

Slide 42 text

Evaluations: Indexing 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT

Slide 43

Slide 43 text

Evaluations: Indexing 0 100 200 300 400 500 600 700 0 0.5 1 1.5 2 Index Size (GB) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT

Slide 44

Slide 44 text

Evaluations: Indexing 0 100 200 300 400 500 600 700 0 0.5 1 1.5 2 Index Size (GB) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT

Slide 45

Slide 45 text

Evaluations: Querying

Slide 46

Slide 46 text

Evaluations: Querying 0 5 10 15 0 0.5 1 1.5 2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT

Slide 47

Slide 47 text

Evaluations: Querying 0 5 10 15 0 0.5 1 1.5 2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT

Slide 48

Slide 48 text

Evaluations: Querying 0 5 10 15 0 0.5 1 1.5 2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 Probability Query Time (ms) SIFT MongoDB

Slide 49

Slide 49 text

Evaluations: Skew Handling

Slide 50

Slide 50 text

Evaluations: Skew Handling 0 5 10 15 20 25 0 0.5 1 1.5 2 Machines Used Number of Objects Stored (Billions) MongoDB SIFT

Slide 51

Slide 51 text

Evaluations: Skew Handling 0 50 100 150 200 250 0 0.5 1 1.5 Chunks/Partition Number of Objects Stored (Billions) 0 5 10 15 20 25 0 0.5 1 1.5 2 Machines Used Number of Objects Stored (Billions) MongoDB SIFT

Slide 52

Slide 52 text

Summary Emerging IoT workloads challenging • Inherently geospatial, heavy skews, unprecedented volume • Need efficient support for storing & querying Our solution, SIFT: • Robust, skew-resistant, massively parallel • Performs well 52