A Scalable Distributed Spatial Index for the Internet-of-Things

A Scalable Distributed Spatial Index for the Internet-of-Things Anand Iyer,
Ion Stoica ACM SoCC September 26, 2017 1

3 Big Data Analytics

From batch data to advanced analytics 4 Big Data Analytics

From batch data to advanced analytics 5 From live data
to real-time decisions Big Data Analytics

IoT Data Challenge#1 Inherently geospatial data • Complex polygons •
Existing spatial indices not designed for dynamic data 6

Existing spatial indices not designed for dynamic data 7

Existing spatial indices not designed for dynamic data 8 Need robust dynamic spatial indexing

IoT Data Challenge#2 Human generated → Machine generated • Location
Based Services (LBS) → Spatial analytics 9

Based Services (LBS) → Spatial analytics 10

Based Services (LBS) → Spatial analytics 11 Need online ingestion at massive rates

IoT Data Challenge#3 Heavily skewed • Operating on fresh data
better than using stale data at all • Post-ingestion load-balancing not sufficient 12

IoT Data Challenge#3 Heavily skewed • Operating on fresh data
better than using stale data at all • Post-ingestion load-balancing not sufficient 13 Need good performance under skews

Current solutions are not good enough 14 More details in
the paper

Current solutions are not good enough 15 More details in
the paper Suited to LBS workloads

Problem: Ingest, index & query dynamic spatial data having unpredictable
skews at unprecedented rates

Problem: Ingest, index & query dynamic spatial data having unpredictable
skews at unprecedented rates SIFT: Robust, skew-resistant, massively parallel spatial index

SIFT Design

SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

SIFT Design Distributing data When/how to create children Skew-resistant design

SIFT Design Distributing data When/how to create children Skew-resistant design
The Grid File: An Adaptable, Symmetric Multikey File Structure, TODS 84

What to use for distribution? SIFT Design

What to use for distribution? SIFT Design 0.9976! 0.9981! 0.9986!
0.9991! 0.9996! 1.0001! 0! 25000! 50000! 75000! Probability! Area (x 1million m2)!

SIFT Design layers

How to parallelize? Need to address node SIFT Design

How to parallelize? Need to address node SIFT Design Key:
(min_x, min_y), (max_x, max_y)

SIFT Design

Cloud Network Latency 0! 0.5! 1! 1.5! 2! 2.5! 3!
3.5! 0! 5! 10! 15! 20! Avg. Query Time (ms)! Number of Machines! No Locality! With Locality! No Locality (Batched)!

SIFT Design How to parallelize? Need to address node

SIFT Design 0 0 1 2 3 0 1 14
15 2 3 13 12 4 5 6 7 8 9 10 11 How to parallelize? Need to address node

SIFT Design 0 0 1 2 3 0 1 14
15 2 3 13 12 4 5 6 7 8 9 10 11 0, 00* 11* 01* 10*

Implementation mongod KeyGenerator QueryPlanner mongos ShardManager ... ... Insert/Query Response
MongoDB 3.2 Google S2 Library for H-Curves ~2000 LoC

Amazon EC2 20 r4.xlarge instances, 30.5GB memory Performance compared against
PostGIS & MongoDB Evaluations Dataset Records Size All landmark in USA (Tiger) 122K 406 MB All cities in earth (OSM) 542K 844 MB All parks in earth (OSM) 234K 102 MB All rivers in earth (OSM) 555K 945 MB Taxi trip records 1.1 billion 280 GB Cellular network (partial) 500 million 2 TB Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

Evaluations: Indexing

Evaluations: Indexing 0 0.2 0.4 0.6 0.8 1 0 0.5
1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT

Evaluations: Indexing 0 100 200 300 400 500 600 700
0 0.5 1 1.5 2 Index Size (GB) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT

Evaluations: Querying

Evaluations: Querying 0 5 10 15 0 0.5 1 1.5
2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT

Evaluations: Querying 0 5 10 15 0 0.5 1 1.5
2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 Probability Query Time (ms) SIFT MongoDB

Evaluations: Skew Handling

Evaluations: Skew Handling 0 5 10 15 20 25 0
0.5 1 1.5 2 Machines Used Number of Objects Stored (Billions) MongoDB SIFT

Evaluations: Skew Handling 0 50 100 150 200 250 0
0.5 1 1.5 Chunks/Partition Number of Objects Stored (Billions) 0 5 10 15 20 25 0 0.5 1 1.5 2 Machines Used Number of Objects Stored (Billions) MongoDB SIFT

Summary Emerging IoT workloads challenging • Inherently geospatial, heavy skews,
unprecedented volume • Need efficient support for storing & querying Our solution, SIFT: • Robust, skew-resistant, massively parallel • Performs well 52

A Scalable Distributed Spatial Index for the I...

A Scalable Distributed Spatial Index for the Internet-of-Things

More Decks by Anand Iyer

Other Decks in Research

Featured

Transcript