A Scalable Distributed Spatial Index for the Internet-of-Things

0ff46442256bf55681d64027c68beea7?s=47 Anand Iyer
September 26, 2017

A Scalable Distributed Spatial Index for the Internet-of-Things

0ff46442256bf55681d64027c68beea7?s=128

Anand Iyer

September 26, 2017
Tweet

Transcript

  1. A Scalable Distributed Spatial Index for the Internet-of-Things Anand Iyer,

    Ion Stoica ACM SoCC September 26, 2017 1
  2. 2

  3. 3 Big Data Analytics

  4. From batch data to advanced analytics 4 Big Data Analytics

  5. From batch data to advanced analytics 5 From live data

    to real-time decisions Big Data Analytics
  6. IoT Data Challenge#1 Inherently geospatial data • Complex polygons •

    Existing spatial indices not designed for dynamic data 6
  7. IoT Data Challenge#1 Inherently geospatial data • Complex polygons •

    Existing spatial indices not designed for dynamic data 7
  8. IoT Data Challenge#1 Inherently geospatial data • Complex polygons •

    Existing spatial indices not designed for dynamic data 8 Need robust dynamic spatial indexing
  9. IoT Data Challenge#2 Human generated → Machine generated • Location

    Based Services (LBS) → Spatial analytics 9
  10. IoT Data Challenge#2 Human generated → Machine generated • Location

    Based Services (LBS) → Spatial analytics 10
  11. IoT Data Challenge#2 Human generated → Machine generated • Location

    Based Services (LBS) → Spatial analytics 11 Need online ingestion at massive rates
  12. IoT Data Challenge#3 Heavily skewed • Operating on fresh data

    better than using stale data at all • Post-ingestion load-balancing not sufficient 12
  13. IoT Data Challenge#3 Heavily skewed • Operating on fresh data

    better than using stale data at all • Post-ingestion load-balancing not sufficient 13 Need good performance under skews
  14. Current solutions are not good enough 14 More details in

    the paper
  15. Current solutions are not good enough 15 More details in

    the paper Suited to LBS workloads
  16. Problem: Ingest, index & query dynamic spatial data having unpredictable

    skews at unprecedented rates
  17. Problem: Ingest, index & query dynamic spatial data having unpredictable

    skews at unprecedented rates SIFT: Robust, skew-resistant, massively parallel spatial index
  18. SIFT Design

  19. SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

  20. SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

  21. SIFT Design Basic datastructure: simple tree Never split/merge tree nodes

  22. SIFT Design Distributing data When/how to create children Skew-resistant design

  23. SIFT Design Distributing data When/how to create children Skew-resistant design

  24. SIFT Design Distributing data When/how to create children Skew-resistant design

    The Grid File: An Adaptable, Symmetric Multikey File Structure, TODS 84
  25. What to use for distribution? SIFT Design

  26. What to use for distribution? SIFT Design

  27. What to use for distribution? SIFT Design 0.9976! 0.9981! 0.9986!

    0.9991! 0.9996! 1.0001! 0! 25000! 50000! 75000! Probability! Area (x 1million m2)!
  28. SIFT Design layers

  29. How to parallelize? Need to address node SIFT Design

  30. How to parallelize? Need to address node SIFT Design Key:

    (min_x, min_y), (max_x, max_y)
  31. SIFT Design

  32. Cloud Network Latency 0! 0.5! 1! 1.5! 2! 2.5! 3!

    3.5! 0! 5! 10! 15! 20! Avg. Query Time (ms)! Number of Machines! No Locality! With Locality! No Locality (Batched)!
  33. SIFT Design How to parallelize? Need to address node

  34. SIFT Design 0 0 1 2 3 0 1 14

    15 2 3 13 12 4 5 6 7 8 9 10 11 How to parallelize? Need to address node
  35. SIFT Design 0 0 1 2 3 0 1 14

    15 2 3 13 12 4 5 6 7 8 9 10 11 How to parallelize? Need to address node
  36. SIFT Design 0 0 1 2 3 0 1 14

    15 2 3 13 12 4 5 6 7 8 9 10 11 0, 00* 11* 01* 10*
  37. Implementation mongod KeyGenerator QueryPlanner mongos ShardManager ... ... Insert/Query Response

    MongoDB 3.2 Google S2 Library for H-Curves ~2000 LoC
  38. Amazon EC2 20 r4.xlarge instances, 30.5GB memory Performance compared against

    PostGIS & MongoDB Evaluations Dataset Records Size All landmark in USA (Tiger) 122K 406 MB All cities in earth (OSM) 542K 844 MB All parks in earth (OSM) 234K 102 MB All rivers in earth (OSM) 555K 945 MB Taxi trip records 1.1 billion 280 GB Cellular network (partial) 500 million 2 TB Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).
  39. Amazon EC2 20 r4.xlarge instances, 30.5GB memory Performance compared against

    PostGIS & MongoDB Evaluations Dataset Records Size All landmark in USA (Tiger) 122K 406 MB All cities in earth (OSM) 542K 844 MB All parks in earth (OSM) 234K 102 MB All rivers in earth (OSM) 555K 945 MB Taxi trip records 1.1 billion 280 GB Cellular network (partial) 500 million 2 TB Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).
  40. Evaluations: Indexing

  41. Evaluations: Indexing 0 0.2 0.4 0.6 0.8 1 0 0.5

    1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT
  42. Evaluations: Indexing 0 0.2 0.4 0.6 0.8 1 0 0.5

    1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT
  43. Evaluations: Indexing 0 100 200 300 400 500 600 700

    0 0.5 1 1.5 2 Index Size (GB) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT
  44. Evaluations: Indexing 0 100 200 300 400 500 600 700

    0 0.5 1 1.5 2 Index Size (GB) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 Indexing Time (s) Number of Objects Stored (Billions) MongoDB SIFT
  45. Evaluations: Querying

  46. Evaluations: Querying 0 5 10 15 0 0.5 1 1.5

    2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT
  47. Evaluations: Querying 0 5 10 15 0 0.5 1 1.5

    2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT
  48. Evaluations: Querying 0 5 10 15 0 0.5 1 1.5

    2 Query Time (ms) Number of Objects Stored (Billions) MongoDB SIFT 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 Probability Query Time (ms) SIFT MongoDB
  49. Evaluations: Skew Handling

  50. Evaluations: Skew Handling 0 5 10 15 20 25 0

    0.5 1 1.5 2 Machines Used Number of Objects Stored (Billions) MongoDB SIFT
  51. Evaluations: Skew Handling 0 50 100 150 200 250 0

    0.5 1 1.5 Chunks/Partition Number of Objects Stored (Billions) 0 5 10 15 20 25 0 0.5 1 1.5 2 Machines Used Number of Objects Stored (Billions) MongoDB SIFT
  52. Summary Emerging IoT workloads challenging • Inherently geospatial, heavy skews,

    unprecedented volume • Need efficient support for storing & querying Our solution, SIFT: • Robust, skew-resistant, massively parallel • Performs well 52