Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Scalable Distributed Spatial Index for the Internet-of-Things

Anand Iyer
September 26, 2017

A Scalable Distributed Spatial Index for the Internet-of-Things

Anand Iyer

September 26, 2017
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. A Scalable Distributed Spatial
    Index for the Internet-of-Things
    Anand Iyer, Ion Stoica
    ACM SoCC
    September 26, 2017
    1

    View Slide

  2. 2

    View Slide

  3. 3
    Big Data Analytics

    View Slide

  4. From batch data to advanced analytics
    4
    Big Data Analytics

    View Slide

  5. From batch data to advanced analytics
    5
    From live data to real-time decisions
    Big Data Analytics

    View Slide

  6. IoT Data Challenge#1
    Inherently geospatial data
    • Complex polygons
    • Existing spatial indices not designed for dynamic data
    6

    View Slide

  7. IoT Data Challenge#1
    Inherently geospatial data
    • Complex polygons
    • Existing spatial indices not designed for dynamic data
    7

    View Slide

  8. IoT Data Challenge#1
    Inherently geospatial data
    • Complex polygons
    • Existing spatial indices not designed for dynamic data
    8
    Need robust dynamic spatial indexing

    View Slide

  9. IoT Data Challenge#2
    Human generated → Machine generated
    • Location Based Services (LBS) → Spatial analytics
    9

    View Slide

  10. IoT Data Challenge#2
    Human generated → Machine generated
    • Location Based Services (LBS) → Spatial analytics
    10

    View Slide

  11. IoT Data Challenge#2
    Human generated → Machine generated
    • Location Based Services (LBS) → Spatial analytics
    11
    Need online ingestion at massive rates

    View Slide

  12. IoT Data Challenge#3
    Heavily skewed
    • Operating on fresh data better than using stale data at all
    • Post-ingestion load-balancing not sufficient
    12

    View Slide

  13. IoT Data Challenge#3
    Heavily skewed
    • Operating on fresh data better than using stale data at all
    • Post-ingestion load-balancing not sufficient
    13
    Need good performance under skews

    View Slide

  14. Current solutions are not good enough
    14
    More details in the paper

    View Slide

  15. Current solutions are not good enough
    15
    More details in the paper
    Suited to LBS workloads

    View Slide

  16. Problem: Ingest, index & query dynamic
    spatial data having unpredictable skews
    at unprecedented rates

    View Slide

  17. Problem: Ingest, index & query dynamic
    spatial data having unpredictable skews
    at unprecedented rates
    SIFT: Robust, skew-resistant, massively
    parallel spatial index

    View Slide

  18. SIFT Design

    View Slide

  19. SIFT Design
    Basic datastructure: simple tree
    Never split/merge tree nodes

    View Slide

  20. SIFT Design
    Basic datastructure: simple tree
    Never split/merge tree nodes

    View Slide

  21. SIFT Design
    Basic datastructure: simple tree
    Never split/merge tree nodes

    View Slide

  22. SIFT Design
    Distributing data
    When/how to create children
    Skew-resistant design

    View Slide

  23. SIFT Design
    Distributing data
    When/how to create children
    Skew-resistant design

    View Slide

  24. SIFT Design
    Distributing data
    When/how to create children
    Skew-resistant design
    The Grid File: An Adaptable, Symmetric Multikey File Structure, TODS 84

    View Slide

  25. What to use for distribution?
    SIFT Design

    View Slide

  26. What to use for distribution?
    SIFT Design

    View Slide

  27. What to use for distribution?
    SIFT Design
    0.9976!
    0.9981!
    0.9986!
    0.9991!
    0.9996!
    1.0001!
    0! 25000! 50000! 75000!
    Probability!
    Area (x 1million m2)!

    View Slide

  28. SIFT Design layers

    View Slide

  29. How to parallelize?
    Need to address node
    SIFT Design

    View Slide

  30. How to parallelize?
    Need to address node
    SIFT Design
    Key: (min_x, min_y), (max_x, max_y)

    View Slide

  31. SIFT Design

    View Slide

  32. Cloud Network Latency
    0!
    0.5!
    1!
    1.5!
    2!
    2.5!
    3!
    3.5!
    0! 5! 10! 15! 20!
    Avg. Query Time (ms)!
    Number of Machines!
    No Locality! With Locality! No Locality (Batched)!

    View Slide

  33. SIFT Design
    How to parallelize?
    Need to address node

    View Slide

  34. SIFT Design
    0
    0
    1 2
    3
    0 1 14 15
    2
    3 13 12
    4
    5 6
    7 8
    9 10
    11
    How to parallelize?
    Need to address node

    View Slide

  35. SIFT Design
    0
    0
    1 2
    3
    0 1 14 15
    2
    3 13 12
    4
    5 6
    7 8
    9 10
    11
    How to parallelize?
    Need to address node

    View Slide

  36. SIFT Design
    0
    0
    1 2
    3
    0 1 14 15
    2
    3 13 12
    4
    5 6
    7 8
    9 10
    11
    0, 00* 11*
    01*
    10*

    View Slide

  37. Implementation
    mongod
    KeyGenerator
    QueryPlanner
    mongos
    ShardManager
    ...
    ...
    Insert/Query
    Response
    MongoDB 3.2
    Google S2 Library for H-Curves
    ~2000 LoC

    View Slide

  38. Amazon EC2 20 r4.xlarge instances, 30.5GB memory
    Performance compared against PostGIS & MongoDB
    Evaluations
    Dataset Records Size
    All landmark in USA (Tiger) 122K 406 MB
    All cities in earth (OSM) 542K 844 MB
    All parks in earth (OSM) 234K 102 MB
    All rivers in earth (OSM) 555K 945 MB
    Taxi trip records 1.1 billion 280 GB
    Cellular network (partial) 500 million 2 TB
    Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

    View Slide

  39. Amazon EC2 20 r4.xlarge instances, 30.5GB memory
    Performance compared against PostGIS & MongoDB
    Evaluations
    Dataset Records Size
    All landmark in USA (Tiger) 122K 406 MB
    All cities in earth (OSM) 542K 844 MB
    All parks in earth (OSM) 234K 102 MB
    All rivers in earth (OSM) 555K 945 MB
    Taxi trip records 1.1 billion 280 GB
    Cellular network (partial) 500 million 2 TB
    Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

    View Slide

  40. Evaluations: Indexing

    View Slide

  41. Evaluations: Indexing
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View Slide

  42. Evaluations: Indexing
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View Slide

  43. Evaluations: Indexing
    0
    100
    200
    300
    400
    500
    600
    700
    0 0.5 1 1.5 2
    Index Size (GB)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View Slide

  44. Evaluations: Indexing
    0
    100
    200
    300
    400
    500
    600
    700
    0 0.5 1 1.5 2
    Index Size (GB)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View Slide

  45. Evaluations: Querying

    View Slide

  46. Evaluations: Querying
    0
    5
    10
    15
    0 0.5 1 1.5 2
    Query Time (ms)
    Number of Objects Stored (Billions)
    MongoDB SIFT

    View Slide

  47. Evaluations: Querying
    0
    5
    10
    15
    0 0.5 1 1.5 2
    Query Time (ms)
    Number of Objects Stored (Billions)
    MongoDB SIFT

    View Slide

  48. Evaluations: Querying
    0
    5
    10
    15
    0 0.5 1 1.5 2
    Query Time (ms)
    Number of Objects Stored (Billions)
    MongoDB SIFT
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 10 20 30 40
    Probability
    Query Time (ms)
    SIFT
    MongoDB

    View Slide

  49. Evaluations: Skew Handling

    View Slide

  50. Evaluations: Skew Handling
    0
    5
    10
    15
    20
    25
    0 0.5 1 1.5 2
    Machines Used
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View Slide

  51. Evaluations: Skew Handling
    0
    50
    100
    150
    200
    250
    0 0.5 1 1.5
    Chunks/Partition
    Number of Objects Stored (Billions)
    0
    5
    10
    15
    20
    25
    0 0.5 1 1.5 2
    Machines Used
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View Slide

  52. Summary
    Emerging IoT workloads challenging
    • Inherently geospatial, heavy skews, unprecedented volume
    • Need efficient support for storing & querying
    Our solution, SIFT:
    • Robust, skew-resistant, massively parallel
    • Performs well
    52

    View Slide