Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Scalable Distributed Spatial Index for the Internet-of-Things

Anand Iyer
September 26, 2017

A Scalable Distributed Spatial Index for the Internet-of-Things

Anand Iyer

September 26, 2017
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. A Scalable Distributed Spatial
    Index for the Internet-of-Things
    Anand Iyer, Ion Stoica
    ACM SoCC
    September 26, 2017
    1

    View full-size slide

  2. 3
    Big Data Analytics

    View full-size slide

  3. From batch data to advanced analytics
    4
    Big Data Analytics

    View full-size slide

  4. From batch data to advanced analytics
    5
    From live data to real-time decisions
    Big Data Analytics

    View full-size slide

  5. IoT Data Challenge#1
    Inherently geospatial data
    • Complex polygons
    • Existing spatial indices not designed for dynamic data
    6

    View full-size slide

  6. IoT Data Challenge#1
    Inherently geospatial data
    • Complex polygons
    • Existing spatial indices not designed for dynamic data
    7

    View full-size slide

  7. IoT Data Challenge#1
    Inherently geospatial data
    • Complex polygons
    • Existing spatial indices not designed for dynamic data
    8
    Need robust dynamic spatial indexing

    View full-size slide

  8. IoT Data Challenge#2
    Human generated → Machine generated
    • Location Based Services (LBS) → Spatial analytics
    9

    View full-size slide

  9. IoT Data Challenge#2
    Human generated → Machine generated
    • Location Based Services (LBS) → Spatial analytics
    10

    View full-size slide

  10. IoT Data Challenge#2
    Human generated → Machine generated
    • Location Based Services (LBS) → Spatial analytics
    11
    Need online ingestion at massive rates

    View full-size slide

  11. IoT Data Challenge#3
    Heavily skewed
    • Operating on fresh data better than using stale data at all
    • Post-ingestion load-balancing not sufficient
    12

    View full-size slide

  12. IoT Data Challenge#3
    Heavily skewed
    • Operating on fresh data better than using stale data at all
    • Post-ingestion load-balancing not sufficient
    13
    Need good performance under skews

    View full-size slide

  13. Current solutions are not good enough
    14
    More details in the paper

    View full-size slide

  14. Current solutions are not good enough
    15
    More details in the paper
    Suited to LBS workloads

    View full-size slide

  15. Problem: Ingest, index & query dynamic
    spatial data having unpredictable skews
    at unprecedented rates

    View full-size slide

  16. Problem: Ingest, index & query dynamic
    spatial data having unpredictable skews
    at unprecedented rates
    SIFT: Robust, skew-resistant, massively
    parallel spatial index

    View full-size slide

  17. SIFT Design
    Basic datastructure: simple tree
    Never split/merge tree nodes

    View full-size slide

  18. SIFT Design
    Basic datastructure: simple tree
    Never split/merge tree nodes

    View full-size slide

  19. SIFT Design
    Basic datastructure: simple tree
    Never split/merge tree nodes

    View full-size slide

  20. SIFT Design
    Distributing data
    When/how to create children
    Skew-resistant design

    View full-size slide

  21. SIFT Design
    Distributing data
    When/how to create children
    Skew-resistant design

    View full-size slide

  22. SIFT Design
    Distributing data
    When/how to create children
    Skew-resistant design
    The Grid File: An Adaptable, Symmetric Multikey File Structure, TODS 84

    View full-size slide

  23. What to use for distribution?
    SIFT Design

    View full-size slide

  24. What to use for distribution?
    SIFT Design

    View full-size slide

  25. What to use for distribution?
    SIFT Design
    0.9976!
    0.9981!
    0.9986!
    0.9991!
    0.9996!
    1.0001!
    0! 25000! 50000! 75000!
    Probability!
    Area (x 1million m2)!

    View full-size slide

  26. SIFT Design layers

    View full-size slide

  27. How to parallelize?
    Need to address node
    SIFT Design

    View full-size slide

  28. How to parallelize?
    Need to address node
    SIFT Design
    Key: (min_x, min_y), (max_x, max_y)

    View full-size slide

  29. Cloud Network Latency
    0!
    0.5!
    1!
    1.5!
    2!
    2.5!
    3!
    3.5!
    0! 5! 10! 15! 20!
    Avg. Query Time (ms)!
    Number of Machines!
    No Locality! With Locality! No Locality (Batched)!

    View full-size slide

  30. SIFT Design
    How to parallelize?
    Need to address node

    View full-size slide

  31. SIFT Design
    0
    0
    1 2
    3
    0 1 14 15
    2
    3 13 12
    4
    5 6
    7 8
    9 10
    11
    How to parallelize?
    Need to address node

    View full-size slide

  32. SIFT Design
    0
    0
    1 2
    3
    0 1 14 15
    2
    3 13 12
    4
    5 6
    7 8
    9 10
    11
    How to parallelize?
    Need to address node

    View full-size slide

  33. SIFT Design
    0
    0
    1 2
    3
    0 1 14 15
    2
    3 13 12
    4
    5 6
    7 8
    9 10
    11
    0, 00* 11*
    01*
    10*

    View full-size slide

  34. Implementation
    mongod
    KeyGenerator
    QueryPlanner
    mongos
    ShardManager
    ...
    ...
    Insert/Query
    Response
    MongoDB 3.2
    Google S2 Library for H-Curves
    ~2000 LoC

    View full-size slide

  35. Amazon EC2 20 r4.xlarge instances, 30.5GB memory
    Performance compared against PostGIS & MongoDB
    Evaluations
    Dataset Records Size
    All landmark in USA (Tiger) 122K 406 MB
    All cities in earth (OSM) 542K 844 MB
    All parks in earth (OSM) 234K 102 MB
    All rivers in earth (OSM) 555K 945 MB
    Taxi trip records 1.1 billion 280 GB
    Cellular network (partial) 500 million 2 TB
    Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

    View full-size slide

  36. Amazon EC2 20 r4.xlarge instances, 30.5GB memory
    Performance compared against PostGIS & MongoDB
    Evaluations
    Dataset Records Size
    All landmark in USA (Tiger) 122K 406 MB
    All cities in earth (OSM) 542K 844 MB
    All parks in earth (OSM) 234K 102 MB
    All rivers in earth (OSM) 555K 945 MB
    Taxi trip records 1.1 billion 280 GB
    Cellular network (partial) 500 million 2 TB
    Table 2: Real-world datasets used in evaluations (from [27, 45, 49]).

    View full-size slide

  37. Evaluations: Indexing

    View full-size slide

  38. Evaluations: Indexing
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View full-size slide

  39. Evaluations: Indexing
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View full-size slide

  40. Evaluations: Indexing
    0
    100
    200
    300
    400
    500
    600
    700
    0 0.5 1 1.5 2
    Index Size (GB)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View full-size slide

  41. Evaluations: Indexing
    0
    100
    200
    300
    400
    500
    600
    700
    0 0.5 1 1.5 2
    Index Size (GB)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1 1.5 2
    Indexing Time (s)
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View full-size slide

  42. Evaluations: Querying

    View full-size slide

  43. Evaluations: Querying
    0
    5
    10
    15
    0 0.5 1 1.5 2
    Query Time (ms)
    Number of Objects Stored (Billions)
    MongoDB SIFT

    View full-size slide

  44. Evaluations: Querying
    0
    5
    10
    15
    0 0.5 1 1.5 2
    Query Time (ms)
    Number of Objects Stored (Billions)
    MongoDB SIFT

    View full-size slide

  45. Evaluations: Querying
    0
    5
    10
    15
    0 0.5 1 1.5 2
    Query Time (ms)
    Number of Objects Stored (Billions)
    MongoDB SIFT
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 10 20 30 40
    Probability
    Query Time (ms)
    SIFT
    MongoDB

    View full-size slide

  46. Evaluations: Skew Handling

    View full-size slide

  47. Evaluations: Skew Handling
    0
    5
    10
    15
    20
    25
    0 0.5 1 1.5 2
    Machines Used
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View full-size slide

  48. Evaluations: Skew Handling
    0
    50
    100
    150
    200
    250
    0 0.5 1 1.5
    Chunks/Partition
    Number of Objects Stored (Billions)
    0
    5
    10
    15
    20
    25
    0 0.5 1 1.5 2
    Machines Used
    Number of Objects Stored (Billions)
    MongoDB
    SIFT

    View full-size slide

  49. Summary
    Emerging IoT workloads challenging
    • Inherently geospatial, heavy skews, unprecedented volume
    • Need efficient support for storing & querying
    Our solution, SIFT:
    • Robust, skew-resistant, massively parallel
    • Performs well
    52

    View full-size slide