$30 off During Our Annual Pro Sale. View Details »

Processing and publishing big data with GeoServer and Databricks in Azure

Processing and publishing big data with GeoServer and Databricks in Azure

The amount of data we have to process and publish keeps growing every day, fortunately, the infrastructure, technologies, and methodologies to handle such streams of data keep improving and maturing. GeoServer is a web service for publishing your geospatial data using industry standards for vector, raster, and mapping. It powers a number of open source projects like GeoNode and geOrchestra and it is widely used throughout the world by organizations to manage and disseminate data at scale. We integrated GeoServer with some well-known big data technologies like Kafka and Databricks, and deployed the systems in Azure cloud, to handle use cases that required near-realtime displaying of the latest received data on a map as well background batch processing of historical data.
This presentation will describe the architecture put in place, and the challenges that GeoSolutions had to overcome to publish big data through GeoServer OGC services (WMS, WFS, and WPS), finding the correct balance that maximized ingestion performance and visualization performance. We had to integrate with a streaming processing platform that took care of most of the processing and storing of the data in an Azure data lake that allows GeoServer to efficiently query for the latest available features, respecting all the authorization policies that were put in place. A few custom GeoServer extensions were implemented to handle the authorization complexity, the advanced styling needs, and big data integration needs.

Simone Giannecchini
PRO

August 31, 2022
Tweet

More Decks by Simone Giannecchini

Other Decks in Technology

Transcript

  1. Nuno Oliveira
    Simone Giannecchini
    GeoSolutions
    Processing and
    publishing big data with
    GeoServer and
    Databricks in Azure

    View Slide

  2. GeoSolutions
    Enterprise Support
    Services
    Deployment
    Subscription
    Professional
    Training
    Customized
    Solutions
    GeoNode
    • Offices in Italy & US, Global Clients/Team
    • 40+ collaborators, 30+ Engineers
    • Our products
    • Our Offer

    View Slide

  3. Affiliations
    We strongly support Open
    Source, it Is in our core
    We actively participate in
    OGC working groups and
    get funded to advance new
    open standards
    We support standards
    critical to GEOINT

    View Slide

  4. What’s big data?

    View Slide

  5. When does data becomes big data?
    • We can start with the usual three V’s:
    • Velocity
    • Volume
    • Variety
    • A practical definition from Wikipedia:
    • Big data refers to data sets that are too large or
    complex to be dealt with by traditional
    data-processing application software.
    • We also need to take into account our
    functional needs …

    View Slide

  6. Using maritime data as our use case
    • Maritime Data is produced by a variety of
    sources:
    • Ships positions AIS, SAR, VMS, …
    • Maritime assets ports, navigational aid systems, …

    View Slide

  7. Our use case in numbers
    • In 24 hours:
    • We receive up to 50 millions positions reports
    • We handle up to 500K different ships
    • Peaks of activity during daylight:
    • Up to 2500 messages per second!
    • Azure data lake with 7 years of data:
    ~125 billion positions!

    View Slide

  8. Maritime data overview
    • Provide a foundation for informed
    decision-making applications:
    • Maritime traffic monitoring
    • Search and rescue operations
    • Environmental marine disasters monitoring
    • …
    • Several datasets need to be combined:
    • Fisheries data
    • Ships registries information
    • ….
    Interoperability!

    View Slide

  9. Implemented scenarios
    • We have implemented the following scenarios
    in GeoServer using WMS, WFS and WPS:
    • Visualize in real time ships positions
    • Density maps computation and visualization
    • Visualize in real time navigation to aid systems
    • Detected ships positions correlation and visualization
    • Electronic navigational charts publishing through
    WMS
    • Historical ships positions visualization

    View Slide

  10. • Authorization rights need to be respected:
    • Different authorization rights will result:
    Authorization rights …
    In different views of maritime assets!
    t1
    Ships Sensors
    SAT-AIS
    T-AIS
    User 1 can see all
    vessels positions.
    User 2 can only see
    SAT-AIS vessels
    positions.
    t1
    t0 t0
    t1 t1
    t1
    t0

    View Slide

  11. Visualize in real time ships
    positions

    View Slide

  12. Use case overview
    • Displays the latest position for each known
    vessel in the last 24 hours.
    • System designed to handle up to 5K positions
    per second 432 millions positions per day!
    • Positions are enriched with several datasets,
    e.g. fisheries.

    View Slide

  13. Deployment diagram in Azure
    Kafka Cluster (Critical: RAM and DISK) SaaS - Managed Kafka
    PostgreSQL
    postgresql
    (32 vCPUs, 160 GB
    RAM, 8 TB, 20000
    IOPS)
    SaaS -
    Gen 5, 32
    vCore
    reads \ writes from topics
    kafka.head.x
    (2 VCPUs, 16 GB
    RAM, 135 GB DISK)
    kafka.worker.x
    (4 VCPUs, 4 GB RAM,
    200 GB DISK)
    kafka.zookeeper.x
    (2 VCPUs, 14 GB RAM, 135
    GB DISK)
    kafka.storage
    (1TB SSD)
    Kubernetes Cluster
    aks.x
    SaaS - DS2 v2
    x3
    x2 x3 x3
    Ingestion Cluster
    (Critical: CPU and RAM):
    ingestion.1
    (4 CPU, 8 GB RAM, 64
    SSD + 64 PREMIUM SSD)
    IaaS - F4s v2
    ingestion.2
    (4 CPU, 8 GB RAM, 64
    SSD + 64 PREMIUM SSD)
    GeoServer Cluster (Critical: CPU)
    geoserver.1
    (4 CPU, 8 GB RAM, 64
    SSD + 32 PREMIUM SSD)
    PaaS - F4s v2
    geoserver.2
    (4 CPU, 8 GB RAM, 64
    SSD + 32 PREMIUM SSD)
    reads tables
    writes to tables
    manages

    View Slide

  14. Data ingestion quick overview
    • Finding the right balance between writing and
    reading:
    • Indexes make readings more performant, but make
    writings slower!
    • Processing uses user defined functions:
    • Filtering Creating layers for different views
    • Computations Create new attributes
    • Transformations Transform existing attributes
    • Processing was implemented with Kafka Streams.

    View Slide

  15. Generating geometries
    • Storing numerical latitudes and longitudes:
    • Allow us to use efficient numerical indexes
    • Geometries are generated on the fly GeoServer side
    • OGC spatial operators are supported

    View Slide

  16. Advanced authorization
    • Each position is associated with a list of
    authorized roles:
    • GeoServer injects the authorization rules when
    querying the data!
    • GeoServer SQL views are sometimes used for
    better control of the final query.
    • Administrators can configure pre-authorized
    caches:
    • GeoServer takes care of the routing!

    View Slide

  17. Real time ships positions
    • Real time maritime picture displayed using a style
    that color each vessel according to its type:

    View Slide

  18. Real time ships positions
    • Real time maritime picture displayed using a style
    that color each vessel according to the age:

    View Slide

  19. Real time ships positions
    • Real time maritime picture displaying only fishing
    vessels colored based on their fishing gear:

    View Slide

  20. Real time navigation to aid systems
    • Real time aircraft search and rescue operation:

    View Slide

  21. Real time navigation to aid systems
    • Real time aids to navigation systems positions:

    View Slide

  22. Real time ships positions
    • Real time maritime picture displayed using a polar
    projection (EPSG:5041):

    View Slide

  23. Advanced styling
    • Several performance optimizations where
    implemented.
    • Highlighting thousands of fast moving objects:

    View Slide

  24. Advanced styling
    • Real time maritime picture displaying only cargo
    vessels, some of them highlighted:

    View Slide

  25. Real time ships positions demo

    View Slide

  26. Historical vessels tracks
    visualization

    View Slide

  27. Use case overview
    • Retrieve ship(s) historical positions from an
    Azure Data Lake:
    • Make them available through GeoServer OGC
    WFS, WMS and WPS services.
    • We can afford an initial load time, but then we
    need to be fast!
    ~125 billion positions!

    View Slide

  28. Read \ Write
    Architecture overview
    Azure Delta Lake
    Apache Spark
    SQL End-Point (Photon)
    Databricks
    Databricks
    Connector
    Read \ Write
    Coordination
    Read
    [PROTOTYPE]

    View Slide

  29. Read \ Write
    Architecture overview
    Azure Delta Lake
    Apache Spark
    SQL End-Point (Photon)
    Databricks
    Databricks
    Connector
    Read \ Write
    Coordination
    Read

    View Slide

  30. Databricks quick overview
    • Databricks photon SQL query engine:
    • Apache Sedona:
    • Out-of-the-box distributed Spatial Datasets
    • Spatial SQL PostGIS look alike functions!
    • Supports both raster and vector data
    Compatible with Apache Spark!

    View Slide

  31. Indexing and partitioning in practice?
    • How efficient is to query Databricks with SQL?
    • It depends!
    • Partitioning
    • Indexing
    • Structure of the data
    • …
    geologic time scale
    vs
    human history time scale

    View Slide

  32. Our use case?
    • Too big and supports too many use cases to
    have an efficient partitioning schema for all!
    • Retrieving 10K or 100K positions usually takes
    around ~20 seconds
    • Solutions?
    • Downsampling | pre-processing
    • Caching

    View Slide

  33. Our solution
    • New DatabricksSelector vendor parameter:
    • Allows us to select a subset from the Azure
    data lake through Databricks:
    http://.../geoserver/wms?SERVICE=WMS&VERSION=1.1.1&
    REQUEST=GetMap&DATABRICKSSELECTOR=vesselId in
    (36547) AND timestamp between '2022-05-16
    00:00:00.000' and '2022-05-23
    00:20:00.000'&STYLES&LAYERS=positions...
    Data will be cached in PostgreSQL
    and managed by GeoServer!

    View Slide

  34. Our solution

    View Slide

  35. Our solution

    View Slide

  36. Historic ships positions demo
    http://20.101.92.136:8080/geoserver/big-data/wms?SERVICE=
    WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image/p
    ng&TRANSPARENT=true&DATABRICKSSELECTOR=mmsi IN
    (477100XXX, 457650XXX) AND timestamp between '2021-01-01
    00:00:00.000' and '2021-07-01
    00:00:00.000&STYLES&TIME=2021-01-01T00:00:00.00Z/2021-0
    7-01T00:00:00.00Z&LAYERS=big-data:enc,big-data:positions&
    exceptions=application/vnd.ogc.se_inimage&SRS=EPSG:4326
    &WIDTH=1000&HEIGHT=600&BBOX=-21.62109375,-22.851562
    5,154.16015625,82.6171875
    ● Display the positions of two vessels for
    six months in the past (2021):

    View Slide

  37. Historic ships positions demo

    View Slide

  38. Historic ships positions demo

    View Slide

  39. Historic ships positions demo
    http://20.101.92.136:8080/geoserver/big-data/wms?SERVICE=
    WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image/p
    ng&TRANSPARENT=true&DATABRICKSSELECTOR=position_
    longitude > 102.843018 AND position_longitude < 110.061035
    AND position_latitude > 0.406491 AND position_latitude <
    8.015716 AND timestamp between '2021-02-05 00:00:00.000'
    and '2021-02-05
    23:59:59.999'&STYLES&TIME=2021-02-05T00:00:00.00Z/2021-0
    2-05T23:59:59.99Z&LAYERS=big-data:enc,big-data:positions&
    exceptions=application/vnd.ogc.se_inimage&SRS=EPSG:4326
    &WIDTH=1000&HEIGHT=600&BBOX=-21.62109375,-22.851562
    5,154.16015625,82.6171875
    ● Display all the vessels that were in that
    specific area the 5th of February 2021:

    View Slide

  40. Historic ships positions demo

    View Slide

  41. Next steps

    View Slide

  42. Next steps
    • Improve the GeoServer databricks integration:
    WPS
    • We are going to propose to the GeoServer
    community to contribute:
    • Generated geometries
    • Databricks SQL connector
    • If they are accepted, should be available
    towards the end of the year (2022)

    View Slide

  43. The End
    Questions?
    primary author email
    secondary author email
    [email protected]

    View Slide