Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Processing and publishing big data with GeoServer and Databricks in Azure

Processing and publishing big data with GeoServer and Databricks in Azure

The amount of data we have to process and publish keeps growing every day, fortunately, the infrastructure, technologies, and methodologies to handle such streams of data keep improving and maturing. GeoServer is a web service for publishing your geospatial data using industry standards for vector, raster, and mapping. It powers a number of open source projects like GeoNode and geOrchestra and it is widely used throughout the world by organizations to manage and disseminate data at scale. We integrated GeoServer with some well-known big data technologies like Kafka and Databricks, and deployed the systems in Azure cloud, to handle use cases that required near-realtime displaying of the latest received data on a map as well background batch processing of historical data.
This presentation will describe the architecture put in place, and the challenges that GeoSolutions had to overcome to publish big data through GeoServer OGC services (WMS, WFS, and WPS), finding the correct balance that maximized ingestion performance and visualization performance. We had to integrate with a streaming processing platform that took care of most of the processing and storing of the data in an Azure data lake that allows GeoServer to efficiently query for the latest available features, respecting all the authorization policies that were put in place. A few custom GeoServer extensions were implemented to handle the authorization complexity, the advanced styling needs, and big data integration needs.

Simone Giannecchini

August 31, 2022

More Decks by Simone Giannecchini

Other Decks in Technology


  1. GeoSolutions Enterprise Support Services Deployment Subscription Professional Training Customized Solutions

    GeoNode • Offices in Italy & US, Global Clients/Team • 40+ collaborators, 30+ Engineers • Our products • Our Offer
  2. Affiliations We strongly support Open Source, it Is in our

    core We actively participate in OGC working groups and get funded to advance new open standards We support standards critical to GEOINT
  3. When does data becomes big data? • We can start

    with the usual three V’s: • Velocity • Volume • Variety • A practical definition from Wikipedia: • Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. • We also need to take into account our functional needs …
  4. Using maritime data as our use case • Maritime Data

    is produced by a variety of sources: • Ships positions AIS, SAR, VMS, … • Maritime assets ports, navigational aid systems, …
  5. Our use case in numbers • In 24 hours: •

    We receive up to 50 millions positions reports • We handle up to 500K different ships • Peaks of activity during daylight: • Up to 2500 messages per second! • Azure data lake with 7 years of data: ~125 billion positions!
  6. Maritime data overview • Provide a foundation for informed decision-making

    applications: • Maritime traffic monitoring • Search and rescue operations • Environmental marine disasters monitoring • … • Several datasets need to be combined: • Fisheries data • Ships registries information • …. Interoperability!
  7. Implemented scenarios • We have implemented the following scenarios in

    GeoServer using WMS, WFS and WPS: • Visualize in real time ships positions • Density maps computation and visualization • Visualize in real time navigation to aid systems • Detected ships positions correlation and visualization • Electronic navigational charts publishing through WMS • Historical ships positions visualization
  8. • Authorization rights need to be respected: • Different authorization

    rights will result: Authorization rights … In different views of maritime assets! t1 Ships Sensors SAT-AIS T-AIS User 1 can see all vessels positions. User 2 can only see SAT-AIS vessels positions. t1 t0 t0 t1 t1 t1 t0
  9. Use case overview • Displays the latest position for each

    known vessel in the last 24 hours. • System designed to handle up to 5K positions per second 432 millions positions per day! • Positions are enriched with several datasets, e.g. fisheries.
  10. Deployment diagram in Azure Kafka Cluster (Critical: RAM and DISK)

    SaaS - Managed Kafka PostgreSQL postgresql (32 vCPUs, 160 GB RAM, 8 TB, 20000 IOPS) SaaS - Gen 5, 32 vCore reads \ writes from topics kafka.head.x (2 VCPUs, 16 GB RAM, 135 GB DISK) kafka.worker.x (4 VCPUs, 4 GB RAM, 200 GB DISK) kafka.zookeeper.x (2 VCPUs, 14 GB RAM, 135 GB DISK) kafka.storage (1TB SSD) Kubernetes Cluster aks.x SaaS - DS2 v2 x3 x2 x3 x3 Ingestion Cluster (Critical: CPU and RAM): ingestion.1 (4 CPU, 8 GB RAM, 64 SSD + 64 PREMIUM SSD) IaaS - F4s v2 ingestion.2 (4 CPU, 8 GB RAM, 64 SSD + 64 PREMIUM SSD) GeoServer Cluster (Critical: CPU) geoserver.1 (4 CPU, 8 GB RAM, 64 SSD + 32 PREMIUM SSD) PaaS - F4s v2 geoserver.2 (4 CPU, 8 GB RAM, 64 SSD + 32 PREMIUM SSD) reads tables writes to tables manages
  11. Data ingestion quick overview • Finding the right balance between

    writing and reading: • Indexes make readings more performant, but make writings slower! • Processing uses user defined functions: • Filtering Creating layers for different views • Computations Create new attributes • Transformations Transform existing attributes • Processing was implemented with Kafka Streams.
  12. Generating geometries • Storing numerical latitudes and longitudes: • Allow

    us to use efficient numerical indexes • Geometries are generated on the fly GeoServer side • OGC spatial operators are supported
  13. Advanced authorization • Each position is associated with a list

    of authorized roles: • GeoServer injects the authorization rules when querying the data! • GeoServer SQL views are sometimes used for better control of the final query. • Administrators can configure pre-authorized caches: • GeoServer takes care of the routing!
  14. Real time ships positions • Real time maritime picture displayed

    using a style that color each vessel according to its type:
  15. Real time ships positions • Real time maritime picture displayed

    using a style that color each vessel according to the age:
  16. Real time ships positions • Real time maritime picture displaying

    only fishing vessels colored based on their fishing gear:
  17. Use case overview • Retrieve ship(s) historical positions from an

    Azure Data Lake: • Make them available through GeoServer OGC WFS, WMS and WPS services. • We can afford an initial load time, but then we need to be fast! ~125 billion positions!
  18. Read \ Write Architecture overview Azure Delta Lake Apache Spark

    SQL End-Point (Photon) Databricks Databricks Connector Read \ Write Coordination Read [PROTOTYPE]
  19. Read \ Write Architecture overview Azure Delta Lake Apache Spark

    SQL End-Point (Photon) Databricks Databricks Connector Read \ Write Coordination Read
  20. Databricks quick overview • Databricks photon SQL query engine: •

    Apache Sedona: • Out-of-the-box distributed Spatial Datasets • Spatial SQL PostGIS look alike functions! • Supports both raster and vector data Compatible with Apache Spark!
  21. Indexing and partitioning in practice? • How efficient is to

    query Databricks with SQL? • It depends! • Partitioning • Indexing • Structure of the data • … geologic time scale vs human history time scale
  22. Our use case? • Too big and supports too many

    use cases to have an efficient partitioning schema for all! • Retrieving 10K or 100K positions usually takes around ~20 seconds • Solutions? • Downsampling | pre-processing • Caching
  23. Our solution • New DatabricksSelector vendor parameter: • Allows us

    to select a subset from the Azure data lake through Databricks: http://.../geoserver/wms?SERVICE=WMS&VERSION=1.1.1& REQUEST=GetMap&DATABRICKSSELECTOR=vesselId in (36547) AND timestamp between '2022-05-16 00:00:00.000' and '2022-05-23 00:20:00.000'&STYLES&LAYERS=positions... Data will be cached in PostgreSQL and managed by GeoServer!
  24. Historic ships positions demo WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image/p ng&TRANSPARENT=true&DATABRICKSSELECTOR=mmsi IN (477100XXX, 457650XXX)

    AND timestamp between '2021-01-01 00:00:00.000' and '2021-07-01 00:00:00.000&STYLES&TIME=2021-01-01T00:00:00.00Z/2021-0 7-01T00:00:00.00Z&LAYERS=big-data:enc,big-data:positions& exceptions=application/vnd.ogc.se_inimage&SRS=EPSG:4326 &WIDTH=1000&HEIGHT=600&BBOX=-21.62109375,-22.851562 5,154.16015625,82.6171875 • Display the positions of two vessels for six months in the past (2021):
  25. Historic ships positions demo WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image/p ng&TRANSPARENT=true&DATABRICKSSELECTOR=position_ longitude > 102.843018

    AND position_longitude < 110.061035 AND position_latitude > 0.406491 AND position_latitude < 8.015716 AND timestamp between '2021-02-05 00:00:00.000' and '2021-02-05 23:59:59.999'&STYLES&TIME=2021-02-05T00:00:00.00Z/2021-0 2-05T23:59:59.99Z&LAYERS=big-data:enc,big-data:positions& exceptions=application/vnd.ogc.se_inimage&SRS=EPSG:4326 &WIDTH=1000&HEIGHT=600&BBOX=-21.62109375,-22.851562 5,154.16015625,82.6171875 • Display all the vessels that were in that specific area the 5th of February 2021:
  26. Next steps • Improve the GeoServer databricks integration: WPS •

    We are going to propose to the GeoServer community to contribute: • Generated geometries • Databricks SQL connector • If they are accepted, should be available towards the end of the year (2022)