Slide 1

Slide 1 text

Nuno Oliveira Simone Giannecchini GeoSolutions Processing and publishing big data with GeoServer and Databricks in Azure

Slide 2

Slide 2 text

GeoSolutions Enterprise Support Services Deployment Subscription Professional Training Customized Solutions GeoNode • Offices in Italy & US, Global Clients/Team • 40+ collaborators, 30+ Engineers • Our products • Our Offer

Slide 3

Slide 3 text

Affiliations We strongly support Open Source, it Is in our core We actively participate in OGC working groups and get funded to advance new open standards We support standards critical to GEOINT

Slide 4

Slide 4 text

What’s big data?

Slide 5

Slide 5 text

When does data becomes big data? • We can start with the usual three V’s: • Velocity • Volume • Variety • A practical definition from Wikipedia: • Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. • We also need to take into account our functional needs …

Slide 6

Slide 6 text

Using maritime data as our use case • Maritime Data is produced by a variety of sources: • Ships positions AIS, SAR, VMS, … • Maritime assets ports, navigational aid systems, …

Slide 7

Slide 7 text

Our use case in numbers • In 24 hours: • We receive up to 50 millions positions reports • We handle up to 500K different ships • Peaks of activity during daylight: • Up to 2500 messages per second! • Azure data lake with 7 years of data: ~125 billion positions!

Slide 8

Slide 8 text

Maritime data overview • Provide a foundation for informed decision-making applications: • Maritime traffic monitoring • Search and rescue operations • Environmental marine disasters monitoring • … • Several datasets need to be combined: • Fisheries data • Ships registries information • …. Interoperability!

Slide 9

Slide 9 text

Implemented scenarios • We have implemented the following scenarios in GeoServer using WMS, WFS and WPS: • Visualize in real time ships positions • Density maps computation and visualization • Visualize in real time navigation to aid systems • Detected ships positions correlation and visualization • Electronic navigational charts publishing through WMS • Historical ships positions visualization

Slide 10

Slide 10 text

• Authorization rights need to be respected: • Different authorization rights will result: Authorization rights … In different views of maritime assets! t1 Ships Sensors SAT-AIS T-AIS User 1 can see all vessels positions. User 2 can only see SAT-AIS vessels positions. t1 t0 t0 t1 t1 t1 t0

Slide 11

Slide 11 text

Visualize in real time ships positions

Slide 12

Slide 12 text

Use case overview • Displays the latest position for each known vessel in the last 24 hours. • System designed to handle up to 5K positions per second 432 millions positions per day! • Positions are enriched with several datasets, e.g. fisheries.

Slide 13

Slide 13 text

Deployment diagram in Azure Kafka Cluster (Critical: RAM and DISK) SaaS - Managed Kafka PostgreSQL postgresql (32 vCPUs, 160 GB RAM, 8 TB, 20000 IOPS) SaaS - Gen 5, 32 vCore reads \ writes from topics kafka.head.x (2 VCPUs, 16 GB RAM, 135 GB DISK) kafka.worker.x (4 VCPUs, 4 GB RAM, 200 GB DISK) kafka.zookeeper.x (2 VCPUs, 14 GB RAM, 135 GB DISK) kafka.storage (1TB SSD) Kubernetes Cluster aks.x SaaS - DS2 v2 x3 x2 x3 x3 Ingestion Cluster (Critical: CPU and RAM): ingestion.1 (4 CPU, 8 GB RAM, 64 SSD + 64 PREMIUM SSD) IaaS - F4s v2 ingestion.2 (4 CPU, 8 GB RAM, 64 SSD + 64 PREMIUM SSD) GeoServer Cluster (Critical: CPU) geoserver.1 (4 CPU, 8 GB RAM, 64 SSD + 32 PREMIUM SSD) PaaS - F4s v2 geoserver.2 (4 CPU, 8 GB RAM, 64 SSD + 32 PREMIUM SSD) reads tables writes to tables manages

Slide 14

Slide 14 text

Data ingestion quick overview • Finding the right balance between writing and reading: • Indexes make readings more performant, but make writings slower! • Processing uses user defined functions: • Filtering Creating layers for different views • Computations Create new attributes • Transformations Transform existing attributes • Processing was implemented with Kafka Streams.

Slide 15

Slide 15 text

Generating geometries • Storing numerical latitudes and longitudes: • Allow us to use efficient numerical indexes • Geometries are generated on the fly GeoServer side • OGC spatial operators are supported

Slide 16

Slide 16 text

Advanced authorization • Each position is associated with a list of authorized roles: • GeoServer injects the authorization rules when querying the data! • GeoServer SQL views are sometimes used for better control of the final query. • Administrators can configure pre-authorized caches: • GeoServer takes care of the routing!

Slide 17

Slide 17 text

Real time ships positions • Real time maritime picture displayed using a style that color each vessel according to its type:

Slide 18

Slide 18 text

Real time ships positions • Real time maritime picture displayed using a style that color each vessel according to the age:

Slide 19

Slide 19 text

Real time ships positions • Real time maritime picture displaying only fishing vessels colored based on their fishing gear:

Slide 20

Slide 20 text

Real time navigation to aid systems • Real time aircraft search and rescue operation:

Slide 21

Slide 21 text

Real time navigation to aid systems • Real time aids to navigation systems positions:

Slide 22

Slide 22 text

Real time ships positions • Real time maritime picture displayed using a polar projection (EPSG:5041):

Slide 23

Slide 23 text

Advanced styling • Several performance optimizations where implemented. • Highlighting thousands of fast moving objects:

Slide 24

Slide 24 text

Advanced styling • Real time maritime picture displaying only cargo vessels, some of them highlighted:

Slide 25

Slide 25 text

Real time ships positions demo

Slide 26

Slide 26 text

Historical vessels tracks visualization

Slide 27

Slide 27 text

Use case overview • Retrieve ship(s) historical positions from an Azure Data Lake: • Make them available through GeoServer OGC WFS, WMS and WPS services. • We can afford an initial load time, but then we need to be fast! ~125 billion positions!

Slide 28

Slide 28 text

Read \ Write Architecture overview Azure Delta Lake Apache Spark SQL End-Point (Photon) Databricks Databricks Connector Read \ Write Coordination Read [PROTOTYPE]

Slide 29

Slide 29 text

Read \ Write Architecture overview Azure Delta Lake Apache Spark SQL End-Point (Photon) Databricks Databricks Connector Read \ Write Coordination Read

Slide 30

Slide 30 text

Databricks quick overview • Databricks photon SQL query engine: • Apache Sedona: • Out-of-the-box distributed Spatial Datasets • Spatial SQL PostGIS look alike functions! • Supports both raster and vector data Compatible with Apache Spark!

Slide 31

Slide 31 text

Indexing and partitioning in practice? • How efficient is to query Databricks with SQL? • It depends! • Partitioning • Indexing • Structure of the data • … geologic time scale vs human history time scale

Slide 32

Slide 32 text

Our use case? • Too big and supports too many use cases to have an efficient partitioning schema for all! • Retrieving 10K or 100K positions usually takes around ~20 seconds • Solutions? • Downsampling | pre-processing • Caching

Slide 33

Slide 33 text

Our solution • New DatabricksSelector vendor parameter: • Allows us to select a subset from the Azure data lake through Databricks: http://.../geoserver/wms?SERVICE=WMS&VERSION=1.1.1& REQUEST=GetMap&DATABRICKSSELECTOR=vesselId in (36547) AND timestamp between '2022-05-16 00:00:00.000' and '2022-05-23 00:20:00.000'&STYLES&LAYERS=positions... Data will be cached in PostgreSQL and managed by GeoServer!

Slide 34

Slide 34 text

Our solution

Slide 35

Slide 35 text

Our solution

Slide 36

Slide 36 text

Historic ships positions demo http://20.101.92.136:8080/geoserver/big-data/wms?SERVICE= WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image/p ng&TRANSPARENT=true&DATABRICKSSELECTOR=mmsi IN (477100XXX, 457650XXX) AND timestamp between '2021-01-01 00:00:00.000' and '2021-07-01 00:00:00.000&STYLES&TIME=2021-01-01T00:00:00.00Z/2021-0 7-01T00:00:00.00Z&LAYERS=big-data:enc,big-data:positions& exceptions=application/vnd.ogc.se_inimage&SRS=EPSG:4326 &WIDTH=1000&HEIGHT=600&BBOX=-21.62109375,-22.851562 5,154.16015625,82.6171875 ● Display the positions of two vessels for six months in the past (2021):

Slide 37

Slide 37 text

Historic ships positions demo

Slide 38

Slide 38 text

Historic ships positions demo

Slide 39

Slide 39 text

Historic ships positions demo http://20.101.92.136:8080/geoserver/big-data/wms?SERVICE= WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image/p ng&TRANSPARENT=true&DATABRICKSSELECTOR=position_ longitude > 102.843018 AND position_longitude < 110.061035 AND position_latitude > 0.406491 AND position_latitude < 8.015716 AND timestamp between '2021-02-05 00:00:00.000' and '2021-02-05 23:59:59.999'&STYLES&TIME=2021-02-05T00:00:00.00Z/2021-0 2-05T23:59:59.99Z&LAYERS=big-data:enc,big-data:positions& exceptions=application/vnd.ogc.se_inimage&SRS=EPSG:4326 &WIDTH=1000&HEIGHT=600&BBOX=-21.62109375,-22.851562 5,154.16015625,82.6171875 ● Display all the vessels that were in that specific area the 5th of February 2021:

Slide 40

Slide 40 text

Historic ships positions demo

Slide 41

Slide 41 text

Next steps

Slide 42

Slide 42 text

Next steps • Improve the GeoServer databricks integration: WPS • We are going to propose to the GeoServer community to contribute: • Generated geometries • Databricks SQL connector • If they are accepted, should be available towards the end of the year (2022)

Slide 43

Slide 43 text

The End Questions? primary author email secondary author email [email protected]