Slide 1

Slide 1 text

A Scripted Approach for Normalizing Geospatial Data in a Relational Database and Assessing Alternative NoSQL Database Storage Options Michael Shensky, GIS & Geospatial Data Coordinator vs. vs.

Slide 2

Slide 2 text

Personal background  DBMS experience and GIS experience  New to working in a library role Project background  Spreadsheet developed for Buildings of Texas donated to UT’s Alexander Architectural Archives  Initial goal #1: prepare dataset for dissemination through new UT GeoData portal  Initial goal #2: determine how we might be able to link data from multiple related datasets Motivation  Is it feasible/advantageous to use a NoSQL (not only SQL) graph or document database for storing related geospatial data? Backround Information

Slide 3

Slide 3 text

 CSV with 6000+ records of built works in Texas  Many built works are related via “events”’ and “contributors”  Data had to be split into multiple datasets (built works, events, contributors, roles) to allow relationships to be established between features Buildings of Texas Dataset

Slide 4

Slide 4 text

Relational database schema development Data cleaning with OpenRefine

Slide 5

Slide 5 text

Python script used to automate data processing ArcPy package to save data as related tables and feature classes in a file geodatabase during testing phase Geocoding of addresses also automated Once normalization logic worked out, script modified to also export data to:  Neo4J graph database  MongoDB document database Script also re-flattens normalized data to a reorganized CSV and shapefile at the end Scripting the Data Normalization Process

Slide 6

Slide 6 text

PostgreSQL Relational databases allow relationships to be modeled between tables with formalized schemas Free, open source, widely used, and increasing in popularity Well developed functionality in both ArcGIS & QGIS We elected to create an ArcGIS Enterprise Geodatabase in our PostgreSQL RDBMS  Allows use of ArcGIS relationship functions and other useful features  Not all geodatabase configurations will work with QGIS (ArcGIS relationship classes) To simplify RDBMS schema development testing, some work was done with a file geodatabase

Slide 7

Slide 7 text

PostgreSQL Pr os Good support in ArcMap, ArcGIS Pro, and QGIS Facilitates editing related data Cons Difficult to visualize relationships Difficult to map normalized data

Slide 8

Slide 8 text

Neo4J  Most widely used graph database  Community edition is free & open source, covered by GPL v3 license  Database populated using Cypher code which was easy to learn  Excellent relationship visualization capabilities  Python script used to programmatically generate Cypher code from the Buildings of Texas CSV (py2neo package used later)  No spatial data editing interface  No current integration with QGIS or ArcGIS software but…  Neo4J spatial library allows for spatial indexing and spatial queries  Integration available for GeoServer, uDIG, and other GIS applications

Slide 9

Slide 9 text

Neo4J Pr os Easy to visualize relationships Easy to query based on complex relationships GIS server support Cons Currently difficult to view, edit, & map data with desktop GIS software*

Slide 10

Slide 10 text

MongoDB  Data is stored in JSON documents (support for GeoJSON)  Community edition is open source, covered by the Server Side Public License  Multiple GUIs available: MongoDB Compass, Robo 3T, etc.  Relationships in the RDBMS sense as known as references in MongoDB  Embedded references are possible  MongoConnector plugin allows direct database connection in QGIS  Possible to relate documents from different collections using unique ids  Returning related documents requires multiple queries  One-to-many relationships easy to structure but many-to-many relationships are more complicated https://docs.mongodb.com/manual/core/data-modeling-introduction/

Slide 11

Slide 11 text

Pr os Support for GeoJSON objects Easy to view, edit & map GeoJSON data in QGIS with MongoConnector plugin Flexible data schemas Cons No ability to connect to MongoDB with ArcGIS software QGIS MongoConnector plugin can only load GeoJSON collections Requires multiple queries to return related documents Difficult to visualize relationships Difficult to map normalized data Difficult to edit related data MongoDB

Slide 12

Slide 12 text

 Data temporarily still being editing within CSV  Python script re-flattens related tables in geodatabase to:  Shapefile  CSV  Shapefile imported into enterprise geodatabase as flat feature class  Feature class published as a web service (WMS, WFS, map service, feature service)  Feature services incorporated into ArcGIS Online web map  Data successfully loaded into new GeoData portal Current Data Workflows

Slide 13

Slide 13 text

Neo4J Powered Web Map Database: Neo4J Code: HTML, CSS, JavaScript JavaScript Library: Leaflet Experimental Products Relationship Visualization Web App Database: Neo4J Code: HTML, CSS, JavaScript JavaScript Library: 3D Force-Directed Graph

Slide 14

Slide 14 text

 Explore saving data in multi-model databases like Arango or OrientDB as well as AgensGraph 2.0 with PostgreSQL  Experiment with the use of enterprise geodatabase views to test mapping related data stored in PostgreSQL  Determine how best to use GIS software for editing geometry and attributes for related objects  Develop QGIS plugin for loading data from Neo4J database  Ingest other datasets with features that relate to those in the Buildings of Texas dataset  The Buildings of Texas dataset will be available for download from the UT Libraries new GeoData portal later this year What’s Next?

Slide 15

Slide 15 text

 Special thanks to Jessica Trelogan, Katie Pierce Meyer, and Josh Conrad for their work on data processing & schema logic  Thank you to Gerald Moorhead and Mario Sanchez for their contribution of the Buildings of Texas dataset  Thank you! Questions? Michael Shensky  GIS & Geospatial Data Coordinator  University of Texas Libraries  [email protected] Concluding thoughts  Graph databases like Neo4J seem to hold great potential for storing related geospatial data  Document databases like MongoDB might be useful in certain scenarios, but do not currently seem to be a good fit for storing geospatial data linked by complex relationships