A Scripted Approach for Normalizing Geospatial Data in a Relational Database and Assessing Alternative NoSQL Database Storage Options

A Scripted Approach for Normalizing Geospatial Data in a Relational Database and Assessing Alternative NoSQL Database Storage Options

Michael Shensky is the GIS & Geospatial Data Coordinator for the UT Libraries at the University of Texas at Austin. In this role he develops GIS training resources, provides personalized research consultation services for the university community, and designs GIS solutions for improving access to geospatial resources in the UT Libraries’ collections. He earned his MA in Geography from California State University, Fullerton in 2013.

Transcript

  1. A Scripted Approach for Normalizing Geospatial Data in a Relational

    Database and Assessing Alternative NoSQL Database Storage Options Michael Shensky, GIS & Geospatial Data Coordinator vs. vs.
  2. Personal background  DBMS experience and GIS experience  New

    to working in a library role Project background  Spreadsheet developed for Buildings of Texas donated to UT’s Alexander Architectural Archives  Initial goal #1: prepare dataset for dissemination through new UT GeoData portal  Initial goal #2: determine how we might be able to link data from multiple related datasets Motivation  Is it feasible/advantageous to use a NoSQL (not only SQL) graph or document database for storing related geospatial data? Backround Information
  3.  CSV with 6000+ records of built works in Texas

     Many built works are related via “events”’ and “contributors”  Data had to be split into multiple datasets (built works, events, contributors, roles) to allow relationships to be established between features Buildings of Texas Dataset
  4. Relational database schema development Data cleaning with OpenRefine

  5. Python script used to automate data processing ArcPy package to

    save data as related tables and feature classes in a file geodatabase during testing phase Geocoding of addresses also automated Once normalization logic worked out, script modified to also export data to:  Neo4J graph database  MongoDB document database Script also re-flattens normalized data to a reorganized CSV and shapefile at the end Scripting the Data Normalization Process
  6. PostgreSQL Relational databases allow relationships to be modeled between tables

    with formalized schemas Free, open source, widely used, and increasing in popularity Well developed functionality in both ArcGIS & QGIS We elected to create an ArcGIS Enterprise Geodatabase in our PostgreSQL RDBMS  Allows use of ArcGIS relationship functions and other useful features  Not all geodatabase configurations will work with QGIS (ArcGIS relationship classes) To simplify RDBMS schema development testing, some work was done with a file geodatabase
  7. PostgreSQL Pr os Good support in ArcMap, ArcGIS Pro, and

    QGIS Facilitates editing related data Cons Difficult to visualize relationships Difficult to map normalized data
  8. Neo4J  Most widely used graph database  Community edition

    is free & open source, covered by GPL v3 license  Database populated using Cypher code which was easy to learn  Excellent relationship visualization capabilities  Python script used to programmatically generate Cypher code from the Buildings of Texas CSV (py2neo package used later)  No spatial data editing interface  No current integration with QGIS or ArcGIS software but…  Neo4J spatial library allows for spatial indexing and spatial queries  Integration available for GeoServer, uDIG, and other GIS applications
  9. Neo4J Pr os Easy to visualize relationships Easy to query

    based on complex relationships GIS server support Cons Currently difficult to view, edit, & map data with desktop GIS software*
  10. MongoDB  Data is stored in JSON documents (support for

    GeoJSON)  Community edition is open source, covered by the Server Side Public License  Multiple GUIs available: MongoDB Compass, Robo 3T, etc.  Relationships in the RDBMS sense as known as references in MongoDB  Embedded references are possible  MongoConnector plugin allows direct database connection in QGIS  Possible to relate documents from different collections using unique ids  Returning related documents requires multiple queries  One-to-many relationships easy to structure but many-to-many relationships are more complicated https://docs.mongodb.com/manual/core/data-modeling-introduction/
  11. Pr os Support for GeoJSON objects Easy to view, edit

    & map GeoJSON data in QGIS with MongoConnector plugin Flexible data schemas Cons No ability to connect to MongoDB with ArcGIS software QGIS MongoConnector plugin can only load GeoJSON collections Requires multiple queries to return related documents Difficult to visualize relationships Difficult to map normalized data Difficult to edit related data MongoDB
  12.  Data temporarily still being editing within CSV  Python

    script re-flattens related tables in geodatabase to:  Shapefile  CSV  Shapefile imported into enterprise geodatabase as flat feature class  Feature class published as a web service (WMS, WFS, map service, feature service)  Feature services incorporated into ArcGIS Online web map  Data successfully loaded into new GeoData portal Current Data Workflows
  13. Neo4J Powered Web Map Database: Neo4J Code: HTML, CSS, JavaScript

    JavaScript Library: Leaflet Experimental Products Relationship Visualization Web App Database: Neo4J Code: HTML, CSS, JavaScript JavaScript Library: 3D Force-Directed Graph
  14.  Explore saving data in multi-model databases like Arango or

    OrientDB as well as AgensGraph 2.0 with PostgreSQL  Experiment with the use of enterprise geodatabase views to test mapping related data stored in PostgreSQL  Determine how best to use GIS software for editing geometry and attributes for related objects  Develop QGIS plugin for loading data from Neo4J database  Ingest other datasets with features that relate to those in the Buildings of Texas dataset  The Buildings of Texas dataset will be available for download from the UT Libraries new GeoData portal later this year What’s Next?
  15.  Special thanks to Jessica Trelogan, Katie Pierce Meyer, and

    Josh Conrad for their work on data processing & schema logic  Thank you to Gerald Moorhead and Mario Sanchez for their contribution of the Buildings of Texas dataset  Thank you! Questions? Michael Shensky  GIS & Geospatial Data Coordinator  University of Texas Libraries  m.shensky@austin.utexas.edu Concluding thoughts  Graph databases like Neo4J seem to hold great potential for storing related geospatial data  Document databases like MongoDB might be useful in certain scenarios, but do not currently seem to be a good fit for storing geospatial data linked by complex relationships