Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Big Data Analytics OmniSci on Microsoft Azure

OmniSci
December 07, 2018

Streaming Big Data Analytics OmniSci on Microsoft Azure

2018 Philadelphia Azure DataFest: Advanced Analytics and Big Data Conference
Friday, December 7, 2018 from 9:00 AM to 5:00 PM (EST)
Malvern, PA

Randy Zwitch, Senior Developer Advocate and Ashish Bambroo, VP of Business Development

OmniSci

December 07, 2018
Tweet

More Decks by OmniSci

Other Decks in Technology

Transcript

  1. © OmniSci 2018 Ashish Bambroo Vice President of Business Development

    [email protected] /in/ashishbambroo/ Randy Zwitch Senior Developer Advocate @randyzwitch [email protected] /in/randyzwitch/ /randyzwitch
  2. © OmniSci 2018 Data Grows Faster Than CPU Processing Data

    Growth 40% per year CPU Processing Power 20% per year
  3. © OmniSci 2018 TOP-TIER VENTURE BACKING USED BY 100+ GLOBAL

    ORGS $92 MILLION IN FUNDING OPEN-SOURCE COMMUNITY About OmniSci 6
  4. © OmniSci 2018 Four Ways to Get Started GitHub repo

    OPEN SOURCE Website download COMMUNITY OmniSci as a service OMNISCI CLOUD Contact sales ENTERPRISE 7
  5. © OmniSci 2018 Top OmniSci Use Cases Automotive Vehicle telematics

    analysis Preventative maintenance Supply chain logistics Capital Markets Investing with alternative data Net asset valuation Portfolio performance risk Defense & Intelligence Geospatial intelligence (GEOINT) Pattern of life analysis Battlespace information dominance Telecommunications Network reliability analysis Location-enabled services Field service tracking Utilities Smart meter analysis Grid reliability analysis Preventative maintenance Other Oil & gas well log analysis Pharmaceutical clinical trial analysis Fleet telematics analysis Logistics telematics analysis 8
  6. © OmniSci 2018 10 OmniSci Innovations Powering Extreme Analytics 3-Tier

    Memory Caching Query Compilation In-Situ Rendering
  7. © OmniSci 2018 Goal: Monitor Bike Availability in Real-Time Architecture

    Considerations: - Hundreds of API feeds conforming to GBFS specification: https://github.com/NABSA/gbfs - Each feed provides relatively small amount of info as JSON; need to pre-process before loading to OmniSci - Feeds have different TTL values; want to be respectful when pinging API endpoints
  8. © OmniSci 2018 Data Pre-Processing: HDInsight and StreamSets Artwork: https://www.jowanza.com/blog/2018/9/8/real-time-station-tracking-ford-gobike-and-mapd

    Using Azure HDInsight, we can set up a managed Apache Kafka cluster - Kafka serves several purposes: aggregating feeds into a single stream, buffer for a more consistent throughput - StreamSets is a data pipeline tool, provided as an option during HDInsight setup
  9. © OmniSci 2018 Data Pre-Processing: HDInsight and StreamSets Each GBFS

    feed is a separate pipeline in StreamSets. JSON gets processed into records, values transformed, then each pipeline writes to same Kafka topic
  10. © OmniSci 2018 Data Ingestion: StreamSets to OmniSci Artwork: https://www.jowanza.com/blog/2018/9/8/real-time-station-tracking-ford-gobike-and-mapd

    With feeds aggregated to single Kafka Producer (topic), ingest to OmniSci via JDBC - OmniSci supports data streaming directly from Kafka, but using StreamSets allows for additional transformation - Using JDBC along with StreamSets also allows StreamSets to manage retries and Kafka offsets
  11. © OmniSci 2018 Real-Time Dashboard in OmniSci Time slider snaps

    to hour-width. Moving slider crossfilters entire dashboard Clicking on bike share location crossfilters entire dashboard With the massive compute power of GPUs, data is available to the dashboard as soon as it is ingested; no indexing or other backend operations needed
  12. © OmniSci 2018 Ashish Bambroo Vice President of Business Development

    [email protected] /in/ashishbambroo/ Randy Zwitch Senior Developer Advocate @randyzwitch [email protected] /in/randyzwitch/ /randyzwitch
  13. © OmniSci 2018 Appendix: HDInsight Cluster Size For illustration purposes

    only; required hardware needed depends on use case, data volumes, etc.
  14. © OmniSci 2018 Appendix: GPU-enabled VM size for OmniSci OmniSci

    Hardware Reference Guide: https://www.omnisci.com/docs/latest/4_hardware_configuration_guide.html For best performance, OmniSci recommends having additional Premium SSD(s) attached to the GPU-enabled VM solely for OmniSci read/write The number of GPUs required depends on how much data needed to be kept “hot” for a given workload. More GPUs = More GPU RAM = More data caching