Ground: A Data Context Service

Ground: A Data Context Service

Talk from CIDR 2017

Fb47910b51938c597b6ed6291206cb6e?s=128

Joe Hellerstein

January 09, 2017
Tweet

Transcript

  1. ground Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti,

    Joey Gonzalez, et al. CIDR 2017 https://github.com/ground-context/ground
  2. Open Source Big Data Community Health Long-term Data Management Data

    Analysis Data Wrangling FAIL
  3. What was the big data revolution really all about?

  4. Database

  5. A DECOUPLED STACK Ingest/ PubSub Workflow Scheduler Storage Dataflow Engine

    Query Optimizer API / Query Language Big Data
  6. A DECOUPLED STACK Ingest/ PubSub Workflow Scheduler Storage Dataflow Engine

    Query Optimizer API / Query Language SQL GP ORCA The Good: Agility
  7. A DECOUPLED STACK SQL GP ORCA The Bad: Dis-integration.

  8. CRISIS: HOW DO WE SHARE INFORMATION?

  9. WHAT IS METADATA?

  10. • Data about data • This used to be so

    simple! • But … schema on use • One of many changes WHAT IS METADATA?
  11. Lay the groundwork for rich
 data context. OPPORTUNITY: A BIGGER

    CONTEXT Don’t just fill a metadata- sized hole in the big data stack.
  12. WHAT IS DATA CONTEXT? All the information surrounding the use

    of data.
  13. The ABCs of Data Context Application Context: Views, models, code

    Behavioral Context: Data lineage & usage
 Change Over Time: Version histories Generated by—and useful to—many applications and components.
  14. ground Janet I bet social media content can predict which

    customers might cancel their accounts! Hey Janet! We already paid for a full Gnip feed from Twitter — you can find it here By the way: Sue used this following related table and script.
  15. Janet ground Hey Janet! This looks like Twitter JSON. Many

    people use this script to turn it into a table. Be careful: When people store outputs from this script, the following fields are often flagged by IT as PII. BTW, have you tried the sentiment analysis package? I bet social media content can predict which customers might cancel their accounts!
  16. share Sue 0 7.5 15 22.5 30 0 4 8

    12 16 ground Janet It looks true! 
 Tweets predict churn!
  17. TweetId Text Sentiment 47 “sad!” negative 53 “awesome!” positive 57

    “go packers!” neutral 64 “fleek!” positive TweetId Text neg pos neut 47 “sad!” 1 0 0 53 “awesome!” 0 1 0 57 “go packers!” 0 0 1 64 “fleek!” 0 1 0 ground Sue I wonder if Janet’s sentiment analysis will help with my discount targeting pipeline. 0 7.5 15 22.5 30 0 4 8 12 16
  18. TweetId Text neg pos neut 47 “sad!” 0 0 0

    53 “awesome!” 0 0 0 57 “go packers!” 0 0 0 64 “fleek!” 0 0 0 TweetId Text Sentiment 47 “sad!” sadness 53 “awesome!” elation 57 “go packers!” sports 64 “fleek!” trendy Sue Uh oh, prediction accuracy metrics are down! Time passes… Oh dear. I better call a meeting to introduce better governance on sentiment labeler. FYI: Janet’s wrangling script changed! ground Prediction Accuracy 0 25 50 75 100 1/1/2017 00:00 1/1/17 18:00 1/2/17 12:00 VERSION HISTORY 12/31/2016 00:00 -800 
 hash: 6dda491064bcce14f558bf83867b8c247027c423
 user: will
  19. WHAT DID CONTEXT ENABLE? Figuring out which changes introduced the

    error. VERSION HISTORY Determining who made the change to help us resolve the issue. user: will Fueling our model accuracy monitor. 0 25 50 75 100 1/1/2017 00:00 1/2/17 00:00 Self-service catalog, wrangling and analytics.
 Collective governance of data.
  20. 7 7 9 9 THE BIG CONTEXT Where are the

    interesting technical challenges? All over! Our goal is not to solve all these challenges. It’s to provide an environment to enable solutions.
  21. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON

    GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Time Travel Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth ground
  22. Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned


    Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Time Machine Model
 Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND
  23. DESIGN REQUIREMENTS • Model-agnostic • Immutable • Scalable • Politically

    Neutral
  24. Postel’s Law Be conservative in what you do, 
 be

    liberal in what you accept from others
  25. A: Model Graphs COMMON GROUND The metamodel

  26. member k1 member k1: string member k2 Object 2 member

    k1 member k2:
 number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key
  27. COMMON GROUND The versioning model B. Version Graphs A: Model

    Graphs
  28. COMMON GROUND The versioning model A. Model Graphs B. Version

    Graphs
  29. COMMON GROUND The usage model C. Lineage Graphs A. Model

    Graphs B. Version Graphs
  30. SCALABLE, IMMUTABLE BACKEND Longstanding open problem Workloads? • Graph queries

    for metamodel traversal • Log analysis queries for usage Room for improvement • Goal: compete with in-memory performance
 (“the McSherry baseline”) Ground 0 makes use of LinkedIn’s Gobblin system for crawling and ingest from files, databases, web sources and the like. We have integrated and evaluated a number of backing stores for versioned storage, including PostgreSQL, Cassandra, TitanDB and Neo4j; we report on results later in this section. We are currently integrating ElasticSearch for text indexing and are still evaluating options for ID/Authorization and Workflow/Scheduling. To exercise our initial design and provide immediate functionality, we built support for three sources of metadata most commonly used in the Big Data ecosystem: file metadata from HDFS, schemas from Hive, and code versioning from git. To support HDFS, we extended Gobblin to extract file system metadata from its HDFS crawls and publish to Ground’s Kafka connector. The resulting metadata is then ingested into Ground, and notifications are published on a Kafka channel for applications to respond to. To support Hive, we built an API shim that allows Ground to serve as a drop-in replacement for the Hive Metastore. One key benefit of using Ground as Hive’s relational catalog is Ground’s built-in support for versioning, which— combined with the append-only nature of HDFS—makes it possible to time travel and view Hive tables as they appeared in the past. To support git, we have built crawlers to extract git history graphs as ExternalVersions in Ground. These three scenarios guided our design for Common Ground. Figure 8: Dwell time analysis. Figure 9: Impact analysis. Figure 10: PostgreSQL transitive closure variants.
  31. NEUTRALITY Reminder: There will be k competing solutions for: •

    Data wrangling • Data cataloging • Schema extraction • Feature extraction • Social network analysis • Etc. • This will consolidate somewhat, but only over time Goal: foster the ecosystem
  32. NEUTRALITY YOU

  33. MANY OPEN RESEARCH QUESTIONS Underground • Workloads • Common Ground

    representations • No-overwrite versioned DB • Time travel queries: point and trend Graph queries + log analysis • Consistency Aboveground • Content extraction • Analytic user exhaust • Socio-technical networks • Collective governance • Reproducibility • Lifecycle of systems that learn
  34. CURRENT STATUS Alpha Release • Integrated with LinkedIn Gobblin, Kafka,

    Hive Metastore, Github • All components have Docker images on DockerHub • We’d love feedback! www.ground-context.org ground