Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Gobblin

Apache Gobblin

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

June 09, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Gobblin ? • A big data integration

    framework • To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management • For streaming and batch • An Apache incubator project
  2. Gobblin Execution Modes • Gobblin has a number of execution

    modes • Standalone – Run on a single box / JVM / embedded mode • Map Reduce – Run as a map reduce application • Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA • Cloud – Run on AWS / Azure, supports HA
  3. Gobblin Sinks/Writers • Gobblin supports the following sinks – Avro

    HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka
  4. Gobblin Sources Gobblin supports the following sources • Avro files

    • File copy • Query based • Rest API • Google Analytics • Google drive • Google webmaster • Hadoop text input • Hive Avro to ORC • Hive compliance purging • JSON • Kafka • MySQL • Oracle • Salesforce • FTP / SFTP • SQL Server • Teradata • Wikipedia
  5. Gobblin Architecture • A Gobblin job is built on a

    set of plugable constructs • Which are extensible • A job is a set of tasks created from a workunit • The workunit serves as a container at runtime • Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce • Run time handles scheduling, error handling etc • Utilities handle meta data, state, metrics etc
  6. Gobblin Job • Optional aquire lock (to stop next job

    instance) • Create source instance • From source work units create tasks • Launch and run tasks • Publish data if OK to do so • Persist the job/task states into the state store • Clean up temporary work data • Release the job lock ( optional )
  7. Gobblin Constructs • Source partitions data into work units •

    Source creates work unit data extractors • Converter converts schema and data records • Quality checker checks row and task level data • Fork operator allows control to flow into multiple streams • Writers sends data records to sink • Publisher publishes job records
  8. Gobblin Job Configuration • Goblin jobs are configured via configuration

    files • May be named .pull / .job plus .properties • Source properties file defines – Connection / converter / quality / publisher • Job file defines – Name / group / description / schedule – Extraction properties – Source properties
  9. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  10. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration