Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Gobblin

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Apache Gobblin

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Avatar for Mike Frampton

Mike Frampton

June 09, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Gobblin ? • A big data integration

    framework • To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management • For streaming and batch • An Apache incubator project
  2. Gobblin Execution Modes • Gobblin has a number of execution

    modes • Standalone – Run on a single box / JVM / embedded mode • Map Reduce – Run as a map reduce application • Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA • Cloud – Run on AWS / Azure, supports HA
  3. Gobblin Sinks/Writers • Gobblin supports the following sinks – Avro

    HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka
  4. Gobblin Sources Gobblin supports the following sources • Avro files

    • File copy • Query based • Rest API • Google Analytics • Google drive • Google webmaster • Hadoop text input • Hive Avro to ORC • Hive compliance purging • JSON • Kafka • MySQL • Oracle • Salesforce • FTP / SFTP • SQL Server • Teradata • Wikipedia
  5. Gobblin Architecture • A Gobblin job is built on a

    set of plugable constructs • Which are extensible • A job is a set of tasks created from a workunit • The workunit serves as a container at runtime • Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce • Run time handles scheduling, error handling etc • Utilities handle meta data, state, metrics etc
  6. Gobblin Job • Optional aquire lock (to stop next job

    instance) • Create source instance • From source work units create tasks • Launch and run tasks • Publish data if OK to do so • Persist the job/task states into the state store • Clean up temporary work data • Release the job lock ( optional )
  7. Gobblin Constructs • Source partitions data into work units •

    Source creates work unit data extractors • Converter converts schema and data records • Quality checker checks row and task level data • Fork operator allows control to flow into multiple streams • Writers sends data records to sink • Publisher publishes job records
  8. Gobblin Job Configuration • Goblin jobs are configured via configuration

    files • May be named .pull / .job plus .properties • Source properties file defines – Connection / converter / quality / publisher • Job file defines – Name / group / description / schedule – Extraction properties – Source properties
  9. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  10. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration