Apache Gobblin

What Is Apache Gobblin ? • A big data integration
framework • To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management • For streaming and batch • An Apache incubator project

Gobblin Execution Modes • Gobblin has a number of execution
modes • Standalone – Run on a single box / JVM / embedded mode • Map Reduce – Run as a map reduce application • Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA • Cloud – Run on AWS / Azure, supports HA

Gobblin Sinks/Writers • Gobblin supports the following sinks – Avro
HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka

Gobblin Sources Gobblin supports the following sources • Avro files
• File copy • Query based • Rest API • Google Analytics • Google drive • Google webmaster • Hadoop text input • Hive Avro to ORC • Hive compliance purging • JSON • Kafka • MySQL • Oracle • Salesforce • FTP / SFTP • SQL Server • Teradata • Wikipedia

Gobblin Architecture

Gobblin Architecture • A Gobblin job is built on a
set of plugable constructs • Which are extensible • A job is a set of tasks created from a workunit • The workunit serves as a container at runtime • Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce • Run time handles scheduling, error handling etc • Utilities handle meta data, state, metrics etc

Gobblin Job

Gobblin Job • Optional aquire lock (to stop next job
instance) • Create source instance • From source work units create tasks • Launch and run tasks • Publish data if OK to do so • Persist the job/task states into the state store • Clean up temporary work data • Release the job lock ( optional )

Gobblin Constructs

Gobblin Constructs • Source partitions data into work units •
Source creates work unit data extractors • Converter converts schema and data records • Quality checker checks row and task level data • Fork operator allows control to flow into multiple streams • Writers sends data records to sink • Publisher publishes job records

Gobblin Job Configuration • Goblin jobs are configured via configuration
files • May be named .pull / .job plus .properties • Source properties file defines – Connection / converter / quality / publisher • Job file defines – Name / group / description / schedule – Extraction properties – Source properties

Gobblin Users

Available Books • See “Big Data Made Easy” – Apress
Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
• See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Gobblin

Apache Gobblin

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

What Is Apache Gobblin ? • A big data integration

Gobblin Execution Modes • Gobblin has a number of execution

Gobblin Sinks/Writers • Gobblin supports the following sinks – Avro

Gobblin Sources Gobblin supports the following sources • Avro files

Gobblin Architecture

Gobblin Architecture • A Gobblin job is built on a

Gobblin Job

Gobblin Job • Optional aquire lock (to stop next job

Gobblin Constructs

Gobblin Constructs • Source partitions data into work units •

Gobblin Job Configuration • Goblin jobs are configured via configuration

Gobblin Users

Available Books • See “Big Data Made Easy” – Apress

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020