Apache Gobblin - Speaker Deck

Slide 1

Slide 1 text

What Is Apache Gobblin ? ● A big data integration framework ● To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management ● For streaming and batch ● An Apache incubator project

Slide 2

Slide 2 text

Gobblin Execution Modes ● Gobblin has a number of execution modes ● Standalone – Run on a single box / JVM / embedded mode ● Map Reduce – Run as a map reduce application ● Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA ● Cloud – Run on AWS / Azure, supports HA

Slide 3

Slide 3 text

Gobblin Sinks/Writers ● Gobblin supports the following sinks – Avro HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka

Slide 4

Slide 4 text

Gobblin Sources Gobblin supports the following sources ● Avro files ● File copy ● Query based ● Rest API ● Google Analytics ● Google drive ● Google webmaster ● Hadoop text input ● Hive Avro to ORC ● Hive compliance purging ● JSON ● Kafka ● MySQL ● Oracle ● Salesforce ● FTP / SFTP ● SQL Server ● Teradata ● Wikipedia

Slide 5

Slide 5 text

Gobblin Architecture

Slide 6

Slide 6 text

Gobblin Architecture ● A Gobblin job is built on a set of plugable constructs ● Which are extensible ● A job is a set of tasks created from a workunit ● The workunit serves as a container at runtime ● Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce ● Run time handles scheduling, error handling etc ● Utilities handle meta data, state, metrics etc

Slide 7

Slide 7 text

Gobblin Job

Slide 8

Slide 8 text

Gobblin Job ● Optional aquire lock (to stop next job instance) ● Create source instance ● From source work units create tasks ● Launch and run tasks ● Publish data if OK to do so ● Persist the job/task states into the state store ● Clean up temporary work data ● Release the job lock ( optional )

Slide 9

Slide 9 text

Gobblin Constructs

Slide 10

Slide 10 text

Gobblin Constructs ● Source partitions data into work units ● Source creates work unit data extractors ● Converter converts schema and data records ● Quality checker checks row and task level data ● Fork operator allows control to flow into multiple streams ● Writers sends data records to sink ● Publisher publishes job records

Slide 11

Slide 11 text

Gobblin Job Configuration ● Goblin jobs are configured via configuration files ● May be named .pull / .job plus .properties ● Source properties file defines – Connection / converter / quality / publisher ● Job file defines – Name / group / description / schedule – Extraction properties – Source properties

Slide 12

Slide 12 text

Gobblin Users

Slide 13

Slide 13 text

Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Slide 14

Slide 14 text

Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration