How to Write Complex Data Pipeline in Ruby

Slide 1

Slide 1 text

How To Write Complex Data Pipeline In Ruby 2016.12.02 RubyConf Taiwan 2016 Kazuyuki Honda

Slide 2

Slide 2 text

HELLO! I am Kazuyuki Honda DevOps Engineer at Quipper You can find me at: https://github.com/hakobera on Github @hakobera on Twitter

Slide 3

Slide 3 text

○EdTech Company □Our service support both teachers and students ○Services are launched in 6 countries □Indonesia, Philippines, Mexico and Japan □Trials starts in 2 other countries

Slide 4

Slide 4 text

What is Data Pipeline? Let’s dive into data and pipes

Slide 5

Slide 5 text

What is Data Pipeline? ○ Data Pipeline is sequence of data and task ○ Represented as DAG (Directed Acyclic Graph) ○ Task can read data which generated by previous tasks Task 1 Output = Input Output = Input Task 3 Output = Input Task 2 Output = Input

Slide 6

Slide 6 text

Examples of Data Pipeline ○ETL (Extract-Transform-Load) □Extract data from production database □Transform, cleaning, pre-processing data □Load it into report database ○Generate reports using SQL which has dependencies

Slide 7

Slide 7 text

Kinesis Lambda BigQuery Examples of Data Pipeline

Slide 8

Slide 8 text

Kinesis Lambda BigQuery Examples of Data Pipeline ETL Generate Reports using SQL with dependencies

Slide 9

Slide 9 text

How much data does Quipper process?

Slide 10

Slide 10 text

100GB Events and Logs are inserted 40k queries Executed 4TB Scans Daily Activity

Slide 11

Slide 11 text

“ This is not so big, but not so small. So we have to consider about how to write complex data pipeline more easy, and run it stable.

Slide 12

Slide 12 text

Why writing complex data pipeline so difficult?

Slide 13

Slide 13 text

Why writing complex data pipeline so difficult? Because we have to consider many things which is not directly related to data and task such as ○ Dependency resolution ○ Error handling ○ Idempotence ○ Retry ○ Logging ○ Parallel execution

Slide 14

Slide 14 text

Write data pipeline in Ruby (Early Days)

Slide 15

Slide 15 text

Early Days Example Task 1 Data 1 Task 2 Let’s write this simple data pipeline in Ruby. Data 2

Slide 16

Slide 16 text

Early Days Example (1) do_task1() do_task2() ✓ Dependency resolution Error handling Idempotence Logging Retry

Slide 17

Slide 17 text

Early Days Example (2) begin do_task1() do_task2() resque => e exit 1 ✓ Dependency resolution ✓ Error handling Idempotence Logging Retry

Slide 18

Slide 18 text

Early Days Example (3) begin do_task1() unless data1_exist? do_task2() unless data2_exist? resque => e exit 1 ✓ Dependency resolution ✓ Error handling ✓ Idempotence Logging Retry

Slide 19

Slide 19 text

Early Days Example (4) begin log.info(“start task1”) do_task1() unless data1_exist? log.info(“success task1”) log.info(“start task2”) do_task2() unless data2_exist? log.info(“success task2”) resque => e log.error(e.message) exit 1 ✓ Dependency resolution ✓ Error handling ✓ Idempotence ✓ Logging Retry

Slide 20

Slide 20 text

Early Days Example (5) Retriable.configure do … end begin Retriable.retriable do log.info(“start task1”) do_task1() unless data1_exist? log.info(“success task1”) log.info(“start task2”) do_task2() unless data2_exist? log.info(“success task2”) end resque => e log.error(e.message) exit 1 ✓ Dependency resolution ✓ Error handling ✓ Idempotence ✓ Logging ✓ Retry

Slide 21

Slide 21 text

Too many boilerplate!

Slide 22

Slide 22 text

I need a solution!!

Slide 23

Slide 23 text

Solution There are famous Open Source Workflow Engine ○ Luigi by Spotify ○ Airflow by AirBnB □ Now become Apache Incubator Workflow Engine can solve this problem.

Slide 24

Slide 24 text

They are awesome, but they are for Python, not for Ruby!!

Slide 25

Slide 25 text

So I made it myself

Slide 26

Slide 26 text

A open source plugin based ruby library to build, run and manage complex workflows Document: https://tumugi.github.io/ Source: https://github.com/tumugi/tumugi

Slide 27

Slide 27 text

What tumugi provides ○ Define workflow using internal DSL ○ Task dependency resolution ○ Error handling and retry ○ Support build idempotence data pipeline ○ Centralized logging ○ Parallel task execution using threads ○ Visualize ○ Plugin architecture

Slide 28

Slide 28 text

What tumugi does not provide ○ Scheduler (Event Trigger) □ Use external tools like cron, Jenkins ○ Executor for multiple machines □ One data pipeline can only in one machine □Control cloud distributed resources ○ Eg. BigQuery, Cloud Storage □Sync multiple data pipeline using external task

Slide 29

Slide 29 text

Data Pipeline with tumugi Task 1 Data 1 Task 2 Let’s write this very simple data pipeline using tumugi. Data 2

Slide 30

Slide 30 text

task :task1 do output target(:data1) run { “do something” } end task :task2 do requires :task1 output target(:data2) run { output.write(“do something using #{input.value}”) } end ## Data pipeline with tumugi (1) Data Pipeline with tumugi

Slide 31

Slide 31 text

Before and After Retriable.configure do … end begin Retriable.retriable do log.info(“start task1”) do_task1() unless data1_exist? log.info(“success task1”) log.info(“start task2”) do_task2() unless data2_exist? log.info(“success task2”) end resque => e log.error(e.message) exit 1 task :task1 do output target(:data1) run { output.write(“do something”) } end task :task2 do requires :task1 output target(:data2) run { output.write( “do something using #{input.value}”) } } end Ruby only tumugi DSL

Slide 32

Slide 32 text

Task 1 Data 1 Task 2 Data 2 task :task1 do output target(:data1) run { output.write(“do something”) } end task :task2 do requires :task1 output target(:data2) run { output.write(“do something using #{input.value}”) } end

Slide 33

Slide 33 text

How to abstract data pipeline in Ruby

Slide 34

Slide 34 text

Core Components of tumugi Target Task Parameter

Slide 35

Slide 35 text

What is Task? Task is represents a `task` in data pipeline. Write your own data process logic in #run method using input data and generate output data. Task #run Output Input

Slide 36

Slide 36 text

Task Dependency Task also has `#requires` method, which write dependency of tasks. This information used by tumugi to build DAG. Task #requires Output = Input Output = Input Task #requires Output = Input Task #requires Output = Input

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Requires multiple tasks task :first_task do end task :second_task do requires :first_task end task :another_second_task do requires :first_task end task :last_task do requires [:second_task, :another_second_task] end First Task Last Task Another Second Task Second Task

Slide 39

Slide 39 text

What is Target? Target abstract `data` in data pipeline. E.g. Local File, Remote Object, Datbase Table, etc Target must implements #exist? method Target #exist? Task Task

Slide 40

Slide 40 text

Example of Target class LocalFileTarget < Tumugi::Target attr_reader :path def initialize(path) @path = path end def exist? File.exist?(@path) end end

Slide 41

Slide 41 text

Target#exist? method is for Idempotence Target#exist? used by tumugi to build idempotence data pipeline. while !dag.complete? task = dag.next() if task.output.exist? #<= HERE! task.trigger_event(:skip) elsif task.ready? task.run && task.trigger_event(:complete) else dag.push(task) end end

Slide 42

Slide 42 text

Slide 43

Slide 43 text

What is Parameter? Parameter is another input data of Task. You can read parameter in Task#run method. Task Output Input Parameter Parameter Parameter

Slide 44

Slide 44 text

Example of Parameters ○ You can define parameter using #param ○ You can read parameter in #run method ○ You can override parameters of plugin in DSL task :some_task do param :key1, type: :string, default: ‘value1’ end task :another_task, type: :awesome_plugin do key2 ‘another param name’ run do puts key2 #=> ‘another param name` end end

Slide 45

Slide 45 text

Parameter Auto Binding If a parameter auto_binding is enabled, tumugi bind parameter value from CLI options. task :some_task do param :param1, type: :string, auto_bind: true run { puts param1 } #=> “hello” end $ tumugi run -f workflow.rb -p param1:hello

Slide 46

Slide 46 text

Plugin Architecture

Slide 47

Slide 47 text

Everything is plugin All tasks and targets are plugin

Slide 48

Slide 48 text

Task Plugin task :task1, type: :awesome do key “value” end module Tumugi module Plugin class AwesomeTask < Tumugi::Task Plugin.register_task(‘awesome’, self) param :key def run puts key end end end end Definition Usage

Slide 49

Slide 49 text

Target Plugin module Tumugi module Plugin class AwesomeTarget < Tumugi::Target Plugin.register_target(‘awesome’, self) def exist? # check awesome resource is exist or not end end end end task :task1 do output target :awesome end Definition Usage

Slide 50

Slide 50 text

Distribute tumugi plugin You can find and register tumugi plugins in RubyGems.org Each gem has useful tasks and targets. https://rubygems.org/search?query=tumugi-plugin

Slide 51

Slide 51 text

Real world example

Slide 52

Slide 52 text

Dynamic Workflow Export multiple MongoDB collections using embulk configs = Dir.glob(“schema/*.json”) task :main do requires configs.map {|config| “#{File.basename(config, “.*”)}_export” } run { log “done” } end configs.each do |config| collection = File.basename(config, “.*”) task “#{collection}_export”, type: :command do param :day, auto_bind: true, required: true command { “embulk_wrapper --day=#{day} --collection=#{collection}” } output { target(:bigquery_table, project_id: “xxx”, dataset_id: “yyy”, table_id: “#{collection}_#{day}”) } end end

Slide 53

Slide 53 text

Run query, export and notify DEMO: Export BigQuery query result to Google Drive and notify URL to Slack. http://tumugi.github.io/recipe2/

Slide 54

Slide 54 text

THANKS! Any questions? You can find me at: https://github.com/hakobera on Github @hakobera on Twitter