WHO AM I?
• A developer
• Google, Microsoft, Outbrain, Gigaspaces, Totango etc
• Hector, flask-restful-swager, meteor-migrations, monitoring...
• Podcast: reversim.com
• devdev.io
• Gormim
Tuesday, July 15, 14
Slide 4
Slide 4 text
WHAT IS LUIGI?
Tuesday, July 15, 14
Slide 5
Slide 5 text
WHAT IS LUIGI?
Tuesday, July 15, 14
Slide 6
Slide 6 text
WHAT IS LUIGI?
• A Workflow Engine.
• Who the fuck needs a workflow engine?
Tuesday, July 15, 14
Slide 7
Slide 7 text
WHAT IS LUIGI?
• A Workflow Engine.
• Who the fuck needs a workflow engine?
• You do!!!
• If you run hadoop (or other ETL jobs)
• If you have dependencies b/w them (who doesn’t?!)
• If they fail (s/if/when/)
Tuesday, July 15, 14
Slide 8
Slide 8 text
WHAT IS LUIGI?
• A Workflow Engine.
• Who the fuck needs a workflow engine?
• You do!!!
• If you run hadoop (or other ETL jobs)
• If you have dependencies b/w them (who doesn’t?!)
• If they fail (s/if/when/)
• Luigi doesn’t replace Hadoop, Scalding, Pig, Hive, Redshift.
• It orchestrates them
Tuesday, July 15, 14
Slide 9
Slide 9 text
SCREENSHOTS
Tuesday, July 15, 14
Slide 10
Slide 10 text
HOW DO YOU ETL YOUR
DATA?
• Hadoop
• Spark
• Redshift
• Postgres
• Ad-hoc java/python/ruby/go/...
Tuesday, July 15, 14
Slide 11
Slide 11 text
RUNNING ONE JOB IS EASY
RUNNING MANY IS HARD
• 100s of concurrent jobs, 1000s Daily.
• Job dependencies.
• E.g. first copy the file, then crunch it.
• Errors / retries
• Idempotency
• Monitoring / Visuals
Tuesday, July 15, 14
Slide 12
Slide 12 text
THE WRONG WAY TO DO IT
Tuesday, July 15, 14
Slide 13
Slide 13 text
EXAMPLE WORKFLOW
Log data
Subsample and
extract features
Features
Train classification
model
Model
Log data
Log data
Log data
Upload model to
servers
Tuesday, July 15, 14
Slide 14
Slide 14 text
THE CRON PHENOMENON
THE W
RONG WAY TO
DO
IT
Tuesday, July 15, 14
Slide 15
Slide 15 text
THE CRON PHENOMENON
Don’t try this at home!!!
THE W
RONG WAY TO
DO
IT
Tuesday, July 15, 14
Slide 16
Slide 16 text
ENTER LUIGI
Tuesday, July 15, 14
Slide 17
Slide 17 text
ENTER LUIGI
• Like Makefile - but in python
• And - For data
• Integrates well with data targets
• Hadoop, Spark, Databases
• Atomic file/db operations
• Visualization
• CLI - really nice developer interface!
Tuesday, July 15, 14
Slide 18
Slide 18 text
LUIGI TASK
Tuesday, July 15, 14
Slide 19
Slide 19 text
LUIGI TASK
Tuesday, July 15, 14
Slide 20
Slide 20 text
RUN FROM THE CLI
Tuesday, July 15, 14
Slide 21
Slide 21 text
TASK PARAMETERS
Tuesday, July 15, 14
Slide 22
Slide 22 text
AWESOME HADOOP (MR)
SUPPORT
Tuesday, July 15, 14
Slide 23
Slide 23 text
WEB UI
Tuesday, July 15, 14
Slide 24
Slide 24 text
PROCESS
SYNCHRONIZATION
Tuesday, July 15, 14
Slide 25
Slide 25 text
USED BY
Tuesday, July 15, 14
Slide 26
Slide 26 text
SEMI-DEEP DIVE
Programming for Luigi
Tuesday, July 15, 14
Slide 27
Slide 27 text
LUIGI TASKS
• Implement 4 method:
def input(self) (optional)
def output(self)
def run(self)
def depends(self)
Tuesday, July 15, 14
Slide 28
Slide 28 text
LUIGI TASKS
• Or extend one of the predefined tasks
• S3CopyToTable
• RedshiftManifestTask
• SparkJob
• HiveQueryTask
• HadoopJobTask
Tuesday, July 15, 14
Slide 29
Slide 29 text
EXAMPLE
LOCAL WORDCOUNT
class WordCount(luigi.Task):
date_interval = luigi.DateIntervalParameter()
def requires(self):
return [InputText(date) for date in self.date_interval.dates()]
def output(self):
return luigi.LocalTarget('/var/tmp/text-count/%s' % self.date_interval)
def run(self):
count = {}
for file in self.input():
for line in file.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
# output data
f = self.output().open('w')
for word, count in count.iteritems():
f.write("%s\t%d\n" % (word, count))
f.close()
Tuesday, July 15, 14
Slide 30
Slide 30 text
EXAMPLE
HADOOP WORDCOUNT
class WordCount(luigi.hadoop.JobTask):
date_interval = luigi.DateIntervalParameter()
def requires(self):
return [InputText(date) for date in self.date_interval.dates()]
def output(self):
return luigi.hdfs.HdfsTarget('/tmp/text-count/%s' % self.date_interval)
def mapper(self, line):
for word in line.strip().split():
yield word, 1
def reducer(self, key, values):
yield key, sum(values)
Tuesday, July 15, 14
Slide 31
Slide 31 text
LUIGI TARGETS
• HDFS
• Local File
• Postgres / MySQL, Redshift, ElasticSearch
• ... Easy to extend
Tuesday, July 15, 14
Slide 32
Slide 32 text
DEFINING A TARGET
• Implement:
def exists(self)
And optionally:
connect or open / close
Tuesday, July 15, 14
Slide 33
Slide 33 text
EXAMPLE
MYSQL TARGET
class MySqlTarget(luigi.Target):
def touch(self, connection=None):
...
def exists(self, connection=None):
cursor = connection.cursor()
cursor.execute("""SELECT 1 FROM {marker_table}
WHERE update_id = %s
LIMIT 1""".format(marker_table=self.marker_table),
(self.update_id,)
)
row = cursor.fetchone()
return row is not None
def connect(self, autocommit=False):
...
def create_marker_table(self):
...
Tuesday, July 15, 14
Slide 34
Slide 34 text
THE GRAND SCHEME
Tuesday, July 15, 14
Slide 35
Slide 35 text
THE GRAND SCHEME
Run
Task
Tuesday, July 15, 14
Slide 36
Slide 36 text
THE GRAND SCHEME
Run
Task
Check
Deps
Tuesday, July 15, 14
Slide 37
Slide 37 text
THE GRAND SCHEME
Run
Task
Check
Deps
self.requires()
Tuesday, July 15, 14
Slide 38
Slide 38 text
THE GRAND SCHEME
Run
Task
Check
Deps
self.requires()
target.exists()
Tuesday, July 15, 14
Slide 39
Slide 39 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
self.requires()
target.exists()
Tuesday, July 15, 14
Slide 40
Slide 40 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
Use Deps
Output
self.requires()
target.exists()
Tuesday, July 15, 14
Slide 41
Slide 41 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
Use Deps
Output
Invoke
Run
self.requires()
target.exists()
Tuesday, July 15, 14
Slide 42
Slide 42 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
Use Deps
Output
Invoke
Run
self.requires()
target.exists()
Check Scheduler
Tuesday, July 15, 14
Slide 43
Slide 43 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
Use Deps
Output
Invoke
Run self.input()
self.requires()
target.exists()
Check Scheduler
Tuesday, July 15, 14
Slide 44
Slide 44 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
Use Deps
Output
Invoke
Run self.input()
self.output()
self.requires()
target.exists()
Check Scheduler
Tuesday, July 15, 14
Slide 45
Slide 45 text
THE GRAND SCHEME
Run
Task
Check
Deps
Run
Deps
Use Deps
Output
Invoke
Run
Write
Output
self.input()
self.output()
self.requires()
target.exists()
Check Scheduler
Tuesday, July 15, 14
Slide 46
Slide 46 text
OPEN SOURCE
Tuesday, July 15, 14
Slide 47
Slide 47 text
BY
Tuesday, July 15, 14
Slide 48
Slide 48 text
Tuesday, July 15, 14
Slide 49
Slide 49 text
WHAT DID I DO?
• Add Redshift support
• Add MySQL support
• Various small features (improved notifications, dep.py,
historydb etc)
• Various bug reports
• And fixes!
Tuesday, July 15, 14
Slide 50
Slide 50 text
LUIGI @ TOTANGO
• Daily computation
• Hourly computation
• Ad-hoc data loading (for data analysis activities, to redshift)
Tuesday, July 15, 14
Slide 51
Slide 51 text
TOTANGO’S SETUP
invoke
Luigi Workers
coordinate
Luigi Scheduler
report
Tuesday, July 15, 14
Slide 52
Slide 52 text
TOTANGO’S SETUP
Luigi Worker
read / write
Tuesday, July 15, 14
Slide 53
Slide 53 text
MOAR SCREENSHOTS
Tuesday, July 15, 14
Slide 54
Slide 54 text
MOAR SCREENSHOTS
Tuesday, July 15, 14
Slide 55
Slide 55 text
MOAR SCREENSHOTS
Tuesday, July 15, 14
Slide 56
Slide 56 text
MOAR SCREENSHOTS
Tuesday, July 15, 14
Slide 57
Slide 57 text
MOAR SCREENSHOTS
Tuesday, July 15, 14
Slide 58
Slide 58 text
GAMEBOY!!!
Tuesday, July 15, 14
Slide 59
Slide 59 text
AND... GAMEBOY
Tuesday, July 15, 14
Slide 60
Slide 60 text
GAMEBOY
Tuesday, July 15, 14
Slide 61
Slide 61 text
GAMEBOY
Tuesday, July 15, 14
Slide 62
Slide 62 text
GAMEBOY IS
• A Totango specific controller for Luigi
• The transition process (to Luigi)
• Provide high level overview
• Manual re-run of tasks
• Monitor progress, performance, run times, queues,
worker load etc...
• Implemented using Flask and AngularJS
Tuesday, July 15, 14
Slide 63
Slide 63 text
IS GAMEBOY OPEN SOURCE
• Well, no. At least not right away
• Right now it’s very totango-specific.
• Integrations to Librato-metics
• Queries on Totango Databases
• Uses Jenkins for controlling executions
• Displays Totango’s Account metadata (totango’s business logic)
• Maybe some other day...
Tuesday, July 15, 14
Slide 64
Slide 64 text
WHAT ELSE IS OUT
THERE?
Tuesday, July 15, 14
Slide 65
Slide 65 text
WHAT ELSE IS OUT THERE?
• Oozie
• Azkaban
• AWS Data Pipeline
• Chronos
• spring-batch
• Dataswarm (facebook)
• River (outbrain internal)
• What’s your favorite WF engine? (did you build one?)
Tuesday, July 15, 14
Slide 66
Slide 66 text
MY OTHER PROJECTS
• https://github.com/hector-client/hector
• https://github.com/rantav/flask-restful-swagger
• https://github.com/sebastien/monitoring
• https://github.com/rantav/meteor-migrations
• https://github.com/rantav/node-github-list-packages
• https://github.com/rantav/devdev
Tuesday, July 15, 14
Slide 67
Slide 67 text
REFS
• https://github.com/spotify/luigi
• Facebook’s Dataswarm https://www.youtube.com/watch?
v=M0VCbhfQ3HQ
• Outbrain’s River https://www.youtube.com/watch?
v=EzsckTggDiM
Tuesday, July 15, 14