Slide 1

Slide 1 text

@rantav totango Tuesday, July 15, 14

Slide 2

Slide 2 text

WHO AM I? Tuesday, July 15, 14

Slide 3

Slide 3 text

WHO AM I? • A developer • Google, Microsoft, Outbrain, Gigaspaces, Totango etc • Hector, flask-restful-swager, meteor-migrations, monitoring... • Podcast: reversim.com • devdev.io • Gormim Tuesday, July 15, 14

Slide 4

Slide 4 text

WHAT IS LUIGI? Tuesday, July 15, 14

Slide 5

Slide 5 text

WHAT IS LUIGI? Tuesday, July 15, 14

Slide 6

Slide 6 text

WHAT IS LUIGI? • A Workflow Engine. • Who the fuck needs a workflow engine? Tuesday, July 15, 14

Slide 7

Slide 7 text

WHAT IS LUIGI? • A Workflow Engine. • Who the fuck needs a workflow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) Tuesday, July 15, 14

Slide 8

Slide 8 text

WHAT IS LUIGI? • A Workflow Engine. • Who the fuck needs a workflow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) • Luigi doesn’t replace Hadoop, Scalding, Pig, Hive, Redshift. • It orchestrates them Tuesday, July 15, 14

Slide 9

Slide 9 text

SCREENSHOTS Tuesday, July 15, 14

Slide 10

Slide 10 text

HOW DO YOU ETL YOUR DATA? • Hadoop • Spark • Redshift • Postgres • Ad-hoc java/python/ruby/go/... Tuesday, July 15, 14

Slide 11

Slide 11 text

RUNNING ONE JOB IS EASY RUNNING MANY IS HARD • 100s of concurrent jobs, 1000s Daily. • Job dependencies. • E.g. first copy the file, then crunch it. • Errors / retries • Idempotency • Monitoring / Visuals Tuesday, July 15, 14

Slide 12

Slide 12 text

THE WRONG WAY TO DO IT Tuesday, July 15, 14

Slide 13

Slide 13 text

EXAMPLE WORKFLOW Log data Subsample and extract features Features Train classification model Model Log data Log data Log data Upload model to servers Tuesday, July 15, 14

Slide 14

Slide 14 text

THE CRON PHENOMENON THE W RONG WAY TO DO IT Tuesday, July 15, 14

Slide 15

Slide 15 text

THE CRON PHENOMENON Don’t try this at home!!! THE W RONG WAY TO DO IT Tuesday, July 15, 14

Slide 16

Slide 16 text

ENTER LUIGI Tuesday, July 15, 14

Slide 17

Slide 17 text

ENTER LUIGI • Like Makefile - but in python • And - For data • Integrates well with data targets • Hadoop, Spark, Databases • Atomic file/db operations • Visualization • CLI - really nice developer interface! Tuesday, July 15, 14

Slide 18

Slide 18 text

LUIGI TASK Tuesday, July 15, 14

Slide 19

Slide 19 text

LUIGI TASK Tuesday, July 15, 14

Slide 20

Slide 20 text

RUN FROM THE CLI Tuesday, July 15, 14

Slide 21

Slide 21 text

TASK PARAMETERS Tuesday, July 15, 14

Slide 22

Slide 22 text

AWESOME HADOOP (MR) SUPPORT Tuesday, July 15, 14

Slide 23

Slide 23 text

WEB UI Tuesday, July 15, 14

Slide 24

Slide 24 text

PROCESS SYNCHRONIZATION Tuesday, July 15, 14

Slide 25

Slide 25 text

USED BY Tuesday, July 15, 14

Slide 26

Slide 26 text

SEMI-DEEP DIVE Programming for Luigi Tuesday, July 15, 14

Slide 27

Slide 27 text

LUIGI TASKS • Implement 4 method: def input(self) (optional) def output(self) def run(self) def depends(self) Tuesday, July 15, 14

Slide 28

Slide 28 text

LUIGI TASKS • Or extend one of the predefined tasks • S3CopyToTable • RedshiftManifestTask • SparkJob • HiveQueryTask • HadoopJobTask Tuesday, July 15, 14

Slide 29

Slide 29 text

EXAMPLE LOCAL WORDCOUNT class WordCount(luigi.Task): date_interval = luigi.DateIntervalParameter() def requires(self): return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.LocalTarget('/var/tmp/text-count/%s' % self.date_interval) def run(self): count = {} for file in self.input(): for line in file.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 # output data f = self.output().open('w') for word, count in count.iteritems(): f.write("%s\t%d\n" % (word, count)) f.close() Tuesday, July 15, 14

Slide 30

Slide 30 text

EXAMPLE HADOOP WORDCOUNT class WordCount(luigi.hadoop.JobTask): date_interval = luigi.DateIntervalParameter() def requires(self): return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.hdfs.HdfsTarget('/tmp/text-count/%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Tuesday, July 15, 14

Slide 31

Slide 31 text

LUIGI TARGETS • HDFS • Local File • Postgres / MySQL, Redshift, ElasticSearch • ... Easy to extend Tuesday, July 15, 14

Slide 32

Slide 32 text

DEFINING A TARGET • Implement: def exists(self) And optionally: connect or open / close Tuesday, July 15, 14

Slide 33

Slide 33 text

EXAMPLE MYSQL TARGET class MySqlTarget(luigi.Target): def touch(self, connection=None): ... def exists(self, connection=None): cursor = connection.cursor() cursor.execute("""SELECT 1 FROM {marker_table} WHERE update_id = %s LIMIT 1""".format(marker_table=self.marker_table), (self.update_id,) ) row = cursor.fetchone() return row is not None def connect(self, autocommit=False): ... def create_marker_table(self): ... Tuesday, July 15, 14

Slide 34

Slide 34 text

THE GRAND SCHEME Tuesday, July 15, 14

Slide 35

Slide 35 text

THE GRAND SCHEME Run Task Tuesday, July 15, 14

Slide 36

Slide 36 text

THE GRAND SCHEME Run Task Check Deps Tuesday, July 15, 14

Slide 37

Slide 37 text

THE GRAND SCHEME Run Task Check Deps self.requires() Tuesday, July 15, 14

Slide 38

Slide 38 text

THE GRAND SCHEME Run Task Check Deps self.requires() target.exists() Tuesday, July 15, 14

Slide 39

Slide 39 text

THE GRAND SCHEME Run Task Check Deps Run Deps self.requires() target.exists() Tuesday, July 15, 14

Slide 40

Slide 40 text

THE GRAND SCHEME Run Task Check Deps Run Deps Use Deps Output self.requires() target.exists() Tuesday, July 15, 14

Slide 41

Slide 41 text

THE GRAND SCHEME Run Task Check Deps Run Deps Use Deps Output Invoke Run self.requires() target.exists() Tuesday, July 15, 14

Slide 42

Slide 42 text

THE GRAND SCHEME Run Task Check Deps Run Deps Use Deps Output Invoke Run self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Slide 43

Slide 43 text

THE GRAND SCHEME Run Task Check Deps Run Deps Use Deps Output Invoke Run self.input() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Slide 44

Slide 44 text

THE GRAND SCHEME Run Task Check Deps Run Deps Use Deps Output Invoke Run self.input() self.output() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Slide 45

Slide 45 text

THE GRAND SCHEME Run Task Check Deps Run Deps Use Deps Output Invoke Run Write Output self.input() self.output() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14

Slide 46

Slide 46 text

OPEN SOURCE Tuesday, July 15, 14

Slide 47

Slide 47 text

BY Tuesday, July 15, 14

Slide 48

Slide 48 text

Tuesday, July 15, 14

Slide 49

Slide 49 text

WHAT DID I DO? • Add Redshift support • Add MySQL support • Various small features (improved notifications, dep.py, historydb etc) • Various bug reports • And fixes! Tuesday, July 15, 14

Slide 50

Slide 50 text

LUIGI @ TOTANGO • Daily computation • Hourly computation • Ad-hoc data loading (for data analysis activities, to redshift) Tuesday, July 15, 14

Slide 51

Slide 51 text

TOTANGO’S SETUP invoke Luigi Workers coordinate Luigi Scheduler report Tuesday, July 15, 14

Slide 52

Slide 52 text

TOTANGO’S SETUP Luigi Worker read / write Tuesday, July 15, 14

Slide 53

Slide 53 text

MOAR SCREENSHOTS Tuesday, July 15, 14

Slide 54

Slide 54 text

MOAR SCREENSHOTS Tuesday, July 15, 14

Slide 55

Slide 55 text

MOAR SCREENSHOTS Tuesday, July 15, 14

Slide 56

Slide 56 text

MOAR SCREENSHOTS Tuesday, July 15, 14

Slide 57

Slide 57 text

MOAR SCREENSHOTS Tuesday, July 15, 14

Slide 58

Slide 58 text

GAMEBOY!!! Tuesday, July 15, 14

Slide 59

Slide 59 text

AND... GAMEBOY Tuesday, July 15, 14

Slide 60

Slide 60 text

GAMEBOY Tuesday, July 15, 14

Slide 61

Slide 61 text

GAMEBOY Tuesday, July 15, 14

Slide 62

Slide 62 text

GAMEBOY IS • A Totango specific controller for Luigi • The transition process (to Luigi) • Provide high level overview • Manual re-run of tasks • Monitor progress, performance, run times, queues, worker load etc... • Implemented using Flask and AngularJS Tuesday, July 15, 14

Slide 63

Slide 63 text

IS GAMEBOY OPEN SOURCE • Well, no. At least not right away • Right now it’s very totango-specific. • Integrations to Librato-metics • Queries on Totango Databases • Uses Jenkins for controlling executions • Displays Totango’s Account metadata (totango’s business logic) • Maybe some other day... Tuesday, July 15, 14

Slide 64

Slide 64 text

WHAT ELSE IS OUT THERE? Tuesday, July 15, 14

Slide 65

Slide 65 text

WHAT ELSE IS OUT THERE? • Oozie • Azkaban • AWS Data Pipeline • Chronos • spring-batch • Dataswarm (facebook) • River (outbrain internal) • What’s your favorite WF engine? (did you build one?) Tuesday, July 15, 14

Slide 66

Slide 66 text

MY OTHER PROJECTS • https://github.com/hector-client/hector • https://github.com/rantav/flask-restful-swagger • https://github.com/sebastien/monitoring • https://github.com/rantav/meteor-migrations • https://github.com/rantav/node-github-list-packages • https://github.com/rantav/devdev Tuesday, July 15, 14

Slide 67

Slide 67 text

REFS • https://github.com/spotify/luigi • Facebook’s Dataswarm https://www.youtube.com/watch? v=M0VCbhfQ3HQ • Outbrain’s River https://www.youtube.com/watch? v=EzsckTggDiM Tuesday, July 15, 14