Luigi

 Luigi

An overview of the Luigi open source framework for the Committres forum at Aleph & Reversim

2cdc5c3059c40c76d5ca7bec8d149f9e?s=128

Ran Tavory

July 07, 2014
Tweet

Transcript

  1. @rantav totango Tuesday, July 15, 14

  2. WHO AM I? Tuesday, July 15, 14

  3. WHO AM I? • A developer • Google, Microsoft, Outbrain,

    Gigaspaces, Totango etc • Hector, flask-restful-swager, meteor-migrations, monitoring... • Podcast: reversim.com • devdev.io • Gormim Tuesday, July 15, 14
  4. WHAT IS LUIGI? Tuesday, July 15, 14

  5. WHAT IS LUIGI? Tuesday, July 15, 14

  6. WHAT IS LUIGI? • A Workflow Engine. • Who the

    fuck needs a workflow engine? Tuesday, July 15, 14
  7. WHAT IS LUIGI? • A Workflow Engine. • Who the

    fuck needs a workflow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) Tuesday, July 15, 14
  8. WHAT IS LUIGI? • A Workflow Engine. • Who the

    fuck needs a workflow engine? • You do!!! • If you run hadoop (or other ETL jobs) • If you have dependencies b/w them (who doesn’t?!) • If they fail (s/if/when/) • Luigi doesn’t replace Hadoop, Scalding, Pig, Hive, Redshift. • It orchestrates them Tuesday, July 15, 14
  9. SCREENSHOTS Tuesday, July 15, 14

  10. HOW DO YOU ETL YOUR DATA? • Hadoop • Spark

    • Redshift • Postgres • Ad-hoc java/python/ruby/go/... Tuesday, July 15, 14
  11. RUNNING ONE JOB IS EASY RUNNING MANY IS HARD •

    100s of concurrent jobs, 1000s Daily. • Job dependencies. • E.g. first copy the file, then crunch it. • Errors / retries • Idempotency • Monitoring / Visuals Tuesday, July 15, 14
  12. THE WRONG WAY TO DO IT Tuesday, July 15, 14

  13. EXAMPLE WORKFLOW Log data Subsample and extract features Features Train

    classification model Model Log data Log data Log data Upload model to servers Tuesday, July 15, 14
  14. THE CRON PHENOMENON THE W RONG WAY TO DO IT

    Tuesday, July 15, 14
  15. THE CRON PHENOMENON Don’t try this at home!!! THE W

    RONG WAY TO DO IT Tuesday, July 15, 14
  16. ENTER LUIGI Tuesday, July 15, 14

  17. ENTER LUIGI • Like Makefile - but in python •

    And - For data • Integrates well with data targets • Hadoop, Spark, Databases • Atomic file/db operations • Visualization • CLI - really nice developer interface! Tuesday, July 15, 14
  18. LUIGI TASK Tuesday, July 15, 14

  19. LUIGI TASK Tuesday, July 15, 14

  20. RUN FROM THE CLI Tuesday, July 15, 14

  21. TASK PARAMETERS Tuesday, July 15, 14

  22. AWESOME HADOOP (MR) SUPPORT Tuesday, July 15, 14

  23. WEB UI Tuesday, July 15, 14

  24. PROCESS SYNCHRONIZATION Tuesday, July 15, 14

  25. USED BY Tuesday, July 15, 14

  26. SEMI-DEEP DIVE Programming for Luigi Tuesday, July 15, 14

  27. LUIGI TASKS • Implement 4 method: def input(self) (optional) def

    output(self) def run(self) def depends(self) Tuesday, July 15, 14
  28. LUIGI TASKS • Or extend one of the predefined tasks

    • S3CopyToTable • RedshiftManifestTask • SparkJob • HiveQueryTask • HadoopJobTask Tuesday, July 15, 14
  29. EXAMPLE LOCAL WORDCOUNT class WordCount(luigi.Task): date_interval = luigi.DateIntervalParameter() def requires(self):

    return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.LocalTarget('/var/tmp/text-count/%s' % self.date_interval) def run(self): count = {} for file in self.input(): for line in file.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 # output data f = self.output().open('w') for word, count in count.iteritems(): f.write("%s\t%d\n" % (word, count)) f.close() Tuesday, July 15, 14
  30. EXAMPLE HADOOP WORDCOUNT class WordCount(luigi.hadoop.JobTask): date_interval = luigi.DateIntervalParameter() def requires(self):

    return [InputText(date) for date in self.date_interval.dates()] def output(self): return luigi.hdfs.HdfsTarget('/tmp/text-count/%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Tuesday, July 15, 14
  31. LUIGI TARGETS • HDFS • Local File • Postgres /

    MySQL, Redshift, ElasticSearch • ... Easy to extend Tuesday, July 15, 14
  32. DEFINING A TARGET • Implement: def exists(self) And optionally: connect

    or open / close Tuesday, July 15, 14
  33. EXAMPLE MYSQL TARGET class MySqlTarget(luigi.Target): def touch(self, connection=None): ... def

    exists(self, connection=None): cursor = connection.cursor() cursor.execute("""SELECT 1 FROM {marker_table} WHERE update_id = %s LIMIT 1""".format(marker_table=self.marker_table), (self.update_id,) ) row = cursor.fetchone() return row is not None def connect(self, autocommit=False): ... def create_marker_table(self): ... Tuesday, July 15, 14
  34. THE GRAND SCHEME Tuesday, July 15, 14

  35. THE GRAND SCHEME Run Task Tuesday, July 15, 14

  36. THE GRAND SCHEME Run Task Check Deps Tuesday, July 15,

    14
  37. THE GRAND SCHEME Run Task Check Deps self.requires() Tuesday, July

    15, 14
  38. THE GRAND SCHEME Run Task Check Deps self.requires() target.exists() Tuesday,

    July 15, 14
  39. THE GRAND SCHEME Run Task Check Deps Run Deps self.requires()

    target.exists() Tuesday, July 15, 14
  40. THE GRAND SCHEME Run Task Check Deps Run Deps Use

    Deps Output self.requires() target.exists() Tuesday, July 15, 14
  41. THE GRAND SCHEME Run Task Check Deps Run Deps Use

    Deps Output Invoke Run self.requires() target.exists() Tuesday, July 15, 14
  42. THE GRAND SCHEME Run Task Check Deps Run Deps Use

    Deps Output Invoke Run self.requires() target.exists() Check Scheduler Tuesday, July 15, 14
  43. THE GRAND SCHEME Run Task Check Deps Run Deps Use

    Deps Output Invoke Run self.input() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14
  44. THE GRAND SCHEME Run Task Check Deps Run Deps Use

    Deps Output Invoke Run self.input() self.output() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14
  45. THE GRAND SCHEME Run Task Check Deps Run Deps Use

    Deps Output Invoke Run Write Output self.input() self.output() self.requires() target.exists() Check Scheduler Tuesday, July 15, 14
  46. OPEN SOURCE Tuesday, July 15, 14

  47. BY Tuesday, July 15, 14

  48. Tuesday, July 15, 14

  49. WHAT DID I DO? • Add Redshift support • Add

    MySQL support • Various small features (improved notifications, dep.py, historydb etc) • Various bug reports • And fixes! Tuesday, July 15, 14
  50. LUIGI @ TOTANGO • Daily computation • Hourly computation •

    Ad-hoc data loading (for data analysis activities, to redshift) Tuesday, July 15, 14
  51. TOTANGO’S SETUP invoke Luigi Workers coordinate Luigi Scheduler report Tuesday,

    July 15, 14
  52. TOTANGO’S SETUP Luigi Worker read / write Tuesday, July 15,

    14
  53. MOAR SCREENSHOTS Tuesday, July 15, 14

  54. MOAR SCREENSHOTS Tuesday, July 15, 14

  55. MOAR SCREENSHOTS Tuesday, July 15, 14

  56. MOAR SCREENSHOTS Tuesday, July 15, 14

  57. MOAR SCREENSHOTS Tuesday, July 15, 14

  58. GAMEBOY!!! Tuesday, July 15, 14

  59. AND... GAMEBOY Tuesday, July 15, 14

  60. GAMEBOY Tuesday, July 15, 14

  61. GAMEBOY Tuesday, July 15, 14

  62. GAMEBOY IS • A Totango specific controller for Luigi •

    The transition process (to Luigi) • Provide high level overview • Manual re-run of tasks • Monitor progress, performance, run times, queues, worker load etc... • Implemented using Flask and AngularJS Tuesday, July 15, 14
  63. IS GAMEBOY OPEN SOURCE • Well, no. At least not

    right away • Right now it’s very totango-specific. • Integrations to Librato-metics • Queries on Totango Databases • Uses Jenkins for controlling executions • Displays Totango’s Account metadata (totango’s business logic) • Maybe some other day... Tuesday, July 15, 14
  64. WHAT ELSE IS OUT THERE? Tuesday, July 15, 14

  65. WHAT ELSE IS OUT THERE? • Oozie • Azkaban •

    AWS Data Pipeline • Chronos • spring-batch • Dataswarm (facebook) • River (outbrain internal) • What’s your favorite WF engine? (did you build one?) Tuesday, July 15, 14
  66. MY OTHER PROJECTS • https://github.com/hector-client/hector • https://github.com/rantav/flask-restful-swagger • https://github.com/sebastien/monitoring •

    https://github.com/rantav/meteor-migrations • https://github.com/rantav/node-github-list-packages • https://github.com/rantav/devdev Tuesday, July 15, 14
  67. REFS • https://github.com/spotify/luigi • Facebook’s Dataswarm https://www.youtube.com/watch? v=M0VCbhfQ3HQ • Outbrain’s

    River https://www.youtube.com/watch? v=EzsckTggDiM Tuesday, July 15, 14