Lessons from 6 Months of using Luigi

Lessons from 6 Months of using Luigi

AKA Why it's better to be woken up by you cat than by the server alarm

65bea007fa26257adff7aaf5b7268e09?s=128

peteowlett

May 07, 2016
Tweet

Transcript

  1. Lessons from 6 months of using Luigi in production @peterowlett

    @deliveroo
  2. Hello! I’m Pete

  3. None
  4. I work for these folks

  5. WE DO THIS

  6. Why it’s better to be woken up by your cat

    than by the server alarm A BETTER TITLE
  7. This is Kitty

  8. NO RESPECT FOR PERSONAL SPACE

  9. This is PagerDuty

  10. Even less respect for personal space

  11. Let’s Compare! - Goes off at any time, day or

    night - Loud ring tone, text messages, answer phone messages and flashing - Resolution can take hours - Goes off only once at precisely 6am - Cute batting motion to wake - Resolved in time it takes to open cat food packet
  12. I think we can all agree with my premise Kitty

    >> PagerDuty
  13. Lets get to it

  14. Chapter 1 “The Model”

  15. Let’s build a model

  16. I’m ready, where’s the data?

  17. “Just pg_dump the prod db”

  18. OH PLS PLS NO DON’T DO THAT

  19. Lets spin up a read slave and ETL the data

    to a warehouse …
  20. … then train our models from that

  21. How do we ensure tasks run in Order?

  22. I want them to run one after the other 2

    1 3 4 5 6 7 8
  23. Directed Acyclic Graph

  24. None
  25. Enter Stage Left …

  26. Simple Task

  27. Postgres Loader Task

  28. We string these together to make DAGs CHECK MAX ROW

    ID LOAD DATA MOD DATA MAKE MODEL CHECK MAX ROW ID LOAD DATA TABLE1 TABLE2
  29. DAGs solve the dependency problem

  30. Bung it all on EC2

  31. Define an entry point

  32. Run the scheduler

  33. Kick it all off with CRON

  34. With luigi we were up and running in a few

    hours
  35. Chapter 2 “The Nuclear Option”

  36. A few weeks later, something happened …

  37. None
  38. Schema can change anytime without warning HAS THE SCHEMA CHANGED?

    RELOAD JUST NEW ROWS DROP AND CREATE WHOLE SCHEMA RELOAD ALL TABLES NO! YES!
  39. None
  40. Handle schema changes robustly

  41. Let’s test our pipeline before we deploy it. But how?

  42. Two new operating modes TEST MODE Run the whole pipeline

    but only write to a test schema UNIT MODE Run the current task, ignoring its dependencies
  43. Configure these modes in the pipeline using luigi.Parameter

  44. Now nothing will ever go wrong, ever again …

  45. Make your testing comprehensive

  46. Make your testing fast

  47. Adding in external API services

  48. Build Loaders for each API

  49. Loading Schedules

  50. Plumbing them in

  51. Keep def rows as short as possible

  52. Be consistent in loader design pattern

  53. Expect external API services to misbehave

  54. Expect external API services to misbehave X

  55. Trust external API services as if they actively want to

    hurt you
  56. Hey cool, all our data is in one place, we

    might as well use it for BI Reporting
  57. Chapter 3 “The Management Report”

  58. This happened

  59. And then your ad hoc database is now supporting global

    business critical apps
  60. None
  61. Stuff that was happening • Irrelevant upstream failures • Low

    priority upstream failures • Flakey Data (but it worked!)
  62. Our DAG looked like this: START LOAD2 LOAD1 LOAD3 LOAD

    DONE MAKE1 MAKE2 MAKE3 END
  63. And this was happening START LOAD2 LOAD1 LOAD3 LOAD DONE

    MAKE1 MAKE2 MAKE3 END
  64. This one doesn’t need LOAD3 START LOAD2 LOAD1 LOAD3 LOAD

    DONE MAKE1 MAKE2 MAKE3 END
  65. So we changed it to this START LOAD2 LOAD1 LOAD3

    LOAD ALL MAKE1 MAKE2 MAKE3 END
  66. Now when 3 fails: START LOAD2 LOAD1 LOAD3 LOAD ALL

    MAKE1 MAKE2 MAKE3 END
  67. Decide about what can be allowed to fail, and what

    can’t
  68. Isolate the path to critical ops jobs

  69. Loading tables more reliably 5AM SATURDAY

  70. The currency table failed to update

  71. Loading tables more reliably DROP TABLE CREATE TABLE LOAD DATA

    Task 1 Task 2 Task 3
  72. Loading tables more reliably DROP TABLE CREATE TABLE THIS CAN

    GO WRONG LOAD DATA Task 1 Task 2 Task 3 THIS CAN GO WRONG THIS CAN GO WRONG
  73. Expect Failure, Rollback Transaction CREATE TEMP TABLE Task 1 (There

    is no task 2) LOAD DATA RENAME OLD TABLE RENAME NEW TABLE ROLLBACK
  74. None
  75. Encapsulate logic in bigger chunks

  76. None
  77. Anticipating problems early

  78. Going beyond system monitoring

  79. Defined Monitoring Tests

  80. Measuring outcomes directly

  81. And get gentler alerts in slack

  82. Monitor (and alert on) outcomes as well as system metrics

  83. Try / Except / Slack Alert low priority tasks

  84. Chapter 4 “Hi, this is Australia calling …”

  85. In the beginning there was the UK YAY! DOWNTIME!! Midnight

    Midnight Midday UK Ops
  86. Then Europe Ok cool still loads of downtime Midnight Midnight

    Midday
  87. Then Some Other Places No such thing as downtime anymore

    Midnight Midnight Midday
  88. Table Loading - Take 2 HASH THE TABLE SCHEMA COMPARE

    TO LAST HASH SAME! CHANGED! DROP AND REBUILD JUST LOAD ROWS
  89. Get rid of the nuclear option

  90. SORRY RIPLEY

  91. Chapter 5 “Moving to Scale”

  92. When it comes to BI, Old School Rules Still Apply

  93. Configuration Management (Docker + ECS)

  94. Distributed workers make some pain go away

  95. Protobuf3 on Message Bus

  96. Final Thoughts

  97. I regret nothing!

  98. Everything is defined in code

  99. Two people, tiny budget

  100. Time spent speeding up the build process is time well

    spent
  101. Think carefully about what dependencies *mean*

  102. To finish …

  103. None
  104. We’re hiring! Grab me after :) https://roo.it/peteo Also £5 off

    your first order!
  105. Sleep Well! @peterowlett @deliveroo Sleep Well!