Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from 6 Months of using Luigi

Lessons from 6 Months of using Luigi

AKA Why it's better to be woken up by you cat than by the server alarm

peteowlett

May 07, 2016
Tweet

More Decks by peteowlett

Other Decks in Technology

Transcript

  1. Lessons from 6 months of
    using Luigi in production
    @peterowlett @deliveroo

    View Slide

  2. Hello!
    I’m Pete

    View Slide

  3. View Slide

  4. I work for these folks

    View Slide

  5. WE DO
    THIS

    View Slide

  6. Why it’s better to be
    woken up by your cat
    than by the server alarm
    A BETTER TITLE

    View Slide

  7. This is Kitty

    View Slide

  8. NO RESPECT
    FOR PERSONAL SPACE

    View Slide

  9. This is
    PagerDuty

    View Slide

  10. Even less respect for
    personal space

    View Slide

  11. Let’s Compare!
    - Goes off at any time, day or night
    - Loud ring tone, text messages, answer
    phone messages and flashing
    - Resolution can take hours
    - Goes off only once at precisely 6am
    - Cute batting motion to wake
    - Resolved in time it takes to open cat
    food packet

    View Slide

  12. I think we can all
    agree with my premise
    Kitty >> PagerDuty

    View Slide

  13. Lets get to it

    View Slide

  14. Chapter 1
    “The Model”

    View Slide

  15. Let’s build a model

    View Slide

  16. I’m ready, where’s the data?

    View Slide

  17. “Just pg_dump the
    prod db”

    View Slide

  18. OH PLS PLS NO
    DON’T DO THAT

    View Slide

  19. Lets spin up a read slave
    and ETL the data to a
    warehouse …

    View Slide

  20. … then train our
    models from that

    View Slide

  21. How do we ensure
    tasks run in Order?

    View Slide

  22. I want them to run one after
    the other
    2
    1
    3
    4
    5
    6
    7
    8

    View Slide

  23. Directed
    Acyclic
    Graph

    View Slide

  24. View Slide

  25. Enter Stage Left …

    View Slide

  26. Simple Task

    View Slide

  27. Postgres Loader Task

    View Slide

  28. We string these together to
    make DAGs
    CHECK
    MAX
    ROW ID
    LOAD
    DATA
    MOD
    DATA
    MAKE
    MODEL
    CHECK
    MAX
    ROW ID
    LOAD
    DATA
    TABLE1
    TABLE2

    View Slide

  29. DAGs solve the
    dependency problem

    View Slide

  30. Bung it all on EC2

    View Slide

  31. Define an entry point

    View Slide

  32. Run the scheduler

    View Slide

  33. Kick it all off with CRON

    View Slide

  34. With luigi we were up and
    running in a few hours

    View Slide

  35. Chapter 2
    “The Nuclear Option”

    View Slide

  36. A few weeks later,
    something happened …

    View Slide

  37. View Slide

  38. Schema can change
    anytime without warning
    HAS THE
    SCHEMA
    CHANGED?
    RELOAD
    JUST NEW
    ROWS
    DROP AND
    CREATE
    WHOLE
    SCHEMA
    RELOAD
    ALL
    TABLES
    NO!
    YES!

    View Slide

  39. View Slide

  40. Handle schema
    changes robustly

    View Slide

  41. Let’s test our pipeline
    before we deploy it. But
    how?

    View Slide

  42. Two new operating modes
    TEST MODE
    Run the whole pipeline
    but only write to a test
    schema
    UNIT MODE
    Run the current task,
    ignoring its
    dependencies

    View Slide

  43. Configure these modes in
    the pipeline using
    luigi.Parameter

    View Slide

  44. Now nothing will ever
    go wrong, ever again …

    View Slide

  45. Make your testing
    comprehensive

    View Slide

  46. Make your testing fast

    View Slide

  47. Adding in external API
    services

    View Slide

  48. Build Loaders for each API

    View Slide

  49. Loading Schedules

    View Slide

  50. Plumbing them in

    View Slide

  51. Keep def rows as
    short as possible

    View Slide

  52. Be consistent in
    loader design pattern

    View Slide

  53. Expect external API
    services to misbehave

    View Slide

  54. Expect external API
    services to misbehave
    X

    View Slide

  55. Trust external API
    services as if they
    actively want to hurt you

    View Slide

  56. Hey cool, all our data is in
    one place, we might as
    well use it for BI Reporting

    View Slide

  57. Chapter 3
    “The Management Report”

    View Slide

  58. This happened

    View Slide

  59. And then your ad hoc
    database is now supporting
    global business critical apps

    View Slide

  60. View Slide

  61. Stuff that was happening
    • Irrelevant upstream failures
    • Low priority upstream failures
    • Flakey Data (but it worked!)

    View Slide

  62. Our DAG looked like this:
    START LOAD2
    LOAD1
    LOAD3
    LOAD
    DONE
    MAKE1
    MAKE2
    MAKE3
    END

    View Slide

  63. And this was happening
    START LOAD2
    LOAD1
    LOAD3
    LOAD
    DONE
    MAKE1
    MAKE2
    MAKE3
    END

    View Slide

  64. This one doesn’t need
    LOAD3
    START LOAD2
    LOAD1
    LOAD3
    LOAD
    DONE
    MAKE1
    MAKE2
    MAKE3
    END

    View Slide

  65. So we changed it to this
    START LOAD2
    LOAD1
    LOAD3
    LOAD
    ALL
    MAKE1
    MAKE2
    MAKE3
    END

    View Slide

  66. Now when 3 fails:
    START LOAD2
    LOAD1
    LOAD3
    LOAD
    ALL
    MAKE1
    MAKE2
    MAKE3
    END

    View Slide

  67. Decide about what can
    be allowed to fail, and
    what can’t

    View Slide

  68. Isolate the path to
    critical ops jobs

    View Slide

  69. Loading tables more reliably
    5AM SATURDAY

    View Slide

  70. The currency table
    failed to update

    View Slide

  71. Loading tables more reliably
    DROP
    TABLE
    CREATE
    TABLE
    LOAD
    DATA
    Task 1 Task 2 Task 3

    View Slide

  72. Loading tables more reliably
    DROP
    TABLE
    CREATE
    TABLE
    THIS CAN
    GO WRONG
    LOAD
    DATA
    Task 1 Task 2 Task 3
    THIS CAN
    GO WRONG
    THIS CAN
    GO WRONG

    View Slide

  73. Expect Failure, Rollback
    Transaction
    CREATE
    TEMP
    TABLE
    Task 1 (There is no task 2)
    LOAD
    DATA
    RENAME
    OLD
    TABLE
    RENAME
    NEW
    TABLE
    ROLLBACK

    View Slide

  74. View Slide

  75. Encapsulate logic in
    bigger chunks

    View Slide

  76. View Slide

  77. Anticipating problems early

    View Slide

  78. Going beyond system
    monitoring

    View Slide

  79. Defined Monitoring Tests

    View Slide

  80. Measuring outcomes
    directly

    View Slide

  81. And get gentler alerts in
    slack

    View Slide

  82. Monitor (and alert on)
    outcomes as well as
    system metrics

    View Slide

  83. Try / Except / Slack
    Alert low priority tasks

    View Slide

  84. Chapter 4
    “Hi, this is Australia calling …”

    View Slide

  85. In the beginning there was
    the UK
    YAY!
    DOWNTIME!!
    Midnight Midnight
    Midday
    UK Ops

    View Slide

  86. Then Europe
    Ok cool still
    loads of downtime
    Midnight Midnight
    Midday

    View Slide

  87. Then Some Other Places
    No such thing as downtime anymore
    Midnight Midnight
    Midday

    View Slide

  88. Table Loading - Take 2
    HASH THE
    TABLE
    SCHEMA
    COMPARE
    TO LAST
    HASH
    SAME!
    CHANGED! DROP
    AND
    REBUILD
    JUST
    LOAD
    ROWS

    View Slide

  89. Get rid of the nuclear
    option

    View Slide

  90. SORRY RIPLEY

    View Slide

  91. Chapter 5
    “Moving to Scale”

    View Slide

  92. When it comes to BI, Old
    School Rules Still Apply

    View Slide

  93. Configuration
    Management
    (Docker + ECS)

    View Slide

  94. Distributed workers make
    some pain go away

    View Slide

  95. Protobuf3 on
    Message Bus

    View Slide

  96. Final Thoughts

    View Slide

  97. I regret nothing!

    View Slide

  98. Everything is defined
    in code

    View Slide

  99. Two people, tiny
    budget

    View Slide

  100. Time spent speeding up
    the build process is time
    well spent

    View Slide

  101. Think carefully about
    what dependencies
    *mean*

    View Slide

  102. To finish …

    View Slide

  103. View Slide

  104. We’re hiring! Grab me
    after :)
    https://roo.it/peteo
    Also £5 off your first order!

    View Slide

  105. Sleep Well!
    @peterowlett @deliveroo
    Sleep Well!

    View Slide