Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Promises and Queues: Using Unlikely Suspects to Handle Asynchronous Parallel Processing

Jacob Mather
February 09, 2019

Promises and Queues: Using Unlikely Suspects to Handle Asynchronous Parallel Processing

My last job was with a distributed manufacturing platform for turning digital ideas into physical products. They enabled customers to upload 3D models, have the models manufactured into physical goods, and delivered into the customer's hands, all within 24 hours. Every time a digital model is uploaded, we processed the file with an array of tools that inspect the model and make determinations about its manufacturability, size, and perhaps most importantly price. One of the very first things I did there was to completely overhaul this process, converting it from a mystical black box to a clear set of discrete processes with copious amounts of highly visible logging. While there are many possible ways to do this, I chose Jenkins, and in supporting this system for over a year, while I may use different tools to do it now, Jenkins bought us a lot of time in the interim, and I left them with quite a bit of runway before any changes would be required.

In this talk I will cover where we started, why I chose Jenkins, why it works so well for this use case, and how to use these same patterns to solve your asynchronous parallel processing problems, regardless of your platform. Our use patterns showed us that managing jobs in Jenkins can be a very similar experience to managing code deployed to server-less solutions such as AWS Lambda. Let me show you how.

Jacob Mather

February 09, 2019
Tweet

More Decks by Jacob Mather

Other Decks in Programming

Transcript

  1. Promises and Queues
    Using Unlikely Suspects to Handle Async Parallel Processes

    View full-size slide

  2. About Me
    What I do:
    • Senior Web Developer

    at Dart Container
    • Spouter of Opposing Ideas
    • Community Evangelist
    Where I can be found:
    • Blog: http://jmather.com
    • Twitter: @thejmather
    • Medium: @jacobmather

    View full-size slide

  3. User Uploads
    Digital Model
    User Receives
    Physical Model
    User Selects
    Material

    View full-size slide

  4. Focus Area
    User Uploads
    Digital Model
    User Receives
    Physical Model
    User Selects
    Material

    View full-size slide

  5. User Uploads
    Digital Model
    User Selects
    Material
    System
    Inspects Model

    View full-size slide

  6. System Inspects Model
    File
    Upload
    File
    Inspection
    Part
    Order

    View full-size slide

  7. System Inspects Model
    File
    Upload
    File
    Inspection
    Part
    Order
    Cloud Cloud

    View full-size slide

  8. System Inspects Model
    File
    Upload
    File
    Inspection
    Part
    Order
    Cloud Cloud
    On Prem

    View full-size slide

  9. How did we get the uploaded file from the
    cloud to the computer under the stairs for
    processing, and then get the results back
    into the cloud for our application?

    View full-size slide

  10. Problems with this setup
    • No Clarity
    • Not really sure when which thing does what
    • No Visibility
    • Not really sure when things break
    • No Verifiability
    • Not really sure when things are done

    View full-size slide

  11. Other issues…
    • There was another problem with the upload process — it
    could time out.
    • Heroku requires calls to complete in under 30 seconds.
    Having to have the file be small enough that it could be
    uploaded twice (once from the browser to Heroku, then
    from Heroku to Dropbox) within 30 seconds was
    constraining customers.

    View full-size slide

  12. What do I expect from a
    system?

    View full-size slide

  13. Clarity
    Consistency
    Communication

    View full-size slide

  14. Clarity
    • Clarity means you can understand *how* the system runs
    through it’s phases of execution.
    • Often, race conditions enter a system due to a lack of
    clarity within the execution process.

    View full-size slide

  15. Consistency
    • Consistency means the system performs the same way
    with the same inputs each and every time.
    • This is also known as being deterministic.

    View full-size slide

  16. Communication
    • Communication means the system is capable of
    broadcasting it’s current state in a way which can be
    readily understood by those who are interacting with and
    maintaining the system.

    View full-size slide

  17. So let’s redesign this

    View full-size slide

  18. Another view of where we started

    View full-size slide

  19. Phase 1: The Upload

    View full-size slide

  20. Phase 2: The Processing

    View full-size slide

  21. Full Redesign

    View full-size slide

  22. Inputs and Outputs
    • Inputs:
    • SOURCE_S3_FILE_PATH (required)
    • RESULT_S3_FILE_PATH (required)
    • REGISTRATION_RETURN_URL (required, added later)
    • REPORT_RETURN_URL (required)
    • TOOLS_COMMIT (default: master)
    • Outputs:
    • Immediate: Redirect Header, http://jenkins-master-2938.serverfarm.com/queue/29219
    • Eventual: Report JSON
    • Reporting Process: (string)
    • Status: (SUCCESS/FAILURE)
    • Reason: (string)
    • Data: (object)
    • Log: (object)

    View full-size slide

  23. Why I like Jenkins
    • Lots of visibility into the execution of CLI scripts
    • Simple interface, deep links, and great interaction patterns
    • Very old tool, reasonably secure, extensive community
    • We were able to tie it into our Google auth to allow all
    employees easy access to log in and see why something failed.
    • It is incredibly flexible, with a lot of plugins which added
    functionality that made all of this possible VERY quickly.
    • GREAT scaling architecture

    View full-size slide

  24. Favorite Jenkins Plugins
    • Workflow — it enabled us to have our processing logic self-
    contained within the job. Without Groovy logic, we would have been
    forced to make our Platform responsible for handling the decision
    tree, requiring much greater complexity in the implementation.
    • Build Authorization Token Root — lets you enable token
    authentication (the token is defined in the job) to trigger builds via
    API calls even when Jenkins denies anonymous access.
    • Node and Label Parameter — lets you target a Jenkins slave
    specifically to detect if a machine instance is failing.
    • Rebuild — Makes it easy to re-run a job, pre-populating parameters.

    View full-size slide

  25. How this scales

    View full-size slide

  26. How this scales

    View full-size slide

  27. How this scales

    View full-size slide

  28. What to call this architecture
    • Eventually I came to call this Promise-based Architecture.
    • Origin application has an environment variable with the
    base URL to send jobs to.
    • Job submission payload contained URLs by which they
    should report expectations and results to.
    • Job submission result payload contained a URL to query
    for job status.
    • Jobs were at least 99.9% reliably returning their
    contractual callback requests.

    View full-size slide

  29. Then new requirements
    came along…

    View full-size slide

  30. We added another processor

    View full-size slide

  31. So we went from…

    View full-size slide

  32. So that’s how we’re going to
    play it, huh?
    It’s time to get a little more deliberate in how this stuff is defined.

    View full-size slide

  33. How this works in Jenkins
    • Jenkins jobs were broken into two categories:
    • Orchestration (what order to do things in)
    • Written in Groovy, using Jenkins Workflow Plugin
    • Worker Process (what to do)
    • Written in Bash + Other Scripting Languages

    View full-size slide

  34. How I think about
    Orchestration
    • Within a given orchestration job, I think it makes the
    most sense to view it as a series of consecutive Phases
    which may have multiple concurrent steps.
    • Remember, Orchestration jobs (in my world) don’t
    perform any direct work. Their responsibility is executing
    Worker Processes, evaluating their results, and
    determining the next steps to perform.
    • I felt this was a very SOLID view to take.

    View full-size slide

  35. How orchestration works

    View full-size slide

  36. How orchestration is ran

    View full-size slide

  37. Phases run in order, steps
    run concurrently

    View full-size slide

  38. But it’s not always sunny in
    San Francisco

    View full-size slide

  39. My Problems With Jenkins
    • I am *NOT* a GUI guy.
    • Driving a browser all day was driving me insane.
    • I am *REALLY* impatient.
    • I HATED debug by re-running, especially when it could take
    5-10 minutes to get a result.
    • Writing production worthy code in a was
    beginning to get on my last nerve…
    • Just… yeah. Need I say more on that one?

    View full-size slide

  40. Step 1: Get away from the UI

    View full-size slide

  41. Getting away from the GUI
    • Jenkins offers the Jenkins CLI which has probably 95%
    of the features you need on a daily basis for Jenkins
    server management.
    • Add/delete/update Jobs
    • Install Plugins
    • Restarting
    • Managing nodes

    View full-size slide

  42. Getting away from the GUI
    • Jenkins Jobs are XML Definitions with code payloads
    • I made templates of the XML definitions using DOT.js
    • I put Groovy code in .groovy files editing them with a
    Groovy friendly editor. Thanks for IntelliJ, JetBrains!
    • I put Bash code in .bash files editing them with a Bash
    friendly editor. Another win for IntelliJ!
    • I put the tools I would call within jobs in a repository which
    I could easily checkout within a Job context, to execute via
    Bash scripts.

    View full-size slide

  43. Getting away from the GUI
    • Jobs got a generated “version” attached to them
    • {job name}-v{version number of job}-{latest git hash}
    • The version number of the job was calculated for file and it’s
    dependencies (a “Job” was a JSON file)
    • The system would also replace references to the “common name” of a
    job with its “deployed name”, allowing us to write code that said
    “build(‘process_1’)” even though in production the job was actually
    called “process_1-v45-1kljh21lk1j2h12lkj1h1”.
    • I build a cleanup script which would query all of the jobs on a Jenkins
    Master and prune any job which had not been ran in 7 days, and was
    not the latest version.

    View full-size slide

  44. Step 2: Improving the

    Speed & Safety of Development

    View full-size slide

  45. Improving Speed & Safety
    • Now that I had Groovy in .groovy files, things got really interesting very quickly.
    • I created a compile process for the Groovy, allowing me to prepend and
    append code around the Groovy which defined the actual work, allowing for
    common libraries shared between jobs, and made the source directory
    configurable, to enable production and test builds.
    • I created mocks for the Jenkins API that I in the test builds so I could test
    that my Groovy was executable without having to run the job in Jenkins.
    • I created a configuration file to define what those mocks should respond
    with, and how I expected them to be called. I then made the mocks record
    their calls, and added an append process to produce a report, which could
    then be validated.
    • I made a Groovy test suite for Jenkins Workflows!

    View full-size slide

  46. Improving Speed & Safety
    • Since I had made the callback URLs parameterized, it
    enabled me to build an external functional test suite,
    using ngrok to expose my laptop for cloud callbacks,
    allowing me to easily validate that the jobs were
    performing in a real execution environment as expected.
    • The external functional test suite was also capable of
    performing as a performance test suite by simply upping
    the number of configured files to process, and the rate at
    which they were submitted for processing.

    View full-size slide

  47. And then things expanded!

    View full-size slide

  48. Problems we ran into
    • Logs. Logs. logs. logs. logs. logs. logs. logs. logs. logs.
    • Finding the balance between enough logs for business
    purposes (2-3 days of logs) and too many logs for Jenkins
    to cope well (100-200 builds on the Master instances we
    were using).
    • Jobs. Jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs.
    • Sometimes during heavy dev periods there would be a LOT
    of jobs, which Jenkins also didn’t seem to really like. And it
    got confusing for others trying to figure out which jobs to
    look at for production information.

    View full-size slide

  49. The next round of work…

    View full-size slide

  50. Operationalizing Jenkins
    • Chef recipes to build Jenkins Master and Jenkins Slave workers, and Groovy to configure
    Jenkins without ever having to go into the UI.
    • Groovy handled things like enabling security, assigning SSH public keys to enable CLI
    usage, and setting other various global configuration options.
    • OpsWorks configuration to build Jenkins farm of Master and Workers.
    • Chef recipes to consume OpsWorks environment config and automatically attach new Worker
    instances to the Master.
    • Chef recipes to deploy new suites of Jenkins jobs to the Master, including any required plugins.
    • CloudFormation templates to build OpsWorks environments, and ensure workers were
    distributed across multiple availability zones within a region.
    • Scripts to generate CloudFormation templates for a given scale (how many of each worker
    type, instance sizes, etc…).
    • Jenkins jobs to build deployable components for Chef recipes and Jenkins jobs.

    View full-size slide

  51. Operationalizing Jenkins
    • Orchestration job builder got a “no-op” Worker Process
    mode (with mocked report.json artifacts), so we could
    read the report.json and confirm the orchestration was
    operating as expected *quickly*.
    • Soon realized I had the levers to do something else
    pretty interesting…

    View full-size slide

  52. Bonus Content
    What if we could operationalize QA?

    View full-size slide

  53. Component Files
    • Each deployable module got a “Component File” which
    detailed several items:
    • What repository it lived in
    • Path in the repository to watch for changes
    • High level environment requirements (node, ruby, linux,
    etc…)
    • Build, Test, Stage, and Release steps.

    View full-size slide

  54. Definition Example for
    JavaScript UI
    • Repo: http://github.com/somesite/core-system.git
    • Path: frontend/src
    • Requirements:
    • node, version 8.12.0
    • Build:
    • npm install

    npm run build
    • Test:
    • npm test
    • Stage:
    • s3cmd cp build/* s3://somesite-public/release/${GIT_HASH}
    • Release:
    • heroku update:config -a {HEROKU_APP} JAVASCRIPT_URL https://cdn.somesite.com/release/${GIT_HASH}

    View full-size slide

  55. Operationalizing QA
    • Added “Environments” to component files (for lack of better
    location at the time).
    • Jenkins pre-scripts ensured RVM and/or NVM was installed for
    Ruby/JavaScript requirements. Using the Jenkins plugins for
    these didn’t seem to work consistently.
    • Auto-built orchestration and worker jobs for each component’s
    Build-Test, Build-Test-Stage, and Build-Test-Stage-Release
    • Used environment names in the “environment” section to build
    -Stage and -Release jobs, and
    enabled variables to be set per-environment which were
    populated in the build, test, stage, and deploy job parameters.

    View full-size slide

  56. Operationalizing QA
    • Once I had high-level jobs to Stage and Release to an environment, I
    added “dependencies” to the component file, to document
    dependencies between components — that the JavaScript Frontend
    relied on the Platform API and the Teams API, for example.
    • The generation system would then automatically ensure that
    dependent components were released to the test environment
    targeted, and that the test suites for all of components which relied
    on new version *also* passed during the integration test phase.
    • For example, when we updated the Teams API, it would ensure that
    the latest production Platform API and JavaScript Frontend were
    deployed, and that JavaScript Frontend’s integration test suite
    passed, ensuring we hadn’t broken anything in the stack.

    View full-size slide

  57. Operationalizing QA
    • Initially the QA system began behaving unpredictably, releasing
    strange versions of code to production. It feels obvious in retrospec,
    but this was because Jenkins should not be responsible for releasing
    itself. Safe Restart doesn’t really apply when restarting itself.
    • We ended up with 3 QA environments - QA Production, QA Test, and
    QA QA. QA QA’s only job was to monitor the Chef recipes and Jenkins
    Jobs for QA, and release to QA Production when appropriate. QA Test
    was QA QA’s integration test environment,
    • Thanks to all of the previous work, QA QA was able to run a full
    functional test of QA Production in QA Test, ensuring everything
    worked as expected, including being able to spin up a test
    environment for each component in our application stack to ensure
    that the build and release processes were behaving appropriately.

    View full-size slide

  58. Recap
    • Created a high-visibility asynchronous parallelizable job
    queue using Jenkins using a Promise-based architecture for
    managing inter-system dependencies.
    • Treated Jenkins as a PAAS (Platform as a Service) for
    deploying jobs, which made deploying safer, as jobs were
    treated as immutable.
    • Created unit, functional, and performance test suites to
    measure and ensure stability and reliability.
    • Created a repeatable way to build Jenkins environments,
    enabling rapid scalability and ensuring business continuity.

    View full-size slide

  59. Thank You!
    Please rate and review my talk!

    https://joind.in/talk/8ceeb








    The slides will be posted on joind.in shortly.

    View full-size slide