Promises and Queues: Using Unlikely Suspects to Handle Asynchronous Parallel Processing

Promises and Queues: Using Unlikely Suspects to Handle Asynchronous Parallel Processing

My last job was with a distributed manufacturing platform for turning digital ideas into physical products. They enabled customers to upload 3D models, have the models manufactured into physical goods, and delivered into the customer's hands, all within 24 hours. Every time a digital model is uploaded, we processed the file with an array of tools that inspect the model and make determinations about its manufacturability, size, and perhaps most importantly price. One of the very first things I did there was to completely overhaul this process, converting it from a mystical black box to a clear set of discrete processes with copious amounts of highly visible logging. While there are many possible ways to do this, I chose Jenkins, and in supporting this system for over a year, while I may use different tools to do it now, Jenkins bought us a lot of time in the interim, and I left them with quite a bit of runway before any changes would be required.

In this talk I will cover where we started, why I chose Jenkins, why it works so well for this use case, and how to use these same patterns to solve your asynchronous parallel processing problems, regardless of your platform. Our use patterns showed us that managing jobs in Jenkins can be a very similar experience to managing code deployed to server-less solutions such as AWS Lambda. Let me show you how.

5048618030da9845ed2710e0dc4da654?s=128

Jacob Mather

February 09, 2019
Tweet

Transcript

  1. Promises and Queues Using Unlikely Suspects to Handle Async Parallel

    Processes
  2. About Me What I do: • Senior Web Developer
 at

    Dart Container • Spouter of Opposing Ideas • Community Evangelist Where I can be found: • Blog: http://jmather.com • Twitter: @thejmather • Medium: @jacobmather
  3. User Uploads Digital Model User Receives Physical Model User Selects

    Material
  4. Focus Area User Uploads Digital Model User Receives Physical Model

    User Selects Material
  5. User Uploads Digital Model User Selects Material System Inspects Model

  6. System Inspects Model File Upload File Inspection Part Order

  7. System Inspects Model File Upload File Inspection Part Order Cloud

    Cloud
  8. System Inspects Model File Upload File Inspection Part Order Cloud

    Cloud On Prem
  9. How did we get the uploaded file from the cloud

    to the computer under the stairs for processing, and then get the results back into the cloud for our application?
  10. None
  11. None
  12. Problems with this setup • No Clarity • Not really

    sure when which thing does what • No Visibility • Not really sure when things break • No Verifiability • Not really sure when things are done
  13. Other issues… • There was another problem with the upload

    process — it could time out. • Heroku requires calls to complete in under 30 seconds. Having to have the file be small enough that it could be uploaded twice (once from the browser to Heroku, then from Heroku to Dropbox) within 30 seconds was constraining customers.
  14. What do I expect from a system?

  15. Clarity Consistency Communication

  16. Clarity • Clarity means you can understand *how* the system

    runs through it’s phases of execution. • Often, race conditions enter a system due to a lack of clarity within the execution process.
  17. Consistency • Consistency means the system performs the same way

    with the same inputs each and every time. • This is also known as being deterministic.
  18. Communication • Communication means the system is capable of broadcasting

    it’s current state in a way which can be readily understood by those who are interacting with and maintaining the system.
  19. So let’s redesign this

  20. Another view of where we started

  21. Phase 1: The Upload

  22. Phase 2: The Processing

  23. Full Redesign

  24. Inputs and Outputs • Inputs: • SOURCE_S3_FILE_PATH (required) • RESULT_S3_FILE_PATH

    (required) • REGISTRATION_RETURN_URL (required, added later) • REPORT_RETURN_URL (required) • TOOLS_COMMIT (default: master) • Outputs: • Immediate: Redirect Header, http://jenkins-master-2938.serverfarm.com/queue/29219 • Eventual: Report JSON • Reporting Process: (string) • Status: (SUCCESS/FAILURE) • Reason: (string) • Data: (object) • Log: (object)
  25. Why I like Jenkins • Lots of visibility into the

    execution of CLI scripts • Simple interface, deep links, and great interaction patterns • Very old tool, reasonably secure, extensive community • We were able to tie it into our Google auth to allow all employees easy access to log in and see why something failed. • It is incredibly flexible, with a lot of plugins which added functionality that made all of this possible VERY quickly. • GREAT scaling architecture
  26. Favorite Jenkins Plugins • Workflow — it enabled us to

    have our processing logic self- contained within the job. Without Groovy logic, we would have been forced to make our Platform responsible for handling the decision tree, requiring much greater complexity in the implementation. • Build Authorization Token Root — lets you enable token authentication (the token is defined in the job) to trigger builds via API calls even when Jenkins denies anonymous access. • Node and Label Parameter — lets you target a Jenkins slave specifically to detect if a machine instance is failing. • Rebuild — Makes it easy to re-run a job, pre-populating parameters.
  27. How this scales

  28. How this scales

  29. How this scales

  30. What to call this architecture • Eventually I came to

    call this Promise-based Architecture. • Origin application has an environment variable with the base URL to send jobs to. • Job submission payload contained URLs by which they should report expectations and results to. • Job submission result payload contained a URL to query for job status. • Jobs were at least 99.9% reliably returning their contractual callback requests.
  31. Then new requirements came along…

  32. We added another processor

  33. So we went from…

  34. To…

  35. So that’s how we’re going to play it, huh? It’s

    time to get a little more deliberate in how this stuff is defined.
  36. How this works in Jenkins • Jenkins jobs were broken

    into two categories: • Orchestration (what order to do things in) • Written in Groovy, using Jenkins Workflow Plugin • Worker Process (what to do) • Written in Bash + Other Scripting Languages
  37. How I think about Orchestration • Within a given orchestration

    job, I think it makes the most sense to view it as a series of consecutive Phases which may have multiple concurrent steps. • Remember, Orchestration jobs (in my world) don’t perform any direct work. Their responsibility is executing Worker Processes, evaluating their results, and determining the next steps to perform. • I felt this was a very SOLID view to take.
  38. How orchestration works

  39. How orchestration is ran

  40. Phases run in order, steps run concurrently

  41. But it’s not always sunny in San Francisco

  42. My Problems With Jenkins • I am *NOT* a GUI

    guy. • Driving a browser all day was driving me insane. • I am *REALLY* impatient. • I HATED debug by re-running, especially when it could take 5-10 minutes to get a result. • Writing production worthy code in a <TEXTAREA> was beginning to get on my last nerve… • Just… yeah. Need I say more on that one?
  43. Step 1: Get away from the UI

  44. Getting away from the GUI • Jenkins offers the Jenkins

    CLI which has probably 95% of the features you need on a daily basis for Jenkins server management. • Add/delete/update Jobs • Install Plugins • Restarting • Managing nodes
  45. Getting away from the GUI • Jenkins Jobs are XML

    Definitions with code payloads • I made templates of the XML definitions using DOT.js • I put Groovy code in .groovy files editing them with a Groovy friendly editor. Thanks for IntelliJ, JetBrains! • I put Bash code in .bash files editing them with a Bash friendly editor. Another win for IntelliJ! • I put the tools I would call within jobs in a repository which I could easily checkout within a Job context, to execute via Bash scripts.
  46. Getting away from the GUI • Jobs got a generated

    “version” attached to them • {job name}-v{version number of job}-{latest git hash} • The version number of the job was calculated for file and it’s dependencies (a “Job” was a JSON file) • The system would also replace references to the “common name” of a job with its “deployed name”, allowing us to write code that said “build(‘process_1’)” even though in production the job was actually called “process_1-v45-1kljh21lk1j2h12lkj1h1”. • I build a cleanup script which would query all of the jobs on a Jenkins Master and prune any job which had not been ran in 7 days, and was not the latest version.
  47. Step 2: Improving the
 Speed & Safety of Development

  48. Improving Speed & Safety • Now that I had Groovy

    in .groovy files, things got really interesting very quickly. • I created a compile process for the Groovy, allowing me to prepend and append code around the Groovy which defined the actual work, allowing for common libraries shared between jobs, and made the source directory configurable, to enable production and test builds. • I created mocks for the Jenkins API that I in the test builds so I could test that my Groovy was executable without having to run the job in Jenkins. • I created a configuration file to define what those mocks should respond with, and how I expected them to be called. I then made the mocks record their calls, and added an append process to produce a report, which could then be validated. • I made a Groovy test suite for Jenkins Workflows!
  49. Improving Speed & Safety • Since I had made the

    callback URLs parameterized, it enabled me to build an external functional test suite, using ngrok to expose my laptop for cloud callbacks, allowing me to easily validate that the jobs were performing in a real execution environment as expected. • The external functional test suite was also capable of performing as a performance test suite by simply upping the number of configured files to process, and the rate at which they were submitted for processing.
  50. And then things expanded!

  51. From

  52. To

  53. And then to

  54. Problems we ran into • Logs. Logs. logs. logs. logs.

    logs. logs. logs. logs. logs. • Finding the balance between enough logs for business purposes (2-3 days of logs) and too many logs for Jenkins to cope well (100-200 builds on the Master instances we were using). • Jobs. Jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. • Sometimes during heavy dev periods there would be a LOT of jobs, which Jenkins also didn’t seem to really like. And it got confusing for others trying to figure out which jobs to look at for production information.
  55. The next round of work…

  56. Operationalizing Jenkins • Chef recipes to build Jenkins Master and

    Jenkins Slave workers, and Groovy to configure Jenkins without ever having to go into the UI. • Groovy handled things like enabling security, assigning SSH public keys to enable CLI usage, and setting other various global configuration options. • OpsWorks configuration to build Jenkins farm of Master and Workers. • Chef recipes to consume OpsWorks environment config and automatically attach new Worker instances to the Master. • Chef recipes to deploy new suites of Jenkins jobs to the Master, including any required plugins. • CloudFormation templates to build OpsWorks environments, and ensure workers were distributed across multiple availability zones within a region. • Scripts to generate CloudFormation templates for a given scale (how many of each worker type, instance sizes, etc…). • Jenkins jobs to build deployable components for Chef recipes and Jenkins jobs.
  57. Operationalizing Jenkins • Orchestration job builder got a “no-op” Worker

    Process mode (with mocked report.json artifacts), so we could read the report.json and confirm the orchestration was operating as expected *quickly*. • Soon realized I had the levers to do something else pretty interesting…
  58. Bonus Content What if we could operationalize QA?

  59. Component Files • Each deployable module got a “Component File”

    which detailed several items: • What repository it lived in • Path in the repository to watch for changes • High level environment requirements (node, ruby, linux, etc…) • Build, Test, Stage, and Release steps.
  60. Definition Example for JavaScript UI • Repo: http://github.com/somesite/core-system.git • Path:

    frontend/src • Requirements: • node, version 8.12.0 • Build: • npm install
 npm run build • Test: • npm test • Stage: • s3cmd cp build/* s3://somesite-public/release/${GIT_HASH} • Release: • heroku update:config -a {HEROKU_APP} JAVASCRIPT_URL https://cdn.somesite.com/release/${GIT_HASH}
  61. Operationalizing QA • Added “Environments” to component files (for lack

    of better location at the time). • Jenkins pre-scripts ensured RVM and/or NVM was installed for Ruby/JavaScript requirements. Using the Jenkins plugins for these didn’t seem to work consistently. • Auto-built orchestration and worker jobs for each component’s Build-Test, Build-Test-Stage, and Build-Test-Stage-Release • Used environment names in the “environment” section to build <Environment>-Stage and <Environment>-Release jobs, and enabled variables to be set per-environment which were populated in the build, test, stage, and deploy job parameters.
  62. Operationalizing QA • Once I had high-level jobs to Stage

    and Release to an environment, I added “dependencies” to the component file, to document dependencies between components — that the JavaScript Frontend relied on the Platform API and the Teams API, for example. • The generation system would then automatically ensure that dependent components were released to the test environment targeted, and that the test suites for all of components which relied on new version *also* passed during the integration test phase. • For example, when we updated the Teams API, it would ensure that the latest production Platform API and JavaScript Frontend were deployed, and that JavaScript Frontend’s integration test suite passed, ensuring we hadn’t broken anything in the stack.
  63. Operationalizing QA • Initially the QA system began behaving unpredictably,

    releasing strange versions of code to production. It feels obvious in retrospec, but this was because Jenkins should not be responsible for releasing itself. Safe Restart doesn’t really apply when restarting itself. • We ended up with 3 QA environments - QA Production, QA Test, and QA QA. QA QA’s only job was to monitor the Chef recipes and Jenkins Jobs for QA, and release to QA Production when appropriate. QA Test was QA QA’s integration test environment, • Thanks to all of the previous work, QA QA was able to run a full functional test of QA Production in QA Test, ensuring everything worked as expected, including being able to spin up a test environment for each component in our application stack to ensure that the build and release processes were behaving appropriately.
  64. Recap • Created a high-visibility asynchronous parallelizable job queue using

    Jenkins using a Promise-based architecture for managing inter-system dependencies. • Treated Jenkins as a PAAS (Platform as a Service) for deploying jobs, which made deploying safer, as jobs were treated as immutable. • Created unit, functional, and performance test suites to measure and ensure stability and reliability. • Created a repeatable way to build Jenkins environments, enabling rapid scalability and ensuring business continuity.
  65. Thank You! Please rate and review my talk!
 https://joind.in/talk/8ceeb
 


    
 
 
 
 
 
 The slides will be posted on joind.in shortly.