Promises and Queues: Using Unlikely Suspects to Handle Asynchronous Parallel Processing

Promises and Queues Using Unlikely Suspects to Handle Async Parallel
Processes

About Me What I do: • Senior Web Developer  at
Dart Container • Spouter of Opposing Ideas • Community Evangelist Where I can be found: • Blog: http://jmather.com • Twitter: @thejmather • Medium: @jacobmather

User Uploads Digital Model User Receives Physical Model User Selects
Material

Focus Area User Uploads Digital Model User Receives Physical Model
User Selects Material

User Uploads Digital Model User Selects Material System Inspects Model

System Inspects Model File Upload File Inspection Part Order

System Inspects Model File Upload File Inspection Part Order Cloud
Cloud

System Inspects Model File Upload File Inspection Part Order Cloud
Cloud On Prem

How did we get the uploaded ﬁle from the cloud
to the computer under the stairs for processing, and then get the results back into the cloud for our application?

Problems with this setup • No Clarity • Not really
sure when which thing does what • No Visibility • Not really sure when things break • No Veriﬁability • Not really sure when things are done

Other issues… • There was another problem with the upload
process — it could time out. • Heroku requires calls to complete in under 30 seconds. Having to have the ﬁle be small enough that it could be uploaded twice (once from the browser to Heroku, then from Heroku to Dropbox) within 30 seconds was constraining customers.

What do I expect from a system?

Clarity Consistency Communication

Clarity • Clarity means you can understand *how* the system
runs through it’s phases of execution. • Often, race conditions enter a system due to a lack of clarity within the execution process.

Consistency • Consistency means the system performs the same way
with the same inputs each and every time. • This is also known as being deterministic.

Communication • Communication means the system is capable of broadcasting
it’s current state in a way which can be readily understood by those who are interacting with and maintaining the system.

So let’s redesign this

Another view of where we started

Phase 1: The Upload

Phase 2: The Processing

Full Redesign

Inputs and Outputs • Inputs: • SOURCE_S3_FILE_PATH (required) • RESULT_S3_FILE_PATH
(required) • REGISTRATION_RETURN_URL (required, added later) • REPORT_RETURN_URL (required) • TOOLS_COMMIT (default: master) • Outputs: • Immediate: Redirect Header, http://jenkins-master-2938.serverfarm.com/queue/29219 • Eventual: Report JSON • Reporting Process: (string) • Status: (SUCCESS/FAILURE) • Reason: (string) • Data: (object) • Log: (object)

Why I like Jenkins • Lots of visibility into the
execution of CLI scripts • Simple interface, deep links, and great interaction patterns • Very old tool, reasonably secure, extensive community • We were able to tie it into our Google auth to allow all employees easy access to log in and see why something failed. • It is incredibly ﬂexible, with a lot of plugins which added functionality that made all of this possible VERY quickly. • GREAT scaling architecture

Favorite Jenkins Plugins • Workflow — it enabled us to
have our processing logic self- contained within the job. Without Groovy logic, we would have been forced to make our Platform responsible for handling the decision tree, requiring much greater complexity in the implementation. • Build Authorization Token Root — lets you enable token authentication (the token is defined in the job) to trigger builds via API calls even when Jenkins denies anonymous access. • Node and Label Parameter — lets you target a Jenkins slave specifically to detect if a machine instance is failing. • Rebuild — Makes it easy to re-run a job, pre-populating parameters.

How this scales

What to call this architecture • Eventually I came to
call this Promise-based Architecture. • Origin application has an environment variable with the base URL to send jobs to. • Job submission payload contained URLs by which they should report expectations and results to. • Job submission result payload contained a URL to query for job status. • Jobs were at least 99.9% reliably returning their contractual callback requests.

Then new requirements came along…

We added another processor

So we went from…

So that’s how we’re going to play it, huh? It’s
time to get a little more deliberate in how this stuff is deﬁned.

How this works in Jenkins • Jenkins jobs were broken
into two categories: • Orchestration (what order to do things in) • Written in Groovy, using Jenkins Workﬂow Plugin • Worker Process (what to do) • Written in Bash + Other Scripting Languages

How I think about Orchestration • Within a given orchestration
job, I think it makes the most sense to view it as a series of consecutive Phases which may have multiple concurrent steps. • Remember, Orchestration jobs (in my world) don’t perform any direct work. Their responsibility is executing Worker Processes, evaluating their results, and determining the next steps to perform. • I felt this was a very SOLID view to take.

How orchestration works

How orchestration is ran

Phases run in order, steps run concurrently

But it’s not always sunny in San Francisco

My Problems With Jenkins • I am *NOT* a GUI
guy. • Driving a browser all day was driving me insane. • I am *REALLY* impatient. • I HATED debug by re-running, especially when it could take 5-10 minutes to get a result. • Writing production worthy code in a <TEXTAREA> was beginning to get on my last nerve… • Just… yeah. Need I say more on that one?

Step 1: Get away from the UI

Getting away from the GUI • Jenkins offers the Jenkins
CLI which has probably 95% of the features you need on a daily basis for Jenkins server management. • Add/delete/update Jobs • Install Plugins • Restarting • Managing nodes

Getting away from the GUI • Jenkins Jobs are XML
Definitions with code payloads • I made templates of the XML definitions using DOT.js • I put Groovy code in .groovy files editing them with a Groovy friendly editor. Thanks for IntelliJ, JetBrains! • I put Bash code in .bash files editing them with a Bash friendly editor. Another win for IntelliJ! • I put the tools I would call within jobs in a repository which I could easily checkout within a Job context, to execute via Bash scripts.

Getting away from the GUI • Jobs got a generated
“version” attached to them • {job name}-v{version number of job}-{latest git hash} • The version number of the job was calculated for ﬁle and it’s dependencies (a “Job” was a JSON ﬁle) • The system would also replace references to the “common name” of a job with its “deployed name”, allowing us to write code that said “build(‘process_1’)” even though in production the job was actually called “process_1-v45-1kljh21lk1j2h12lkj1h1”. • I build a cleanup script which would query all of the jobs on a Jenkins Master and prune any job which had not been ran in 7 days, and was not the latest version.

Step 2: Improving the  Speed & Safety of Development

Improving Speed & Safety • Now that I had Groovy
in .groovy files, things got really interesting very quickly. • I created a compile process for the Groovy, allowing me to prepend and append code around the Groovy which defined the actual work, allowing for common libraries shared between jobs, and made the source directory configurable, to enable production and test builds. • I created mocks for the Jenkins API that I in the test builds so I could test that my Groovy was executable without having to run the job in Jenkins. • I created a configuration file to define what those mocks should respond with, and how I expected them to be called. I then made the mocks record their calls, and added an append process to produce a report, which could then be validated. • I made a Groovy test suite for Jenkins Workflows!

Improving Speed & Safety • Since I had made the
callback URLs parameterized, it enabled me to build an external functional test suite, using ngrok to expose my laptop for cloud callbacks, allowing me to easily validate that the jobs were performing in a real execution environment as expected. • The external functional test suite was also capable of performing as a performance test suite by simply upping the number of conﬁgured ﬁles to process, and the rate at which they were submitted for processing.

And then things expanded!

And then to

Problems we ran into • Logs. Logs. logs. logs. logs.
logs. logs. logs. logs. logs. • Finding the balance between enough logs for business purposes (2-3 days of logs) and too many logs for Jenkins to cope well (100-200 builds on the Master instances we were using). • Jobs. Jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. • Sometimes during heavy dev periods there would be a LOT of jobs, which Jenkins also didn’t seem to really like. And it got confusing for others trying to ﬁgure out which jobs to look at for production information.

The next round of work…

Operationalizing Jenkins • Chef recipes to build Jenkins Master and
Jenkins Slave workers, and Groovy to configure Jenkins without ever having to go into the UI. • Groovy handled things like enabling security, assigning SSH public keys to enable CLI usage, and setting other various global configuration options. • OpsWorks configuration to build Jenkins farm of Master and Workers. • Chef recipes to consume OpsWorks environment config and automatically attach new Worker instances to the Master. • Chef recipes to deploy new suites of Jenkins jobs to the Master, including any required plugins. • CloudFormation templates to build OpsWorks environments, and ensure workers were distributed across multiple availability zones within a region. • Scripts to generate CloudFormation templates for a given scale (how many of each worker type, instance sizes, etc…). • Jenkins jobs to build deployable components for Chef recipes and Jenkins jobs.

Operationalizing Jenkins • Orchestration job builder got a “no-op” Worker
Process mode (with mocked report.json artifacts), so we could read the report.json and conﬁrm the orchestration was operating as expected *quickly*. • Soon realized I had the levers to do something else pretty interesting…

Bonus Content What if we could operationalize QA?

Component Files • Each deployable module got a “Component File”
which detailed several items: • What repository it lived in • Path in the repository to watch for changes • High level environment requirements (node, ruby, linux, etc…) • Build, Test, Stage, and Release steps.

Deﬁnition Example for JavaScript UI • Repo: http://github.com/somesite/core-system.git • Path:
frontend/src • Requirements: • node, version 8.12.0 • Build: • npm install  npm run build • Test: • npm test • Stage: • s3cmd cp build/* s3://somesite-public/release/${GIT_HASH} • Release: • heroku update:conﬁg -a {HEROKU_APP} JAVASCRIPT_URL https://cdn.somesite.com/release/${GIT_HASH}

Operationalizing QA • Added “Environments” to component ﬁles (for lack
of better location at the time). • Jenkins pre-scripts ensured RVM and/or NVM was installed for Ruby/JavaScript requirements. Using the Jenkins plugins for these didn’t seem to work consistently. • Auto-built orchestration and worker jobs for each component’s Build-Test, Build-Test-Stage, and Build-Test-Stage-Release • Used environment names in the “environment” section to build <Environment>-Stage and <Environment>-Release jobs, and enabled variables to be set per-environment which were populated in the build, test, stage, and deploy job parameters.

Operationalizing QA • Once I had high-level jobs to Stage
and Release to an environment, I added “dependencies” to the component ﬁle, to document dependencies between components — that the JavaScript Frontend relied on the Platform API and the Teams API, for example. • The generation system would then automatically ensure that dependent components were released to the test environment targeted, and that the test suites for all of components which relied on new version *also* passed during the integration test phase. • For example, when we updated the Teams API, it would ensure that the latest production Platform API and JavaScript Frontend were deployed, and that JavaScript Frontend’s integration test suite passed, ensuring we hadn’t broken anything in the stack.

Operationalizing QA • Initially the QA system began behaving unpredictably,
releasing strange versions of code to production. It feels obvious in retrospec, but this was because Jenkins should not be responsible for releasing itself. Safe Restart doesn’t really apply when restarting itself. • We ended up with 3 QA environments - QA Production, QA Test, and QA QA. QA QA’s only job was to monitor the Chef recipes and Jenkins Jobs for QA, and release to QA Production when appropriate. QA Test was QA QA’s integration test environment, • Thanks to all of the previous work, QA QA was able to run a full functional test of QA Production in QA Test, ensuring everything worked as expected, including being able to spin up a test environment for each component in our application stack to ensure that the build and release processes were behaving appropriately.

Recap • Created a high-visibility asynchronous parallelizable job queue using
Jenkins using a Promise-based architecture for managing inter-system dependencies. • Treated Jenkins as a PAAS (Platform as a Service) for deploying jobs, which made deploying safer, as jobs were treated as immutable. • Created unit, functional, and performance test suites to measure and ensure stability and reliability. • Created a repeatable way to build Jenkins environments, enabling rapid scalability and ensuring business continuity.

Thank You! Please rate and review my talk!  https://joind.in/talk/8ceeb   
            The slides will be posted on joind.in shortly.

Promises and Queues: Using Unlikely Suspects to...

Promises and Queues: Using Unlikely Suspects to Handle Asynchronous Parallel Processing

More Decks by Jacob Mather

Other Decks in Programming

Featured

Transcript