Slide 1

Slide 1 text

Promises and Queues Using Unlikely Suspects to Handle Async Parallel Processes

Slide 2

Slide 2 text

About Me What I do: • Senior Web Developer
 at Dart Container • Spouter of Opposing Ideas • Community Evangelist Where I can be found: • Blog: http://jmather.com • Twitter: @thejmather • Medium: @jacobmather

Slide 3

Slide 3 text

User Uploads Digital Model User Receives Physical Model User Selects Material

Slide 4

Slide 4 text

Focus Area User Uploads Digital Model User Receives Physical Model User Selects Material

Slide 5

Slide 5 text

User Uploads Digital Model User Selects Material System Inspects Model

Slide 6

Slide 6 text

System Inspects Model File Upload File Inspection Part Order

Slide 7

Slide 7 text

System Inspects Model File Upload File Inspection Part Order Cloud Cloud

Slide 8

Slide 8 text

System Inspects Model File Upload File Inspection Part Order Cloud Cloud On Prem

Slide 9

Slide 9 text

How did we get the uploaded file from the cloud to the computer under the stairs for processing, and then get the results back into the cloud for our application?

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Problems with this setup • No Clarity • Not really sure when which thing does what • No Visibility • Not really sure when things break • No Verifiability • Not really sure when things are done

Slide 13

Slide 13 text

Other issues… • There was another problem with the upload process — it could time out. • Heroku requires calls to complete in under 30 seconds. Having to have the file be small enough that it could be uploaded twice (once from the browser to Heroku, then from Heroku to Dropbox) within 30 seconds was constraining customers.

Slide 14

Slide 14 text

What do I expect from a system?

Slide 15

Slide 15 text

Clarity Consistency Communication

Slide 16

Slide 16 text

Clarity • Clarity means you can understand *how* the system runs through it’s phases of execution. • Often, race conditions enter a system due to a lack of clarity within the execution process.

Slide 17

Slide 17 text

Consistency • Consistency means the system performs the same way with the same inputs each and every time. • This is also known as being deterministic.

Slide 18

Slide 18 text

Communication • Communication means the system is capable of broadcasting it’s current state in a way which can be readily understood by those who are interacting with and maintaining the system.

Slide 19

Slide 19 text

So let’s redesign this

Slide 20

Slide 20 text

Another view of where we started

Slide 21

Slide 21 text

Phase 1: The Upload

Slide 22

Slide 22 text

Phase 2: The Processing

Slide 23

Slide 23 text

Full Redesign

Slide 24

Slide 24 text

Inputs and Outputs • Inputs: • SOURCE_S3_FILE_PATH (required) • RESULT_S3_FILE_PATH (required) • REGISTRATION_RETURN_URL (required, added later) • REPORT_RETURN_URL (required) • TOOLS_COMMIT (default: master) • Outputs: • Immediate: Redirect Header, http://jenkins-master-2938.serverfarm.com/queue/29219 • Eventual: Report JSON • Reporting Process: (string) • Status: (SUCCESS/FAILURE) • Reason: (string) • Data: (object) • Log: (object)

Slide 25

Slide 25 text

Why I like Jenkins • Lots of visibility into the execution of CLI scripts • Simple interface, deep links, and great interaction patterns • Very old tool, reasonably secure, extensive community • We were able to tie it into our Google auth to allow all employees easy access to log in and see why something failed. • It is incredibly flexible, with a lot of plugins which added functionality that made all of this possible VERY quickly. • GREAT scaling architecture

Slide 26

Slide 26 text

Favorite Jenkins Plugins • Workflow — it enabled us to have our processing logic self- contained within the job. Without Groovy logic, we would have been forced to make our Platform responsible for handling the decision tree, requiring much greater complexity in the implementation. • Build Authorization Token Root — lets you enable token authentication (the token is defined in the job) to trigger builds via API calls even when Jenkins denies anonymous access. • Node and Label Parameter — lets you target a Jenkins slave specifically to detect if a machine instance is failing. • Rebuild — Makes it easy to re-run a job, pre-populating parameters.

Slide 27

Slide 27 text

How this scales

Slide 28

Slide 28 text

How this scales

Slide 29

Slide 29 text

How this scales

Slide 30

Slide 30 text

What to call this architecture • Eventually I came to call this Promise-based Architecture. • Origin application has an environment variable with the base URL to send jobs to. • Job submission payload contained URLs by which they should report expectations and results to. • Job submission result payload contained a URL to query for job status. • Jobs were at least 99.9% reliably returning their contractual callback requests.

Slide 31

Slide 31 text

Then new requirements came along…

Slide 32

Slide 32 text

We added another processor

Slide 33

Slide 33 text

So we went from…

Slide 34

Slide 34 text

To…

Slide 35

Slide 35 text

So that’s how we’re going to play it, huh? It’s time to get a little more deliberate in how this stuff is defined.

Slide 36

Slide 36 text

How this works in Jenkins • Jenkins jobs were broken into two categories: • Orchestration (what order to do things in) • Written in Groovy, using Jenkins Workflow Plugin • Worker Process (what to do) • Written in Bash + Other Scripting Languages

Slide 37

Slide 37 text

How I think about Orchestration • Within a given orchestration job, I think it makes the most sense to view it as a series of consecutive Phases which may have multiple concurrent steps. • Remember, Orchestration jobs (in my world) don’t perform any direct work. Their responsibility is executing Worker Processes, evaluating their results, and determining the next steps to perform. • I felt this was a very SOLID view to take.

Slide 38

Slide 38 text

How orchestration works

Slide 39

Slide 39 text

How orchestration is ran

Slide 40

Slide 40 text

Phases run in order, steps run concurrently

Slide 41

Slide 41 text

But it’s not always sunny in San Francisco

Slide 42

Slide 42 text

My Problems With Jenkins • I am *NOT* a GUI guy. • Driving a browser all day was driving me insane. • I am *REALLY* impatient. • I HATED debug by re-running, especially when it could take 5-10 minutes to get a result. • Writing production worthy code in a was beginning to get on my last nerve… • Just… yeah. Need I say more on that one?

Slide 43

Slide 43 text

Step 1: Get away from the UI

Slide 44

Slide 44 text

Getting away from the GUI • Jenkins offers the Jenkins CLI which has probably 95% of the features you need on a daily basis for Jenkins server management. • Add/delete/update Jobs • Install Plugins • Restarting • Managing nodes

Slide 45

Slide 45 text

Getting away from the GUI • Jenkins Jobs are XML Definitions with code payloads • I made templates of the XML definitions using DOT.js • I put Groovy code in .groovy files editing them with a Groovy friendly editor. Thanks for IntelliJ, JetBrains! • I put Bash code in .bash files editing them with a Bash friendly editor. Another win for IntelliJ! • I put the tools I would call within jobs in a repository which I could easily checkout within a Job context, to execute via Bash scripts.

Slide 46

Slide 46 text

Getting away from the GUI • Jobs got a generated “version” attached to them • {job name}-v{version number of job}-{latest git hash} • The version number of the job was calculated for file and it’s dependencies (a “Job” was a JSON file) • The system would also replace references to the “common name” of a job with its “deployed name”, allowing us to write code that said “build(‘process_1’)” even though in production the job was actually called “process_1-v45-1kljh21lk1j2h12lkj1h1”. • I build a cleanup script which would query all of the jobs on a Jenkins Master and prune any job which had not been ran in 7 days, and was not the latest version.

Slide 47

Slide 47 text

Step 2: Improving the
 Speed & Safety of Development

Slide 48

Slide 48 text

Improving Speed & Safety • Now that I had Groovy in .groovy files, things got really interesting very quickly. • I created a compile process for the Groovy, allowing me to prepend and append code around the Groovy which defined the actual work, allowing for common libraries shared between jobs, and made the source directory configurable, to enable production and test builds. • I created mocks for the Jenkins API that I in the test builds so I could test that my Groovy was executable without having to run the job in Jenkins. • I created a configuration file to define what those mocks should respond with, and how I expected them to be called. I then made the mocks record their calls, and added an append process to produce a report, which could then be validated. • I made a Groovy test suite for Jenkins Workflows!

Slide 49

Slide 49 text

Improving Speed & Safety • Since I had made the callback URLs parameterized, it enabled me to build an external functional test suite, using ngrok to expose my laptop for cloud callbacks, allowing me to easily validate that the jobs were performing in a real execution environment as expected. • The external functional test suite was also capable of performing as a performance test suite by simply upping the number of configured files to process, and the rate at which they were submitted for processing.

Slide 50

Slide 50 text

And then things expanded!

Slide 51

Slide 51 text

From

Slide 52

Slide 52 text

To

Slide 53

Slide 53 text

And then to

Slide 54

Slide 54 text

Problems we ran into • Logs. Logs. logs. logs. logs. logs. logs. logs. logs. logs. • Finding the balance between enough logs for business purposes (2-3 days of logs) and too many logs for Jenkins to cope well (100-200 builds on the Master instances we were using). • Jobs. Jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. jobs. • Sometimes during heavy dev periods there would be a LOT of jobs, which Jenkins also didn’t seem to really like. And it got confusing for others trying to figure out which jobs to look at for production information.

Slide 55

Slide 55 text

The next round of work…

Slide 56

Slide 56 text

Operationalizing Jenkins • Chef recipes to build Jenkins Master and Jenkins Slave workers, and Groovy to configure Jenkins without ever having to go into the UI. • Groovy handled things like enabling security, assigning SSH public keys to enable CLI usage, and setting other various global configuration options. • OpsWorks configuration to build Jenkins farm of Master and Workers. • Chef recipes to consume OpsWorks environment config and automatically attach new Worker instances to the Master. • Chef recipes to deploy new suites of Jenkins jobs to the Master, including any required plugins. • CloudFormation templates to build OpsWorks environments, and ensure workers were distributed across multiple availability zones within a region. • Scripts to generate CloudFormation templates for a given scale (how many of each worker type, instance sizes, etc…). • Jenkins jobs to build deployable components for Chef recipes and Jenkins jobs.

Slide 57

Slide 57 text

Operationalizing Jenkins • Orchestration job builder got a “no-op” Worker Process mode (with mocked report.json artifacts), so we could read the report.json and confirm the orchestration was operating as expected *quickly*. • Soon realized I had the levers to do something else pretty interesting…

Slide 58

Slide 58 text

Bonus Content What if we could operationalize QA?

Slide 59

Slide 59 text

Component Files • Each deployable module got a “Component File” which detailed several items: • What repository it lived in • Path in the repository to watch for changes • High level environment requirements (node, ruby, linux, etc…) • Build, Test, Stage, and Release steps.

Slide 60

Slide 60 text

Definition Example for JavaScript UI • Repo: http://github.com/somesite/core-system.git • Path: frontend/src • Requirements: • node, version 8.12.0 • Build: • npm install
 npm run build • Test: • npm test • Stage: • s3cmd cp build/* s3://somesite-public/release/${GIT_HASH} • Release: • heroku update:config -a {HEROKU_APP} JAVASCRIPT_URL https://cdn.somesite.com/release/${GIT_HASH}

Slide 61

Slide 61 text

Operationalizing QA • Added “Environments” to component files (for lack of better location at the time). • Jenkins pre-scripts ensured RVM and/or NVM was installed for Ruby/JavaScript requirements. Using the Jenkins plugins for these didn’t seem to work consistently. • Auto-built orchestration and worker jobs for each component’s Build-Test, Build-Test-Stage, and Build-Test-Stage-Release • Used environment names in the “environment” section to build -Stage and -Release jobs, and enabled variables to be set per-environment which were populated in the build, test, stage, and deploy job parameters.

Slide 62

Slide 62 text

Operationalizing QA • Once I had high-level jobs to Stage and Release to an environment, I added “dependencies” to the component file, to document dependencies between components — that the JavaScript Frontend relied on the Platform API and the Teams API, for example. • The generation system would then automatically ensure that dependent components were released to the test environment targeted, and that the test suites for all of components which relied on new version *also* passed during the integration test phase. • For example, when we updated the Teams API, it would ensure that the latest production Platform API and JavaScript Frontend were deployed, and that JavaScript Frontend’s integration test suite passed, ensuring we hadn’t broken anything in the stack.

Slide 63

Slide 63 text

Operationalizing QA • Initially the QA system began behaving unpredictably, releasing strange versions of code to production. It feels obvious in retrospec, but this was because Jenkins should not be responsible for releasing itself. Safe Restart doesn’t really apply when restarting itself. • We ended up with 3 QA environments - QA Production, QA Test, and QA QA. QA QA’s only job was to monitor the Chef recipes and Jenkins Jobs for QA, and release to QA Production when appropriate. QA Test was QA QA’s integration test environment, • Thanks to all of the previous work, QA QA was able to run a full functional test of QA Production in QA Test, ensuring everything worked as expected, including being able to spin up a test environment for each component in our application stack to ensure that the build and release processes were behaving appropriately.

Slide 64

Slide 64 text

Recap • Created a high-visibility asynchronous parallelizable job queue using Jenkins using a Promise-based architecture for managing inter-system dependencies. • Treated Jenkins as a PAAS (Platform as a Service) for deploying jobs, which made deploying safer, as jobs were treated as immutable. • Created unit, functional, and performance test suites to measure and ensure stability and reliability. • Created a repeatable way to build Jenkins environments, enabling rapid scalability and ensuring business continuity.

Slide 65

Slide 65 text

Thank You! Please rate and review my talk!
 https://joind.in/talk/8ceeb
 
 
 
 
 
 
 
 The slides will be posted on joind.in shortly.