Slide 1

Slide 1 text

Shipping a Replacement Architecture in Elixir Chris Bell • @cjbell_ EMPEX LA 2018

Slide 2

Slide 2 text

I’m Chris and I <3 Elixir • 3 years of writing production Elixir apps • EMPEX NYC Organizer • ElixirTalk Co-host (with @desmondmonster)

Slide 3

Slide 3 text

• Powers Review & Collaboration for Video teams • Used by Vice, Turner, Buzzfeed, NYTimes, NASA • 450,000+ customers • Founded in 2014, based in NYC

Slide 4

Slide 4 text

Ruby on Rails to
 Elixir & Phoenix OUR JOURNEY FROM:

Slide 5

Slide 5 text

Why Rewrite & 
 Why Elixir? PART I

Slide 6

Slide 6 text

We did that thing you should never do: a rewrite OOPS, SORRY JOEL

Slide 7

Slide 7 text

Rapid growth and lots of shifting requirements left the codebase not in the best state

Slide 8

Slide 8 text

• Ruby 1.9.3, Rails 3.2 • No tests… at all… for anything • Custom ORM into DynamoDB • Metaprogramming everywhere • No logical separation of concerns (authorization, persistence, domain logic) • No metrics or visibility into performance API ISSUES PRE MIGRATION

Slide 9

Slide 9 text

• Everything is a string … even nulls. • Foreign keys stored as a JSON encoded list of strings on the model. No atomic updates = lost data. • DynamoDB cost $$$ to support our workload • No ability to paginate anything = up to 45s response times and large payloads (> 1mb of JSON) DATABASE ISSUES PRE-MIGRATION

Slide 10

Slide 10 text

We felt we were technically bankrupt with our existing API and services

Slide 11

Slide 11 text

So why Elixir for us?

Slide 12

Slide 12 text

• Easy-ish ramp up for our existing Ruby / Python developers • Highly concurrent: use resources much more efficiently • Building on a mature VM (BEAM) and established language (Erlang & OTP) • Language attributes promote explicitness: immutability, pattern matching, multi function heads. WHY ELIXIR?

Slide 13

Slide 13 text

What We Shipped &
 A Peek Into The System PART II

Slide 14

Slide 14 text

API 
 (Ruby on Rails) Websocket connection Real-time Service (Node.JS) Support Tool (Node.js) Push Notifications SES Email Digest Service (Node.js) Client
 (iOS / Web / Adobe) DynamoDB LEGACY ARCHITECTURE

Slide 15

Slide 15 text

API 
 (Ruby on Rails) Websocket connection Real-time Service (Node.JS) Support Tool (Node.js) Push Notifications SES Email Digest Service (Node.js) Client
 (iOS / Web / Adobe) DynamoDB LEGACY ARCHITECTURE: WHAT WE REPLACED

Slide 16

Slide 16 text

• Elixir powered API, notifications system, real-time service, and support tool • Migrated all of our data to Postgres from DynamoDB • Dockerized all of the above and rebuilt our tooling and deploy process from the ground up WHAT WE SHIPPED

Slide 17

Slide 17 text

V2 API 
 (Phoenix) Websocket connection Real-time Service 
 (Elixir / Phoenix) Support Tool (Phoenix) Push Notifications SES Email Service Client
 (iOS / Web / Adobe) Postgres UPDATED ARCHITECTURE Munger API (Phoenix) Core Business Logic Memcached Umbrella App

Slide 18

Slide 18 text

• ~40 EC2 Instances to ~5 (running on ECS) • API 95th Percentile: ~30ms @ ~120rps • Database cost 91% reduced • Full visibility into all parts of the system (via statsd & datadog) • Modular, documented, maintainable codebase WHAT WE SHIPPED: RESULTS

Slide 19

Slide 19 text

AND BEST OF ALL: Predictable, stable performance with no out-of-hours incidents yet (and very few in general)

Slide 20

Slide 20 text

1. The Intermediary API 2. Umbrella App Structure 3. Event System 4. Moving millions of records A WHIRLWIND TOUR THROUGH THE SYSTEM

Slide 21

Slide 21 text

• Consumes new API, spits out old schemas and maintains legacy contract (by stringifying everything) • Allowed us to ship our new stack sooner with fewer implications for our different clients • Complexity is high, but designed to be thrown away (~6 months time) THE INTERMEDIARY API

Slide 22

Slide 22 text

THE INTERMEDIARY API: HOW IT WORKS HTTP
 Request Fetch Resources Translate New to Old Serialize 4 3 2 1

Slide 23

Slide 23 text

THE INTERMEDIARY API: HOW IT WORKS HTTP
 Request Fetch Resources Translate New to Old Serialize 4 3 2 1

Slide 24

Slide 24 text

THE INTERMEDIARY API: HOW IT WORKS HTTP
 Request Fetch Resources Translate New to Old Serialize 4 3 2 1

Slide 25

Slide 25 text

1. The Intermediary API 2. Umbrella App Structure 3. Event System 4. Moving millions of records A WHIRLWIND TOUR THROUGH THE SYSTEM

Slide 26

Slide 26 text

• We use a single ‘monorepo’ to contain all our separate applications structured as an Umbrella • Total of 11 apps right now UMBRELLA APP STRUCTURE

Slide 27

Slide 27 text

UMBRELLA APP STRUCTURE Core API Support Tool Munger DB Cron Dynasaur Monitoring Middleware PHOENIX APPS BUSINESS LOGIC SHARED COMPONENTS Auth Email

Slide 28

Slide 28 text

UMBRELLA APP STRUCTURE Core API Support Tool Munger DB Cron Dynasaur Monitoring Middleware PHOENIX APPS BUSINESS LOGIC SHARED COMPONENTS Auth Email

Slide 29

Slide 29 text

• Apps built and deployed as separate Docker containers in CircleCI via Distillery • Each build & deploy takes ~5 minutes (run in parallel) • Blue / green deploys via ECS • All auto-scaled via CPU / Memory threshold alarms UMBRELLA APP STRUCTURE

Slide 30

Slide 30 text

UMBRELLA APP STRUCTURE Core API Support Tool Munger DB Cron Dynasaur Monitoring Middleware PHOENIX APPS BUSINESS LOGIC SHARED COMPONENTS Auth Email

Slide 31

Slide 31 text

• Core houses all of our business logic, services, Ecto schemas, access policies, deferred logic, and more • Broken into two contexts: Accounts & Projects • API & Support Tool use the Core to fetch data and execute requests – they are effectively dumb HTTP wrappers UMBRELLA APP STRUCTURE: CORE

Slide 32

Slide 32 text

$ Finished in 10.7 seconds $ 1183 tests, 0 failures
 … And a lot of tests ✨

Slide 33

Slide 33 text

1. The Intermediary API 2. Umbrella App Structure 3. Event System 4. Moving millions of records A WHIRLWIND TOUR THROUGH THE SYSTEM

Slide 34

Slide 34 text

• All changes through our system broadcasted through a single, local event bus • Provides a powerful hook to build deferred functionality on-top of (like notifications, analytics tracking etc) • Implemented using GenStage and Protocols EVENT SYSTEM: WHAT IS IT?

Slide 35

Slide 35 text

EVENT SYSTEM Service Broadcaster Consumer Implementation

Slide 36

Slide 36 text

EVENT SYSTEM Service Broadcaster Consumer Implementation

Slide 37

Slide 37 text

EVENT SYSTEM Service Broadcaster Consumer Implementation

Slide 38

Slide 38 text

EVENT SYSTEM AssetCreated
 Event Broadcaster Audits Analytics Notifications Usage Cache Broadcaster will notify all consumers concurrently. Events always typed as structs Consumers implemented DynamicSupervisors

Slide 39

Slide 39 text

A WHIRLWIND TOUR THROUGH THE SYSTEM 1. The Intermediary API 2. Umbrella App Structure 3. Event System 4. Moving millions of records

Slide 40

Slide 40 text

• Moved all our records from DynamoDB into Postgres • Migrated through Flow tasks that streamed data from our tables and converted into the appropriate Postgres schemas • Largest table size was ~9m records, each record > 100kb (lots of JSON) MOVING MILLIONS OF RECORDS

Slide 41

Slide 41 text

MOVING MILLIONS OF RECORDS: HOW IT WORKS Define Dynamo Schema Stream from table Translate Old to New 3 2 1

Slide 42

Slide 42 text

MOVING MILLIONS OF RECORDS: HOW IT WORKS Define Dynamo Schema Stream from table Translate Old to New 3 2 1

Slide 43

Slide 43 text

MOVING MILLIONS OF RECORDS: HOW IT WORKS Define Dynamo Schema Stream from table Translate Old to New 3 2 1

Slide 44

Slide 44 text

MOVING MILLIONS OF RECORDS: HOW IT WORKS • Each table migration job runs in its own isolated Docker container using the ECS run task • We monitored errors in our jobs and constantly refined and tweaked the parallelism for each job • Ran weekly migrations and manually checked the migrated data in our QA environment

Slide 45

Slide 45 text

Challenges & Takeaways PART III

Slide 46

Slide 46 text

• Bugs and replicating old bugs • Team ramp up: 3 new developers learning Elixir trying to ship a thing is hard. Protip: establish patterns. • Understanding the performance characteristics of a new system and new database. • Estimation of complexity: went 6 weeks over our planned delivery date. CHALLENGES DURING THE MIGRATION

Slide 47

Slide 47 text

TAKEAWAY #1 Elixir was a huge win for us, but might not be for you.

Slide 48

Slide 48 text

TAKEAWAY #2 If you do rewrite, don’t move databases at the same time

Slide 49

Slide 49 text

TAKEAWAY #3 Good code isn’t about getting it right the first time. Good code is just legacy code that doesn’t get in the way. @tef_ebooks

Slide 50

Slide 50 text

Thank you. Questions?
 [email protected] • @cjbell_