Slide 1

Slide 1 text

A Journey Through Wonderland Paul Seiffert Mathias Lafeldt

Slide 2

Slide 2 text

The Purpose of Wonderland

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

● Took Jimdo 5 years to migrate core infrastructure from bare metal to AWS ● Teams started to love the cloud ● Many experiments in different AWS accounts ● “Reinvented” production stacks How we got here

Slide 5

Slide 5 text

● Founded to solve common infrastructure problems of Jimdo teams ● Provides standard platform that is reliable and simple to use: Wonderland ● Allows Jimdo developers to focus on product development Werkzeugschmiede Team

Slide 6

Slide 6 text

Wonderland’s History

Slide 7

Slide 7 text

Wonderland 101

Slide 8

Slide 8 text

PaaS allowing Jimdo developers to run their dockerized applications

Slide 9

Slide 9 text

● Long-running stateless services ○ DNS, load balancing, health checks, auto scaling, … ● One-off tasks and cron jobs ● Centralized logging and metrics collection via external providers Features

Slide 10

Slide 10 text

● APIs ● CLI tool wl ● Chatbot Alice ● Docker registry ● Vault ● No SSH access Interfaces

Slide 11

Slide 11 text

● SLA ● Status page ● Documentation ● Workshops ● Use-case-driven development Internal service provider

Slide 12

Slide 12 text

Wonderland Internals

Slide 13

Slide 13 text

We run... ● AWS infrastructure ● Services providing our APIs

Slide 14

Slide 14 text

AWS Infrastructure ● Networking ● Cluster of EC2 instances ● Jenkins ● Route 53, DynamoDB, S3, SQS, SNS, ...

Slide 15

Slide 15 text

“Crims” Cluster ● Runs user applications + system services ● EC2 auto-scaling group ● Providing resources to ECS ● CoreOS

Slide 16

Slide 16 text

AWS ECS AWS EC2 Auto-Scaling Group

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Two-Dimensional ● Services (based on resource consumption) ● Cluster (based on available slots) Auto-Scaling

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

AWS/AutoScaling GroupDesiredCapacity Wonderland/ECS DesiredClusterSizeDelta 1 week

Slide 21

Slide 21 text

ECS Agent Log Forwarder Datadog Agent AWS ECS Service A Service B Service C E L B E L B HTTP :80 HTTPS :443 HTTP :11411 TCP :1234 TCP :11412 A Crims Cluster Instance

Slide 22

Slide 22 text

● Infrastructure as code ● CloudFormation and Ansible ● Applied by a Central State Enforcer ● Workflow based on GitHub pull requests ● Automated rollout to production Infrastructure Development

Slide 23

Slide 23 text

● We test everything ● Unit, integration, and system tests ● Tests in staging environment ● Staging is set up from scratch every week ● Periodic GameDays QA

Slide 24

Slide 24 text

Our Services ● provide APIs ● deploy other services ● are Wonderland services

Slide 25

Slide 25 text

SQS Queue Status Check Service AutoScaler Deployer API (Dash-) Boards Oraculum (Logs) AWS Route53 AWS Application AutoScaling Notifi- cations AWS SNS Alice (Chatbot) Deployer Worker WL (CLI Tool) AWS S3

Slide 26

Slide 26 text

Service Configuration $ cat wonderland-autoscaler/wonderland.yaml --- scale: 2 components: - name: autoscaler image: registry.example.com/wonderland-autoscaler:v1.0.3 env: DYNAMODB_TABLE_NAME: wonderland-autoscaling-configs endpoint: domain: autoscaler.example.com load-balancer: healthcheck: path: /v1/health ports: - port: 443 protocol: HTTPS component: autoscaler port: 80

Slide 27

Slide 27 text

Deploy it! $ wl deploy autoscaler -f wonderland-autoscaler/wonderland.yaml autoscaler/1466583476 This is try 1 autoscaler/1466583476 Updating ELB autoscaler-1466437217 autoscaler/1466583476 Configuring health check HTTP:11011/v1/health autoscaler/1466583476 Enabling cross-zone load balancing autoscaler/1466583476 Configuring connection draining with a timeout of 180s autoscaler/1466583476 Not enabling access log autoscaler/1466583476 Letting autoscaler.example.com point to autoscaler-1363526915.eu-west-1.elb.amazonaws.com autoscaler/1466583476 Registered new ECS TaskDefinition (autoscaler:58) for service autoscaler autoscaler/1466583476 Updating ECS service autoscaler-1466437217 autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 180s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 170s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 160s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 150s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 140s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 130s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 120s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 110s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 100s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 90s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 80s) autoscaler/1466583476 Rolling update completed successfully. autoscaler/1466583476 Waiting for ELB to have at least one healthy instance autoscaler/1466583476 Deleting old ECS Task Definition service-autoscaler:57 autoscaler/1466583476 Marking deployment autoscaler/1466583476 active autoscaler/1466583476 [Boards] Creating Board for Service [werkzeugschmiede] autoscaler autoscaler/1466583476 [Datadog] Creating Deployment Event autoscaler/1466583476 [Notifications] Notification channel is /v1/teams/werkzeugschmiede/channels/autoscaler autoscaler/1466583476 [StatusCheck] CheckID is f85ded4d-9ad0-4375-81b4-5989964e8ed5 autoscaler/1466583476 Deployment successful

Slide 28

Slide 28 text

Monitor it! $ wl status autoscaler Current deployment: 1466583491 Desired scale: 2 Machine Component Status Started Deployment ELB ------- --------- ------ ------- ---------- --- i-7db992f7 autoscaler RUNNING 22 Jun 16 11:14 CEST 1466583491 InService i-fb2f5b77 autoscaler RUNNING 24 Jun 16 01:13 CEST 1466583491 InService $ wl logs -f autoscaler ...

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

The Future

Slide 31

Slide 31 text

● Persistent disk storage ● Dynamic load balancing ● Long-running / memory hungry jobs ● Speed up ECS cluster rotation ● Make crons more reliable ● Outsource Docker registry Improvements

Slide 32

Slide 32 text

Twitter: @seiffertp / @mlafeldt https://medium.com/production-ready Questions? Thank you.