A Journey Through Wonderland

A Journey Through Wonderland Paul Seiffert Mathias Lafeldt

The Purpose of Wonderland

• Took Jimdo 5 years to migrate core infrastructure from
bare metal to AWS • Teams started to love the cloud • Many experiments in different AWS accounts • “Reinvented” production stacks How we got here

• Founded to solve common infrastructure problems of Jimdo teams
• Provides standard platform that is reliable and simple to use: Wonderland • Allows Jimdo developers to focus on product development Werkzeugschmiede Team

Wonderland’s History

Wonderland 101

PaaS allowing Jimdo developers to run their dockerized applications

• Long-running stateless services ◦ DNS, load balancing, health checks,
auto scaling, … • One-off tasks and cron jobs • Centralized logging and metrics collection via external providers Features

• APIs • CLI tool wl • Chatbot Alice •
Docker registry • Vault • No SSH access Interfaces

• SLA • Status page • Documentation • Workshops •
Use-case-driven development Internal service provider

Wonderland Internals

We run... • AWS infrastructure • Services providing our APIs

AWS Infrastructure • Networking • Cluster of EC2 instances •
Jenkins • Route 53, DynamoDB, S3, SQS, SNS, ...

“Crims” Cluster • Runs user applications + system services •
EC2 auto-scaling group • Providing resources to ECS • CoreOS

AWS ECS AWS EC2 Auto-Scaling Group

Two-Dimensional • Services (based on resource consumption) • Cluster (based
on available slots) Auto-Scaling

AWS/AutoScaling GroupDesiredCapacity Wonderland/ECS DesiredClusterSizeDelta 1 week

ECS Agent Log Forwarder Datadog Agent AWS ECS Service A
Service B Service C E L B E L B HTTP :80 HTTPS :443 HTTP :11411 TCP :1234 TCP :11412 A Crims Cluster Instance

• Infrastructure as code • CloudFormation and Ansible • Applied
by a Central State Enforcer • Workflow based on GitHub pull requests • Automated rollout to production Infrastructure Development

• We test everything • Unit, integration, and system tests
• Tests in staging environment • Staging is set up from scratch every week • Periodic GameDays QA

Our Services • provide APIs • deploy other services •
are Wonderland services

SQS Queue Status Check Service AutoScaler Deployer API (Dash-) Boards
Oraculum (Logs) AWS Route53 AWS Application AutoScaling Notifi- cations AWS SNS Alice (Chatbot) Deployer Worker WL (CLI Tool) AWS S3

Service Configuration $ cat wonderland-autoscaler/wonderland.yaml --- scale: 2 components: -
name: autoscaler image: registry.example.com/wonderland-autoscaler:v1.0.3 env: DYNAMODB_TABLE_NAME: wonderland-autoscaling-configs endpoint: domain: autoscaler.example.com load-balancer: healthcheck: path: /v1/health ports: - port: 443 protocol: HTTPS component: autoscaler port: 80

Deploy it! $ wl deploy autoscaler -f wonderland-autoscaler/wonderland.yaml autoscaler/1466583476 This
is try 1 autoscaler/1466583476 Updating ELB autoscaler-1466437217 autoscaler/1466583476 Configuring health check HTTP:11011/v1/health autoscaler/1466583476 Enabling cross-zone load balancing autoscaler/1466583476 Configuring connection draining with a timeout of 180s autoscaler/1466583476 Not enabling access log autoscaler/1466583476 Letting autoscaler.example.com point to autoscaler-1363526915.eu-west-1.elb.amazonaws.com autoscaler/1466583476 Registered new ECS TaskDefinition (autoscaler:58) for service autoscaler autoscaler/1466583476 Updating ECS service autoscaler-1466437217 autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 180s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 170s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 160s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 150s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 140s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 130s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 120s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 110s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 100s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 90s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 80s) autoscaler/1466583476 Rolling update completed successfully. autoscaler/1466583476 Waiting for ELB to have at least one healthy instance autoscaler/1466583476 Deleting old ECS Task Definition service-autoscaler:57 autoscaler/1466583476 Marking deployment autoscaler/1466583476 active autoscaler/1466583476 [Boards] Creating Board for Service [werkzeugschmiede] autoscaler autoscaler/1466583476 [Datadog] Creating Deployment Event autoscaler/1466583476 [Notifications] Notification channel is /v1/teams/werkzeugschmiede/channels/autoscaler autoscaler/1466583476 [StatusCheck] CheckID is f85ded4d-9ad0-4375-81b4-5989964e8ed5 autoscaler/1466583476 Deployment successful

Monitor it! $ wl status autoscaler Current deployment: 1466583491 Desired
scale: 2 Machine Component Status Started Deployment ELB ------- --------- ------ ------- ---------- --- i-7db992f7 autoscaler RUNNING 22 Jun 16 11:14 CEST 1466583491 InService i-fb2f5b77 autoscaler RUNNING 24 Jun 16 01:13 CEST 1466583491 InService $ wl logs -f autoscaler ...

The Future

• Persistent disk storage • Dynamic load balancing • Long-running
/ memory hungry jobs • Speed up ECS cluster rotation • Make crons more reliable • Outsource Docker registry Improvements

Twitter: @seiffertp / @mlafeldt https://medium.com/production-ready Questions? Thank you.

A Journey Through Wonderland

A Journey Through Wonderland

Mathias Lafeldt

More Decks by Mathias Lafeldt

Other Decks in Technology

Featured

Transcript