A Journey Through Wonderland

A Journey Through Wonderland

In this talk we're going to introduce you to Wonderland, Jimdo's in-house PaaS for microservices. Wonderland provides Jimdo developers with an API and other tools that make deploying Dockerized applications easy. Our PaaS utilizes Amazon ECS to run Docker containers on a CoreOS cluster in EC2. Besides those basic building blocks, we integrate different external services for metrics and logging.

We'll show how Wonderland works under the hood, what we like and what we don't like about the current setup, and how our teams are using the platform for running services in production.

(Talk given at ContainerDays 2016: http://www.containerdays.de/)

2190d7a468f51fa3be5eabfc9397a28b?s=128

Mathias Lafeldt

June 28, 2016
Tweet

Transcript

  1. 3.
  2. 4.

    • Took Jimdo 5 years to migrate core infrastructure from

    bare metal to AWS • Teams started to love the cloud • Many experiments in different AWS accounts • “Reinvented” production stacks How we got here
  3. 5.

    • Founded to solve common infrastructure problems of Jimdo teams

    • Provides standard platform that is reliable and simple to use: Wonderland • Allows Jimdo developers to focus on product development Werkzeugschmiede Team
  4. 9.

    • Long-running stateless services ◦ DNS, load balancing, health checks,

    auto scaling, … • One-off tasks and cron jobs • Centralized logging and metrics collection via external providers Features
  5. 10.

    • APIs • CLI tool wl • Chatbot Alice •

    Docker registry • Vault • No SSH access Interfaces
  6. 11.

    • SLA • Status page • Documentation • Workshops •

    Use-case-driven development Internal service provider
  7. 14.

    AWS Infrastructure • Networking • Cluster of EC2 instances •

    Jenkins • Route 53, DynamoDB, S3, SQS, SNS, ...
  8. 15.

    “Crims” Cluster • Runs user applications + system services •

    EC2 auto-scaling group • Providing resources to ECS • CoreOS
  9. 17.
  10. 19.
  11. 21.

    ECS Agent Log Forwarder Datadog Agent AWS ECS Service A

    Service B Service C E L B E L B HTTP :80 HTTPS :443 HTTP :11411 TCP :1234 TCP :11412 A Crims Cluster Instance
  12. 22.

    • Infrastructure as code • CloudFormation and Ansible • Applied

    by a Central State Enforcer • Workflow based on GitHub pull requests • Automated rollout to production Infrastructure Development
  13. 23.

    • We test everything • Unit, integration, and system tests

    • Tests in staging environment • Staging is set up from scratch every week • Periodic GameDays QA
  14. 25.

    SQS Queue Status Check Service AutoScaler Deployer API (Dash-) Boards

    Oraculum (Logs) AWS Route53 AWS Application AutoScaling Notifi- cations AWS SNS Alice (Chatbot) Deployer Worker WL (CLI Tool) AWS S3
  15. 26.

    Service Configuration $ cat wonderland-autoscaler/wonderland.yaml --- scale: 2 components: -

    name: autoscaler image: registry.example.com/wonderland-autoscaler:v1.0.3 env: DYNAMODB_TABLE_NAME: wonderland-autoscaling-configs endpoint: domain: autoscaler.example.com load-balancer: healthcheck: path: /v1/health ports: - port: 443 protocol: HTTPS component: autoscaler port: 80
  16. 27.

    Deploy it! $ wl deploy autoscaler -f wonderland-autoscaler/wonderland.yaml autoscaler/1466583476 This

    is try 1 autoscaler/1466583476 Updating ELB autoscaler-1466437217 autoscaler/1466583476 Configuring health check HTTP:11011/v1/health autoscaler/1466583476 Enabling cross-zone load balancing autoscaler/1466583476 Configuring connection draining with a timeout of 180s autoscaler/1466583476 Not enabling access log autoscaler/1466583476 Letting autoscaler.example.com point to autoscaler-1363526915.eu-west-1.elb.amazonaws.com autoscaler/1466583476 Registered new ECS TaskDefinition (autoscaler:58) for service autoscaler autoscaler/1466583476 Updating ECS service autoscaler-1466437217 autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 180s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 170s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 160s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 150s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 140s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 130s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 120s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 110s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 100s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 90s) autoscaler/1466583476 Waiting for service autoscaler-1466437217 to complete rolling update (timeout in 80s) autoscaler/1466583476 Rolling update completed successfully. autoscaler/1466583476 Waiting for ELB to have at least one healthy instance autoscaler/1466583476 Deleting old ECS Task Definition service-autoscaler:57 autoscaler/1466583476 Marking deployment autoscaler/1466583476 active autoscaler/1466583476 [Boards] Creating Board for Service [werkzeugschmiede] autoscaler autoscaler/1466583476 [Datadog] Creating Deployment Event autoscaler/1466583476 [Notifications] Notification channel is /v1/teams/werkzeugschmiede/channels/autoscaler autoscaler/1466583476 [StatusCheck] CheckID is f85ded4d-9ad0-4375-81b4-5989964e8ed5 autoscaler/1466583476 Deployment successful
  17. 28.

    Monitor it! $ wl status autoscaler Current deployment: 1466583491 Desired

    scale: 2 Machine Component Status Started Deployment ELB ------- --------- ------ ------- ---------- --- i-7db992f7 autoscaler RUNNING 22 Jun 16 11:14 CEST 1466583491 InService i-fb2f5b77 autoscaler RUNNING 24 Jun 16 01:13 CEST 1466583491 InService $ wl logs -f autoscaler ...
  18. 29.
  19. 31.

    • Persistent disk storage • Dynamic load balancing • Long-running

    / memory hungry jobs • Speed up ECS cluster rotation • Make crons more reliable • Outsource Docker registry Improvements