Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fancy Containers (Velocity SF 2018)

Abby Fuller
June 14, 2018
770

Fancy Containers (Velocity SF 2018)

Abby Fuller

June 14, 2018
Tweet

Transcript

  1. What we’re going to talk about • First things first

    • You’ve deployed your cluster- now what? • Speeding up deployments • Cleaning up after yourself • Fun with user-data • A little bit on monitoring/observability/debugging
  2. In a lot of cases, first problem that crops up

    is deployments. As in, they’re SLOW. Lots of reasons this happens!
  3. Check your image sizes Larger images mean slower deploys (slower

    to build/push/pull) Image size is determined (mostly), by the number of layers your image has, and how large those layers are. A few tips: • Use shared base images wherever possible • Limit the data written to the container layer • Chain RUN statements • Prevent cache misses at build for as long as possible
  4. Smaller images means understanding the cache Starting from the parent

    instruction, Docker will look at each following instruction to see if it matches the cached version. Only ADD and COPY will look at checksums for a match Other than ADD and COPY, only the string of the command is used, not the contents of the files. Once the cache is broken, every subsequent layer is built again.
  5. How quickly your images are pulling from your registry is

    important for speeding up your deploys: smaller, lighter images can build and push more quickly during your CI/CD process, and pull more quickly during your deploys.
  6. “Secret”* ASG settings • There are some less intuitive settings

    (UX! ) that can impact how quickly your group can a) autoscale, b) how long it takes to mark your containers healthy.
  7. What are these? Default cooldown: how long the autoscaling group

    will wait before evaluating the rule again Health check grace period: amount of time, in seconds, that the autoscaling group will wait before health checking the service.
  8. Since services run through ECS and EKS are backed by

    autoscaling groups, changing these settings can cut down the amount of time it takes your service to a) scale, b) become healthy after starting.
  9. Advanced health check settings can also influence your deployments! This

    happens through a few settings: Healthy and unhealthy thresholds determine how many times your health check endpoint is tried before your container is declared healthy or unhealthy. The shorter you can make the interval between these checks, the faster your deployment will be. Be careful though- you need to give your application enough time to actually pass.
  10. Garbage collection for containers is used to remove things like

    dangling/untagged images, containers, and volumes. Most orchestration platforms (like ECS/EKS, or Kubernetes) will do some garbage collection for you, but it’s not always enough. Too many unused images, containers and volumes will: • Steal your disk space • Wake you up • Cost you cash money
  11. Tune the image cleanup parameters with ECS • With ECS,

    you can tune the options you have available for garbage collection. • If you want to cleanup more aggressively, you can look at setting these ecs-agent options*: ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION ECS_IMAGE_CLEANUP_INTERVAL ECS_NUM_IMAGES_DELETE_PER_CYCLE * don’t know how to set ecs-agent options? We’ll talk about it in a second!
  12. Tune the image cleanup parameters with kubelet • Kubelet also

    has a cleanup function! • Can clean up images and containers, controlled through kubelet flags • Image collection is based on disk usage • Container collection is based on flags (or the defaults) • minimum-container-ttl-duration • maximum-dead-containers-per-container • maximum-dead-containers Just like with ECS, be careful not to collect so aggressively that you lose useful containers
  13. What happens when the defaults aren’t enough? You can: A)

    tune (like we just talked about), B) use a 3rd party tool (like spotify-gc) C) All of the above
  14. It’s also worth mentioning that unused images (and volumes and

    containers, you know the drill), aren’t the only way to lose your disk space. Log rotate is also configurable. Rotate away more files, or more often, or both. With Ubuntu, you can do that with /etc/logrotate.conf. For example, you could change monthly to daily.
  15. *Configuring the ecs-agent, a belated footnote For ECS, you can

    customize quite a bit through the ecs-agent. This is good for things like customizing image cleanup, and changing how your instances interact with Docker and AWS (for example, changing Docker flags, or resource usage). Full list of options is available here. Options can be set in /etc/ecs/ecs.config
  16. Sometimes, you might want to bring your own AMI to

    ECS or EKS (or Fargate, but that’s not possible yet). With EKS, this is easy! Just select your own when you start the cluster. For ECS, this is a little trickier. Fun fact: this is the icon for AMI
  17. The easiest way to use ECS is with the ECS-optimized

    AMI, but life is too short to not have custom AMIs. A few steps to registering your own AMI with ECS: 0. Make sure your AMI has the right requirements! 1. Install ecs-agent 2. Install Docker daemon (if it’s not already installed) 3. Register instance with cluster 4. Optional (but good): some sort of init process to manage the ecs-agent Most of these options can be set in EC2 user-data (more on that in a sec)
  18. Your instance needs a few things before you can use

    it If you’re using Amazon Linux, you need the correct role, and Docker installed. If you’re not using Amazon Linux, you have a few more steps to follow: documentation is here.
  19. Install and start the ecs-agent and Docker daemon $ yum

    install -y ecs-init $ service docker start $ start ecs Full instructions here.
  20. Register your instance with the cluster With Amazon Linux, in

    /etc/ecs/ecs.config: ECS_ENABLE_TASK_IAM_ROLE=true ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true ECS_LOGFILE=/log/ecs-agent.log ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"] ECS_LOGLEVEL=info ECS_CLUSTER=default Full instructions here.
  21. All of those options we just talked about (plus many

    more) can be scripted in EC2 user-data. To start with, here is a walkthrough on bootstrapping ECS container instances with user-data.
  22. You can do all kinds of stuff here, both with

    shell scripts, and cloud-init directives. Or MIME multi-part files. Wild. User-data is good for everything from starting services, to configuring your environment, to enabling options and flags that aren’t supported in the UI. Scripts run at boot, and you can modify them through the EC2 Console.
  23. Speaking of unsupported flags, what happens when I want to

    use a Docker flag that’s not supported in the UI? $ echo ”WHATEVER_FLAG=hi” >> /etc/sysconfig/docker An actual, candid reaction from the ECS team when I told them I do this.
  24. If it doesn’t make sense to you at 3am, it’s

    probably not helping much: a memoir.
  25. Just a few things to think about: • Reduce the

    noise • Page only on issues that require immediate, emergency attention • You need all of it- both monitoring AND observability (more than just logs!) • Make sure you’re asking and answering the right questions.
  26. Hear from the experts: Charity Majors (@mipsytipsy): on Twitter, and

    on the honeycomb.io blog. And a great article from Cindy Sridharan (@copyconstruct) on the differences between monitoring and observability