disclaimer: these opinions are my own • Previous systems management things: • IBM - storage management • Red Hat - wrote Cobbler, co-wrote Func, others • Puppet Labs - short stint helping with Product Management, but learned a lot • rPath - reproducible immutable systems before it’s time, and also way too complicated • Ansible - side-project started 3 years ago, now 120k downloads/month on PyPi (reality = x4?) • Ansible, Inc - CTO. Ran all of Engineering/Strategy/Architecture, Ansible Tower, & OSS project
basic manual virt basic private / public cloud effective use of IaaS immutable systems metal PaaS self-managing clusters? SKY NET Robots who can build/rack HW
many things. • Originally much interest originally was a lot about automated tooling, regardless of the meaning of the phrase. This was “Infrastructure As Code”. Sort of the “Software Craftsmanship” or “Test Engineering” of the Sysadmin. “I don’t just type in and click stuff”. • Then it became about communication/culture (conferences started rejecting tooling talks) • Some interesting parts are actually about Japanese Auto Manufacturing. Pipelines: Continuous Integration, push-button (sometimes Continuous) Deployment. Pipelines. • There’s often lots of cloud and monitoring bits. • Doesn’t matter. Let’s talk about “Ops”.
Application Deployment - your software • Orchestration - controlling the above over a network • Cloud Automation / Provisioning Systems • Image Build Systems / Continuous Integration / Deployment • Monitoring/Trending - Critical, Super Interesting, And We’re Not Talking About It Much Today
system, very uncommonly chosen today, though present in some large shops still, but mired in complexity • Puppet (DSL) - first usable system, IMHO. Gained popularity during an incompatibility between CFEngine 2 and 3. • Chef (Ruby) - founded by Puppet users unhappy with ordering and other kinks in Puppet, who also wanted to write directly in Ruby. • Various others - Pallet (Clojure!), Salt (impure YAML), bcfg2 (XML) • Ansible (YAML) - focused on multi-node management and converging application deployment cases, over SSH versus custom protocol/agent
drive 2850 miles East (assuming you are in CA) • Declarative: be in North Carolina, just do it. If you’re there, do nothing. • Impotence: most misused word ever, but F(x) = F(F(x)). • Makes repeated re-application to minimize “drift” • Drift is a phantom fear if you just use your management tool to edit things properly. It’s real if you aren’t. • Centralized sources of truth. Manage everything from ONE place.
key resources: • Service - make this service be running or stopped or disabled. Possibly automatically restart when certain files/ packages change. • Files - templates, copies, attributes, SELinux, etc. • Packages - install this package (usually yum/apt) and make sure it’s at the latest version or a specific version or maybe just installed.
but nodes working in concert • Sometimes a disconnect between Config Tools and Deployment Tools (Large # of Puppet/Chef users also using Fabric/Capistrano) • Avoiding head-desking over common problems with other tools (for me). • Avoiding historical agent-fun (NTP, SSL, certs/CAs, is the agent crashed, how do I upgrade?, CPU/RAM drain)
scale. Not true. 10k nodes is possible from 1 node with major caveats, but you must limit total number of tasks (ansible push runs ansible locally). Also - do you want to possibly break 10k nodes at once? Not usually. Updates should roll. (Talking to several hundred at once, totally reasonable, it will auto loop). • With push OR pull, Anything doing 10k nodes can set your network on fire. Package mirror? First to fall. Don’t do WGET from people’s travels on personal web space - personal favorite misuse of automation (DDOS!) • Pull can actually create a thundering herd, historically Puppet compilation was very CPU bound
can “do this now” on all nodes a bit more easily, without a separate system to tell the pull to “pull now”. • Quick to choreograph steps between tiers - maximum speed versus 30m+30m+30m worst case for a 3-tier op (web+db+other, etc).
strong type/provider model - skipped to save time early on. Not critical, but would have been nice and now difficult (~300 modules in core) • Much less modules in core - lots of time to support them (but also really good for adoption!) • Would I have focused more on modular architecture (for maintenance/ testing not speed) earlier on vs large scale contribution rates, which were really too much. ‘v2’ effort now underway will take care of most of this, enabling nodes to deploy at their own pace versus in lockstep (optionally).
• no agents - just log in and SSH. ControlPersist. • deploy ‘modules’ as units of work, which are declarative/idempotent, etc. Emits JSON. • language is just YAML - easier for machines to read/write, and good enough for people
be exactly what is deployed • avoid failure/surprises during install/upgrades/autoscale in: • package updated/missing on mirror • network outage on mirror • miscellaneous failure on wget
integration tests • tests are required for Continuous Integration • successful tests result in new image builds if you are going an image based route that you’ll use later in deployment (recommended)
can get to • Getting to “frequent deployment” is good. • Automated rollout from a button, or automatically from Jenkins, upgrading all nodes in cloud/system • Relies on orchestration tooling - and either images (better) or running config automation. Often use load balancers to take upgrading nodes offline or swap old instances out for newer ones.
pipeline. • Not just about reporting failure. Detect trends before they become problems (slow queries, resource issues, space issues, etc). • Hosted monitoring is growing popular because the monitoring system is available in the event of a crash of your infrastructure • Log file analysis is growing popular, ELK/others rising up because of high cost of proprietary options (ex: Splunk).
I travel in the wrong circles so I could be wrong. I want this to be very m much alive. • assume this, I just want my code to run in the cloud, give me however many instances I need and don’t make me see them. Classic Automation then supports bringing up the PaaS and stops. “Just let me be a developer”. • Hard for existing apps, may be great for green field (but sometimes expensive).
cloud-image based crowd. Personally I think it’s most interesting for blue/green upgrades. • Sometimes confusing as some are running additional “cloud” software on another cloud. • Best/reliable/future management software not entirely certain yet. (Mesos/Fleet/OpenStack/Kub/other).
making some earlier automation tool concepts more accessible, and mostly succeeds at this. YAML language is not great, but it’s quick. Less moving parts is a huge win and makes tooling accessible to audiences that struggled with previous efforts, which is why it’s so widely deployed. Still, it’s a stepping stone towards immutable - but with various shops at various points on that journey. Many have enough other things to deal with that they aren’t ready for that now. • Still, IT systems are evolving more towards immutable systems (image-based) and PaaS-enabling systems over time, particularly in leading edge shops or new ventures. Progress is good! Various things still getting refined. Containers will help, but right now also add a degree of complexity. Just building AMIs if you’re on Amazon is a great start. Immutable systems means you can skip learning automation languages, which is nice (ex: Docker files), but you likely still need automation to deploy your container management system itself. • The infrastructure and nuts and bolts behind the apps, cloud, and network will matter less to more people over time. More so, intent can be coded, rather than form and common building blocks. • True IaaS applications are significantly different, and should be written differently. Flexibility, lock-in, and cost is traded for better reliability, scalability, ease of management. Write apps for the business, not reinventing the same wheels everyone has to invent. • Much more people writing code in Ops land. Will “10 years in AWS Services” is going to be the new “10 years in J2EE” for Ops professionals?