Slide 1

Slide 1 text

Terraform Orchestration at Cloud-Scale

Slide 2

Slide 2 text

@armon Armon Dadgar

Slide 3

Slide 3 text

Towards a software-managed datacenter HashiCorp hashicorp.com I work at HashiCorp, which is a DevOps tools company and we work on moving towards a software managed datacenter.

Slide 4

Slide 4 text

You may recognize some of our tools, we make Vagrant, Packer, Serf, Consul and Terraform.

Slide 5

Slide 5 text

Overview 1. Orchestration 2. Origin Story 3. Modern Datacenter 4. Complexity of “Cloud-Scale” Today we are talking about Orchestration at Cloud-Scale. This raises a few questions, what do we mean by orchestration, where did we start, what is the modern datacenter, and how do we manage the increasing complexity of Cloud-Scale.

Slide 6

Slide 6 text

Orchestration? The word Orchestration get’s thrown around a lot, but it’s not always clear what is meant by it.

Slide 7

Slide 7 text

Orchestration • Infrastructure Lifecycle • Acquisition • Provisioning • Updating • Destroying When we talk about orchestration, we are talking about managing the lifecycle of infrastructure. The lifecycle consists of 4 phases. The first is acquisition. This could be calling a hardware vendor or making an API call to EC2. Next is provisioning, this means installing an OS, running configuration management tools, and initializing the resource for production use. Once a resource is provisioned, likely at some point in it’s life it requires updating. Lastly, when pulling something out of service, it must be destroyed. These phases for the lifecycle of a given resource, which could be a network switch, server, load balancer, etc. This entire process is what we refer to as orchestration.

Slide 8

Slide 8 text

In the beginning, there was only one. One server that is. Acquisition wasn’t really an issue, because you only needed that one. It had a caretaker and everything this was done manually, because ultimately it’s not something you needed to do repeatably.

Slide 9

Slide 9 text

Shortly after, the datacenter evolved to many more servers. You had a room with a handful of server and they were lovingly tended to. Of course, this was done by a select group of sysadmins who were the ultimate gatekeepers.

Slide 10

Slide 10 text

Datacenter of Yore • Acquisition: O(Weeks) • Provisioning: Manual • Updating: Manual • Destroying: Manual Of course, the datacenter wasn’t exactly a panacea. Acquisition of hardware was slow, on the order of weeks or months. You had to actually call Dell and wait for them to physically ship to you so that you could rack them. Then you had to wait for the sysadmins to rack them and do the initial setup. Any provisioning, or changes to configuration were manual. This meant things were rather slow, tedious, and error prone.

Slide 11

Slide 11 text

Around 2006, there was a major tectonic shift: The Cloud. Less ambiguously, we saw the rise of elastic compute. Cheap, on-demand, and infinitely scalable. Instead of calling Dell and waiting for weeks, a server is just an API call away.

Slide 12

Slide 12 text

Elastic Compute • Specialization of Labor • On-Demand Utility • Minutes vs. Months • CapEx vs OpEx • SaaS It’s hard to overstate the impact of elastic compute. Even today, we are still just beginning to appreciate the changes. Instead of forcing every company to have domain knowledge in datacenter management and management procurement and financing, we can now treat compute like a utility. Cheap, always on, on demand. It also shifted the financial burden from being a capital expense to an operational expense. This has lowered the barriers to entry and enabled a whole new class of innovation in SaaS.

Slide 13

Slide 13 text

The Cloud Wars It wasn’t long however, until the start of the cloud wars. It’s hard to keep track of the number of options these days, but luckily as they all compute we are experiencing a race to the bottom, where we have access to incredible amounts of compute time for practically nothing.

Slide 14

Slide 14 text

On-Premise Cloud At the same time, we see the rise of the on-premise cloud. Companies willing to make the capital expenditure want to have access to the same sort of API driven elasticity as the public clouds.

Slide 15

Slide 15 text

Zero to 60? • Acquisition: O(Minutes) • Provisioning? It is clear that one of the biggest advantages of elastic compute is the acquisition of new resources. We have gone from waiting months to minutes. But now we are spinning up servers left and right. But now we must answer the provisioning question. How do we setup these servers?

Slide 16

Slide 16 text

Provisioning • Manual • Bash • Perl • Knowledge Silos Clearly, we’ve been deploying applications for decades, so the problem is not new. However, the tooling that is available has improved dramatically. It used to be that server setup was done manually by SSH’ing in and running commands. Some of this work was automated using Bash or Perl. However, there was generally a knowledge silo, those administering knew the magic incantations and everybody else didn’t.

Slide 17

Slide 17 text

Config Management • Chef, Puppet, Salt, Ansible • Higher Level Abstractions • Codify Knowledge • Automate • Faster • Less Errors Luckily, around the same time elastic compute was on the rise we see broader development and adoption of configuration management tools. These tools provided a higher level of abstraction. They also allowed knowledge to be codified. As a result of being code, they allowed for automation which was faster and less error prone than previous approaches.

Slide 18

Slide 18 text

Containerization • OS-agnostic Packaging • Simplify Delivery • Config Management Even more recently, containerization is becoming popularized. It provides an OS-agnostic way to package applications, and simplifies delivery. We can still make use of our existing configuration management tools to make our containers as well.

Slide 19

Slide 19 text

Ultimately, by using both Elastic Compute and Configuration management, we are able to quickly provision new resources, and reliably configure them into our desired state.

Slide 20

Slide 20 text

SaaS Outsourcing • Better, Faster, Cheaper • Wu Wei: “Doing without Action” • Outsource to SaaS • Specialize on Value Add Just because we are now doing things better, faster and cheaper doesn’t mean we want to be doing it all. Instead, SaaS products are allowing us to outsource pieces of our infrastructure and allowing us to specialize and focus on our value add.

Slide 21

Slide 21 text

SaaS • Diversity and Competition • DNS: Route53, Zerigo, DynDNS, SimpleDNS • CDN: CloudFlare, CloudFront, Akamai, Fastly, • Monitoring: DataDog, Liberato, PingDom, PagerDuty • Email: SendGrid, SES, MailChimp • … When evaluating SaaS options, there are dozens of options for any possible service solution. This is great, we get competition, innovation and lower costs. The only problem is we end up with a very diverse ecosystem. While great for evolution, this diversity comes with a complexity cost.

Slide 22

Slide 22 text

In fact, this rise of complexity is the general trend. We are seeing new platforms, services and tools all the time. This is leading to an increasingly heterogeneous environment in the modern datacenter.

Slide 23

Slide 23 text

Modern Datacenter • Hybrid of Physical, On-Premise / Public Cloud • Global Distributed • Matrix of Provisioning Systems • Integrated PaaS and SaaS All of this leads us to the modern datacenter. It is now entirely common to see hybrid deployments of physical, on-premise and public cloud. These sites are usually globally distributed to more efficiently deliver to customers. Most organizations have adopted multiple different provisioning systems throughout their organization, and have selectively adopted PaaS and SaaS.

Slide 24

Slide 24 text

All this begs the question, how is an operator of modern infrastructure supposed to manage the complexity while increasing efficiently of application delivery?

Slide 25

Slide 25 text

Operators • Understand Infrastructure • Adopt New Technology • Deploy Efficiently So a modern operator is faced with many problems. Firstly they need to understand the increasingly complex infrastructure. They need to move quickly to adopt new technology. And most importantly, they need to help developers deliver applications quickly and efficiently.

Slide 26

Slide 26 text

Move Fast Without Breaking Things In sum, how can we move fast without breaking things?

Slide 27

Slide 27 text

To that end, we’ve built a tool called Terraform

Slide 28

Slide 28 text

Terraform Goals • One Workflow, Technology Agnostic • Modern Datacenter • PaaS and SaaS first class • Operator First With Terraform, we had a number of goals. We wanted to have a single workflow to unify the heterogenous technologies being deployed. We wanted to support a modern datacenter, meaning physical, virtual and containerized infrastructure must be supported. It also means PaaS and SaaS solutions are first class. But most importantly, we put the operator first.

Slide 29

Slide 29 text

Infrastructure as Code • Description of Desired State • Management Automation • Infrastructure Documented • Knowledge Sharing To address all our goals, we needed to treat infrastructure as code. This means that the our configuration should describe our desired state. The burden of change management should be handled by the tooling and not the operator. Computers are very good at managing dependencies and parallelization, humans not so much. More importantly however, is that now our infrastructure is documented. This means anyone can reference the source to figure out the current state of the world. It also means that knowledge sharing can take place. Once you know the Terraform workflow, you can deploy anything from a redis cluster to your internal web app.

Slide 30

Slide 30 text

Flexibility • Must support the entire Datacenter • PaaS and SaaS • Networks • Storage However, for infrastructure as code to work, it must be flexible enough to represent any resource in the datacenter. It is not enough to only capture servers, especially with the increasing prevalence of PaaS and SaaS. Most infrastructure also relies on being able to configure networks and storage to meet the needs of our application.

Slide 31

Slide 31 text

HCL • Based on libucl • Human readable • JSON interoperable All of this led to us developing HCL. HCL is based closely on libucl. It is meant to be human readable, since as operators we want to be able to understand our own infrastructure. It is also JSON interoperable allowing automated tooling to be used to interact with Terraform.

Slide 32

Slide 32 text

Here is an example of HCL in Terraform. In this small example, we are defining a droplet, which is a unit of compute for DigitalOcean. We provide some number of attributes which define the type, size, and region of our instance. We also can define a DNS record for the DNSimple provider. As an operator, we don’t need to know the API for these services or click around in a Web UI. Instead, we describe what we want to Terraform, and let it automatically handle the creation.

Slide 33

Slide 33 text

Resource Graph • Dependency Management • Change Ordering • Parallelization • Visualization We use HCL to describe to Terraform what we want our infrastructure to look like, but not how to make it happen. This is the responsibility of Terraform. Under the hood, Terraform makes use of a Resource Graph. The resource graph is used to represent the dependencies between resources. In the previous example, our DNS record depended on the IP address of our new compute instance. By using a resource graph, Terraform is able to automatically determine the order in which resources should be created, updated and destroyed. It also allows TF to automatically parallelize where possible. Lastly, it can be visualized to help operators understand the interconnections in their infrastructure.

Slide 34

Slide 34 text

As an example, this is the resource graph that was created for our little example. The DNS record depends on both the droplet and the DNSimple provider. The droplet only relies on the DigitalOcean provider.

Slide 35

Slide 35 text

Providers • Integration Point • Expose Resources • CRUD API • Core vs Providers Providers are the integration point between Terraform Core and the outside world. They expose resources to terraform and allow them to be described using HCL. The providers themselves only need to satisfy a simple CRUD API. Having the split between Terraform Core and Providers allows the core to contain all the complex logic, while making it simple to integrate new providers.

Slide 36

Slide 36 text

Physical Virtual Physical (OpenStack) Virtual Container Container Container Container Composition • “Layer Cake” • Provider for each Layer • Unified Configuration • “terraform apply” Another major advantage of this design, is that we can compose providers together. This allows us to manage the “layer cake” of infrastructure. Using Terraform, we can incrementally transform physical hardware, into an OpenStack cluster, provision virtual machines, and deploy containers on top. The best part, is that we don’t need to figure out how to piece together a jigsaw of tools. Instead we use a unified configuration language HCL, and do all the orchestration with a single “terraform apply”. This allows operators to adopt a single workflow for any underlying technology.

Slide 37

Slide 37 text

Entropy • Initial Construction Simple • Changes? • Manual • Automated It’s great that we can use Terraform to manage the construction of this initial layer cake of infrastructure. Unfortunately, nothing lasts forever. At some point, the infrastructure needs to be modified, updated. Entropy, the enemy of stability rears it’s head. So then the question becomes how do we manage these changes. When we have a few dozen servers it may be fine to handle changes manually. Eventually, we are talking about hundreds or thousands of machines across data centers. At this point, this is is usually the answer.

Slide 38

Slide 38 text

This is the typical answer to change orchestration. In a complex infrastructure, operations are required to “divine” the changes required. This is an almost impossible task in a modern datacenter.

Slide 39

Slide 39 text

Measure Twice, Cut Once • Hard to “divine” changes • Complex inter-dependencies • Separate planning from execution • Strict Plan Application However, Terraform has a much better approach to this problem. We are firm believers that it is better to measure twice and cut once. In a complex infrastructure it becomes very hard to divine the changes especially when we account for the complex inter-dependencies of services. The Terraform solution to this is to separate the planning phase from the execution phase. This allows operators to inspect the execution plan, verify it is doing what they intend or better yet to catch an unexpected action. Once the plan is verified, it can be applied and Terraform ensures only the changes in the plan take place

Slide 40

Slide 40 text

Configuration State State’ Execution Plan Plan Application This gives operators a way to measure twice and cut once. This is essential to moving fast without breaking things.

Slide 41

Slide 41 text

Terraform • Infrastructure as Code • Compose Providers • Safely Iterate • Technology Agnostic In summary, Terraform is a tool for infrastructure orchestration. It allows infrastructure to be expressed as code, while composing IaaS, PaaS and SaaS together. It’s use of a resource graph and planning phase allows operators to safely iterate without any unexpected actions. Most importantly, Terraform is designed to be technology agnostic. This allows it to manage the modern datacenter of today, and the datacenter of tomorrow.

Slide 42

Slide 42 text

Operators ! • Unified Configuration • Atlas for Infrastructure • One Tool • Future Proof What does this mean for the operators of Terraform? Terraform provides a single unified configuration that can be used to represent the entire infrastructure. This can be used not only to act on but as an atlas. Best of all, it’s only a single tool to manage it all. The alternative is to piece together a mosaic of tools spanning every technology used today, and every new piece being added daily.

Slide 43

Slide 43 text

End-Users • Ops maintains Infrastructure • Devs responsible for application • Self-Serve! The benefits of Terraform actually extend past the operators. While only a few people are responsible for the core infrastructure and maintenance of the Terraform configuration, the workflow is simple enough to enable end users to deploy application changes while treating terraform as a black box. This allows Terraform to be used as part of a Self-Serve pipeline.

Slide 44

Slide 44 text

Physical Virtual Physical (OpenStack) Virtual Container Container Container Container Ops Dev Self-Serve • Decompose • Delegate • Deploy Just as before, if we think about our infrastructure as a layer cake (you can tell I’m very fond of cake), then we can decompose and delegate responsibility of each layer. For example, we might have a centralized ops team responsible for the physical substrate, but then allow developers to provision and deploy the virtualized and containerized workloads using a self serve model.

Slide 45

Slide 45 text

Modules • Abstract Infrastructure Components • Higher Level Reasoning • Re-Use Configuration (DRY) In addition to allowing self-serve, we want to make it easier to extract and abstract chunks of our infrastructure. This allows developers and operators to reason about their application architecture at a higher level while re-using configuration. This makes it easier to deploy new applications using existing templates for infrastructure.

Slide 46

Slide 46 text

From the perspective of a module developer, there is nothing different from standard Terraform configuration.

Slide 47

Slide 47 text

But from the perspective of a user, modules provide a “black box” like mechanism for adding the functionality they need without worrying about the details of orchestrating those components.

Slide 48

Slide 48 text

“Cloud-Scale” • Complex Infrastructure • Shared Infrastructure • Dev vs Ops • Decentralized Development This brings us to the modern cloud-scale infrastructure. We’ve already seen how the complexity of a modern datacenter has grown. This infrastructure is also shared, and is a common fabric for the applications of an organization. However, there is a split between dev and ops. While we generally have centralized operations teams, development is often decentralized.

Slide 49

Slide 49 text

Continuous Delivery • Reduce Friction • Self-Serve • Ready for Change In this world, how do we continue to focus on application delivery? Given the split between ops and dev, we need to try to reduce the friction of application delivery, and move towards elastic self-serve. And most importantly we need to be ready to embrace change.

Slide 50

Slide 50 text

Terraform Pipeline • Split interactions with Terraform • Operations: “substrate” and modules • Developers: plug-and-play • Move Fast Without Breaking Things As we’ve seen, we can do all of this by building a pipeline around Terraform. We split the interactions between ops and dev. The operations team focuses on the infrastructure substrate and writing re-usable modules for that substrate. Developers then can use Terraform in a plug-and-play model, by using the shared infrastructure and simply importing ready to use modules. Together, this allows an organization to move fast without breaking things.

Slide 51

Slide 51 text

Conclusion • One Tool • Enable Operators (and Users!) • Tame the Complexity Curve In conclusion, Terraform provides a single tool with a unified configuration language. It can be used to enable the operators of infrastructure as well as the end users. But most importantly, it helps us to tame the complexity curve, and hopefully allows us to unbuckle our seat belts.

Slide 52

Slide 52 text

Thanks! Questions?