Terraform: Orchestration at Cloud-Scale

Terraform Orchestration at Cloud-Scale

@armon Armon Dadgar

Towards a software-managed datacenter HashiCorp hashicorp.com I work at HashiCorp,
which is a DevOps tools company and we work on moving towards a software managed datacenter.

You may recognize some of our tools, we make Vagrant,
Packer, Serf, Consul and Terraform.

Overview 1. Orchestration 2. Origin Story 3. Modern Datacenter 4.
Complexity of “Cloud-Scale” Today we are talking about Orchestration at Cloud-Scale. This raises a few questions, what do we mean by orchestration, where did we start, what is the modern datacenter, and how do we manage the increasing complexity of Cloud-Scale.

Orchestration? The word Orchestration get’s thrown around a lot, but
it’s not always clear what is meant by it.

Orchestration • Infrastructure Lifecycle • Acquisition • Provisioning • Updating
• Destroying When we talk about orchestration, we are talking about managing the lifecycle of infrastructure. The lifecycle consists of 4 phases. The first is acquisition. This could be calling a hardware vendor or making an API call to EC2. Next is provisioning, this means installing an OS, running configuration management tools, and initializing the resource for production use. Once a resource is provisioned, likely at some point in it’s life it requires updating. Lastly, when pulling something out of service, it must be destroyed. These phases for the lifecycle of a given resource, which could be a network switch, server, load balancer, etc. This entire process is what we refer to as orchestration.

In the beginning, there was only one. One server that
is. Acquisition wasn’t really an issue, because you only needed that one. It had a caretaker and everything this was done manually, because ultimately it’s not something you needed to do repeatably.

Shortly after, the datacenter evolved to many more servers. You
had a room with a handful of server and they were lovingly tended to. Of course, this was done by a select group of sysadmins who were the ultimate gatekeepers.

Datacenter of Yore • Acquisition: O(Weeks) • Provisioning: Manual •
Updating: Manual • Destroying: Manual Of course, the datacenter wasn’t exactly a panacea. Acquisition of hardware was slow, on the order of weeks or months. You had to actually call Dell and wait for them to physically ship to you so that you could rack them. Then you had to wait for the sysadmins to rack them and do the initial setup. Any provisioning, or changes to configuration were manual. This meant things were rather slow, tedious, and error prone.

Around 2006, there was a major tectonic shift: The Cloud.
Less ambiguously, we saw the rise of elastic compute. Cheap, on-demand, and infinitely scalable. Instead of calling Dell and waiting for weeks, a server is just an API call away.

Elastic Compute • Specialization of Labor • On-Demand Utility •
Minutes vs. Months • CapEx vs OpEx • SaaS It’s hard to overstate the impact of elastic compute. Even today, we are still just beginning to appreciate the changes. Instead of forcing every company to have domain knowledge in datacenter management and management procurement and financing, we can now treat compute like a utility. Cheap, always on, on demand. It also shifted the financial burden from being a capital expense to an operational expense. This has lowered the barriers to entry and enabled a whole new class of innovation in SaaS.

The Cloud Wars It wasn’t long however, until the start
of the cloud wars. It’s hard to keep track of the number of options these days, but luckily as they all compute we are experiencing a race to the bottom, where we have access to incredible amounts of compute time for practically nothing.

On-Premise Cloud At the same time, we see the rise
of the on-premise cloud. Companies willing to make the capital expenditure want to have access to the same sort of API driven elasticity as the public clouds.

Zero to 60? • Acquisition: O(Minutes) • Provisioning? It is
clear that one of the biggest advantages of elastic compute is the acquisition of new resources. We have gone from waiting months to minutes. But now we are spinning up servers left and right. But now we must answer the provisioning question. How do we setup these servers?

Provisioning • Manual • Bash • Perl • Knowledge Silos
Clearly, we’ve been deploying applications for decades, so the problem is not new. However, the tooling that is available has improved dramatically. It used to be that server setup was done manually by SSH’ing in and running commands. Some of this work was automated using Bash or Perl. However, there was generally a knowledge silo, those administering knew the magic incantations and everybody else didn’t.

Conﬁg Management • Chef, Puppet, Salt, Ansible • Higher Level
Abstractions • Codify Knowledge • Automate • Faster • Less Errors Luckily, around the same time elastic compute was on the rise we see broader development and adoption of configuration management tools. These tools provided a higher level of abstraction. They also allowed knowledge to be codified. As a result of being code, they allowed for automation which was faster and less error prone than previous approaches.

Containerization • OS-agnostic Packaging • Simplify Delivery • Conﬁg Management
Even more recently, containerization is becoming popularized. It provides an OS-agnostic way to package applications, and simplifies delivery. We can still make use of our existing configuration management tools to make our containers as well.

Ultimately, by using both Elastic Compute and Configuration management, we
are able to quickly provision new resources, and reliably configure them into our desired state.

SaaS Outsourcing • Better, Faster, Cheaper • Wu Wei: “Doing
without Action” • Outsource to SaaS • Specialize on Value Add Just because we are now doing things better, faster and cheaper doesn’t mean we want to be doing it all. Instead, SaaS products are allowing us to outsource pieces of our infrastructure and allowing us to specialize and focus on our value add.

SaaS • Diversity and Competition • DNS: Route53, Zerigo, DynDNS,
SimpleDNS • CDN: CloudFlare, CloudFront, Akamai, Fastly, • Monitoring: DataDog, Liberato, PingDom, PagerDuty • Email: SendGrid, SES, MailChimp • … When evaluating SaaS options, there are dozens of options for any possible service solution. This is great, we get competition, innovation and lower costs. The only problem is we end up with a very diverse ecosystem. While great for evolution, this diversity comes with a complexity cost.

In fact, this rise of complexity is the general trend.
We are seeing new platforms, services and tools all the time. This is leading to an increasingly heterogeneous environment in the modern datacenter.

Modern Datacenter • Hybrid of Physical, On-Premise / Public Cloud
• Global Distributed • Matrix of Provisioning Systems • Integrated PaaS and SaaS All of this leads us to the modern datacenter. It is now entirely common to see hybrid deployments of physical, on-premise and public cloud. These sites are usually globally distributed to more efficiently deliver to customers. Most organizations have adopted multiple different provisioning systems throughout their organization, and have selectively adopted PaaS and SaaS.

All this begs the question, how is an operator of
modern infrastructure supposed to manage the complexity while increasing efficiently of application delivery?

Operators • Understand Infrastructure • Adopt New Technology • Deploy
Efﬁciently So a modern operator is faced with many problems. Firstly they need to understand the increasingly complex infrastructure. They need to move quickly to adopt new technology. And most importantly, they need to help developers deliver applications quickly and efficiently.

Move Fast Without Breaking Things In sum, how can we
move fast without breaking things?

To that end, we’ve built a tool called Terraform

Terraform Goals • One Workﬂow, Technology Agnostic • Modern Datacenter
• PaaS and SaaS ﬁrst class • Operator First With Terraform, we had a number of goals. We wanted to have a single workflow to unify the heterogenous technologies being deployed. We wanted to support a modern datacenter, meaning physical, virtual and containerized infrastructure must be supported. It also means PaaS and SaaS solutions are first class. But most importantly, we put the operator first.

Infrastructure as Code • Description of Desired State • Management
Automation • Infrastructure Documented • Knowledge Sharing To address all our goals, we needed to treat infrastructure as code. This means that the our configuration should describe our desired state. The burden of change management should be handled by the tooling and not the operator. Computers are very good at managing dependencies and parallelization, humans not so much. More importantly however, is that now our infrastructure is documented. This means anyone can reference the source to figure out the current state of the world. It also means that knowledge sharing can take place. Once you know the Terraform workflow, you can deploy anything from a redis cluster to your internal web app.

Flexibility • Must support the entire Datacenter • PaaS and
SaaS • Networks • Storage However, for infrastructure as code to work, it must be flexible enough to represent any resource in the datacenter. It is not enough to only capture servers, especially with the increasing prevalence of PaaS and SaaS. Most infrastructure also relies on being able to configure networks and storage to meet the needs of our application.

HCL • Based on libucl • Human readable • JSON
interoperable All of this led to us developing HCL. HCL is based closely on libucl. It is meant to be human readable, since as operators we want to be able to understand our own infrastructure. It is also JSON interoperable allowing automated tooling to be used to interact with Terraform.

Here is an example of HCL in Terraform. In this
small example, we are defining a droplet, which is a unit of compute for DigitalOcean. We provide some number of attributes which define the type, size, and region of our instance. We also can define a DNS record for the DNSimple provider. As an operator, we don’t need to know the API for these services or click around in a Web UI. Instead, we describe what we want to Terraform, and let it automatically handle the creation.

Resource Graph • Dependency Management • Change Ordering • Parallelization
• Visualization We use HCL to describe to Terraform what we want our infrastructure to look like, but not how to make it happen. This is the responsibility of Terraform. Under the hood, Terraform makes use of a Resource Graph. The resource graph is used to represent the dependencies between resources. In the previous example, our DNS record depended on the IP address of our new compute instance. By using a resource graph, Terraform is able to automatically determine the order in which resources should be created, updated and destroyed. It also allows TF to automatically parallelize where possible. Lastly, it can be visualized to help operators understand the interconnections in their infrastructure.

As an example, this is the resource graph that was
created for our little example. The DNS record depends on both the droplet and the DNSimple provider. The droplet only relies on the DigitalOcean provider.

Providers • Integration Point • Expose Resources • CRUD API
• Core vs Providers Providers are the integration point between Terraform Core and the outside world. They expose resources to terraform and allow them to be described using HCL. The providers themselves only need to satisfy a simple CRUD API. Having the split between Terraform Core and Providers allows the core to contain all the complex logic, while making it simple to integrate new providers.

Physical Virtual Physical (OpenStack) Virtual Container Container Container Container Composition
• “Layer Cake” • Provider for each Layer • Uniﬁed Conﬁguration • “terraform apply” Another major advantage of this design, is that we can compose providers together. This allows us to manage the “layer cake” of infrastructure. Using Terraform, we can incrementally transform physical hardware, into an OpenStack cluster, provision virtual machines, and deploy containers on top. The best part, is that we don’t need to figure out how to piece together a jigsaw of tools. Instead we use a unified configuration language HCL, and do all the orchestration with a single “terraform apply”. This allows operators to adopt a single workflow for any underlying technology.

Entropy • Initial Construction Simple • Changes? • Manual •
Automated It’s great that we can use Terraform to manage the construction of this initial layer cake of infrastructure. Unfortunately, nothing lasts forever. At some point, the infrastructure needs to be modified, updated. Entropy, the enemy of stability rears it’s head. So then the question becomes how do we manage these changes. When we have a few dozen servers it may be fine to handle changes manually. Eventually, we are talking about hundreds or thousands of machines across data centers. At this point, this is is usually the answer.

This is the typical answer to change orchestration. In a
complex infrastructure, operations are required to “divine” the changes required. This is an almost impossible task in a modern datacenter.

Measure Twice, Cut Once • Hard to “divine” changes •
Complex inter-dependencies • Separate planning from execution • Strict Plan Application However, Terraform has a much better approach to this problem. We are firm believers that it is better to measure twice and cut once. In a complex infrastructure it becomes very hard to divine the changes especially when we account for the complex inter-dependencies of services. The Terraform solution to this is to separate the planning phase from the execution phase. This allows operators to inspect the execution plan, verify it is doing what they intend or better yet to catch an unexpected action. Once the plan is verified, it can be applied and Terraform ensures only the changes in the plan take place

Configuration State State’ Execution Plan Plan Application This gives operators
a way to measure twice and cut once. This is essential to moving fast without breaking things.

Terraform • Infrastructure as Code • Compose Providers • Safely
Iterate • Technology Agnostic In summary, Terraform is a tool for infrastructure orchestration. It allows infrastructure to be expressed as code, while composing IaaS, PaaS and SaaS together. It’s use of a resource graph and planning phase allows operators to safely iterate without any unexpected actions. Most importantly, Terraform is designed to be technology agnostic. This allows it to manage the modern datacenter of today, and the datacenter of tomorrow.

Operators ! • Uniﬁed Conﬁguration • Atlas for Infrastructure •
One Tool • Future Proof What does this mean for the operators of Terraform? Terraform provides a single unified configuration that can be used to represent the entire infrastructure. This can be used not only to act on but as an atlas. Best of all, it’s only a single tool to manage it all. The alternative is to piece together a mosaic of tools spanning every technology used today, and every new piece being added daily.

End-Users • Ops maintains Infrastructure • Devs responsible for application
• Self-Serve! The benefits of Terraform actually extend past the operators. While only a few people are responsible for the core infrastructure and maintenance of the Terraform configuration, the workflow is simple enough to enable end users to deploy application changes while treating terraform as a black box. This allows Terraform to be used as part of a Self-Serve pipeline.

Physical Virtual Physical (OpenStack) Virtual Container Container Container Container Ops
Dev Self-Serve • Decompose • Delegate • Deploy Just as before, if we think about our infrastructure as a layer cake (you can tell I’m very fond of cake), then we can decompose and delegate responsibility of each layer. For example, we might have a centralized ops team responsible for the physical substrate, but then allow developers to provision and deploy the virtualized and containerized workloads using a self serve model.

Modules • Abstract Infrastructure Components • Higher Level Reasoning •
Re-Use Conﬁguration (DRY) In addition to allowing self-serve, we want to make it easier to extract and abstract chunks of our infrastructure. This allows developers and operators to reason about their application architecture at a higher level while re-using configuration. This makes it easier to deploy new applications using existing templates for infrastructure.

From the perspective of a module developer, there is nothing
different from standard Terraform configuration.

But from the perspective of a user, modules provide a
“black box” like mechanism for adding the functionality they need without worrying about the details of orchestrating those components.

“Cloud-Scale” • Complex Infrastructure • Shared Infrastructure • Dev vs
Ops • Decentralized Development This brings us to the modern cloud-scale infrastructure. We’ve already seen how the complexity of a modern datacenter has grown. This infrastructure is also shared, and is a common fabric for the applications of an organization. However, there is a split between dev and ops. While we generally have centralized operations teams, development is often decentralized.

Continuous Delivery • Reduce Friction • Self-Serve • Ready for
Change In this world, how do we continue to focus on application delivery? Given the split between ops and dev, we need to try to reduce the friction of application delivery, and move towards elastic self-serve. And most importantly we need to be ready to embrace change.

Terraform Pipeline • Split interactions with Terraform • Operations: “substrate”
and modules • Developers: plug-and-play • Move Fast Without Breaking Things As we’ve seen, we can do all of this by building a pipeline around Terraform. We split the interactions between ops and dev. The operations team focuses on the infrastructure substrate and writing re-usable modules for that substrate. Developers then can use Terraform in a plug-and-play model, by using the shared infrastructure and simply importing ready to use modules. Together, this allows an organization to move fast without breaking things.

Conclusion • One Tool • Enable Operators (and Users!) •
Tame the Complexity Curve In conclusion, Terraform provides a single tool with a unified configuration language. It can be used to enable the operators of infrastructure as well as the end users. But most importantly, it helps us to tame the complexity curve, and hopefully allows us to unbuckle our seat belts.

Thanks! Questions?

Terraform: Orchestration at Cloud-Scale

Terraform: Orchestration at Cloud-Scale

More Decks by Armon Dadgar

Other Decks in Technology

Featured

Transcript