Hosted and Managed Elasticsearch: How It’s Built

‹#› Njal Karevoll Erik Redding Hosted and Managed Elasticsearch

Hello! 2 Njal Karevoll Engineer Erik Redding SRE

Agenda 3 Why hosted Elasticsearch? 1 Living with systems that
fail 3 Current architecture 4 Demo: Bootstrapping and using Cloud 5 Attempted architectures 2

4 Why hosted Elasticsearch?

‹#› History and Early Architectures Mistakes made, lessons learned

From Alpha to Cloud Enterprise 6 Alpha: Shared Clusters 1
2 3 4 5 Private Beta: Managing Nodes Public SaaS: Managing Clusters Cloud Enterprise The Future™

Alpha 7 Take 1: Shared Cluster

Alpha Cluster Node A Index A Node C Index C
Node B Index B Node D

Alpha Cluster Node A Index A Node C Index A
Node B Node D Customer A

Alpha Cluster Node A Node C Index A Node B
Node D Index B Customer B Customer A

Alpha Cluster Node A Node C Index A Node B
Node D Index B Customer B Customer A Proxy

12 Bad Idea

Resource Governance • Elasticsearch wants lots of memory: Page cache
and heap space • Elasticsearch also wants lots of CPU • Operating systems govern this. Elasticsearch cannot. 13

Reliability Concerns • Runaway scripts • Mapping explosion • Experimental
queries affecting the response time of production queries 14

Security Concerns • Bugs • Powerful Scripts • Custom Plugins
15

Version Upgrades • When to (not) upgrade should be your
decision. • Particularly when there are breaking changes. 16

Separate Clusters for Separate Concerns • Don't mix testing and
production • Separate clusters for searching and logging • … and for different scaling profiles 17

All previous points are general Elasticsearch best practices 18

… managing individual nodes 19 Take 2: Dedicated Clusters

Customer B Customer A Proxy

Bootstrapping a cluster is easy • How many nodes? •
Which size? • Running where? How many zones? • Which version? 21

Extending a cluster is easy • Add new node •
Let Elasticsearch do its thing • Send traffic to it 22

Changing a live cluster is hard • Now change all
your nodes while serving requests • … with minimal disruption • … with safe rollback if something goes wrong 23

24 Verify version compatibility Ensure snapshot repository Create snapshot Start
nodes Wait on new quorum Adjust allocation settings Wait for migrations Adjust proxy routing Reroute traffic Drain connections Suspend periodic snapshotting Stop old nodes Clean up old nodes Prepare new nodes Resume periodic snapshotting Assuming nothing goes wrong

‹#› Let me do all those things in the right
order. Looks easy. Not a single beta user

26 People who should solve this problem

27 People who should solve this problem

Needed a Cluster Scheduler • Have a cluster with a
certain topology • Need to reconfigure it to a different topology • Working with thousands of nodes 28

Where do you store things? • Persistent storage sounds great
• … until it brings down everything with it • … or just has generally jittery performance 29

Limiting Blast Radius • In distributed systems, failure is inevitable
• Important to control cascading failures for your services • Design accordingly 31

Treating Nodes like Cattle • Ephemeral Storage. Better performance, minimal
blast radius • Replicate to multiple availability zones • Snapshot often, just in case 32

… Not Pets • Sign of problem? Move away •
Large scale failure? Thundering herd inhibits repair • Additional capacity already running - Fail over fast • Copy data from replica • Worst case: recover from recent snapshot 33

‹#› Story Time

Kernel Bugs Crashing All The Servers • Private beta days.
• Two Linux kernel bugs: • One related to how we did internal networking between containers. • One bug as Linux switched scheduling algorithms as CPU load transitioned. • Long story short: A lot of servers died every day. For two weeks. 35

Mass Reboots • Xen bug prompted large-scale rebooting for cloud
providers. • Could not migrate to new instances to avoid it. • Our infrastructure provider, our problem. • Upgraded every cluster to high-availability, for free. 38

Failure is fine. 39

Proxies and Failovers • Proxy / Load Balancers important for
failovers. • The client need not change. • Stop routing to problematic nodes • Request and Metric Logging 40

Logging and Monitoring • We use a lot of Elasticsearch
to manage Elasticsearch. • Some duct tape, lots of dog fooding. • We heavily influence the entire stack. 41

‹#› Current Architecture Automatically take care of all the hard
parts

Security • Biggest attack vector: Elasticsearch itself. • Scripts, custom
plugins, bugs, etc. • Design assumption: User has compromised Elasticsearch. • Isolated processes. • Dedicated resources, like S3 buckets for snapshotting. 43

Security Improvements in Elasticsearch • Elasticsearch is improving a lot.
• Huge efforts to enable SecurityManager. • Even so, will always assume the hosted nodes can't be trusted. 44

Containers • Isolation and Resource Governance. • Memory, CPU weighing
/ hard limits, IOPS/Network weighing. • Currently using Docker 45

Runners – managing multiple containers • The runner is a
service that manages the life-cycle of a  container on a given server. • It reads the set of containers and their definitions from  ZooKeeper • Publishes meta-data and inspection data about the active containers back to ZooKeeper 46 Runner Allocator Proxy Admin Console Secure Tunnel Generic container

Allocators – managing multiple cluster instances • Manages the life-cycle
of Elasticsearch and Kibana  instances. • Generates configuration files • Other maintenance tasks • Has a set of attributes and features. 47 Allocator Elasticsearch Elasticsearch Elasticsearch Cabana Service Container

Constructor: Managing a pool of resources • Contains the cluster
scheduler • Allocators make themselves available in a resource pool. • Considers the constraints of the cluster topology when performing changes. • Constructor assigns Elasticsearch and Kibana instances to allocators. 48

49 POST cloud.elastic.co/regions/us-east-1/_new  { "name": "staging", "capacity": 32768, “zones": 2
} Console API Constructor zone 1 Allocator Allocator zone 2 Allocator Allocator zone 3 Allocator Allocator Zookeeper Allocator Allocator

‹#› Demo

More Questions? Come see us in the AMA booth! 51
https://www.elastic.co/cloud

Hosted and Managed Elasticsearch: How It’s Built

Hosted and Managed Elasticsearch: How It’s Built

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript