Elastic Cloud @ Fandango: How They Shifted Deployment Model to Scale & Meet Their Deadlines

Fandango January, 2017 Elastic Cloud @ Fandango Adam Kane, Director
of Engineering Jason Rojas, Sr. Manager, Engineering Effectiveness

Agenda 2 1 Problem Space & Goals 2 On-premise Elastic
Stack 3 Elastic Cloud 4 Migration challenges 5 Cost Mitigation 6 Future planning

• Adam Kane leads Operations and Site Reliability Engineering @
Fandango • Jason Rojas leads Engineering Effectiveness @ Fandango • We bleed operations • Elastic Stack users since 2010 • Elastic Stack customers since 2016 About us 3

• The ultimate digital network for all things movies •
Our portfolio reaches more than 60 million unique visitors per month About Fandango 4

How do we use Elasticsearch? 5 Audit Logs Error Logs
Access Logs On-Prem Elastic Elastic Cloud

How do we use Elasticsearch? 6

How do we use Elasticsearch? 7

Problem Space Difficult roads often lead to beautiful destinations

9 Integration Scalability Stability • Multiple business units • Global
footprint • VPNs, Datacenters, Cloud • Compliance & SSL • Physical hardware limits • Slow elasticity • New services • Can’t afford cluster outages • Marketing campaigns • Blockbuster Movies • Business Intelligence Problem Space

Problem Space: Integration - Data Silos 10 Hybrid Cloud Cloud
Native Hybrid Cloud Cloud & On-Prem ES Clusters Cloud ES Cluster On-Prem ES Cluster

THE NATURE OF OUR BUSINESS EQUALS LARGE TRAFFIC SPIKES

• Physical hardware limitations • Slow elasticity • Big movie
releases, Marketing campaigns, etc. • Adding a new cluster member takes weeks • Hardware procurement, datacenter trip, configuration, etc. Problem Space: Scalability

• Engineering teams depend on cluster stability • We can’t
debug production issues when the cluster is down • Insight into KPM & KPI is important to the business • Adoption slows when Kibana isn’t working Problem Space: Stability

• Elasticity • Velocity • Easily integrate existing and future
business units • Security (Shield: SSL, Audit logs, Index access restrictions) • Easy upgrade paths • Stability Problem Space: Goals

On-premise Elastic Stack We’ve been doing this for years

• Elasticsearch 2.4.4 (previously 1.x) / Logstash 5.1.1 (previously 1.x)
• 11 physical node cluster • 7 data nodes, 3 search nodes, 1 tiebreaker • Data nodes: 6x 1TB SSD; 24 CPU; 96G RAM * 7 • Cluster Size: 15TB-20TB • +100 million documents per day • Average 15-20 billion documents at any given time • Daily index rotation (per log type) • Logstash, Logspout, Filebeat, Winlogbeat, Curator, Redis, Kafka On-premise Elastic: Overview

17 Elastic Stack: Architecture Kafka Redis Messaging Queues Elasticsearch Kibana
X-pack Authentication Notification X-pack LDAP Instances (3) Master Nodes (4) Ingest Nodes (7) Data Nodes - Hot (7) Data Nodes - Warm (7) Local on all Clients Logstash Nodes (7) Logstash Applications Beats WinLog Beat Metrics Top Beat Packet Beat

On-premise Elastic: Issues 18 Marketing Campaign 1 2 3 4
5 Traffic increases 5-20x Cluster heap skyrockets. Cluster hangs.** Application issues due to traffic increase, engineers can’t view error logs and marketing can’t see impact Elastic cluster deemed useless. Operations team reverts to manual log gathering

On-premise Elastic: Issues 19 • Heap spike / long GC
pauses • Cluster hangs (mapping/state) • Shards & CPU • Deprecated settings 3 Cluster heap skyrockets. Cluster hangs.**

On-premise Elastic: Issues Example Marketing Campaign

• Cluster stability • Heap, Field Data, Cluster State •
Time consuming upgrade paths • Limited in hardware capacity On-premise Elastic: Issues 22 Management, Stability, Scalability

HOW DO I QUICKLY INTEGRATE ALL OF MY BUSINESS UNITS?

Elastic Cloud We need velocity

• Our last big marketing campaign left us blind due
to cluster scale issues • Decided to go to the cloud • We looked at multiple options • Signed, sealed, and delivered in 3 days • Made possible for many reasons ๏ Great sales team, great support team ๏ Elastic Cloud interface was easy to use ๏ Proper configuration management allowed us to modify logstash config across our ecosystem quickly Elastic Cloud: A single weekend 25 We move pretty damn fast

• Went straight to Elastic Cloud 5.x • Had issues
with Shield, causing cluster to go red every few hours • Tested removal of Shield, no luck • Downgraded to 2.4 to gain stability • Mapping issues (keyword vs. string) caused end user confusion Elastic Cloud: A single weekend extended 26 Too fast?

Problem Space: Integration - Data Silos (redundant) 27 Hybrid Cloud
Cloud Native Hybrid Cloud Cloud & On-Prem ES Clusters Cloud ES Cluster On-Prem ES Cluster

Elastic Cloud: Breaking the Data Silos 28 Hybrid Cloud Cloud
Native Hybrid Cloud Elastic Cloud

• It’s important to cleanup indexes • Field mappings are
important • Storage is not infinite ($$$) • Index shard settings • Dashboard migration from on-prem to cloud can be tricky • Keeping On-Prem cluster and Cloud cluster mappings in sync is key • Index cleanup with curator (cloud vs. on-prem) Elastic Cloud: Lessons Learned (so far) 29 You don’t know what you don’t know, you know?

Migration Challenges Logging is hard

• Original goal was to force the pain of going
from 1.5 to 5.x • Upgrade on-prem from 1.x to 2.x then to 2.4 • Upgrade Logstash from 1.5 to 2.4 / 5.1 • Re-indexing of data • Deadlines Migration Challenges 31 Upgrades & Re-indexing Galore

• Field data settings • Refresh intervals • 5.x issues
in Cloud due to Shield • These are fixed now! We worked closely with Dev/Support • Rollback to 2.4 • Cluster state size • Shard sizing / allocation Migration Challenges: Lessons Learned 32 We learned things

Cost Mitigation Let’s not blow our budget

• Decrease cloud costs with index retention strategies • Production
logs go to both on-prem and cloud for redundancy • This has saved us a few times! • Dev/Integration/Staging logs go to on-prem • Long term retention goes to on-prem / S3 / Glacier • Elastic will soon support hot/warm architecture – this is amazing! • Will allow us to re-think cost mitigation strategy and move more to the cloud Cost Mitigation 34 Do we really need everything in the cloud?

Cost Mitigation: Index lifecycle 35 Prod logs go to cloud
and on-prem 1 2 3 4 5 Prod logs in cloud have short retention (7d) On-prem have longer retention (60-120d) On-prem snapshots to S3 S3 to Glacier

Future planning Always be asking, “what’s next?”

• Upgrade to 5.x on-prem and cloud • Deeper Kafka
integration • Queuing during traffic spikes • Improve log libraries across code (common logging formats) • Logstash 5.x everywhere • Logstash 5.2 Persistent Queues • SAML/LDAP Integration with Shield (we hope!) Future planning 37 What’s next?

38 Questions?

Elastic Cloud @ Fandango: How They Shifted Depl...

Elastic Cloud @ Fandango: How They Shifted Deployment Model to Scale & Meet Their Deadlines

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript