Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic Cloud @ Fandango: How They Shifted Deployment Model to Scale & Meet Their Deadlines

Elastic Co
March 09, 2017

Elastic Cloud @ Fandango: How They Shifted Deployment Model to Scale & Meet Their Deadlines

Every month, more than 60 million people visit Fandango’s website to browse movie tickets as well as rent or buy TV and movie content. In order to best understand the effectiveness of their outbound marketing and offer campaigns, Fandango deployed the Elastic Stack to monitor and analyze over 5 billion web logs monthly.

In this talk, Adam will walk you through how, in one weekend, the team at FandangoNOW redesigned and re-architected their prior on-premise deployment onto Elastic Cloud in order to hit their launch date. He’ll cover their lessons learned and the journey scaling up to analyzing up to 500 million records per day.

Adam Kane l Director of Engineering l Fandango
Jason Rojas l Sr. Manager, Engineering Effectiveness l Fandango

Elastic Co

March 09, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Fandango January, 2017 Elastic Cloud @ Fandango Adam Kane, Director

    of Engineering Jason Rojas, Sr. Manager, Engineering Effectiveness
  2. Agenda 2 1 Problem Space & Goals 2 On-premise Elastic

    Stack 3 Elastic Cloud 4 Migration challenges 5 Cost Mitigation 6 Future planning
  3. • Adam Kane leads Operations and Site Reliability Engineering @

    Fandango • Jason Rojas leads Engineering Effectiveness @ Fandango • We bleed operations • Elastic Stack users since 2010 • Elastic Stack customers since 2016 About us 3
  4. • The ultimate digital network for all things movies •

    Our portfolio reaches more than 60 million unique visitors per month About Fandango 4
  5. How do we use Elasticsearch? 5 Audit Logs Error Logs

    Access Logs On-Prem Elastic Elastic Cloud
  6. 9 Integration Scalability Stability • Multiple business units • Global

    footprint • VPNs, Datacenters, Cloud • Compliance & SSL • Physical hardware limits • Slow elasticity • New services • Can’t afford cluster outages • Marketing campaigns • Blockbuster Movies • Business Intelligence Problem Space
  7. Problem Space: Integration - Data Silos 10 Hybrid Cloud Cloud

    Native Hybrid Cloud Cloud & On-Prem ES Clusters Cloud ES Cluster On-Prem ES Cluster
  8. • Physical hardware limitations • Slow elasticity • Big movie

    releases, Marketing campaigns, etc. • Adding a new cluster member takes weeks • Hardware procurement, datacenter trip, configuration, etc. Problem Space: Scalability
  9. • Engineering teams depend on cluster stability • We can’t

    debug production issues when the cluster is down • Insight into KPM & KPI is important to the business • Adoption slows when Kibana isn’t working Problem Space: Stability
  10. • Elasticity • Velocity • Easily integrate existing and future

    business units • Security (Shield: SSL, Audit logs, Index access restrictions) • Easy upgrade paths • Stability Problem Space: Goals
  11. • Elasticsearch 2.4.4 (previously 1.x) / Logstash 5.1.1 (previously 1.x)

    • 11 physical node cluster • 7 data nodes, 3 search nodes, 1 tiebreaker • Data nodes: 6x 1TB SSD; 24 CPU; 96G RAM * 7 • Cluster Size: 15TB-20TB • +100 million documents per day • Average 15-20 billion documents at any given time • Daily index rotation (per log type) • Logstash, Logspout, Filebeat, Winlogbeat, Curator, Redis, Kafka On-premise Elastic: Overview
  12. 17 Elastic Stack: Architecture Kafka Redis Messaging Queues Elasticsearch Kibana

    X-pack Authentication Notification X-pack LDAP Instances (3) Master Nodes (4) Ingest Nodes (7) Data Nodes - Hot (7) Data Nodes - Warm (7) Local on all Clients Logstash Nodes (7) Logstash Applications Beats WinLog Beat Metrics Top Beat Packet Beat
  13. On-premise Elastic: Issues 18 Marketing Campaign 1 2 3 4

    5 Traffic increases 5-20x Cluster heap skyrockets. Cluster hangs.** Application issues due to traffic increase, engineers can’t view error logs and marketing can’t see impact Elastic cluster deemed useless. Operations team reverts to manual log gathering
  14. On-premise Elastic: Issues 19 • Heap spike / long GC

    pauses • Cluster hangs (mapping/state) • Shards & CPU • Deprecated settings 3 Cluster heap skyrockets. Cluster hangs.**
  15. • Cluster stability • Heap, Field Data, Cluster State •

    Time consuming upgrade paths • Limited in hardware capacity On-premise Elastic: Issues 22 Management, Stability, Scalability
  16. • Our last big marketing campaign left us blind due

    to cluster scale issues • Decided to go to the cloud • We looked at multiple options • Signed, sealed, and delivered in 3 days • Made possible for many reasons ๏ Great sales team, great support team ๏ Elastic Cloud interface was easy to use ๏ Proper configuration management allowed us to modify logstash config across our ecosystem quickly Elastic Cloud: A single weekend 25 We move pretty damn fast
  17. • Went straight to Elastic Cloud 5.x • Had issues

    with Shield, causing cluster to go red every few hours • Tested removal of Shield, no luck • Downgraded to 2.4 to gain stability • Mapping issues (keyword vs. string) caused end user confusion Elastic Cloud: A single weekend extended 26 Too fast?
  18. Problem Space: Integration - Data Silos (redundant) 27 Hybrid Cloud

    Cloud Native Hybrid Cloud Cloud & On-Prem ES Clusters Cloud ES Cluster On-Prem ES Cluster
  19. • It’s important to cleanup indexes • Field mappings are

    important • Storage is not infinite ($$$) • Index shard settings • Dashboard migration from on-prem to cloud can be tricky • Keeping On-Prem cluster and Cloud cluster mappings in sync is key • Index cleanup with curator (cloud vs. on-prem) Elastic Cloud: Lessons Learned (so far) 29 You don’t know what you don’t know, you know?
  20. • Original goal was to force the pain of going

    from 1.5 to 5.x • Upgrade on-prem from 1.x to 2.x then to 2.4 • Upgrade Logstash from 1.5 to 2.4 / 5.1 • Re-indexing of data • Deadlines Migration Challenges 31 Upgrades & Re-indexing Galore
  21. • Field data settings • Refresh intervals • 5.x issues

    in Cloud due to Shield • These are fixed now! We worked closely with Dev/Support • Rollback to 2.4 • Cluster state size • Shard sizing / allocation Migration Challenges: Lessons Learned 32 We learned things
  22. • Decrease cloud costs with index retention strategies • Production

    logs go to both on-prem and cloud for redundancy • This has saved us a few times! • Dev/Integration/Staging logs go to on-prem • Long term retention goes to on-prem / S3 / Glacier • Elastic will soon support hot/warm architecture – this is amazing! • Will allow us to re-think cost mitigation strategy and move more to the cloud Cost Mitigation 34 Do we really need everything in the cloud?
  23. Cost Mitigation: Index lifecycle 35 Prod logs go to cloud

    and on-prem 1 2 3 4 5 Prod logs in cloud have short retention (7d) On-prem have longer retention (60-120d) On-prem snapshots to S3 S3 to Glacier
  24. • Upgrade to 5.x on-prem and cloud • Deeper Kafka

    integration • Queuing during traffic spikes • Improve log libraries across code (common logging formats) • Logstash 5.x everywhere • Logstash 5.2 Persistent Queues • SAML/LDAP Integration with Shield (we hope!) Future planning 37 What’s next?