arRESTful Development: How Netflix Uses Elasticsearch to Better Understand Their Data

arRESTful Development: How Netflix Uses Elasticsearch to Better Understand Their Data

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

With 700-800 production nodes spread across 100 Elasticsearch clusters, Netflix is pushing the envelope when it comes to extracting real-time insights on a massive scale. This talk will highlight how they deploy and manage an environment of that scale and touch on Raigad - their internally-built open source sidecar management tool for Elasticsearch.

Presented by Sagar Loke & Homajeet Cheema, Netflix

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 10, 2015
Tweet

Transcript

  1. arRESTful Development: How Netflix Uses Elasticsearch to Better Understand Their

    Data Sagar Loke & Homajeet Cheema (Senior Software Engineers)
  2. { } CC-BY-ND 4.0 Who are we •  Cassandra • 

    RDS •  Elasticsearch •  Dynomite – Netflix OSS •  Priam, Raigad – Netflix OSS Netflix OSS - http://netflix.github.io 2   Cloud Database Engineering @ Netflix
  3. { } CC-BY-ND 4.0 Summary •  Why Elasticsearch •  How

    we use Elasticsearch •  How we run Elasticsearch •  Raigad 3  
  4. { } CC-BY-ND 4.0 Why Elasticsearch •  Quick retrieval • 

    Full text search •  Distributed system •  Sharding, replication •  Cluster scale up or down fairly easy •  Flexible schema 4  
  5. { } CC-BY-ND 4.0 Who uses Elasticsearch @ Netflix Events

    generated by user activity •  Customer service •  Playback •  Signups/User Logins/Referrer URLs Service usage •  Security 5  
  6. { } CC-BY-ND 4.0 ES ecosystem @ Netflix •  Suro

    Data Pipeline - Netflix OSS •  Handles backpressure •  Retries •  Transport Client •  REST •  Logstash •  Kibana Netflix OSS - http://netflix.github.io 6  
  7. { } CC-BY-ND 4.0 How we run Elasticsearch Deployment • 

    AWS AMI •  Jenkins Job •  Python Scripts •  Raigad 7  
  8. { } CC-BY-ND 4.0 How we run Elasticsearch •  Asgard

    – Netflix OSS •  Archaius – Netflix OSS •  Eureka – Netflix OSS Monitoring, Alerting, Dashboard •  Servo – Netflix OSS Netflix OSS - http://netflix.github.io 8   Configuration
  9. { } CC-BY-ND 4.0 A Typical Cluster •  Dedicated master

    nodes •  Dedicated data nodes •  Search nodes •  Zone aware replication •  At least 1 replica •  Instance replacement •  Zone outages 9  
  10. { } CC-BY-ND 4.0 A Typical Cluster 10  

  11. { } CC-BY-ND 4.0 An Example Cluster Deployed in one

    AWS region •  More than 3 billion documents (event logs) indexed (per day) •  More than 5TB (per day) •  Indexes stored for 5 days 11  
  12. { } CC-BY-ND 4.0 ES Deployment Growth 12  

  13. { } CC-BY-ND 4.0 { 13 } Raigad An Elasticsearch

    Sidecar
  14. { } CC-BY-ND 4.0 Raigad – Motivation •  Helps to

    automate ES deployments, upgrades •  Node Discovery and Tracking •  Automatic Index Management •  Scheduled Backup and Restore •  Geared towards running in AWS Environment { 14 }
  15. { } CC-BY-ND 4.0 Raigad – How it runs • 

    Elasticsearch Side Car installed on every ES instance •  Tunes elasticsearch.yml file based on configuration parameters •  Overwrites existing yml file with new parameters •  Updates Security Groups •  Bootstraps ES process •  Gathers information about peers and passes on to ES process during bootstrap { 15 }
  16. { } CC-BY-ND 4.0 Raigad – Auto ES Deployments • 

    Based on configuration parameters; tunes Elasticsearch.yml file •  Single-region deployments node.rack_id : us-east-1c / us-east-1d / us-east-1e {Availability Zone} •  Multi-region deployments node.rack_id : us-east-1 {Region Name} network.publish_host: 54.123.456.789 •  Currently follows dedicated Master-Data-Search deployment based on ASG Names { 16 }
  17. { } CC-BY-ND 4.0 Raigad – Node Discovery and Tracking

    •  Sample implementation using Cassandra •  C* keeps track of metadata information of ES Clusters •  ES instance reads C* to discover other nodes during bootstrap •  Storing metadata in C* helps in Multi-Region deployments { 17 }
  18. { } CC-BY-ND 4.0 Raigad – Metadata in C* cluster

    { 18 }
  19. { } CC-BY-ND 4.0 Raigad – Auto Index Management • 

    Provides configuration properties for Auto Index Management •  Based on specific index date suffix (YYYYMMDD), old indices are cleaned and new indices are created •  Index Manager job can be scheduled or invoked through REST call •  Scheduled job runs only on Master node { 19 }
  20. { } CC-BY-ND 4.0 Raigad – Running Index Manager …

    Before Running Index Manager After Running Index Manager { 20 }
  21. { } CC-BY-ND 4.0 Raigad – Configuration Parameters •  By

    default, uses Dynamic Properties in Archaius (https://github.com/Netflix/archaius) •  Supports configuration parameters through properties file / System Properties •  Based on configuration parameters, update following: –  Single/Multi-region deployment –  Tuning ES yml file –  Tribe Node setup –  Security Group settings –  Backup / Restore properties –  Frequency of Snapshot backup (daily / hourly etc) { 21 }
  22. { } CC-BY-ND 4.0 Raigad – Running in AWS • 

    Automatic updates to Security Groups when new nodes are added or removed •  Supports IAM Credentials •  Scheduled Snapshot Backup to S3 -- uses elasticsearch-cloud-aws plugin •  Publish ES Metrics to Servo - Centralized Monitoring System { 22 }
  23. { } CC-BY-ND 4.0 Raigad – Miscellaneous •  Tribe Node

    Setup –  Requires Source Clusters running on different TCP Ports –  Tested for Single Region Tribe Cluster •  REST API Support –  Start ES Process –  Stop ES Process –  Run Index Manager –  Get Peer information –  Run Snapshot Backup / Restore { 23 }
  24. { } CC-BY-ND 4.0 Lessons Learned … •  Assign approximately

    (Available RAM/2) for ES Process •  Following JVM settings worked well for us : •  CMS Collector •  Young Gen = min(500MB * num_cores, 1/4 * heap size) { 24 } Tuning JVM
  25. { } CC-BY-ND 4.0 Lessons Learned … •  refresh.interval =

    Disabled •  replication factor = Reduce •  schema changes = selectively index fields •  queue size = Unbounded queue for bulk indexing (Check heap usage) •  number of shards = Increase { 25 } Write Heavy Workloads
  26. { } CC-BY-ND 4.0 Lessons Learned •  Dedicated master nodes

    •  Queue to regulate indexing load for heavy write applications •  Set High file descriptor limit •  Ideally ES Clients and Servers should have same JVM Versions •  Do NOT run ES Cluster with Mixed JVM Versions { 26 }
  27. { } Thank you. We are hiring !! Apply here

    : jobs@netflix.com Homajeet Cheema (www.linkedin.com/in/homajeetcheema) Sagar Loke (@sagar_loke) (www.linkedin.com/in/sagarloke)
  28. { } CC-BY-ND 4.0 This work is licensed under the

    Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA { 28 }