Upgrade to Pro — share decks privately, control downloads, hide ads and more …

arRESTful Development: How Netflix Uses Elasticsearch to Better Understand Their Data

arRESTful Development: How Netflix Uses Elasticsearch to Better Understand Their Data

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

With 700-800 production nodes spread across 100 Elasticsearch clusters, Netflix is pushing the envelope when it comes to extracting real-time insights on a massive scale. This talk will highlight how they deploy and manage an environment of that scale and touch on Raigad - their internally-built open source sidecar management tool for Elasticsearch.

Presented by Sagar Loke & Homajeet Cheema, Netflix

Elastic Co

March 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. arRESTful Development: How Netflix Uses
    Elasticsearch to Better Understand Their Data
    Sagar Loke & Homajeet Cheema (Senior Software Engineers)

    View Slide

  2. { } CC-BY-ND 4.0
    Who are we
    •  Cassandra
    •  RDS
    •  Elasticsearch
    •  Dynomite – Netflix OSS
    •  Priam, Raigad – Netflix OSS
    Netflix OSS - http://netflix.github.io
    2  
    Cloud Database Engineering @ Netflix

    View Slide

  3. { } CC-BY-ND 4.0
    Summary
    •  Why Elasticsearch
    •  How we use Elasticsearch
    •  How we run Elasticsearch
    •  Raigad
    3  

    View Slide

  4. { } CC-BY-ND 4.0
    Why Elasticsearch
    •  Quick retrieval
    •  Full text search
    •  Distributed system
    •  Sharding, replication
    •  Cluster scale up or down fairly easy
    •  Flexible schema
    4  

    View Slide

  5. { } CC-BY-ND 4.0
    Who uses Elasticsearch @ Netflix
    Events generated by user activity
    •  Customer service
    •  Playback
    •  Signups/User Logins/Referrer URLs
    Service usage
    •  Security
    5  

    View Slide

  6. { } CC-BY-ND 4.0
    ES ecosystem @ Netflix
    •  Suro Data Pipeline - Netflix OSS
    •  Handles backpressure
    •  Retries
    •  Transport Client
    •  REST
    •  Logstash
    •  Kibana
    Netflix OSS - http://netflix.github.io
    6  

    View Slide

  7. { } CC-BY-ND 4.0
    How we run Elasticsearch
    Deployment
    •  AWS AMI
    •  Jenkins Job
    •  Python Scripts
    •  Raigad
    7  

    View Slide

  8. { } CC-BY-ND 4.0
    How we run Elasticsearch
    •  Asgard – Netflix OSS
    •  Archaius – Netflix OSS
    •  Eureka – Netflix OSS
    Monitoring, Alerting, Dashboard
    •  Servo – Netflix OSS
    Netflix OSS - http://netflix.github.io
    8  
    Configuration

    View Slide

  9. { } CC-BY-ND 4.0
    A Typical Cluster
    •  Dedicated master nodes
    •  Dedicated data nodes
    •  Search nodes
    •  Zone aware replication
    •  At least 1 replica
    •  Instance replacement
    •  Zone outages
    9  

    View Slide

  10. { } CC-BY-ND 4.0
    A Typical Cluster
    10  

    View Slide

  11. { } CC-BY-ND 4.0
    An Example Cluster
    Deployed in one AWS region
    •  More than 3 billion documents (event logs) indexed (per day)
    •  More than 5TB (per day)
    •  Indexes stored for 5 days
    11  

    View Slide

  12. { } CC-BY-ND 4.0
    ES Deployment Growth
    12  

    View Slide

  13. { } CC-BY-ND 4.0 { 13 }
    Raigad
    An Elasticsearch Sidecar

    View Slide

  14. { } CC-BY-ND 4.0
    Raigad – Motivation
    •  Helps to automate ES deployments, upgrades
    •  Node Discovery and Tracking
    •  Automatic Index Management
    •  Scheduled Backup and Restore
    •  Geared towards running in AWS Environment
    { 14 }

    View Slide

  15. { } CC-BY-ND 4.0
    Raigad – How it runs
    •  Elasticsearch Side Car installed on every ES instance
    •  Tunes elasticsearch.yml file based on configuration parameters
    •  Overwrites existing yml file with new parameters
    •  Updates Security Groups
    •  Bootstraps ES process
    •  Gathers information about peers and passes on to ES process during bootstrap
    { 15 }

    View Slide

  16. { } CC-BY-ND 4.0
    Raigad – Auto ES Deployments
    •  Based on configuration parameters; tunes Elasticsearch.yml file
    •  Single-region deployments
    node.rack_id : us-east-1c / us-east-1d / us-east-1e {Availability Zone}
    •  Multi-region deployments
    node.rack_id : us-east-1 {Region Name}
    network.publish_host: 54.123.456.789
    •  Currently follows dedicated Master-Data-Search deployment based on ASG Names
    { 16 }

    View Slide

  17. { } CC-BY-ND 4.0
    Raigad – Node Discovery and Tracking
    •  Sample implementation using Cassandra
    •  C* keeps track of metadata information of ES Clusters
    •  ES instance reads C* to discover other nodes during bootstrap
    •  Storing metadata in C* helps in Multi-Region deployments
    { 17 }

    View Slide

  18. { } CC-BY-ND 4.0
    Raigad – Metadata in C* cluster
    { 18 }

    View Slide

  19. { } CC-BY-ND 4.0
    Raigad – Auto Index Management
    •  Provides configuration properties for Auto Index Management
    •  Based on specific index date suffix (YYYYMMDD), old indices are cleaned and new
    indices are created
    •  Index Manager job can be scheduled or invoked through REST call
    •  Scheduled job runs only on Master node
    { 19 }

    View Slide

  20. { } CC-BY-ND 4.0
    Raigad – Running Index Manager …
    Before Running Index Manager After Running Index Manager
    { 20 }

    View Slide

  21. { } CC-BY-ND 4.0
    Raigad – Configuration Parameters
    •  By default, uses Dynamic Properties in Archaius (https://github.com/Netflix/archaius)
    •  Supports configuration parameters through properties file / System Properties
    •  Based on configuration parameters, update following:
    –  Single/Multi-region deployment
    –  Tuning ES yml file
    –  Tribe Node setup
    –  Security Group settings
    –  Backup / Restore properties
    –  Frequency of Snapshot backup (daily / hourly etc)
    { 21 }

    View Slide

  22. { } CC-BY-ND 4.0
    Raigad – Running in AWS
    •  Automatic updates to Security Groups when new nodes are added or removed
    •  Supports IAM Credentials
    •  Scheduled Snapshot Backup to S3 -- uses elasticsearch-cloud-aws plugin
    •  Publish ES Metrics to Servo - Centralized Monitoring System
    { 22 }

    View Slide

  23. { } CC-BY-ND 4.0
    Raigad – Miscellaneous
    •  Tribe Node Setup
    –  Requires Source Clusters running on different TCP Ports
    –  Tested for Single Region Tribe Cluster
    •  REST API Support
    –  Start ES Process
    –  Stop ES Process
    –  Run Index Manager
    –  Get Peer information
    –  Run Snapshot Backup / Restore
    { 23 }

    View Slide

  24. { } CC-BY-ND 4.0
    Lessons Learned …
    •  Assign approximately (Available RAM/2) for ES Process
    •  Following JVM settings worked well for us :
    •  CMS Collector
    •  Young Gen = min(500MB * num_cores, 1/4 * heap size)
    { 24 }
    Tuning JVM

    View Slide

  25. { } CC-BY-ND 4.0
    Lessons Learned …
    •  refresh.interval = Disabled
    •  replication factor = Reduce
    •  schema changes = selectively index fields
    •  queue size = Unbounded queue for bulk indexing (Check heap usage)
    •  number of shards = Increase
    { 25 }
    Write Heavy Workloads

    View Slide

  26. { } CC-BY-ND 4.0
    Lessons Learned
    •  Dedicated master nodes
    •  Queue to regulate indexing load for heavy write applications
    •  Set High file descriptor limit
    •  Ideally ES Clients and Servers should have same JVM Versions
    •  Do NOT run ES Cluster with Mixed JVM Versions
    { 26 }

    View Slide

  27. { }
    Thank you.
    We are hiring !!
    Apply here : [email protected]
    Homajeet Cheema
    (www.linkedin.com/in/homajeetcheema)
    Sagar Loke (@sagar_loke)
    (www.linkedin.com/in/sagarloke)

    View Slide

  28. { } CC-BY-ND 4.0
    This work is licensed under the Creative Commons
    Attribution-NoDerivatives 4.0 International License.
    To view a copy of this license, visit:
    http://creativecommons.org/licenses/by-nd/4.0/
    or send a letter to:
    Creative Commons
    PO Box 1866
    Mountain View, CA 94042
    USA
    { 28 }

    View Slide