Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Elasticsearch at NetSuite: The Good, The Bad, and The Ugly

Elastic Co
February 18, 2016

Scaling Elasticsearch at NetSuite: The Good, The Bad, and The Ugly

Learn how NetSuite has scaled its Elastic Stack deployment to handle 3 billion daily events and a petabyte of data while meeting security requirements like HIPAA, EU, PACRIM, and PCI/DDS. Topics covered will include hardware selection, core configuration, deployment, security, and monitoring.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 1 Bryan Washer Manager Engineering Operations Architecture, Principal Engineer [email protected]

    #elasticsearch: _Bryan_ Vertical Scaling Elasticsearch: The Good, The Bad, and The Ugly
  2. 2 Agenda - Loosely followed Ask Questions…I like to discuss

    The Beginning…seems like so long ago… 1 The GOOD 3 The BAD 4 The End….not really we are still growing… 5 The UGLY 2
  3. 4 The Beginning…seems like so long ago… •  Capable of

    processing 1.25billion events a day ‒  Peak will be roughly 2x of trough ‒  Store 13 months of data (Estimated 600-750TB) •  10-12 Physical Data Nodes ‒  2 Instances (1 Hot, 1 Warm), Lots of storage, Fast CPU’s, Lots of Memory (DB server) •  Secure data ingest and access ‒  Shield – Validated client, Separation of Duties, Authorization/Authentication, Least Privileges Rule, Control of Application Accounts •  Resilient ‒  Can handle hardware failure and/or node (software) crashes (Not if but when) •  Small footprint in DC Elasticsearch to the rescue!!!
  4. 6 Can I add another index? I estimate it is

    only going to be ###,### events a day. Anonymous Developer Said to often to keep track
  5. 7 Scaling •  Bare Hardware / Virtual Servers •  Usage

    Profiles •  Growth Expectations •  Storage Requirements •  Existing Environment Standards •  Number of Clusters •  Risk Acceptance Horizontally Vertically Vertically Horizontally
  6. 8 Characteristics of The UGLY •  Inefficient use of Hardware

    ‒  Powerful CPU’s ‒  Large amounts of memory ‒  Massive storage ‒  High speed networks •  Limited number of nodes ‒  Less resources for queries ‒  Less locations for distribution of shards ‒  Less total number of shards can be supported •  Larger amounts of data to replicate on node failure What is wrong here?
  7. 10 Characteristics of The Good •  Use the Hardware ‒ 

    Push the CPU’s ‒  Enable the memory we have (Up to 50%) ‒  Split up the storage •  Increase nodes ‒  More resources for queries ‒  Better distribution of the larger number of shards ‒  More shards supported, Thus more Indexes, Thus more data available online •  Faster replication on node failure due to smaller data sets How do we make things better?
  8. 12 Characteristics of The Bad •  More complicated cluster management

    ‒  Need to address each instance for all management ‒  Need to setup proper allocation rules to protect data from physical node crash •  Non Standard configuration means more difficult to get assistance from public resources and documentation. ‒  Support is your friend here. (Shout out to Chris Earle and the Elastic developers) •  Physical node loss means greater cluster impact •  Higher cost of hardware for data nodes •  More power and environment controls necessary What did it cost us?
  9. 13 The End…well not really we are still growing… • 

    Capable of processing 5.75billion events a day ‒  Peak is roughly 2x of trough (As high as 10x) ‒  Store 13 months of data (1.75PB) and growing •  25-30 Physical Data Nodes ‒  5 Instances (1 Hot, 5 Warm), Lots of storage, Fast CPU’s, Lots of Memory •  Secure data injest and access ‒  Shield – Validated client, Separation of Duties, Authorization/Authentication, Least Privs •  Resilient ‒  Can handle hardware failure and/or node crashes (Not IF but WHEN) ‒  Can handle maintenance/upgrades •  Small footprint in DC (Could be bigger) Elasticsearch continues to rescue!!!
  10. 15 Winner…Winner…Winner! Bonus Material The following slides are extras that

    may help should you decide to use some of the things we discussed today. They come completely unsupported and should they cause any pain, suffering or possibly a zombie apocalypse I am completely absolved of any responsibility.
  11. 16 Modifications to the init file To enable multiple ES

    configurations to be read 35 prog=$(basename "$0") … 47 # Moodifications based on Ticket #17293 with Elastic Support to continue to suport seperate instance /etc/sysconfig/* files.. 48 # BEGIN COMMENT OUT OLD 49 #ES_ENV_FILE="/etc/sysconfig/elasticsearch" 50 #if [ -f "$ES_ENV_FILE" ]; then 51 # . "$ES_ENV_FILE" 52 #fi 53 # END COMMENT OUT OLD 54 # BEGIN TICKET #17293 MODIFICATIONS 55 ES_ENV_FILE="/etc/sysconfig/${prog}" 56 if [ -f "$ES_ENV_FILE" ]; then 57 . "$ES_ENV_FILE" 58 fi 59 # END TICKET #17293 MODIFICATIONS
  12. 17 Things you must address with multi-instances All of these

    need to be addressed Need to be set for proper allocation rules to be applied 1. node.name – to identify the specific instance For consistency I would suggest naming the process, configuration files and node all the same. 2. node.server – to identify the physical server for the instance 3. node.tag (optional) – to be able to move shards around 4. path.conf – to identify the unique configuration directory 5. path.data – to identify the unique data directory 6. path.logs – to identify the unique logs directory 7. path.work – to identify the unique work directory 8. cluster.routing.allocation.same_shard.host: true – Enable check to prevent allocation of multiple instances of the same shard on a single host.
  13. 18 Except where otherwise noted, this work is licensed under

    http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders.