Scaling Elasticsearch at NetSuite: The Good, The Bad, and The Ugly

1 Bryan Washer Manager Engineering Operations Architecture, Principal Engineer [email protected]
#elasticsearch: _Bryan_ Vertical Scaling Elasticsearch: The Good, The Bad, and The Ugly

2 Agenda - Loosely followed Ask Questions…I like to discuss
The Beginning…seems like so long ago… 1 The GOOD 3 The BAD 4 The End….not really we are still growing… 5 The UGLY 2

3 Starting is easy, crossing the finish line is what
counts… Bryan Washer

4 The Beginning…seems like so long ago… •  Capable of
processing 1.25billion events a day ‒  Peak will be roughly 2x of trough ‒  Store 13 months of data (Estimated 600-750TB) •  10-12 Physical Data Nodes ‒  2 Instances (1 Hot, 1 Warm), Lots of storage, Fast CPU’s, Lots of Memory (DB server) •  Secure data ingest and access ‒  Shield – Validated client, Separation of Duties, Authorization/Authentication, Least Privileges Rule, Control of Application Accounts •  Resilient ‒  Can handle hardware failure and/or node (software) crashes (Not if but when) •  Small footprint in DC Elasticsearch to the rescue!!!

5 The UGLY What is wrong here?

6 Can I add another index? I estimate it is
only going to be ###,### events a day. Anonymous Developer Said to often to keep track

7 Scaling •  Bare Hardware / Virtual Servers •  Usage
Profiles •  Growth Expectations •  Storage Requirements •  Existing Environment Standards •  Number of Clusters •  Risk Acceptance Horizontally Vertically Vertically Horizontally

8 Characteristics of The UGLY •  Inefficient use of Hardware
‒  Powerful CPU’s ‒  Large amounts of memory ‒  Massive storage ‒  High speed networks •  Limited number of nodes ‒  Less resources for queries ‒  Less locations for distribution of shards ‒  Less total number of shards can be supported •  Larger amounts of data to replicate on node failure What is wrong here?

9 The Good How do we make things better?

10 Characteristics of The Good •  Use the Hardware ‒ 
Push the CPU’s ‒  Enable the memory we have (Up to 50%) ‒  Split up the storage •  Increase nodes ‒  More resources for queries ‒  Better distribution of the larger number of shards ‒  More shards supported, Thus more Indexes, Thus more data available online •  Faster replication on node failure due to smaller data sets How do we make things better?

11 The BAD What did it cost us?

12 Characteristics of The Bad •  More complicated cluster management
‒  Need to address each instance for all management ‒  Need to setup proper allocation rules to protect data from physical node crash •  Non Standard configuration means more difficult to get assistance from public resources and documentation. ‒  Support is your friend here. (Shout out to Chris Earle and the Elastic developers) •  Physical node loss means greater cluster impact •  Higher cost of hardware for data nodes •  More power and environment controls necessary What did it cost us?

13 The End…well not really we are still growing… • 
Capable of processing 5.75billion events a day ‒  Peak is roughly 2x of trough (As high as 10x) ‒  Store 13 months of data (1.75PB) and growing •  25-30 Physical Data Nodes ‒  5 Instances (1 Hot, 5 Warm), Lots of storage, Fast CPU’s, Lots of Memory •  Secure data injest and access ‒  Shield – Validated client, Separation of Duties, Authorization/Authentication, Least Privs •  Resilient ‒  Can handle hardware failure and/or node crashes (Not IF but WHEN) ‒  Can handle maintenance/upgrades •  Small footprint in DC (Could be bigger) Elasticsearch continues to rescue!!!

14 Questions and possibly Answers Thank you for attending today.

15 Winner…Winner…Winner! Bonus Material The following slides are extras that
may help should you decide to use some of the things we discussed today. They come completely unsupported and should they cause any pain, suffering or possibly a zombie apocalypse I am completely absolved of any responsibility.

16 Modifications to the init file To enable multiple ES
configurations to be read 35 prog=$(basename "$0") … 47 # Moodifications based on Ticket #17293 with Elastic Support to continue to suport seperate instance /etc/sysconfig/* files.. 48 # BEGIN COMMENT OUT OLD 49 #ES_ENV_FILE="/etc/sysconfig/elasticsearch" 50 #if [ -f "$ES_ENV_FILE" ]; then 51 # . "$ES_ENV_FILE" 52 #fi 53 # END COMMENT OUT OLD 54 # BEGIN TICKET #17293 MODIFICATIONS 55 ES_ENV_FILE="/etc/sysconfig/${prog}" 56 if [ -f "$ES_ENV_FILE" ]; then 57 . "$ES_ENV_FILE" 58 fi 59 # END TICKET #17293 MODIFICATIONS

17 Things you must address with multi-instances All of these
need to be addressed Need to be set for proper allocation rules to be applied 1. node.name – to identify the specific instance For consistency I would suggest naming the process, configuration files and node all the same. 2. node.server – to identify the physical server for the instance 3. node.tag (optional) – to be able to move shards around 4. path.conf – to identify the unique configuration directory 5. path.data – to identify the unique data directory 6. path.logs – to identify the unique logs directory 7. path.work – to identify the unique work directory 8. cluster.routing.allocation.same_shard.host: true – Enable check to prevent allocation of multiple instances of the same shard on a single host.

18 Except where otherwise noted, this work is licensed under
http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders.

Scaling Elasticsearch at NetSuite: The Good, Th...

Scaling Elasticsearch at NetSuite: The Good, The Bad, and The Ugly

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

1 Bryan Washer Manager Engineering Operations Architecture, Principal Engineer [email protected]

2 Agenda - Loosely followed Ask Questions…I like to discuss

3 Starting is easy, crossing the finish line is what

4 The Beginning…seems like so long ago… •  Capable of

5 The UGLY What is wrong here?

6 Can I add another index? I estimate it is

7 Scaling •  Bare Hardware / Virtual Servers •  Usage

8 Characteristics of The UGLY •  Inefficient use of Hardware

9 The Good How do we make things better?

10 Characteristics of The Good •  Use the Hardware ‒

11 The BAD What did it cost us?

12 Characteristics of The Bad •  More complicated cluster management

13 The End…well not really we are still growing… •

14 Questions and possibly Answers Thank you for attending today.

15 Winner…Winner…Winner! Bonus Material The following slides are extras that

16 Modifications to the init file To enable multiple ES

17 Things you must address with multi-instances All of these

18 Except where otherwise noted, this work is licensed under